Thursday, June 12, 2025

Solr query with n-gram

A Search story with Solr N-Gram part 2


Querying our index

In part 1 of this search story I described the setup we did to create a custom Solr index in Sitecore that had a few fields with the n-gram tokenizer. 

A small recap: we are trying to create a search on a bunch of similar Sitecore items that uses tagging but also free text search in the title and description. We want to make sure the users always get results if possible with the most relevant on top.

In this second part I will describe how we queried that index to get what we need. We are trying to use only solr - not retrieving any data from Sitecore - as we want to be ready to move this solution out of the Sitecore environment some day. This is the reason we are not using the Sitecore search layer, but instead the SolrNet library.


Query options and basics

Let's start easy with setting some query options.
var options = new QueryOptions
{
  Rows = parameters.Rows,
  StartOrCursor = new StartOrCursor.Start(parameters.Start)
};
We are just setting the parameters for paging here - number of rows and the start row.
var query = new List<ISolrQuery>()
  {
    new SolrQueryByField("_template", "bdd6ede443e889619bc01314c027b3da"),
    new SolrQueryByField("_language", language),
    new SolrQueryByField("_path", "5bbbd9fa6d764b01813f0cafd6f5de31")
  };
We start the query by setting the desired template, language and path.
We use SolrQueryInList with an IEnumerable to add the tagging parts to the query but as that is not the most relevant part here I will not go into more details. You can find all the information on querying with SolrNet in their docs on Github.


Search query

The next step and most interesting one is adding the search part to the query.
if (!string.IsNullOrEmpty(parameters.SearchTerm))
{
  var searchQuery = new List<ISolrQuery>()
  {
    new SolrQueryByField("titlestring_s", parameters.SearchTerm),
    new SolrQueryByField("descriptionstring_s", parameters.SearchTerm),
    new SolrQueryByField("titlesearch_txts", parameters.SearchTerm),
    new SolrQueryByField("descriptionsearch_txts", parameters.SearchTerm)
  };
  var search = new SolrMultipleCriteriaQuery(searchQuery, SolrMultipleCriteriaQuery.Operator.OR);
  query.Add(search);
  options.AddOrder(new SortOrder("score", Order.DESC));
  options.ExtraParams = new Dictionary<string, string>
  {
      { "defType", "edismax" },
      { "qf", "titlestring_s^9 descriptionstring_s^5 titlesearch_txts^2 descriptionsearch_txts" }
  };
}
else
{
  options.AddOrder(new SortOrder("__smallupdateddate_tdt", Order.DESC));
}

What are we doing here? First of all, we check if we actually have a search parameter. If we do not, we do not add any search query and keep the sorting as default - being the last update date in our case. 

But what if we do have a search string? We make a new solr query that combines 4 field queries. We search in the string and ngram version of the title and description. We combine the field queries with an OR operator and add the query to the global solr query. 

We then set the sorting on the score field - this is the score calculated by solr and indicating the relevancy of the result. 

Last we also add extra parameters to indicate the edismax boosting we want to use. We boost the full string matches most, and also title more than description. 

This delivers us the requirements we wanted:
  • search in title and description
  • get results as often as possible
  • show exact matches first
  • get the most relevant results on top


Wrap up

To wrap things up we combine everything and execute the query:
var q = new SolrMultipleCriteriaQuery(query, SolrMultipleCriteriaQuery.Operator.AND);
logger.LogDebug($"[Portal] Information center search: {solrQuerySerializer.Serialize(q)}");
var results = await solrDocuments.QueryAsync(q, options);
Next to gathering the results note that we can also use the provided serializer to log our queries for debugging.

As a final remark I do need to add that a search like this needs fine-tuning. That is tuning the size of the ngrams and also tuning the boost factors. Change the parameters (one at a time) and test until you get the results as you want them.

And that's it for this second and final part of this ngram search series. As mentioned in the first post, this information is not new and most of it can be found in several docs and posts but I though it would be a good idea to bring it all together. Enjoy your search ;)

Wednesday, June 4, 2025

Search with Solr n-gram in Sitecore

A Search story with Solr N-Gram 

For a customer on Sitecore XM 10.2 we have a headless site running JSS with NextJS and a very specific search request. 
One section of their content is an unstructured bunch of help related articles - like a frequently asked questions section. This content is heavily tagged and contains quite a bit of items (in a bucket). We already had an application showing this data with the option to use the tags to filter and get to the required content. But now we also had to add free text search. 

There is nothing more frustrating than finding no results, especially when looking for help - so we want to give as much relevant results as possible but of course the most relevant on top. 

Also note that we do not have a solution like Sitecore Search or Algolia at our disposal here. So we need to create something with basic Solr. 

As I gathered information from several resources and also found quite a bit of outdated information this post seemed like a good idea. I will split it in two - a first part here on how to do the solr setup and a second post on the search code itself.

Solr N-Gram

To be able to (almost) always get results, we decided to use the N-Gram tokenizer.  An n-gram tokenizer splits text into overlapping sequences of characters of a specified length. This tokenizer is useful when you want to perform partial word matching because it generates substrings (character n-grams) of the original input text.

Step 1 in the process is to create a field type in the Solr schema that will use this tokenizer. We will be using it on indexing and on querying, meaning the indexed value and the search string will be split into n-grams.

We could update the schema in Solr (manually) - but every time someone would populate the index schema our change would be gone. 

Customize index schema population 

An article on the Sitecore documentation helped us to customize the index schema population - which is exactly what we need. We took the code from https://doc.sitecore.com/xp/en/developers/latest/platform-administration-and-architecture/add-custom-fields-to-a-solr-schema.html and changed the relevant methods as such:
private IEnumerable<XElement> GetAddCustomFields()
{
  yield return CreateField("*_txts",
    "text_searchable",
    isDynamic: true,
    required: false,
    indexed: true,
    stored: true,
    multiValued: false,
    omitNorms: false,
    termOffsets: false,
    termPositions: false,
    termVectors: false);
}
So we are creating a new field "text_searchable" with an extension txts that will get indexed and stored.

private IEnumerable<XElement> GetAddCustomFieldTypes()
{
  var fieldType = CreateFieldType("text_searchable", "solr.TextField",
    new Dictionary<string, string>
    {
      { "positionIncrementGap", "100" },
      { "multiValued", "false" },
    });
  var indexAnalyzer = new XElement("indexAnalyzer");
  indexAnalyzer.Add(new XElement("tokenizer", new XElement("class", "solr.NGramTokenizerFactory"), new XElement("minGramSize", "3"), new XElement("maxGramSize", "5")));
  indexAnalyzer.Add(new XElement("filters", new XElement("class", "solr.StopFilterFactory"), new XElement("ignoreCase", "true"), new XElement("words", "stopwords.txt")));
  indexAnalyzer.Add(new XElement("filters", new XElement("class", "solr.LowerCaseFilterFactory")));
  fieldType.Add(indexAnalyzer);
  
  var queryAnalyzer = new XElement("queryAnalyzer");
  queryAnalyzer.Add(new XElement("tokenizer", new XElement("class", "solr.NGramTokenizerFactory"), new XElement("minGramSize", "3"), new XElement("maxGramSize", "5")));
  queryAnalyzer.Add(new XElement("filters", new XElement("class", "solr.StopFilterFactory"), new XElement("ignoreCase", "true"), new XElement("words", "stopwords.txt")));
  queryAnalyzer.Add(new XElement("filters", new XElement("class", "solr.SynonymFilterFactory"), new XElement("synonyms", "synonyms.txt"), new XElement("ignoreCase", "true"), new XElement("expand", "true")));
  queryAnalyzer.Add(new XElement("filters", new XElement("class", "solr.LowerCaseFilterFactory")));
  fieldType.Add(queryAnalyzer);
  yield return fieldType;
}
Here we are adding the type for text_searchable as a text field that uses the NGramTokenizerFactory. We are also setting the min and max gram size. This will determine the minimum and maximum number of characters that are used to create the fractions of your text (check the solr docs for more details). 

Don't forget to also add the factory class and the configuration patch and that's it. 

We created a custom index for this purpose in order to be able to have a custom configuration with computed fields and such specific on this index - with a limited number of items. If we now populate the schema for that index, our n-gram field type is added.

Sitecore index configuration

As mentioned earlier we have a custom index configured.  This was done for 2 reasons:
  • settings the crawlers: plural as we have two for both locations where we have items that should be included in the application
  • custom index configuration: we wanted our own index configuration to be completely free in customizing it just for this index without consequences in all the others. The default solr configuration is referenced so we don't need to copy all the basics though
    <ourcustomSolrIndexConfiguration ref="contentSearch/indexConfigurations/defaultSolrIndexConfiguration">
In order to get what we need in the index, we configure:
  • AddIncludedTemplate: list the templates to be added in the index
  • AddComputedIndexField: all computed fields to be added in the index

Computed Fields

Next to a number of computed fields for the extra tagging and such, we also used computed fields to add the title and the description field two more times in the index. Why? Well, it's an easy to way to copy a field (and apply some extra logic if needed). And we do need a copy. Well, copies actually. 

The first copy will be set as a text_searchable field as we just created, the second copy will be a string field. Again, why?

As you will see in the next part of this blog where we talk about querying the data, we will use all data from the index and not go to Sitecore to fetch anything. This means we need everything we want to return in the index and that is why we are creating a string field copy of our text fields. It's all about tokenizers☺.  The text_searchable copy is to have a n-gram version as well.   

I am not going to share code for a computed field here - that has been documented enough already and a simple copy of a field is really very basic. 

Configuration

I will share the configuration parts to add the computed fields.
<fields hint="raw:AddComputedIndexField">
  <field fieldName="customtagname" type="Sitecore.XA.Foundation.Search.ComputedFields.ResolvedLinks, Sitecore.XA.Foundation.Search" returnType="stringCollection" referenceField="contenttype" contentField="title"/>
 ...
  <field fieldName="titlesearch" type="X.Index.CopyField, X" returnType="string" referenceField="title" />
  <field fieldName="descriptionsearch" type="X.Index.CopyField, X" returnType="string" referenceField="description" />
  <field fieldName="titlestring" type="X.Index.CopyField, X" returnType="string" referenceField="title" />
</fields>
  <field fieldName="descriptionstring" type="X.Index.CopyField, X" returnType="string" referenceField="description" />
</fields>  
This config will create all the computed index fields. Note that we are also using the ResolvedLinks from SXA to handle reference fields.
Adding the fields with the correct type to the field map:
<fieldMap ref="contentSearch/indexConfigurations/defaultSolrIndexConfiguration/fieldMap">
  <typeMatches hint="raw:AddTypeMatch">
    <typeMatch type="System.String" typeName="text_searchable" fieldNameFormat="{0}_txts" settingType="Sitecore.ContentSearch.SolrProvider.SolrSearchFieldConfiguration, Sitecore.ContentSearch.SolrProvider" />
  </typeMatches>
  <fieldNames hint="raw:AddFieldByFieldName">
    <field fieldName="titlesearch" returnType="text_searchable"/>
    <field fieldName="descriptionsearch" returnType="text_searchable"/>
  </fieldNames>
</fieldMap>  


Our index is ready now. In part 2 we will query this index to get the required results.


Thursday, May 8, 2025

SUGCON 2025 - the part with XM Cloud Content

SUGCON 2025 - XM Cloud Content

Save the best for last...  that does not only count for this blog post series, but also for Sugcon 2025 itself. Before we continue, as this is part three I also have a part one and a part two 🙂


XM Cloud Content


Alistair Deneys had the honor of presenting us XM Cloud Content. 

Let's start with "what is XM Cloud Content?".  Seems a very simple question but to be honest since the conference I have seen a few answers to this question. After the presentation one would have said this is a new CMS. But apparently some people who had heard about something like this before, seem to think it's not a new product but the base for a brand new version of XM Cloud. 

The session description on the Sugcon website tells me this:
XM Cloud Content is going to be an evolution of Content Hub ONE, Sitecore's fully managed headless CMS. Come and have a peek at this new product which is currently under development at Sitecore.
So for me at the moment we are looking at a new CMS. Which would be great as it has become very clear that XM Cloud as it is now is not suited for everyone and Content Hub One is dead. A new product to fill that gap would be tremendous news so let's just assume that is really what we saw here. 


XM Cloud Content should become the result of all the knowledge Sitecore captured over the past years about creating a CMS - the good things but certainly also the bad ones. Learn from your mistakes is a good credo, and we did see some of that already in the first basics of the newborn. 


Foundation


For SAAS products we shouldn't care what is behind the wall. If it works well it's ok. And although this architecture diagram probably doesn't mean much - it did help making a point about something that has been bothering Sitecore users for a long time. I'm talking about publishing - which will still be possible here but there will be no database transfers anymore making it a lot easier to make this finally fast. 

Probably a bit more interesting already is the domain model. 
  • Items will be defined by content types
  • Taxonomies will be used to classify items
  • Fragments can be used for re-usable content type parts, but as they can also be used in searches (and maybe security?) they become a vital part in composing a good content type structure


Seems all pretty simple and very similar to other equivalent products.


Queries

Queries are done with GraphQL and it seems we will get many options. One that was interesting is querying on fragments as that might avoid having lists of content types in your query. 


Note that the GraphQL schema is definitely not final yet (as is all the rest) and Alistair is looking for feedback on this part. 

There would also be a way to save GraphQL queries - a bit like stored procedures in a sql database. For complex queries this could save a bunch when sending the requests.

Demo 

The main part of the presentation was actually a demo - which is nice as this means something already exists and this is not just a theoretical exercise.


We did get a glimpse of a UI - which was completely in the line of what all the Sitecore products should look like these days. Clean, white, simple. But the demo was completely done with the CLI actually. 



If you can prepare scripts and json files this all goes a bit smoother of course. We saw Alistair creating the content types and taxonomy, then creating some actual content to test with and finally querying that content in several ways. 

The demo went pretty well to be honest - one would wonder what he sacrificed to the demo gods 😈 


We were also introduced to the security aspects. Those looked pretty nice - and you might think this is pretty common but there are some CMS systems out there were this is not so trivial. 

Anyway, it will be possible to restrict access via tokens based on several aspects going from the publish state and an environment to types or fragments and apparently even on saved queries.






Conclusion 


I can only say I am really looking forward to this XM Content Cloud. It looks very promising. Hopefully Sitecore can really deliver this time and can put some pricing on it that is also suitable for smaller markets.

To be continued...  maybe on Symposium?


Friday, May 2, 2025

SUGCON 2025 - part two with XM Cloud

SUGCON 2025 - the story continues with XM Cloud

Make sure to read part 1 of the Sugcon 2025 saga...

Vercel

Let's start this second part with something maybe not completely Sitecore related but very relevant to most current projects so I was glad to see Vercel present at the conference. Not only to have a nice chat with them at the booth, trying to get answers about their version and support strategy. To be honest, that is not yet clear to me but they also gave a session about optimizing Next.js - which was even for someone like me who is completely not (yet) into that Next stuff pretty interesting. 


Alex Hawley presented in a clear and very comprehensible way a few pitfalls and how to solve them. Very interesting for headless implementation of Sitecore (or even other platforms).


JSS - XM Cloud

This brings us to the next topic - proudly presented by Christian Hahn and Liz Nelson. The JSS SDK and starterkits have had a major cleanup.  Note that we are talking about the XMCloud version here. By decoupling this from the XP version a lot of became possible.



In general: they removed a lot of code, making the packages a lot smaller and faster to load.  All developers will know that this is a very pleasant step to take. It brings a fresh start and there seems to be indeed more things on the roadmap to keep on improving. 

 


It's nice to see some of the recommendations from the Vercel session coming back here in the new and improved jss, or should we say Content SDK now... 




As we are talking about XM Cloud, we cannot not mention Andy Cohen. His session was not really what I expected, but it was an eye-opener - as was his first implementation experience apparently 😐



XM Cloud - Marketplace

We are staying in the XM Cloud sphere with the session by Krassi Eneva and Justin Vogt about the marketplace Sitecore is creating for modules to extend XM Cloud. 

Well, actually they are talking about a hub to extend and customize Sitecore Products. So not limited to XM Cloud but the examples went in that direction and it does make sense. As you probably know Sitecore is heavily discouraging custom code on your XM Cloud - something that made the product in the beginning in the eyes of some not really "saas". Even though I am a developer and I always liked extending everything I do believe not putting custom code next to a product that should be SAAS is a good idea. We already had interaction points with several event webhooks. With this marketplace, we can also build real extensions into the editing UI.


 
There are a few levels in the marketplace - each will have a (slightly) different path to register your module.  A single tenant module can be used to extend the product to a use case for a single customer.  This can cover a very specific business need.  A next step is the multi-tenant module which is probably targeted at partners who want to build their own extensions and use them for multiple customers.
The public modules are available to everyone. They can be either free or paid but they must be approved by Sitecore to make sure the quality is good and the look and feel is similar to the original product. 

In order to achieve all this, there is an SDK:
 



Next

There is one more story to tell...  but that will be for part three in this years Sugcon series


SUGCON 2025 - part one and AI

SUGCON 2025 - Antwerp, Belgium

Yeah, Sugcon in Belgium. Which meant I could attend this edition.  As it is always a pleasure to meet up with people from the Sitecore community I was really looking forward to it and it did not disappoint me.

With a venue on approximately 60km away from my doorstep there was no hotel and flight for me. Instead I got a pleasant drive... up until you get a glimpse of the Antwerp skyline which also means: traffic jams 🙂


But we got safe and sound to a very sunny Antwerp and a tremendous venue with a view on the zoo. It should be no surprise that there was another traffic jam when entering the main room for the keynote as everyone was taking pictures

So venue is fine, weather is good and the crowd very nice. Now we just need some good content as well. Ready for take-off... 


Keynotes

I must admit that the keynotes were not really mind blowing nor unforgettable. Which is to be expected on an event like this - big announcements are for Symposium. Sugcon got simple, to-the-point keynotes with the usual AI buzz words and mentions of the community as one would suspect on a community event. 

I was nice to see all the familiar faces behind Dave O'Flanagan -  with Jason in plain sight but I'm also in there somewhere 🙂. 









AI

Of course AI was present in many presentations. Whether it was to generate the images in the presentation like Jason Wilkerson or as real main focus in the CCC demo from Morten Ljungberg. 
 
Morten showed us a demo of the possibilities with Sitecore Stream - the free version and the paid version with brand awareness. It looked pretty good (although his CCC/K joke was unaware of Belgian history). But the demo was nice and gave us a pretty good idea of how these things work.

Let's continue a bit more on the AI path as Vignesh Vishwanath woke us up on Friday with his talk on Stream in XP. 


He made it very clear that there are opportunities here. It also became clear that Stream is not only for the SAAS customers, but also available to those who didn't take that step (yet).  There is a free tier available that does include the basics - probably the biggest difference is the brand awareness, if you want that you will need to subscription version. 

At the moment in the CMS you can mostly generate content, but there is a roadmap to increase the possibilities and add more functionality within the product. 


One of the things that is coming would be help in translating - but with a translation company as a sponsor he also made it clear that this is AI translations and that those are not yet perfect 🙂



There were (lots) more interesting sessions of course. And also very interesting talks during the breaks. More of that in part two...