Thursday, June 12, 2025

Solr query with n-gram

A Search story with Solr N-Gram part 2


Querying our index

In part 1 of this search story I described the setup we did to create a custom Solr index in Sitecore that had a few fields with the n-gram tokenizer. 

A small recap: we are trying to create a search on a bunch of similar Sitecore items that uses tagging but also free text search in the title and description. We want to make sure the users always get results if possible with the most relevant on top.

In this second part I will describe how we queried that index to get what we need. We are trying to use only solr - not retrieving any data from Sitecore - as we want to be ready to move this solution out of the Sitecore environment some day. This is the reason we are not using the Sitecore search layer, but instead the SolrNet library.


Query options and basics

Let's start easy with setting some query options.
var options = new QueryOptions
{
  Rows = parameters.Rows,
  StartOrCursor = new StartOrCursor.Start(parameters.Start)
};
We are just setting the parameters for paging here - number of rows and the start row.
var query = new List<ISolrQuery>()
  {
    new SolrQueryByField("_template", "bdd6ede443e889619bc01314c027b3da"),
    new SolrQueryByField("_language", language),
    new SolrQueryByField("_path", "5bbbd9fa6d764b01813f0cafd6f5de31")
  };
We start the query by setting the desired template, language and path.
We use SolrQueryInList with an IEnumerable to add the tagging parts to the query but as that is not the most relevant part here I will not go into more details. You can find all the information on querying with SolrNet in their docs on Github.


Search query

The next step and most interesting one is adding the search part to the query.
if (!string.IsNullOrEmpty(parameters.SearchTerm))
{
  var searchQuery = new List<ISolrQuery>()
  {
    new SolrQueryByField("titlestring_s", parameters.SearchTerm),
    new SolrQueryByField("descriptionstring_s", parameters.SearchTerm),
    new SolrQueryByField("titlesearch_txts", parameters.SearchTerm),
    new SolrQueryByField("descriptionsearch_txts", parameters.SearchTerm)
  };
  var search = new SolrMultipleCriteriaQuery(searchQuery, SolrMultipleCriteriaQuery.Operator.OR);
  query.Add(search);
  options.AddOrder(new SortOrder("score", Order.DESC));
  options.ExtraParams = new Dictionary<string, string>
  {
      { "defType", "edismax" },
      { "qf", "titlestring_s^9 descriptionstring_s^5 titlesearch_txts^2 descriptionsearch_txts" }
  };
}
else
{
  options.AddOrder(new SortOrder("__smallupdateddate_tdt", Order.DESC));
}

What are we doing here? First of all, we check if we actually have a search parameter. If we do not, we do not add any search query and keep the sorting as default - being the last update date in our case. 

But what if we do have a search string? We make a new solr query that combines 4 field queries. We search in the string and ngram version of the title and description. We combine the field queries with an OR operator and add the query to the global solr query. 

We then set the sorting on the score field - this is the score calculated by solr and indicating the relevancy of the result. 

Last we also add extra parameters to indicate the edismax boosting we want to use. We boost the full string matches most, and also title more than description. 

This delivers us the requirements we wanted:
  • search in title and description
  • get results as often as possible
  • show exact matches first
  • get the most relevant results on top


Wrap up

To wrap things up we combine everything and execute the query:
var q = new SolrMultipleCriteriaQuery(query, SolrMultipleCriteriaQuery.Operator.AND);
logger.LogDebug($"[Portal] Information center search: {solrQuerySerializer.Serialize(q)}");
var results = await solrDocuments.QueryAsync(q, options);
Next to gathering the results note that we can also use the provided serializer to log our queries for debugging.

As a final remark I do need to add that a search like this needs fine-tuning. That is tuning the size of the ngrams and also tuning the boost factors. Change the parameters (one at a time) and test until you get the results as you want them.

And that's it for this second and final part of this ngram search series. As mentioned in the first post, this information is not new and most of it can be found in several docs and posts but I though it would be a good idea to bring it all together. Enjoy your search ;)

Wednesday, June 4, 2025

Search with Solr n-gram in Sitecore

A Search story with Solr N-Gram 

For a customer on Sitecore XM 10.2 we have a headless site running JSS with NextJS and a very specific search request. 
One section of their content is an unstructured bunch of help related articles - like a frequently asked questions section. This content is heavily tagged and contains quite a bit of items (in a bucket). We already had an application showing this data with the option to use the tags to filter and get to the required content. But now we also had to add free text search. 

There is nothing more frustrating than finding no results, especially when looking for help - so we want to give as much relevant results as possible but of course the most relevant on top. 

Also note that we do not have a solution like Sitecore Search or Algolia at our disposal here. So we need to create something with basic Solr. 

As I gathered information from several resources and also found quite a bit of outdated information this post seemed like a good idea. I will split it in two - a first part here on how to do the solr setup and a second post on the search code itself.

Solr N-Gram

To be able to (almost) always get results, we decided to use the N-Gram tokenizer.  An n-gram tokenizer splits text into overlapping sequences of characters of a specified length. This tokenizer is useful when you want to perform partial word matching because it generates substrings (character n-grams) of the original input text.

Step 1 in the process is to create a field type in the Solr schema that will use this tokenizer. We will be using it on indexing and on querying, meaning the indexed value and the search string will be split into n-grams.

We could update the schema in Solr (manually) - but every time someone would populate the index schema our change would be gone. 

Customize index schema population 

An article on the Sitecore documentation helped us to customize the index schema population - which is exactly what we need. We took the code from https://doc.sitecore.com/xp/en/developers/latest/platform-administration-and-architecture/add-custom-fields-to-a-solr-schema.html and changed the relevant methods as such:
private IEnumerable<XElement> GetAddCustomFields()
{
  yield return CreateField("*_txts",
    "text_searchable",
    isDynamic: true,
    required: false,
    indexed: true,
    stored: true,
    multiValued: false,
    omitNorms: false,
    termOffsets: false,
    termPositions: false,
    termVectors: false);
}
So we are creating a new field "text_searchable" with an extension txts that will get indexed and stored.

private IEnumerable<XElement> GetAddCustomFieldTypes()
{
  var fieldType = CreateFieldType("text_searchable", "solr.TextField",
    new Dictionary<string, string>
    {
      { "positionIncrementGap", "100" },
      { "multiValued", "false" },
    });
  var indexAnalyzer = new XElement("indexAnalyzer");
  indexAnalyzer.Add(new XElement("tokenizer", new XElement("class", "solr.NGramTokenizerFactory"), new XElement("minGramSize", "3"), new XElement("maxGramSize", "5")));
  indexAnalyzer.Add(new XElement("filters", new XElement("class", "solr.StopFilterFactory"), new XElement("ignoreCase", "true"), new XElement("words", "stopwords.txt")));
  indexAnalyzer.Add(new XElement("filters", new XElement("class", "solr.LowerCaseFilterFactory")));
  fieldType.Add(indexAnalyzer);
  
  var queryAnalyzer = new XElement("queryAnalyzer");
  queryAnalyzer.Add(new XElement("tokenizer", new XElement("class", "solr.NGramTokenizerFactory"), new XElement("minGramSize", "3"), new XElement("maxGramSize", "5")));
  queryAnalyzer.Add(new XElement("filters", new XElement("class", "solr.StopFilterFactory"), new XElement("ignoreCase", "true"), new XElement("words", "stopwords.txt")));
  queryAnalyzer.Add(new XElement("filters", new XElement("class", "solr.SynonymFilterFactory"), new XElement("synonyms", "synonyms.txt"), new XElement("ignoreCase", "true"), new XElement("expand", "true")));
  queryAnalyzer.Add(new XElement("filters", new XElement("class", "solr.LowerCaseFilterFactory")));
  fieldType.Add(queryAnalyzer);
  yield return fieldType;
}
Here we are adding the type for text_searchable as a text field that uses the NGramTokenizerFactory. We are also setting the min and max gram size. This will determine the minimum and maximum number of characters that are used to create the fractions of your text (check the solr docs for more details). 

Don't forget to also add the factory class and the configuration patch and that's it. 

We created a custom index for this purpose in order to be able to have a custom configuration with computed fields and such specific on this index - with a limited number of items. If we now populate the schema for that index, our n-gram field type is added.

Sitecore index configuration

As mentioned earlier we have a custom index configured.  This was done for 2 reasons:
  • settings the crawlers: plural as we have two for both locations where we have items that should be included in the application
  • custom index configuration: we wanted our own index configuration to be completely free in customizing it just for this index without consequences in all the others. The default solr configuration is referenced so we don't need to copy all the basics though
    <ourcustomSolrIndexConfiguration ref="contentSearch/indexConfigurations/defaultSolrIndexConfiguration">
In order to get what we need in the index, we configure:
  • AddIncludedTemplate: list the templates to be added in the index
  • AddComputedIndexField: all computed fields to be added in the index

Computed Fields

Next to a number of computed fields for the extra tagging and such, we also used computed fields to add the title and the description field two more times in the index. Why? Well, it's an easy to way to copy a field (and apply some extra logic if needed). And we do need a copy. Well, copies actually. 

The first copy will be set as a text_searchable field as we just created, the second copy will be a string field. Again, why?

As you will see in the next part of this blog where we talk about querying the data, we will use all data from the index and not go to Sitecore to fetch anything. This means we need everything we want to return in the index and that is why we are creating a string field copy of our text fields. It's all about tokenizers☺.  The text_searchable copy is to have a n-gram version as well.   

I am not going to share code for a computed field here - that has been documented enough already and a simple copy of a field is really very basic. 

Configuration

I will share the configuration parts to add the computed fields.
<fields hint="raw:AddComputedIndexField">
  <field fieldName="customtagname" type="Sitecore.XA.Foundation.Search.ComputedFields.ResolvedLinks, Sitecore.XA.Foundation.Search" returnType="stringCollection" referenceField="contenttype" contentField="title"/>
 ...
  <field fieldName="titlesearch" type="X.Index.CopyField, X" returnType="string" referenceField="title" />
  <field fieldName="descriptionsearch" type="X.Index.CopyField, X" returnType="string" referenceField="description" />
  <field fieldName="titlestring" type="X.Index.CopyField, X" returnType="string" referenceField="title" />
</fields>
  <field fieldName="descriptionstring" type="X.Index.CopyField, X" returnType="string" referenceField="description" />
</fields>  
This config will create all the computed index fields. Note that we are also using the ResolvedLinks from SXA to handle reference fields.
Adding the fields with the correct type to the field map:
<fieldMap ref="contentSearch/indexConfigurations/defaultSolrIndexConfiguration/fieldMap">
  <typeMatches hint="raw:AddTypeMatch">
    <typeMatch type="System.String" typeName="text_searchable" fieldNameFormat="{0}_txts" settingType="Sitecore.ContentSearch.SolrProvider.SolrSearchFieldConfiguration, Sitecore.ContentSearch.SolrProvider" />
  </typeMatches>
  <fieldNames hint="raw:AddFieldByFieldName">
    <field fieldName="titlesearch" returnType="text_searchable"/>
    <field fieldName="descriptionsearch" returnType="text_searchable"/>
  </fieldNames>
</fieldMap>  


Our index is ready now. In part 2 we will query this index to get the required results.