Showing posts with label solr. Show all posts

Thursday, June 12, 2025

Solr query with n-gram

A Search story with Solr N-Gram part 2

Querying our index

In part 1 of this search story I described the setup we did to create a custom Solr index in Sitecore that had a few fields with the n-gram tokenizer.

A small recap: we are trying to create a search on a bunch of similar Sitecore items that uses tagging but also free text search in the title and description. We want to make sure the users always get results if possible with the most relevant on top.

In this second part I will describe how we queried that index to get what we need. We are trying to use only solr - not retrieving any data from Sitecore - as we want to be ready to move this solution out of the Sitecore environment some day. This is the reason we are not using the Sitecore search layer, but instead the SolrNet library.

Query options and basics

Let's start easy with setting some query options.

var options = new QueryOptions
{
  Rows = parameters.Rows,
  StartOrCursor = new StartOrCursor.Start(parameters.Start)
};

We are just setting the parameters for paging here - number of rows and the start row.

var query = new List<ISolrQuery>()
  {
    new SolrQueryByField("_template", "bdd6ede443e889619bc01314c027b3da"),
    new SolrQueryByField("_language", language),
    new SolrQueryByField("_path", "5bbbd9fa6d764b01813f0cafd6f5de31")
  };

We start the query by setting the desired template, language and path.
We use SolrQueryInList with an IEnumerable to add the tagging parts to the query but as that is not the most relevant part here I will not go into more details. You can find all the information on querying with SolrNet in their docs on Github.

Search query

The next step and most interesting one is adding the search part to the query.

if (!string.IsNullOrEmpty(parameters.SearchTerm))
{
  var searchQuery = new List<ISolrQuery>()
  {
    new SolrQueryByField("titlestring_s", parameters.SearchTerm),
    new SolrQueryByField("descriptionstring_s", parameters.SearchTerm),
    new SolrQueryByField("titlesearch_txts", parameters.SearchTerm),
    new SolrQueryByField("descriptionsearch_txts", parameters.SearchTerm)
  };
  var search = new SolrMultipleCriteriaQuery(searchQuery, SolrMultipleCriteriaQuery.Operator.OR);
  query.Add(search);
  options.AddOrder(new SortOrder("score", Order.DESC));
  options.ExtraParams = new Dictionary<string, string>
  {
      { "defType", "edismax" },
      { "qf", "titlestring_s^9 descriptionstring_s^5 titlesearch_txts^2 descriptionsearch_txts" }
  };
}
else
{
  options.AddOrder(new SortOrder("__smallupdateddate_tdt", Order.DESC));
}

What are we doing here? First of all, we check if we actually have a search parameter. If we do not, we do not add any search query and keep the sorting as default - being the last update date in our case.

But what if we do have a search string? We make a new solr query that combines 4 field queries. We search in the string and ngram version of the title and description. We combine the field queries with an OR operator and add the query to the global solr query.

We then set the sorting on the score field - this is the score calculated by solr and indicating the relevancy of the result.

Last we also add extra parameters to indicate the edismax boosting we want to use. We boost the full string matches most, and also title more than description.

This delivers us the requirements we wanted:

search in title and description
get results as often as possible
show exact matches first
get the most relevant results on top

Wrap up

To wrap things up we combine everything and execute the query:

var q = new SolrMultipleCriteriaQuery(query, SolrMultipleCriteriaQuery.Operator.AND);
logger.LogDebug($"[Portal] Information center search: {solrQuerySerializer.Serialize(q)}");
var results = await solrDocuments.QueryAsync(q, options);

Next to gathering the results note that we can also use the provided serializer to log our queries for debugging.

As a final remark I do need to add that a search like this needs fine-tuning. That is tuning the size of the ngrams and also tuning the boost factors. Change the parameters (one at a time) and test until you get the results as you want them.

And that's it for this second and final part of this ngram search series. As mentioned in the first post, this information is not new and most of it can be found in several docs and posts but I though it would be a good idea to bring it all together. Enjoy your search ;)

Wednesday, June 4, 2025

Search with Solr n-gram in Sitecore

A Search story with Solr N-Gram

For a customer on Sitecore XM 10.2 we have a headless site running JSS with NextJS and a very specific search request.

One section of their content is an unstructured bunch of help related articles - like a frequently asked questions section. This content is heavily tagged and contains quite a bit of items (in a bucket). We already had an application showing this data with the option to use the tags to filter and get to the required content. But now we also had to add free text search.

There is nothing more frustrating than finding no results, especially when looking for help - so we want to give as much relevant results as possible but of course the most relevant on top.

Also note that we do not have a solution like Sitecore Search or Algolia at our disposal here. So we need to create something with basic Solr.

As I gathered information from several resources and also found quite a bit of outdated information this post seemed like a good idea. I will split it in two - a first part here on how to do the solr setup and a second post on the search code itself.

Solr N-Gram

To be able to (almost) always get results, we decided to use the N-Gram tokenizer. An n-gram tokenizer splits text into overlapping sequences of characters of a specified length. This tokenizer is useful when you want to perform partial word matching because it generates substrings (character n-grams) of the original input text.

Step 1 in the process is to create a field type in the Solr schema that will use this tokenizer. We will be using it on indexing and on querying, meaning the indexed value and the search string will be split into n-grams.

We could update the schema in Solr (manually) - but every time someone would populate the index schema our change would be gone.

Customize index schema population

An article on the Sitecore documentation helped us to customize the index schema population - which is exactly what we need. We took the code from https://doc.sitecore.com/xp/en/developers/latest/platform-administration-and-architecture/add-custom-fields-to-a-solr-schema.html and changed the relevant methods as such:

private IEnumerable<XElement> GetAddCustomFields()
{
  yield return CreateField("*_txts",
    "text_searchable",
    isDynamic: true,
    required: false,
    indexed: true,
    stored: true,
    multiValued: false,
    omitNorms: false,
    termOffsets: false,
    termPositions: false,
    termVectors: false);
}

So we are creating a new field "text_searchable" with an extension txts that will get indexed and stored.

private IEnumerable<XElement> GetAddCustomFieldTypes()
{
  var fieldType = CreateFieldType("text_searchable", "solr.TextField",
    new Dictionary<string, string>
    {
      { "positionIncrementGap", "100" },
      { "multiValued", "false" },
    });
  var indexAnalyzer = new XElement("indexAnalyzer");
  indexAnalyzer.Add(new XElement("tokenizer", new XElement("class", "solr.NGramTokenizerFactory"), new XElement("minGramSize", "3"), new XElement("maxGramSize", "5")));
  indexAnalyzer.Add(new XElement("filters", new XElement("class", "solr.StopFilterFactory"), new XElement("ignoreCase", "true"), new XElement("words", "stopwords.txt")));
  indexAnalyzer.Add(new XElement("filters", new XElement("class", "solr.LowerCaseFilterFactory")));
  fieldType.Add(indexAnalyzer);
  
  var queryAnalyzer = new XElement("queryAnalyzer");
  queryAnalyzer.Add(new XElement("tokenizer", new XElement("class", "solr.NGramTokenizerFactory"), new XElement("minGramSize", "3"), new XElement("maxGramSize", "5")));
  queryAnalyzer.Add(new XElement("filters", new XElement("class", "solr.StopFilterFactory"), new XElement("ignoreCase", "true"), new XElement("words", "stopwords.txt")));
  queryAnalyzer.Add(new XElement("filters", new XElement("class", "solr.SynonymFilterFactory"), new XElement("synonyms", "synonyms.txt"), new XElement("ignoreCase", "true"), new XElement("expand", "true")));
  queryAnalyzer.Add(new XElement("filters", new XElement("class", "solr.LowerCaseFilterFactory")));
  fieldType.Add(queryAnalyzer);
  yield return fieldType;
}

Here we are adding the type for text_searchable as a text field that uses the NGramTokenizerFactory. We are also setting the min and max gram size. This will determine the minimum and maximum number of characters that are used to create the fractions of your text (check the solr docs for more details).

Don't forget to also add the factory class and the configuration patch and that's it.

We created a custom index for this purpose in order to be able to have a custom configuration with computed fields and such specific on this index - with a limited number of items. If we now populate the schema for that index, our n-gram field type is added.

Sitecore index configuration

As mentioned earlier we have a custom index configured. This was done for 2 reasons:

settings the crawlers: plural as we have two for both locations where we have items that should be included in the application
custom index configuration: we wanted our own index configuration to be completely free in customizing it just for this index without consequences in all the others. The default solr configuration is referenced so we don't need to copy all the basics though
```
<ourcustomSolrIndexConfiguration ref="contentSearch/indexConfigurations/defaultSolrIndexConfiguration">
```

In order to get what we need in the index, we configure:

AddIncludedTemplate: list the templates to be added in the index
AddComputedIndexField: all computed fields to be added in the index

Computed Fields

Next to a number of computed fields for the extra tagging and such, we also used computed fields to add the title and the description field two more times in the index. Why? Well, it's an easy to way to copy a field (and apply some extra logic if needed). And we do need a copy. Well, copies actually.

The first copy will be set as a text_searchable field as we just created, the second copy will be a string field. Again, why?

As you will see in the next part of this blog where we talk about querying the data, we will use all data from the index and not go to Sitecore to fetch anything. This means we need everything we want to return in the index and that is why we are creating a string field copy of our text fields. It's all about tokenizers☺. The text_searchable copy is to have a n-gram version as well.

I am not going to share code for a computed field here - that has been documented enough already and a simple copy of a field is really very basic.

Configuration

I will share the configuration parts to add the computed fields.

<fields hint="raw:AddComputedIndexField">
  <field fieldName="customtagname" type="Sitecore.XA.Foundation.Search.ComputedFields.ResolvedLinks, Sitecore.XA.Foundation.Search" returnType="stringCollection" referenceField="contenttype" contentField="title"/>
 ...
  <field fieldName="titlesearch" type="X.Index.CopyField, X" returnType="string" referenceField="title" />
  <field fieldName="descriptionsearch" type="X.Index.CopyField, X" returnType="string" referenceField="description" />
  <field fieldName="titlestring" type="X.Index.CopyField, X" returnType="string" referenceField="title" />
</fields>
  <field fieldName="descriptionstring" type="X.Index.CopyField, X" returnType="string" referenceField="description" />
</fields>

This config will create all the computed index fields. Note that we are also using the ResolvedLinks from SXA to handle reference fields.

Adding the fields with the correct type to the field map:

<fieldMap ref="contentSearch/indexConfigurations/defaultSolrIndexConfiguration/fieldMap">
  <typeMatches hint="raw:AddTypeMatch">
    <typeMatch type="System.String" typeName="text_searchable" fieldNameFormat="{0}_txts" settingType="Sitecore.ContentSearch.SolrProvider.SolrSearchFieldConfiguration, Sitecore.ContentSearch.SolrProvider" />
  </typeMatches>
  <fieldNames hint="raw:AddFieldByFieldName">
    <field fieldName="titlesearch" returnType="text_searchable"/>
    <field fieldName="descriptionsearch" returnType="text_searchable"/>
  </fieldNames>
</fieldMap>

Our index is ready now. In part 2 we will query this index to get the required results.

Wednesday, February 15, 2023

XMCloud SxaStarter local setup - solr-init issue

XMCloud - solr-init error on a local container setup with SxaStarter

Headless SXA

I wanted to try the new headless SXA. As this can be installed on Sitecore 10.3 I installed that version locally and installed SXA on top (tip: a good blog post from Dan about this setup). That went fine, but I bumped into issues with SXA. I asked on StackExchange and Slack but nobody seemed to know the answer (if you do, please answer the question on SSE). I was going to open a support ticket but as this was just a test... well, you know... I'll do that when I need it in a project.

I heard there were differences between SXA in Sitecore 10.3 and the version in XM Cloud. So I decided to try that one.

XM Cloud

As XM Cloud is sort of a SAAS solution there is some discussion about the fact why people would install it locally and it does make sense not to do is - but I was in test and mess-around mode so let's do that on a local setup (just because we can).

As I'm no expert (yet) in containers nor nextJS and such, I was looking for a simple way to do this. With the information I gathered during various sessions about XM Cloud, that should be possible.

The setup

I found a blog post from Serge van den Oever who already succeeded in such a setup and documented it very well - thanks Serge for the very informative blog.

The setup seemed to go very well... it takes a while to download all the images but at the end, I had a site that started and I could create the SXA site. Even the creation of the rendering container went fine - some issues mentioned in Serges post seem to be gone, some are still there. But I ended up with a running environment.

Even Headless SXA worked. So the issues I had on my local XM were not present in this XM Cloud version. Hooray, cheers, all happy... so why am I writing this post?

The issue

Of course, there had to be an issue. The next day the containers wouldn't start anymore. The solr was going crazy and that means nothing works. Well in fact, the solr-init won't run because the solr isn't healthy.

I started two tracks: as I had another laptop I tried the same installation there.. and at the same time started searching on Sitecore StackExchange, Google... The second installation on the other laptop worked. Once 😞 The result was the same: solr-init would fail when trying to restart the containers.

In the meantime my search quest led to a blog post from Jeremy Davis that actually didn't sound very hopeful - as you can read in his blog he had the same issue with Sitecore containers (not XMCloud) and already tried several things. So I was not going to try all of those again, but on my machine his workaround "docker network prune" didn't work.

But he does also mention an alternative provided by Rob Ahnemann. So that is a third Sitecore MVP bumping into the same issue 😨 Rob notes that the issue could be solved by removing the Zookeeper - so actually using Solr standalone instead of SolrCloud. This sounds reasonable, but we would need a change to the solr-init. As I'm no Solr expert (and neither a container expert) I'm very glad Rob provided us a full fledged solution on docker hub.

So, let's try this. If you want to do this as well, read his post to understand the options for the solr-init image. His example is for XM - so we have to make a few small changes to get this working for XMCloud. It will result in a solr-config yml section like this:

solr-init:
    isolation: ${ISOLATION}
    image: rahnemann/solr-init:1.0-ltsc2019
    environment:
      TOPOLOGY: xm-sxa
      SITECORE_SOLR_CONNECTION_STRING: http://solr:8983/solr
      SOLR_CORE_PREFIX_NAME: ${SOLR_CORE_PREFIX_NAME}
      ADDITIONAL_SITECORE_CORES: _horizon_index
    volumes:
      - type: bind
        source: ${LOCAL_DATA_PATH}\solr
        target: c:\solr

So we are using the image from rahnemann here, we kept the topology xm-sxa as that fits our purpose but we have to add an additional Sitecore core for (oh boy) "Horizon". This is the Pages editor (which is not Horizon, but actually is).

You will also need to change the solr mode and the solr connectionstring:

solr:
    ...
    environment:
      SOLR_MODE: standalone

cm:
    ...
    environment:
      ...
      Sitecore_ConnectionStrings_Solr.Search: http://solr:8983/solr
    ...

After applying these changes, I can now run the containers again. More than once.

It doesn't feel like a very decent solution. And I guess lots of people will not have these issues - but as some people did and blogged about it I would also assume there are more out there who actually do bump into this and maybe this post about my experience can save you some time of you do.

Tuesday, January 9, 2018

Custom Sitecore DocumentOptions with Solr

Almost 2 years ago I wrote a post about using custom indexes in a Helix environment. That post is still accurate, but the code was based on Lucene. As we are now all moving towards using Solr with our (non-PAAS) Sitecore setups, I though it might be a good idea to bring this topic back on the table with a Solr example this time.

(custom) indexes

I am assuming that you know about Helix, and about custom indexes. If you ever created a custom index you probably have used the documentOptions configuration section - maybe without noticing. It is used to include and/or exclude fields and templates and define computed fields. So you probably used it :)

And it wouldn't be Sitecore if we couldn't customize this...

Our own documentOptions

Why? Because we can. No.. we might have a good reason, like making our custom index definitions (more) Helix compliant. Normally your feature will not have a clue about "page" templates. But what if you want to define the include templates in your index? Those could be page templates.. or at least templates that inherit from your feature template. That is why I build my own documentOptions - to include a way to include templates derived from template-X.

Configuration

So the idea now is to create a custom document options class by inheriting from the SolrDocumentBuilderOptions. We add a new method to allow adding templates in a new section with included base templates. This will not break any other existing configuration sections.

An example config looks like:

<documentOptions type="YourNamespace.TestOptions, YourAssembly">
    <indexAllFields>true</indexAllFields>
    <include hint="list:AddIncludedBaseTemplate">
        <BaseTemplate1>{B6FADEA4-61EE-435F-A9EF-B6C9C3B9CB2E}</BaseTemplate1>
    </include>
</documentOptions>

This looks very familiar - as intended. We create a new include section with the hint "list:AddIncludedBaseTemplate". The name 'AddIncludedBaseTemplate' will come back later in our code.

Code

AddIncludedBaseTemplate

public virtual void AddIncludedBaseTemplate(string templateId)
{
  Assert.ArgumentNotNull(templateId, "templateId");
  ID id;
  Assert.IsTrue(ID.TryParse(templateId, out id), "Configuration: AddIncludedBaseTemplate entry is not a valid GUID. Template ID Value: " + templateId);
  foreach (var linkedId in GetLinkedTemplates(id))
  {
    AddTemplateFilter(linkedId, true);
  }
}

To see the rest of the code, I refer to the original post as nothing has to be changed to that in order to make it work on Solr (instead of Lucene).

Conclusion

To change the code from the Lucene example to a Solr one, we just had to change the base class to SolrDocumentBuilderOptions.
We are now again able to configure our index to only use templates that inherit from our base templates. Still cool. And remember you can easily re-use this logic to create other document options to tweak your index behavior.