A Search story with Solr N-Gram
For a customer on Sitecore XM 10.2 we have a headless site running JSS with NextJS and a very specific search request.
One section of their content is an unstructured bunch of help related articles - like a frequently asked questions section. This content is heavily tagged and contains quite a bit of items (in a bucket). We already had an application showing this data with the option to use the tags to filter and get to the required content. But now we also had to add free text search.
There is nothing more frustrating than finding no results, especially when looking for help - so we want to give as much relevant results as possible but of course the most relevant on top.
Also note that we do not have a solution like Sitecore Search or Algolia at our disposal here. So we need to create something with basic Solr.
As I gathered information from several resources and also found quite a bit of outdated information this post seemed like a good idea. I will split it in two - a first part here on how to do the solr setup and a second post on the search code itself.
Solr N-Gram
To be able to (almost) always get results, we decided to use the N-Gram tokenizer. An n-gram tokenizer splits text into overlapping sequences of characters of a specified length. This tokenizer is useful when you want to perform partial word matching because it generates substrings (character n-grams) of the original input text.
Step 1 in the process is to create a field type in the Solr schema that will use this tokenizer. We will be using it on indexing and on querying, meaning the indexed value and the search string will be split into n-grams.
We could update the schema in Solr (manually) - but every time someone would populate the index schema our change would be gone.
Customize index schema population
An article on the Sitecore documentation helped us to customize the index schema population - which is exactly what we need. We took the code from https://doc.sitecore.com/xp/en/developers/latest/platform-administration-and-architecture/add-custom-fields-to-a-solr-schema.html and changed the relevant methods as such:
private IEnumerable<XElement> GetAddCustomFields()
{
yield return CreateField("*_txts",
"text_searchable",
isDynamic: true,
required: false,
indexed: true,
stored: true,
multiValued: false,
omitNorms: false,
termOffsets: false,
termPositions: false,
termVectors: false);
}
So we are creating a new field "text_searchable" with an extension txts that will get indexed and stored.
private IEnumerable<XElement> GetAddCustomFieldTypes()
{
var fieldType = CreateFieldType("text_searchable", "solr.TextField",
new Dictionary<string, string>
{
{ "positionIncrementGap", "100" },
{ "multiValued", "false" },
});
var indexAnalyzer = new XElement("indexAnalyzer");
indexAnalyzer.Add(new XElement("tokenizer", new XElement("class", "solr.NGramTokenizerFactory"), new XElement("minGramSize", "3"), new XElement("maxGramSize", "5")));
indexAnalyzer.Add(new XElement("filters", new XElement("class", "solr.StopFilterFactory"), new XElement("ignoreCase", "true"), new XElement("words", "stopwords.txt")));
indexAnalyzer.Add(new XElement("filters", new XElement("class", "solr.LowerCaseFilterFactory")));
fieldType.Add(indexAnalyzer);
var queryAnalyzer = new XElement("queryAnalyzer");
queryAnalyzer.Add(new XElement("tokenizer", new XElement("class", "solr.NGramTokenizerFactory"), new XElement("minGramSize", "3"), new XElement("maxGramSize", "5")));
queryAnalyzer.Add(new XElement("filters", new XElement("class", "solr.StopFilterFactory"), new XElement("ignoreCase", "true"), new XElement("words", "stopwords.txt")));
queryAnalyzer.Add(new XElement("filters", new XElement("class", "solr.SynonymFilterFactory"), new XElement("synonyms", "synonyms.txt"), new XElement("ignoreCase", "true"), new XElement("expand", "true")));
queryAnalyzer.Add(new XElement("filters", new XElement("class", "solr.LowerCaseFilterFactory")));
fieldType.Add(queryAnalyzer);
yield return fieldType;
}
Here we are adding the type for text_searchable as a text field that uses the NGramTokenizerFactory. We are also setting the min and max gram size. This will determine the minimum and maximum number of characters that are used to create the fractions of your text (check the solr docs for more details).
Don't forget to also add the factory class and the configuration patch and that's it.
We created a custom index for this purpose in order to be able to have a custom configuration with computed fields and such specific on this index - with a limited number of items. If we now populate the schema for that index, our n-gram field type is added.
Sitecore index configuration
As mentioned earlier we have a custom index configured. This was done for 2 reasons:
- settings the crawlers: plural as we have two for both locations where we have items that should be included in the application
- custom index configuration: we wanted our own index configuration to be completely free in customizing it just for this index without consequences in all the others. The default solr configuration is referenced so we don't need to copy all the basics though
<ourcustomSolrIndexConfiguration ref="contentSearch/indexConfigurations/defaultSolrIndexConfiguration">
In order to get what we need in the index, we configure:
- AddIncludedTemplate: list the templates to be added in the index
- AddComputedIndexField: all computed fields to be added in the index
Computed Fields
Next to a number of computed fields for the extra tagging and such, we also used computed fields to add the title and the description field two more times in the index. Why? Well, it's an easy to way to copy a field (and apply some extra logic if needed). And we do need a copy. Well, copies actually.
The first copy will be set as a text_searchable field as we just created, the second copy will be a string field. Again, why?
As you will see in the next part of this blog where we talk about querying the data, we will use all data from the index and not go to Sitecore to fetch anything. This means we need everything we want to return in the index and that is why we are creating a string field copy of our text fields. It's all about tokenizers☺. The text_searchable copy is to have a n-gram version as well.
I am not going to share code for a computed field here - that has been documented enough already and a simple copy of a field is really very basic.
Configuration
I will share the configuration parts to add the computed fields.
<fields hint="raw:AddComputedIndexField">
<field fieldName="customtagname" type="Sitecore.XA.Foundation.Search.ComputedFields.ResolvedLinks, Sitecore.XA.Foundation.Search" returnType="stringCollection" referenceField="contenttype" contentField="title"/>
...
<field fieldName="titlesearch" type="X.Index.CopyField, X" returnType="string" referenceField="title" />
<field fieldName="descriptionsearch" type="X.Index.CopyField, X" returnType="string" referenceField="description" />
<field fieldName="titlestring" type="X.Index.CopyField, X" returnType="string" referenceField="title" />
</fields>
<field fieldName="descriptionstring" type="X.Index.CopyField, X" returnType="string" referenceField="description" />
</fields>
This config will create all the computed index fields. Note that we are also using the ResolvedLinks from SXA to handle reference fields.Adding the fields with the correct type to the field map:
<fieldMap ref="contentSearch/indexConfigurations/defaultSolrIndexConfiguration/fieldMap">
<typeMatches hint="raw:AddTypeMatch">
<typeMatch type="System.String" typeName="text_searchable" fieldNameFormat="{0}_txts" settingType="Sitecore.ContentSearch.SolrProvider.SolrSearchFieldConfiguration, Sitecore.ContentSearch.SolrProvider" />
</typeMatches>
<fieldNames hint="raw:AddFieldByFieldName">
<field fieldName="titlesearch" returnType="text_searchable"/>
<field fieldName="descriptionsearch" returnType="text_searchable"/>
</fieldNames>
</fieldMap>
Our index is ready now. In part 2 we will query this index to get the required results.
No comments:
Post a Comment