Wednesday, October 5, 2016

Custom Sitecore index crawler

Why? - The case

We needed to find items in the master database quickly, based on some criteria. Language didn't matter and we didn't need the actual item, a few properties would do. So we decided to create a custom index with the fields we needed indexed (the criteria) and/or stored (the properties). All fine, but as we were using the master database we had different versions of each item in the index. As we needed only the latest version in only one language we though we could optimize the index to only contain those versions.

First attempt

We had to override the SitecoreItemCrawler. That was clear. The first attempt was creating a custom IsExcludedFromIndex function that would stop all entries not in English and not the latest version. Pretty simple, but does not work.
First of all, all entries were in English.. this function is only called per item and not per version. So actually, we could not use this. Furthermore, we did not take into account the fact that when adding a new version, we have to remove the previously indexed one.

Don't index multiple versions

I started searching the internet and found this post by Martin English on (not) indexing multiple versions. Great post, and pointing as well towards a solution with inbound filters. But as those filters work for every index, that would be no solution here. I needed it configurable per index. So back to Martins' post. We had to overwrite DoAdd and DoUpdate.

A custom index crawler

The result was a bit different as I was using Sitecore 8.1 and also wanted to include a language filter. I checked the code from the original SitecoreItemCrawler, created a class overwriting it and adapted where needed.

Language

I made the language configurable by putting it into a property:
private string indexLanguage;

public string Language
{
  get  {  return !string.IsNullOrEmpty(indexLanguage) ? indexLanguage : null; }
  set  {  indexLanguage = value; }
}

DoAdd

The DoAdd method was changed by adding an early check in the language-loop to get out when not the requested language. I also removed the version-loop with a request for the latest version so that only that version gets send to the index.
protected override void DoAdd(IProviderUpdateContext context, SitecoreIndexableItem indexable)
{
  Assert.ArgumentNotNull(context, "context");
  Assert.ArgumentNotNull(indexable, "indexable");
  using (new LanguageFallbackItemSwitcher(context.Index.EnableItemLanguageFallback))
  {
    Index.Locator.GetInstance<IEvent>().RaiseEvent("indexing:adding", context.Index.Name, indexable.UniqueId, indexable.AbsolutePath);
    if (!IsExcludedFromIndex(indexable, false))
    {
      foreach (var language in indexable.Item.Languages)
      {
        // only include English
        if (!language.Name.Equals(indexLanguage, StringComparison.OrdinalIgnoreCase))
        {
          continue;
        }

        Item item;
        using (new WriteCachesDisabler())
        {
          item = indexable.Item.Database.GetItem(indexable.Item.ID, language, Version.Latest);
        }

        if (item == null)
        {
          CrawlingLog.Log.Warn(string.Format(CultureInfo.InvariantCulture, "SitecoreItemCrawler : AddItem : Could not build document data {0} - Latest version could not be found. Skipping.", indexable.Item.Uri));
        }
        else
        {
          SitecoreIndexableItem sitecoreIndexableItem;
          using (new WriteCachesDisabler())
          {
            // only latest version
            sitecoreIndexableItem = item.Versions.GetLatestVersion();
          }

          if (sitecoreIndexableItem != null)
          {
            IIndexableBuiltinFields indexableBuiltinFields = sitecoreIndexableItem;
            indexableBuiltinFields.IsLatestVersion = indexableBuiltinFields.Version == item.Version.Number;
            sitecoreIndexableItem.IndexFieldStorageValueFormatter = context.Index.Configuration.IndexFieldStorageValueFormatter;
            Operations.Add(sitecoreIndexableItem, context, index.Configuration);
          }
        }
      }
    }

    Index.Locator.GetInstance<IEvent>().RaiseEvent("indexing:added", context.Index.Name, indexable.UniqueId, indexable.AbsolutePath);
  }
}


DoUpdate

For the DoUpdate method I did something similar although I had to change a bit more here.
protected override void DoUpdate(IProviderUpdateContext context, SitecoreIndexableItem indexable, IndexEntryOperationContext operationContext)
{
  Assert.ArgumentNotNull(context, "context");
  Assert.ArgumentNotNull(indexable, "indexable");
  using (new LanguageFallbackItemSwitcher(Index.EnableItemLanguageFallback))
  {
    if (IndexUpdateNeedDelete(indexable))
    {
      Index.Locator.GetInstance<IEvent>().RaiseEvent("indexing:deleteitem", index.Name, indexable.UniqueId, indexable.AbsolutePath);
      Operations.Delete(indexable, context);
    }
    else
    {
      Index.Locator.GetInstance<IEvent>().RaiseEvent("indexing:updatingitem", index.Name, indexable.UniqueId, indexable.AbsolutePath);
      if (!IsExcludedFromIndex(indexable, true))
      {
        if (operationContext != null && !operationContext.NeedUpdateAllLanguages)
 {
   if (!indexable.Item.Language.Name.Equals(indexLanguage, StringComparison.OrdinalIgnoreCase))
   {
     CrawlingLog.Log.Debug(string.Format(CultureInfo.InvariantCulture, "SitecoreItemCrawler : Update : English not requested {0}. Skipping.", indexable.Item.Uri));
            return;
   }
        }
     
 Item item;
 var languageItem = LanguageManager.GetLanguage(indexLanguage);
 using (new WriteCachesDisabler())
 {
   item = indexable.Item.Database.GetItem(indexable.Item.ID, languageItem, Version.Latest);
 }

 if (item == null)
 {
    CrawlingLog.Log.Warn(string.Format(CultureInfo.InvariantCulture, "SitecoreItemCrawler : Update : Latest version not found for item {0}. Skipping.", indexable.Item.Uri));
 }
 else
 {
   Item[] versions;
   using (new SitecoreCachesDisabler())
   {
     versions = item.Versions.GetVersions(false);
   }

   foreach (var version in versions)
   {
     if (version.Version.Equals(item.Version))
     {
       UpdateItemVersion(context, version, operationContext);
     }
     else  
     {
       Index.Locator.GetInstance<IEvent>().RaiseEvent("indexing:deleteitem", index.Name, indexable.UniqueId, indexable.AbsolutePath);
       Delete(context, ((SitecoreIndexableItem)version).UniqueId);
     }
   }
 }
    
 Index.Locator.GetInstance<IEvent>().RaiseEvent("indexing:updateditem", index.Name, indexable.UniqueId, indexable.AbsolutePath);
      }


      if (!DocumentOptions.ProcessDependencies)
      {
        return;
      }

      if (indexable.Item.Language.Name.Equals(indexLanguage, StringComparison.OrdinalIgnoreCase))
      {
        Index.Locator.GetInstance<IEvent>().RaiseEvent("indexing:updatedependents", index.Name, indexable.UniqueId, indexable.AbsolutePath);
 UpdateDependents(context, indexable);
      }
    }
  }
}

I did a few things here:
  • if the operationContext is not asking to update all languages, I check the language and get it out if it is not the index language
  • I get all versions, loop trough them and update the latest - other versions get a delete instruction
    • not sure if this is really needed as it might be sufficient to delete only the previous one
  • the call to update the dependent items was put in a language condition so that it was only executed when the requested language is the index language

Testing

And I started testing. Rebuild. Add versions. Update items. Constantly using Luke to investigate the index. It all seemed to work. 
Until I tried to add a new version in a language that was not supposed to be in the index. The new version was not send to the index, but it's previous version was. I tried to figure out what was happening and by following the flow through the existing SitecoreItemCrawler I found some options in the "IndexEntryOperationContext" that were used in the base Update function.

Update

So we also override the Update method:
public override void Update(IProviderUpdateContext context, IIndexableUniqueId indexableUniqueId, IndexEntryOperationContext operationContext, IndexingOptions indexingOptions = IndexingOptions.Default)
{
  operationContext.NeedUpdatePreviousVersion = false;
  base.Update(context, indexableUniqueId, operationContext, indexingOptions);
}

What I'm doing here is actually quite simple: I tell the crawler that he does not need to update previous versions, no matter what. As I am already updating all versions in the DoUpdate this seemed ok to do. By doing this, the problem was fixed and I did not had to copy too much code anymore.

Conclusion

The custom crawler works and does what it is supposed to do. It would have been nice though if the functions in the crawler provided by Sitecore were cut into smaller pieces to make it easier to override the pieces we want to change. I remember reading somewhere that Pavel Veller already managed to get this on a roadmap, so I hope that is true...

But for now, this worked for me. Glad to hear any remarks, suggestions, ...

1 comment: