Showing posts with label Lucene. Show all posts
Showing posts with label Lucene. Show all posts

Wednesday, October 5, 2016

Custom Sitecore index crawler

Why? - The case

We needed to find items in the master database quickly, based on some criteria. Language didn't matter and we didn't need the actual item, a few properties would do. So we decided to create a custom index with the fields we needed indexed (the criteria) and/or stored (the properties). All fine, but as we were using the master database we had different versions of each item in the index. As we needed only the latest version in only one language we though we could optimize the index to only contain those versions.

First attempt

We had to override the SitecoreItemCrawler. That was clear. The first attempt was creating a custom IsExcludedFromIndex function that would stop all entries not in English and not the latest version. Pretty simple, but does not work.
First of all, all entries were in English.. this function is only called per item and not per version. So actually, we could not use this. Furthermore, we did not take into account the fact that when adding a new version, we have to remove the previously indexed one.

Don't index multiple versions

I started searching the internet and found this post by Martin English on (not) indexing multiple versions. Great post, and pointing as well towards a solution with inbound filters. But as those filters work for every index, that would be no solution here. I needed it configurable per index. So back to Martins' post. We had to overwrite DoAdd and DoUpdate.

A custom index crawler

The result was a bit different as I was using Sitecore 8.1 and also wanted to include a language filter. I checked the code from the original SitecoreItemCrawler, created a class overwriting it and adapted where needed.

Language

I made the language configurable by putting it into a property:
private string indexLanguage;

public string Language
{
  get  {  return !string.IsNullOrEmpty(indexLanguage) ? indexLanguage : null; }
  set  {  indexLanguage = value; }
}

DoAdd

The DoAdd method was changed by adding an early check in the language-loop to get out when not the requested language. I also removed the version-loop with a request for the latest version so that only that version gets send to the index.
protected override void DoAdd(IProviderUpdateContext context, SitecoreIndexableItem indexable)
{
  Assert.ArgumentNotNull(context, "context");
  Assert.ArgumentNotNull(indexable, "indexable");
  using (new LanguageFallbackItemSwitcher(context.Index.EnableItemLanguageFallback))
  {
    Index.Locator.GetInstance<IEvent>().RaiseEvent("indexing:adding", context.Index.Name, indexable.UniqueId, indexable.AbsolutePath);
    if (!IsExcludedFromIndex(indexable, false))
    {
      foreach (var language in indexable.Item.Languages)
      {
        // only include English
        if (!language.Name.Equals(indexLanguage, StringComparison.OrdinalIgnoreCase))
        {
          continue;
        }

        Item item;
        using (new WriteCachesDisabler())
        {
          item = indexable.Item.Database.GetItem(indexable.Item.ID, language, Version.Latest);
        }

        if (item == null)
        {
          CrawlingLog.Log.Warn(string.Format(CultureInfo.InvariantCulture, "SitecoreItemCrawler : AddItem : Could not build document data {0} - Latest version could not be found. Skipping.", indexable.Item.Uri));
        }
        else
        {
          SitecoreIndexableItem sitecoreIndexableItem;
          using (new WriteCachesDisabler())
          {
            // only latest version
            sitecoreIndexableItem = item.Versions.GetLatestVersion();
          }

          if (sitecoreIndexableItem != null)
          {
            IIndexableBuiltinFields indexableBuiltinFields = sitecoreIndexableItem;
            indexableBuiltinFields.IsLatestVersion = indexableBuiltinFields.Version == item.Version.Number;
            sitecoreIndexableItem.IndexFieldStorageValueFormatter = context.Index.Configuration.IndexFieldStorageValueFormatter;
            Operations.Add(sitecoreIndexableItem, context, index.Configuration);
          }
        }
      }
    }

    Index.Locator.GetInstance<IEvent>().RaiseEvent("indexing:added", context.Index.Name, indexable.UniqueId, indexable.AbsolutePath);
  }
}


DoUpdate

For the DoUpdate method I did something similar although I had to change a bit more here.
protected override void DoUpdate(IProviderUpdateContext context, SitecoreIndexableItem indexable, IndexEntryOperationContext operationContext)
{
  Assert.ArgumentNotNull(context, "context");
  Assert.ArgumentNotNull(indexable, "indexable");
  using (new LanguageFallbackItemSwitcher(Index.EnableItemLanguageFallback))
  {
    if (IndexUpdateNeedDelete(indexable))
    {
      Index.Locator.GetInstance<IEvent>().RaiseEvent("indexing:deleteitem", index.Name, indexable.UniqueId, indexable.AbsolutePath);
      Operations.Delete(indexable, context);
    }
    else
    {
      Index.Locator.GetInstance<IEvent>().RaiseEvent("indexing:updatingitem", index.Name, indexable.UniqueId, indexable.AbsolutePath);
      if (!IsExcludedFromIndex(indexable, true))
      {
        if (operationContext != null && !operationContext.NeedUpdateAllLanguages)
 {
   if (!indexable.Item.Language.Name.Equals(indexLanguage, StringComparison.OrdinalIgnoreCase))
   {
     CrawlingLog.Log.Debug(string.Format(CultureInfo.InvariantCulture, "SitecoreItemCrawler : Update : English not requested {0}. Skipping.", indexable.Item.Uri));
            return;
   }
        }
     
 Item item;
 var languageItem = LanguageManager.GetLanguage(indexLanguage);
 using (new WriteCachesDisabler())
 {
   item = indexable.Item.Database.GetItem(indexable.Item.ID, languageItem, Version.Latest);
 }

 if (item == null)
 {
    CrawlingLog.Log.Warn(string.Format(CultureInfo.InvariantCulture, "SitecoreItemCrawler : Update : Latest version not found for item {0}. Skipping.", indexable.Item.Uri));
 }
 else
 {
   Item[] versions;
   using (new SitecoreCachesDisabler())
   {
     versions = item.Versions.GetVersions(false);
   }

   foreach (var version in versions)
   {
     if (version.Version.Equals(item.Version))
     {
       UpdateItemVersion(context, version, operationContext);
     }
     else  
     {
       Index.Locator.GetInstance<IEvent>().RaiseEvent("indexing:deleteitem", index.Name, indexable.UniqueId, indexable.AbsolutePath);
       Delete(context, ((SitecoreIndexableItem)version).UniqueId);
     }
   }
 }
    
 Index.Locator.GetInstance<IEvent>().RaiseEvent("indexing:updateditem", index.Name, indexable.UniqueId, indexable.AbsolutePath);
      }


      if (!DocumentOptions.ProcessDependencies)
      {
        return;
      }

      if (indexable.Item.Language.Name.Equals(indexLanguage, StringComparison.OrdinalIgnoreCase))
      {
        Index.Locator.GetInstance<IEvent>().RaiseEvent("indexing:updatedependents", index.Name, indexable.UniqueId, indexable.AbsolutePath);
 UpdateDependents(context, indexable);
      }
    }
  }
}

I did a few things here:
  • if the operationContext is not asking to update all languages, I check the language and get it out if it is not the index language
  • I get all versions, loop trough them and update the latest - other versions get a delete instruction
    • not sure if this is really needed as it might be sufficient to delete only the previous one
  • the call to update the dependent items was put in a language condition so that it was only executed when the requested language is the index language

Testing

And I started testing. Rebuild. Add versions. Update items. Constantly using Luke to investigate the index. It all seemed to work. 
Until I tried to add a new version in a language that was not supposed to be in the index. The new version was not send to the index, but it's previous version was. I tried to figure out what was happening and by following the flow through the existing SitecoreItemCrawler I found some options in the "IndexEntryOperationContext" that were used in the base Update function.

Update

So we also override the Update method:
public override void Update(IProviderUpdateContext context, IIndexableUniqueId indexableUniqueId, IndexEntryOperationContext operationContext, IndexingOptions indexingOptions = IndexingOptions.Default)
{
  operationContext.NeedUpdatePreviousVersion = false;
  base.Update(context, indexableUniqueId, operationContext, indexingOptions);
}

What I'm doing here is actually quite simple: I tell the crawler that he does not need to update previous versions, no matter what. As I am already updating all versions in the DoUpdate this seemed ok to do. By doing this, the problem was fixed and I did not had to copy too much code anymore.

Conclusion

The custom crawler works and does what it is supposed to do. It would have been nice though if the functions in the crawler provided by Sitecore were cut into smaller pieces to make it easier to override the pieces we want to change. I remember reading somewhere that Pavel Veller already managed to get this on a roadmap, so I hope that is true...

But for now, this worked for me. Glad to hear any remarks, suggestions, ...

Wednesday, December 23, 2015

Sitecore Lucene index and DateTime fields

[Sitecore 8.1]

DateTime field in Lucene index

I was trying to create an index search for an event calendar that would give me items (from a template etc..)  that have a datefield:
  • from today onwards (today included) 
  • up until today
The field is Sitecore is a date field (so no time indication), but our query seemed to have issues with the time indications. The code to create the predicate looks like this:

private Expression<Func<EventItem, bool>> GetDatePredicate(OverviewMode mode)
{
  var predicate = PredicateBuilder.True<EventItem>();
  switch (mode)
  {
 case OverviewMode.Future:
 {
  var minDate = DateTime.Today.ToUniversalTime();
  predicate = predicate.And(n => n.StartDate > minDate);
  break;
 }
 case OverviewMode.Past:
 {
  var maxDate = DateTime.Today.ToUniversalTime();
  var minDate = DateTime.MinValue.ToUniversalTime();
  predicate = predicate.And(n => n.StartDate < maxDate).And(n => n.StartDate > minDate);
  break;
 }
 default:
 {
  return null;
 }
  }
  return predicate;
}


This did not work correctly with events "today". We had to add "AddDays(-1)" after the Today before we set it to UTC. So why?

The first reason is that Sitecore stores its DateTimes in UTC which was an hour difference with our local time. So, our dates shifted a day back: "12/12/2015" becomes "12/11/2015 23:00". This is known and should be no issue as we also shift to UTC in our predicate.

But still.. we did not get the correct results.

The logs

So we look at the logs. Sitecore logs all requests in the Search log file. We saw that our predicate was translated into something like this:
"+(+date_from:[* TO 20151111t230000000z} +date_from:{00010101t000000000z TO *])"

Looks fine, but note that the "t" in the dates is lowercase. In my index however they are all uppercase. If I try the query with Luke it does give me the wrong results indeed.. When I alter the query in Luke to use uppercase T it works correctly..

Support, here we come!


Solution(s)

Support gave us 2 possible solutions, next to the one we already had (skipping a day).

1. Format

We could alter our index to use a format attribute:
<field fieldName="datefrom" storageType="YES" indexType="UNTOKENIZED" vectorType="NO" boost="1f" 
format="yyyyMMdd" type="System.DateTime" 
settingType="Sitecore.ContentSearch.LuceneProvider.LuceneSearchFieldConfiguration, Sitecore.ContentSearch.LuceneProvider"/>

After rebuilding our index, the "DateFrom" field values, stored in the index, will contain only dates (like "20151209"), so search by dates should return results as expected (since there are no "T" and "Z" symbols). 
This works if you really don't need the times..

2. Custom Converter

Another solution is to override the "Sitecore.ContentSearch.Converters.IndexFieldUtcDateTimeValueConverter" class to store dates in lower case to the index.

Add your converter to the index config:
<converters hint="raw:AddConverter">
  ...
  <converter handlesType="System.DateTime" 
         typeConverter="YourNamespace.LowerCaseIndexFieldUtcDateTimeValueConverter, YourAssembly" />
  ...
</converters>

As a result, all dates should be stored to the index in lower case. As the search query is in lower case, all expected results should be found.


Future solution

Since currently search queries are always generated in lower case and this behavior is currently not configurable (the "LowercaseExpandedTerms" property of the "Lucene.Net.QueryParsers.QueryParser" class is always set to true, which lowers parameters in a search query string), a feature request for the product was made so that it can be considered for future implementations. That should make these tweaks unnecessary..

Monday, December 7, 2015

Sitecore Lucene index with integers

The situation

We recently discovered an issue when using a facet on an integer field in a Sitecore (8.1) Lucene index. We had a number of articles (items) with a date field. We had to query these items, order them by date and determine the number of items in each year.

The code

We created a ComputedField "year" and filled it with the year part of the date:
var dateTime = ((DateField)publicationDateField).DateTime;
return dateTime.Year;
We added the field to a custom index, and created an entry in the fieldmap to mark it as System.Int32. We rebuild the index, check the contents with Luke and all is fine. So we create a class based on SearchResultItem to use for the query:

class NewsItem : SearchResultItem
{
    [IndexField("title")]
    public string Title { get; set; }

    [IndexField("publication date")]
    public DateTime Date { get; set; }

    [IndexField("category")]
    public Guid Category { get; set; }

    [IndexField("year")]
    public int PublicationYear { get; set; }
}

The query

When we use this class for querying, we get not results when filtering on the year.. apparently integer fields need to be tokenized to be used in searches (indexType="TOKENIZED"). Sounds weird as this is surely not true for text fields, but the NumericField constructor makes it clear:

Lucene.Net.Documents.NumericField.NumericField(string name, int precisionStep, Field.Store store, bool index) : base(name, store, index ? Field.Index.ANALYZED_NO_NORMS : Field.Index.NO, Field.TermVector.NO)

So, we changed the field in the fieldmap and set it tokenized. We add an analyzer to prevent the integer being cut in parts (Lucene.Net.Analysis.KeywordAnalyzer or Sitecore.ContentSearch.LuceneProvider.Analyzers.LowerCaseKeywordAnalyzer).

Success?

Yeah! We have results! We got the news items for 2015! And 2014..  But... there is always a but or this post would be too easy. We still needed a facet. And there it went wrong. The facet resulted in this:


Not what we expected actually...

So back to our query and index..  Sitecore Support found out that this happens because of the specific way the numeric fields are indexed by Lucene, they are indexed not just as simple tokens but as a tree structure (http://lucene.apache.org/core/2_9_4/api/all/org/apache/lucene/document/NumericField.html).

Unfortunately, Sitecore cannot do faceting on such fields at this moment - this is now logged as a bug.

The Solution

The solution was actually very simple. We threw out the field from the fieldmap and changed the int in our NewsItem to string. If we want to use them as an integer we need to cast them afterwards, but for now we don't even need that.
Luckily for us, even the sorting doesn't care as our int's are years. So we were set.. queries are working and facets are fine.