November 9, 2016

Spellchecking in Solr for Sitecore 8.1

Disclaimer: This work is based on Solr 5.4.1 and Sitecore 8.1 update-3, a non-supported combination.

Sitecore stores all language versions in the same Solr core per default, so we wanted to have Solr keep a spellcheck index for all relevant languages and be able to switch to a suitable index for any given query.

To accomplish this, the solrconfig.xml file must be updated. The file is located inside the individual Solr core directories, and in our case, the path was:
c:\solr\server\solr\ContentSearch\conf\solrconfig.xml.

The Solr Spellcheck component works by building a dedicated index for spellchecking, meaning that in our case, we'd have multiple spellcheking indexes, one for each supported language.

<searchComponent name="spellcheck" class="solr.SpellCheckComponent" >
  <!-- update this field to "textSpell" -->
  <str name="queryAnalyzerFieldType">textSpell</str>

    <!-- for Danish language -->
    <lst hint="da" name="spellchecker">
    <str name="name">spell_da</str>
    <str name="field">spell_t_da</str>
    <str name="classname">solr.IndexBasedSpellChecker</str>
    <str name="combineWords">true</str>
    <str name="buildOnOptimize">true</str>
    <str name="buildOnCommit">true</str>
    <str name="spellcheckIndexDir">./spellchecker_da</str>
  </lst>

  <!-- for English language -->
  <lst hint="en" name="spellchecker">
    <str name="name">spell_en</str>
    <str name="field">spell_t_en</str>
    <str name="classname">solr.IndexBasedSpellChecker</str>
    <str name="combineWords">true</str>
    <str name="buildOnOptimize">true</str>
    <str name="buildOnCommit">true</str>
    <str name="spellcheckIndexDir">./spellchecker_en</str>
  </lst>

and so on.

Next, an updateRequestProcessorChain needed to be added/uncommented:

<updateRequestProcessorChain name="script">
   <processor class="solr.StatelessScriptUpdateProcessorFactory">
     <str name="script">update-script.js</str>
   </processor>
   <processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>

This piece of configuration instructs Solr to run a javascript(!) on every document update to the index, giving us a way to modify/extend every document going into the index, a perfect place to populate a field for the spellchecking component.

The updateRequestProcessorChain processor also needs to be hooked into every call to the Solr /update handler:

<initParams path="/update/**">
  <lst name="defaults">
     <str name="update.chain">script</str>
  </lst>
</initParams>

The update-javascript file (located in the conf directory) contains a few javascript methods, and we need to overwrite the one called processAdd:

function processAdd(cmd) {
    var doc = cmd.solrDoc;  // org.apache.solr.common.SolrInputDocument
    var locale = doc.getFieldValue("_language"); // Standard Sitecore field
    var isSearchable = doc.getFieldValue("is_searchable_b"); // Custom field
    id = doc.getFieldValue("_uniqueid"); // Standard Sitecore field

    if (locale && isSearchable) {
      locale = locale.substring(0,2);
      doc.addField("spell_t_"+ locale, doc.getFieldValue("_content"));  
      //Existing dynamic field definition(*_txt_en, *_txt_de, etc) in schema.xml per languauage tokenizes this.
      logger.info("update-script#processAdd: id=" + id + " - locale=" + locale);
    } 
}

So, by now Solr will begin populating a field when an index rebuild is performed from Sitecore, but we can't use it yet. Remember to reload the Solr core or restart the service, otherwise it might not work.

Now we must make the spellcheck work when querying Solr. Add the following to the <lst name="defaults"> element below the requestHandler named "/select":

<str name="spellcheck.dictionary">spell_t_en</str>
<str name="spellcheck.dictionary">spell_t_da</str>
<str name="spellcheck.count">1</str>

If you have more spellcheck dictionaries defined, add them here accordingly.

Finally, in the bottom of the requestHandler element, add this:

<arr name="last-components">
    <str>spellcheck</str>
</arr>

Wuhuu, now Solr is capable of running queries with spellcheck in the right languages. In this case, by issuing the following request in a browser:

http://localhost:8983/solr/ContentSearch/select?q=ladde&wt=json&indent=true&spellcheck=true& spellcheck.dictionary=spell_da&spellcheck.collate=true

We get this response:

{
  "responseHeader":{
    "status":0,
    "QTime":2,
    "params":{
      "spellcheck":"true",
      "indent":"true",
      "q":"ladde",
      "spellcheck.dictionary":"spell_da",
      "spellcheck.collate":"true",
      "wt":"json"}},
  "response":{"numFound":0,"start":0,"docs":[] },
  "spellcheck":{
    "suggestions":[
      "ladde",{
        "numFound":1,
        "startOffset":0,
        "endOffset":5,
        "suggestion":["lad"]}],
    "collations":[
      "collation","lad"]}
}

Now, it is time to make this work in Sitecore. Unfortunately, by using the out-of-the-box version of SolrNet.dll, it doesn't work, as it appears that the collation part of the response from Solr varies from Solr 4 (which is supported by Sitecore) to Solr 5 (which is not, at least for Sitecore 8.1).

I used Ehab ElGhindy's very cool and concise extension to the IProviderSearchContext to get easy access to the spellcheck functionality:
http://www.ehabelgindy.com/sitecore-7-solr-spellcheck/

As stated earlier, it did not work due to the choice of going for Solr 5.4.1.
Instead, we found the source code for SolrNet and made changes to the SpellCheckResponseParser.

I stand on the shoulders of these great posts:

http://pavelbogomolenko.github.io/multi-language-handling-in-solr.html
http://wiki.apache.org/solr/ScriptUpdateProcessor
http://solr.pl/en/2011/05/23/“car-sale-application”-–-spellcheckcomponent-–-did-you-really-mean-that-part-5/
http://www.ehabelgindy.com/sitecore-7-solr-spellcheck/

Gotcha's

Please be very aware not to enable the spellchecking on Solr cores that are not needed for ContentSearch. Also, buildOnOptimize and buildOnCommit might also be too often for your use case.