Drupal Search API Solr spellcheck setup for foreign language

The task is to setup the spellcheck feature of Solr in Drupal for foreign language (in this article I will demonstrate enabling German language and adding new language - Filipino). The Solr spellcheck is a feature where it gives list of suggestions if there is no search result due to spelling error. This article assumed that you already setup Apache Solr, index, Drupal with Search API; if you haven't done that please see this article first: Step by step guide to setup Apache Solr 5.x in CentOS 7 for Drupal 7 Panopoly distro using Search API.

Install Search API Spellcheck module.
```
drush en search_api_spellcheck -y
```

We need to add new Solr field type for our foreign language. If you have followed the steps in Step by step guide to setup Apache Solr 5.x in CentOS 7 for Drupal 7 Panopoly distro using Search API, there is an instruction where the predefined Solr config files were copied from Search API Solr module. The existing config files already have the German language support, it is just a matter of enabling by uncommenting it. To do so, open the schema_extra_types.xml file:


vi /var/solr/data/webfoobar/conf/schema_extra_types.xml

Note: you should replace the "webfoobar" with your core name created.

And remove the comments around the fieldType tags. The file content should look something like this:


<types>
  <fieldType name="text_de" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="index">
      <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      <filter class="solr.StopFilterFactory" words="lang/stopwords_de.txt" format="snowball" ignoreCase="true"/>
      <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" splitOnCaseChange="1" splitOnNumerics="1" catenateWords="1" catenateNumbers="1" catenateAll="0" protected="protwords.txt" preserveOriginal="1"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.GermanLightStemFilterFactory"/>
      <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
    </analyzer>
    <analyzer type="query">
      <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      <filter class="solr.SynonymFilterFactory" synonyms="lang/synonyms_de.txt" ignoreCase="true" expand="true"/>
      <filter class="solr.StopFilterFactory" words="lang/stopwords_de.txt" format="snowball" ignoreCase="true"/>
      <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" splitOnCaseChange="1" splitOnNumerics="1" catenateWords="0" catenateNumbers="0" catenateAll="0" protected="protwords.txt" preserveOriginal="1"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.GermanLightStemFilterFactory"/>
      <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
    </analyzer>
  </fieldType>
</types>

Note: Make sure that "stopwords_de.txt" and "synonyms_de.txt" files exist in /var/solr/data/webfoobar/conf/lang/

To add new language, in this example we will add Filipino language. The schema_extra_types.xml should have this content:


<fieldType name="text_ph" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
    <tokenizer class="solr.StandardTokenizerFactory" />
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_ph.txt"/>
    <filter class="solr.LowerCaseFilterFactory" />
    <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
  </analyzer>
</fieldType>

We defined separate "stopwords_ph.txt" in Filipino language field type and make sure it exist in /var/solr/data/webfoobar/conf/

Define the Solr field. Again in German this is already defined, we will just uncomment. Open the "schema_extra_fields.xml" file:


vi /var/solr/data/webfoobar/conf/schema_extra_fields.xml

And remove the comments around the field and dynamicField tags. The file content should look something like this:


<fields>
  <field name="label_de" type="text_de" indexed="true" stored="true" termVectors="true" omitNorms="true"/>
  <field name="content_de" type="text_de" indexed="true" stored="true" termVectors="true"/>
  <field name="teaser_de" type="text_de" indexed="false" stored="true"/>
  <field name="path_alias_de" type="text_de" indexed="true" stored="true" termVectors="true" omitNorms="true"/>
  <field name="taxonomy_names_de" type="text_de" indexed="true" stored="false" termVectors="true" multiValued="true" omitNorms="true"/>
  <field name="spell_de" type="text_de" indexed="true" stored="true" multiValued="true"/>
  <copyField source="label_de" dest="spell_de"/>
  <copyField source="content_de" dest="spell_de"/>
  <dynamicField name="tags_de_*" type="text_de" indexed="true" stored="false" omitNorms="true"/>
  <dynamicField name="ts_de_*" type="text_de" indexed="true" stored="true" multiValued="false" termVectors="true"/>
  <dynamicField name="tm_de_*" type="text_de" indexed="true" stored="true" multiValued="true" termVectors="true"/>
  <dynamicField name="tos_de_*" type="text_de" indexed="true" stored="true" multiValued="false" termVectors="true" omitNorms="true"/>
  <dynamicField name="tom_de_*" type="text_de" indexed="true" stored="true" multiValued="true" termVectors="true" omitNorms="true"/>
</fields>

For Filipino language, the "schema_extra_fields.xml" should have this content:


<fields>
  <field name="spell_ph" type="text_ph" tokenized="true" indexed="true" stored="true" multiValued="true"/>
  <copyField source="tm_title" dest="spell_ph"/>
</fields>

Tip: To get more accurate results, it is best to avoid a heavily processed field when selecting a field for the spellcheck index. The dictionary will be created with many word variations from processing synonyms and/or stemming in addition to more valid spelling data. In my use case, I only choose the node title as my field to lookup or reference for correct spelling.

Define the dictionary of spellcheck component item. In "solrconfig_extra.xml", the German has already predefine dictionary and we will just remove the comments. Open the "solrconfig_extra.xml" file:


vi /var/solr/data/webfoobar/conf/solrconfig_extra.xml

The file content should look something like this:


<searchComponent name="spellcheck" class="solr.SpellCheckComponent">
  <str name="queryAnalyzerFieldType">textSpell</str>
  <lst name="spellchecker">
    <str name="name">default</str>
    <str name="field">spell</str>
    <str name="spellcheckIndexDir">spellchecker</str>
    <str name="buildOnOptimize">true</str>
  </lst>
  <lst name="spellchecker">
    <str name="name">spellchecker_de</str>
    <str name="field">spell_de</str>
    <str name="spellcheckIndexDir">./spellchecker_de</str>
    <str name="buildOnOptimize">true</str>
  </lst>
</searchComponent>

Note the dictionary name "spellchecker_de", we will use that to let Drupal know about the new dictionary we have enabled.

For Filipino language, the "solrconfig_extra.xml" should have this content:


<searchComponent name="spellcheck" class="solr.SpellCheckComponent">
  <str name="queryAnalyzerFieldType">textSpell</str>
  <lst name="spellchecker">
    <str name="name">default</str>
    <str name="field">spell</str>
    <str name="spellcheckIndexDir">spellchecker</str>
    <str name="buildOnOptimize">true</str>
  </lst>
  <lst name="spellchecker">
    <str name="name">spellchecker_ph</str>
    <str name="field">spell_ph</str>
    <str name="spellcheckIndexDir">./spellchecker_ph</str>
    <str name="buildOnOptimize">true</str>
  </lst>
</searchComponent>

Note the dictionary name "spellchecker_ph", we will use that to let Drupal know about the new dictionary we have defined.

Restart the Solr service:
```
systemctl restart solr.service
```

Clear and re-index Solr in our Drupal site:


drush search-api-clear
drush search-api-index

To test, go to Solr admin web interface which can be access at http://www.yourwebsite.com:8983/solr/#/webfoobar/query (Note: change the "www.yourwebsite.com" to your website domain name and the "webfoobar" to your Solr core name). Tick the "spellcheck" checkbox, populate the "spellcheck.dictionary" field with the dictionary name, in our example if German, fillup with "spellchecker_de". But in my case I filled up with "spellchecker_ph" and the rest of spellcheck's paramenters should depend on your use case (to know more about the other paramenters check https://cwiki.apache.org/confluence/display/solr/Spell+Checking). Take note all the spellcheck parameters that you changed that best suit your use case, as you will need these to define in your Drupal.

Now, in "q" field, put your term (don't spell it correctly) that you want to search. I have entered the term "kapittt bisii" as I have a Drupal content node title with "kapit bisig" and click "Execute Query" button. It should display the Solr response:

Notice that Solr suggested the correct spelling for misspelled "kapittt" is "kapit" and "bisii" is "bisig". If you ticked the "spellcheck.collate" option, Solr suggested "kapit bisig" as collation for the phrase search "kapittt bisii".

Note: If your search did not return any suggestions, try ticking the "spellcheck.build" and "spellcheck.reload" checkbox then click "Execute Query" button. We are using the "IndexBasedSpellChecker" and it have to be built regularly.

Lets make Drupal know about the spellcheck parameters that we took note earlier. In your custom module (if you don't have yet, please create one. Checkout this https://www.drupal.org/node/1074360 on how to create a module), add hook_search_api_solr_query_alter() and define all the spellcheck parameters here:


function YOURMODULE_search_api_solr_query_alter(array &$call_args, SearchApiQueryInterface $query) {
  $keys = $query->getKeys();
  if (is_array($keys)) {
    unset($keys['#conjunction']);
    $keys = implode(' ', $keys);
  }
  $call_args['params']['spellcheck.q'] = $keys;
  $call_args['params']['spellcheck.count'] = 10;
  $call_args['params']['spellcheck.accuracy'] = 0.7;
  $call_args['params']['spellcheck.collate'] = 'true';
  $call_args['params']['spellcheck.dictionary'] = 'spellchecker_ph';
}

Clear your Drupal cache:


drush cc all

Lets try to apply this Solr spellcheck feature in our Drupal views. Create a Search API index views search page and add the "Search: Spellcheck" field in "NO RESULTS BEHAVIOR" area.

Save the views.
Check the Search API index views search page we have created. I'll enter the "kapittt bisii" again and this is the output: