I am using Datastax 6.8. This is my SOLR schema:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<schema name="autoSolrSchema" version="1.5">
<types>
<fieldType class="org.apache.solr.schema.StrField" name="StrField"/>
<fieldType class="org.apache.solr.schema.TextField" name="NameField">
<analyzer type="index">
<filter class="solr.ASCIIFoldingFilterFactory"/>
<tokenizer class="solr.LowerCaseTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
<filter class="solr.NGramFilterFactory" maxGramSize="15" minGramSize="2"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.LowerCaseTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
<filter class="solr.NGramFilterFactory" maxGramSize="15" minGramSize="2"/>
</analyzer>
</fieldType>
</types>
<fields>
<field indexed="true" multiValued="false" name="nama" type="StrField"/>
<field indexed="true" multiValued="false" name="nama_copy" type="NameField"/>
</fields>
<uniqueKey>(nama)</uniqueKey>
<copyField dest="nama_copy" source="nama"/>
</schema>
I have this field value in a row batamindo v
Then I ran this query:
http://my_ip_address:8983/solr/search.form/select?wt=json&indent=true&fl=nama&q=nama_copy:batamindo\ v
I got very nice result
{
"responseHeader":{
"status":0,
"QTime":8},
"response":{"numFound":579,"start":0,"docs":[
{
"nama":"BATAMINDO V "},
{
"nama":"BATAMINDO V"},
{
"nama":"BATAMINDO V"},
{
"nama":"BATAMINDO V"},
{
"nama":"BATAMINDO V"},
{
"nama":"BATAMINDO V"},
{
"nama":"BATAMINDO V"},
{
"nama":"BATAMINDO V"},
{
"nama":"BATAMINDO V"},
{
"nama":"BATAMINDO V"}]
}}
But when I ran
http://my_ip_address:8983/solr/search.form/select?wt=json&indent=true&fl=nama&q=nama_copy:batamindo\ vi
My search result is very bad
{
"responseHeader":{
"status":0,
"QTime":14},
"response":{"numFound":602,"start":0,"docs":[
{
"nama":"MV. VINCA"},
{
"nama":"MV. VINASHIP PEARL"},
{
"nama":"MV. VINASHIP PEARL"},
{
"nama":"MV. VINCENT TRADER"},
{
"nama":"MV. MEGHNA VICTORY"},
{
"nama":"MV. MEGHNA VICTORY"},
{
"nama":"NAVI SUNNY"},
{
"nama":"MV. MEGHNA VICTORY"},
{
"nama":"MT. GOLDEN VIOLET"},
{
"nama":"MT. GOLDEN VIOLET"}]
}}
What is happening here?
What you are seeing is expected behaviour.
The NGramFilterFactory
class tokenises strings into grams of N size. In your case, the strings are broken up into grams of 2 to 15 characters based on your schema definition of:
<filter class="solr.NGramFilterFactory" maxGramSize="15" minGramSize="2"/>
For an input string like cassandra
, the N-gram filter generates the following grams:
ca as ss sa an nd dr ra
cas ass ssa san and ndr dra
cass assa ssan sand andr ndra
For search term ss
, the Solr query will get a match for ss
, ass
, ssa
, assa
, ssan
and so on.
In your case where the search term is vi
, it is expected to match vinca
, vinaship
, vincent
, victory
, navi
, violet
and so on.
For more information, see Document Analysis in Solr. Cheers!