parsingsolrlucenetokenizesolr6

Filename getting parsed incorrectly in filter query in Solr 6.6


How do i prevent a filter query such as ss_content:70756_box4_v29.jpg from being parsed as below

"filter_queries":["ss_content:(41339_box4_v29.jpg)"],
"parsed_filter_queries":["ss_content:41339_box4_v29 ss_content:jpg"]

in the parsed_filter_queries the filename has been chopped into 2 separate query. even if i try to include the filenames in double quotes, it still chops the filename into 2 parts.

"filter_queries":["ss_content:\\\"70756_box4_v29.jpg\\\""],
"parsed_filter_queries":["ss_content:70756_box4_v29 ss_content:jpg"],

This is causing the query to show incorrect results as the ss_content field is a keywords field

for example

"ss_content":"628_test.jpg none  facets media image file type jpg type packaging graphics packaging generic year 1996 "

Solution

  • you need to tweak the tokenizer the ss_content field is using. Right now it is tokenizing on the dot, thus creating two terms to query. Set the analysis chain to use the WhitespaceTokenizer for example (better study carefully what the best match is for your use case).