htmlpostgresqlpostgresql-9.5

Why is PostgreSQL stripping HTML entities in ts_headline()?


I'm writing a prototype of a full-text search feature which will return found documents' "headlines" in the search result. Here's a slightly modified example from the Postgres docs:

SELECT ts_headline('english',
  'The most common type of search is to find all documents containing given query terms <b>and</b> return them in <order> of their similarity to the query.',
  to_tsquery('query & similarity'),
  'StartSel = XXX, StopSel = YYY');

What I would expect would be something like

"documents containing given XXXqueryYYY terms <b>and</b> return them in <order> of their XXXsimilarityYYY to the XXXqueryYYY."

What I get instead is

"documents containing given XXXqueryYYY terms  and  return them in   of their XXXsimilarityYYY to the XXXqueryYYY."

It looks like everything that looked remotely like a HTML tag is getting stripped and replaced with a single space character (note the double spaces around the and).

I didn't find any place in the docs that would state that Postgres is assuming the input text is HTML and the user would want the tags stripped off. The api allows overriding of StartSel and StopSel from the default <b> and </b>, so I'd think it was meant to serve a more general use-case.

Is there some setting or comment in the docs that I'm missing?


Solution

  • <b> and </b> are recognized as tag token. By default they are ignored. You need to modify existing configuration or create new one:

    =# CREATE TEXT SEARCH CONFIGURATION english_tag (COPY = english);
    =# alter text search configuration english_tag
       add mapping for tag with simple;
    

    Then tags aren't skipped:

    =# select * from ts_debug('english_tag', 'query <b>test</b>');
       alias   |   description   | token |  dictionaries  |  dictionary  | lexemes
    -----------+-----------------+-------+----------------+--------------+---------
     asciiword | Word, all ASCII | query | {english_stem} | english_stem | {queri}
     blank     | Space symbols   |       | {}             | (null)       | (null)
     tag       | XML tag         | <b>   | {simple}       | simple       | {<b>}
     asciiword | Word, all ASCII | test  | {english_stem} | english_stem | {test}
     tag       | XML tag         | </b>  | {simple}       | simple       | {</b>}
    

    But even in this case ts_headline will skip tags. Because it is hardcoded:

    #define HLIDREPLACE(x)  ( (x)==TAG_T )
    

    There is a workaround of course. It is possible to create your own text search parser extension. Example on GitHub. And change

    #define HLIDREPLACE(x)  ( (x)==TAG_T )
    

    to

    #define HLIDREPLACE(x)  ( false )