htmlsearchelasticsearchdocuments

Elasticsearch raw html document search


I store raw html of website in ElasticSearch, example field called "html_content":

"\ufeff<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Strict//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd\"><html xmlns=\"http://www.w3.org/1999/xhtml\" xml:lang=\"en\">\t<head>\t  \t<base href=\"http://forum.pl\">\t  \t\t  \t<!-- Google Webmaster Tools -->\t\t\t\t<meta name=\"google-site-verification\" content=\"F6oatYg_CzKAKO7hA3Sy11S10eS0_ZYC1yGaoMbKYTU\" />\t\t\t  \t    <meta http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\" />\t    <meta http-equiv=\"X-UA-Compatible\" content=\"IE=EmulateIE7\" />\t    \t    \t    \t   \t<title>Dolnośląska Fundacja Rozowju Regionalnego - Forum.pl</title>\t    \t    <link href=\"/public/css/style.css\" rel=\"stylesheet\" type=\"text/css\">\t\t<link rel=\"stylesheet\" href=\"/public/css/menu.css\" type=\"text/css\" />\t\t<!--[if IE 6]>\t\t<link href=\"/public/css/clean_ie6.css\" rel=\"stylesheet\" type=\"text/css\" />\t\t<![endif]-->\t\t<!--[if IE 7]>\t\t<link href=\"/public/css/clean_ie.css\" rel=\"stylesheet\" type=\"text/css\"    

Now i want to perform a search and find all documents with:

 rel="stylesheet" type="text/css    

In html_content field.

How should i create my index (what mappings, and analyzer should i use)? How should i create the search query?

I tried a lot of things from the docs and google but i can't find the answer.


Solution

  • For analyzer i used:

    {
    "settings": {
    "analysis": {
      "analyzer": {
        "testowy": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": "lowercase"
        }
      }
    }}}
    

    For searching, example:

    {
    "query": {
        "match_phrase" : {
            "html_content" : {
                "query" : "rel=\"stylesheet\" type=\"text/css"
            }
        }
    }}
    

    Or to find document with 2 matches:

        {
      "query": {
        "bool": {
          "must": [
             {"match_phrase": {"html_content":  "rel=\"stylesheet\" type=\"text/css"}},
             {"match_phrase": {"html_content":  "<meta name=\"distribution\""}}
          ]
        }
     }
    }
    

    Still i don;t know why "rel=\"stylesheet\" type=\"text/css" is not equal to

    "rel=\"stylesheet\" type=\"text/cs"