htmlindexingsolrnutch

How to index crawled "html" from Apache Nutch to Solr?


I want to index the source code of my crawled web pages by Apache Nutch (v1.17) to index in Solr (8.6.3), but don't know how to do that? At least I just get a prepared version indexed to Solr content (see below).

{
  "tstamp":"2020-11-19T08:41:15.908Z",
  "digest":"fdc7532e799d4a3a434be4be67c36bb3b",
  "boost":1.0,
  .
  .
  .
  "content":"Algorithm Engineering Group ....",
 "_version_":16837969286885539843
}

I have already looked at the index-writers.xml, but I still don't know how to do that. Maybe you know how to do that.


Solution

  • The Nutch index tool provides a command-line option to index the raw content of web pages:

    $> bin/nutch index
    ...
    -addBinaryContent  index raw/binary content in field `binaryContent`
    -base64            use Base64 encoding for binary content
    ...
    

    Note: be aware of PDF and other binary formats the crawler may visit!