I want to index the source code of my crawled web pages by Apache Nutch (v1.17) to index in Solr (8.6.3), but don't know how to do that? At least I just get a prepared version indexed to Solr content (see below).
{
"tstamp":"2020-11-19T08:41:15.908Z",
"digest":"fdc7532e799d4a3a434be4be67c36bb3b",
"boost":1.0,
.
.
.
"content":"Algorithm Engineering Group ....",
"_version_":16837969286885539843
}
I have already looked at the index-writers.xml, but I still don't know how to do that. Maybe you know how to do that.
The Nutch index tool provides a command-line option to index the raw content of web pages:
$> bin/nutch index
...
-addBinaryContent index raw/binary content in field `binaryContent`
-base64 use Base64 encoding for binary content
...
Note: be aware of PDF and other binary formats the crawler may visit!