How can I configure the Nutch crawler to crawl only English pages?
This is what I set in nutch-site.xml
, but it does not work:
<property>
<name>http.accept.language</name>
<value>en-us,en-gb,en;q=0.7,*;q=0.3</value>
<description>Value of the "Accept-Language" request header field. This allows selecting non-English language as default one to retrieve. It is a useful setting for search engines build for certain national group.
</description>
</property>
The value you set: <value>en-us,en-gb,en;q=0.7,*;q=0.3</value>
means that it prefers English but other languages (*) still there. For crawling only English pages, you should set value as below:
<value>en-us,en-gb,en</value>
To make sure, change the value in nutch-default.xml as well.
Hope this helps
-Le Quoc Do