I have to crawl around 30k to 50k domains with Nutch 1.x on EMR AWS service. It will be gradual i.e., first crawl all pages and later only new or updated pages for these websites. For indexing, I am using Apache Solr. I have few queries for best practices with EMR
org.apache.hadoop.io.compress.ZStandardCodec
is a good option.