indexingsolribm-cloudnutchretrieve-and-rank

Indexer IOException job fail while Indexing nutch crawled data in “Bluemix” solr


I'm trying to index the nutch crawled data by Bluemix solr. I used the following command in my command prompt:

bin/nutch index -D solr.server.url="https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/CLUSTER-ID/solr/admin/collections -D solr.auth=true -D solr.auth.username="USERNAME" -D solr.auth.password="PASS" Crawl/crawldb -linkdb Crawl/linkdb Crawl/segments/2016*

But it fails to finish the indexing. The result is as followed:

Indexer: starting at 2016-06-16 16:31:50
Indexer: deleting gone documents: false
Indexer: URL filtering: false
Indexer: URL normalizing: false
Active IndexWriters :
SolrIndexWriter
        solr.server.type : Type of SolrServer to communicate with (default 'http' however options include 'cloud', 'lb' and 'concurrent')
        solr.server.url : URL of the Solr instance (mandatory)
        solr.zookeeper.url : URL of the Zookeeper URL (mandatory if 'cloud' value for solr.server.type)
        solr.loadbalance.urls : Comma-separated string of Solr server strings to be used (madatory if 'lb' value for solr.server.type)
        solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
        solr.commit.size : buffer size when sending to Solr (default 1000)
        solr.auth : use authentication (default false)
        solr.auth.username : username for authentication
        solr.auth.password : password for authentication
Indexing 153 documents
Indexing 153 documents
Indexer: java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
        at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145)
        at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:222)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:231)

I guess it has something to do with the solr.server.url address, maybe the end of it. I changed it in different ways e.g

"https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/CLUSTER-ID/solr/example_collection/update".

(since it is used for indexing JSON/CSV/... files by the the Bluemix Solr ). But no chance to now.

Anyone knows how can I fix it? And if the problem is as I guessed, anyone knows what exactly should the solr.server.url be ? By the way, "example_collection" is my collections name, and I'm working with nutch1.11.


Solution

  • As far as I know, indexing nutch crawled data in Bluemix R&R, by the index command provided in nutch itself(bin/nutch index...) is not possible. I realized that for indexing nutch crawled data in Bluemix Retrieve and Rank service one should:

    1. Crawl seeds with nutch e.g

      $:bin/crawl -w 5 urls crawl 25

    you can check the status of crawling with:

    bin/nutch readdb crawl/crawldb/ -stats

    1. Dumped the crawled dataas files:

      $:bin/nutch dump -flatdir -outputDir dumpData/ -segment crawl/segments/

    2. Post those that are possible e.g xml files to solr Collection on Retrieve and Rank:

      Post_url = '"https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/%s/solr/%s/update"' %(solr_cluster_id, solr_collection_name)
      cmd ='''curl -X POST -H %s -u %s %s --data-binary @%s''' %(Cont_type_xml, solr_credentials, Post_url, myfilename)          
      subprocess.call(cmd,shell=True)
      
    3. Convert the rest to json with Bluemix Doc-Conv service:

      doc_conv_url = '"https://gateway.watsonplatform.net/document-conversion/api/v1/convert_document?version=2015-12-15"'
      cmd ='''curl -X POST -u %s -F config="{\\"conversion_target\\":\\"answer_units\\"}" -F file=@%s %s''' %(doc_conv_credentials, myfilename, doc_conv_url)
      process = subprocess.Popen(cmd, shell= True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
      

      and then save these Json results in a json file.

    4. Post this json file to the collection:

      Post_converted_url = '"https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/%s/solr/%s/update/json/docs?commit=true&split=/answer_units/id&f=id:/answer_units/id&f=title:/answer_units/title&f=body:/answer_units/content/text"' %(solr_cluster_id, solr_collection_name)
      cmd ='''curl -X POST -H %s -u %s %s --data-binary @%s''' %(Cont_type_json, solr_credentials, Post_converted_url, Path_jsonFile)
      subprocess.call(cmd,shell=True)
      
    5. Send Queries:

      pysolr_client = retrieve_and_rank.get_pysolr_client(solr_cluster_id, solr_collection_name)
      results = pysolr_client.search(Query_term)
      print(results.docs)
      

      Codes are in python.

    For beginners: You can use the curl commands directly in you CMD. I hope it helpes others.