I am trying to index some web pages in Bluemix Retrieve and Rank service. So I did crawled my seeds with nutch 1.11, dumped the crawled data(about 9000 URLs) as files, posted those that are possible e.g xml files to my Collection:
Post_url = '"https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/%s/solr/%s/update"' %(solr_cluster_id, solr_collection_name)
cmd ='''curl -X POST -H %s -u %s %s --data-binary @%s''' %(Cont_type_xml, solr_credentials, Post_url, myfilename)
subprocess.call(cmd,shell=True)
and converted the rest to json with Bluemix Doc-Conv service:
doc_conv_url = '"https://gateway.watsonplatform.net/document-conversion/api/v1/convert_document?version=2015-12-15"'
cmd ='''curl -X POST -u %s -F config="{\\"conversion_target\\":\\"answer_units\\"}" -F file=@%s %s''' %(doc_conv_credentials, myfilename, doc_conv_url)
process = subprocess.Popen(cmd, shell= True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
and then saved these Json results in a json file and posted it to my collection:
Post_converted_url = '"https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/%s/solr/%s/update/json/docs?commit=true&split=/answer_units/id&f=id:/answer_units/id&f=title:/answer_units/title&f=body:/answer_units/content/text"' %(solr_cluster_id, solr_collection_name)
cmd ='''curl -X POST -H %s -u %s %s --data-binary @%s''' %(Cont_type_json, solr_credentials, Post_converted_url, Path_jsonFile)
subprocess.call(cmd,shell=True)
Everything sounds to be done OK. The json file is as it should be and when I post the data I do receive the Status 0, which I Thought means the posting was done correctly. But when I send Queries:
pysolr_client = retrieve_and_rank.get_pysolr_client(solr_cluster_id, solr_collection_name)
results = pysolr_client.search(Query_term)
print(results.docs)
the result is nothing. It finds nothing. I have done the same before, with the same commands' structure and everything, and it worked. I just made a new collection and now it doesn't work.
Has my data been indexed? Then Why the query does not work? When I try getting usage statistics for my Solr cluster the result is:
{"disk_usage":{"used_bytes":2210,"total_bytes":34359738368,"used":"2.1582 KB","total":"32 GB","percent_used":6.4319465309381485E-6},
"memory_usage":{"used_bytes":2069028864,"total_bytes":4194304000,"used":"1.9269 GB","total":"3.9063 GB","percent_used":49.3294921875}}
which I thought means my data has been indexed and is stored in my cluster. Just now I realized that every time that I post my data the data usage and memory usage does not change. does it mean the Posting is not done? even though I receive Status 0? If yes any ideas on what the problem is? why is it happening?
Does it has anything to do with the solr_config?
Any helps or ideas on how to get the result from a query would be highly appreciated.
The URL used for posting the converted files have to split the data by /answer_units/ not by /answer_units/id so it should be :
Post_converted_url = '"https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/%s/solr/%s/update/json/docs?commit=true&split=/answer_units&f=id:/answer_units/id&f=title:/answer_units/title&f=body:/answer_units/content/text"' %(solr_cluster_id, solr_collection_name)
Pay atention to the split=/answer_units part.