So this is what I have done so far.
I have added a request handler in solrconfig.xml
as follows:
<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
<str name="config">wiki-data-config.xml</str>
In the same configuration directory I have created a file wiki-data-config.xml
which contains the following,
<dataSource type="FileDataSource" encoding="UTF-8" />
<entity name="page"
flatten="true" >
<field column="id" xpath="/mediawiki/page/id" />
<field column="title" xpath="/mediawiki/page/title" />
<field column="revision" xpath="/mediawiki/page/revision/id" />
<field column="user" xpath="/mediawiki/page/revision/contributor/username" />
<field column="userId" xpath="/mediawiki/page/revision/contributor/id" />
<field column="text" xpath="/mediawiki/page/revision/text" />
<field column="timestamp" xpath="/mediawiki/page/revision/timestamp" dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss'Z'" />
<field column="$skipDoc" regex="^#REDIRECT .*" replaceWith="true" sourceColName="text"/>
And my schema.xml
contains the following,
<!-- Tanny edit starts -->
<field name="id" type="int" indexed="true" stored="true" required="true"/>
<field name="title" type="string" indexed="true" stored="false"/>
<field name="revision" type="int" indexed="true" stored="true"/>
<field name="user" type="string" indexed="true" stored="true"/>
<field name="userId" type="int" indexed="true" stored="true"/>
<field name="text" type="text_en" indexed="true" stored="false"/>
<field name="timestamp" type="date" indexed="true" stored="true"/>
<field name="titleText" type="text_en" indexed="true" stored="true"/>
<copyField source="title" dest="titleText"/>
<!-- Tanny edit ends -->
Now after restarting the SOLR, I try to post the WikiMedia XML Data using the ./bin/post
script in the following way,
tanny@localhost:~/binaries/solr-5.2.1$ ./bin/post -c core-base-wiki /home/tanny/Downloads/Data/Wiki/enwiki-20150702-stub-articles8.xml
And it prints the following in the console
/usr/lib/jvm/java-7-oracle-cloudera//bin/java -classpath /home/tanny/binaries/solr-5.2.1/dist/solr-core-5.2.1.jar -Dauto=yes -Dc=core-base-wiki -Ddata=files org.apache.solr.util.SimplePostTool /home/tanny/Downloads/Data/Wiki/enwiki-20150702-stub-articles8.xml
SimplePostTool version 5.0.0
Posting files to [base] url http://localhost:8983/solr/core-base-wiki/update...
Entering auto mode. File endings considered are xml,json,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
POSTing file enwiki-20150702-stub-articles8.xml (application/xml) to [base]
1 files indexed.
COMMITting Solr index changes to http://localhost:8983/solr/core-base-wiki/update...
Time spent: 0:00:00.863
However, when I go to the UI and check for the overview it says 0 documents indexed. I am at a loss to understand what configuration I am missing out on. Any help/guidance will be higly appreciated.
P.S.: The dataset enwiki-20150702-stub-articles8.xml is downloaded from WikiMedia Page. Few sample lines from the document are mentioned as follows,
<mediawiki xmlns="" xmlns:xsi="" xsi:schemaLocation="" version="0.10" xml:lang="en">
<generator>MediaWiki 1.26wmf11</generator>
<namespace key="-2" case="first-letter">Media</namespace>
<namespace key="829" case="first-letter">Module talk</namespace>
<namespace key="2600" case="first-letter">Topic</namespace>
<title>700 (number)</title>
<comment>Disambiguated: [[Tintin]] → [[The Adventures of Tintin]]</comment>
<text id="669059875" bytes="12464" />
<title>Canadian federal election, 1957</title>
<comment>/* Impact */ clarify</comment>
<text id="671713242" bytes="77788" />
<title>Professional Players Tournament (snooker)</title>
<redirect title="World Open (snooker)" />
<comment>Robot: Fixing double redirect to [[World Open (snooker)]]</comment>
<text id="360810125" bytes="34" />
The data got indexed after I tried to ingest using the command: "curl http://localhost:8983/solr/core-base-wiki/dataimport?command=full-import".
Somehow the ./bin/post
was not able to do the same. Didn't research more on the same, if anyone else has figured out how to, you are requested to share your findings.