solrdih

Solr: Avoid duplicated records while importing from another solr core


I am trying to import a single column from a solr core into another core using DIH. Solr version is 6.4.0

My managed-schema file has the following entries:

<uniqueKey>journal</uniqueKey>
<field name="journal" type="text_general" multiValued="false" indexed="true" stored="true" />
<field name="fjournal" type="string" indexed="true" stored="false"/>

and also one copyField settings like below:

<copyField source="journal" dest="fjournal" />

In the solrconfig.xml, i configured the following elements:

<lib dir="${solr.install.dir:../../../..}/dist/" regex="solr-dataimporthandler-.*\.jar" />

<requestHandler>
    <lst name="defaults">
       <str name="config">solr-data-config.xml</str>
    </lst>
 </requestHandler>

<updateRequestProcessorChain>
    <processor class="solr.UniqFieldsUpdateProcessorFactory">
        <str name="fieldName">journal</str>
    </processor>

    <processor class="solr.LogUpdateProcessorFactory" />
    <processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>

And the following is in the file "solr-data-config.xml"

<dataConfig>
  <document>
    <entity name="journalMaster" processor="SolrEntityProcessor"
            url="http://localhost:8983/solr/journalMaster "
            query="*:*"
            fl="journal"/>
  </document>
</dataConfig>

When I execute the import process, the values after the import has been completed, still holds the duplicated values.

 {    "journal":"Journal of Immunology",
        "_version_":1559554209274134528,
        "fjournal":"Journal of Immunology"},
      {
        "journal":"Journal of Immunology",
        "_version_":1559554209373749248,
        "fjournal":"Journal of Immunology"},
      {
        "journal":"Journal of Immunology",
        "_version_":1559554209375846400,
        "fjournal":"Journal of Immunology"},

How do I avoid this from happening? I am importing the data from a local core to another core.

Any help will be really appreciated.


Solution

  • When defining a uniqueKey you don't need to analyse the content. Just have a string that will uniquely identify the documents. This unique identifier will be used across a lot of different Lucene/Solr functionality, so it is important to define it properly.

    In your example I would use 'fjournal' as the unique key.

    Then, there is nothing else to worry about, everytime you re-index the same fjournal, the Solr document will be overwritten, so you will end up with a single entry per value.

    Probably a better curiousity would be to know why you need to index a single fielded document ...