jdbcelasticsearchelasticsearch-jdbc-river

ElasticSearch JDBC River creates duplicates


I'm trying to use a JDBC river to copy my MySQL database to my ElasticSearch index.

This however creates double the amount of documents compared to the count(*) of the MySQL table whenever I start the server. I tested multiple times by emptying the Index and recreating the mapping and re applying the river.

Just to be complete this is the mapping of my products index and products type:

{
"products":{
  "properties":{
     "product_id":{
        "type":"string"
     },
     "naam":{
        "type":"string"
     },
     "merk":{
        "type":"string"
     },
     "lijn":{
        "type":"string"
     },
     "sku":{
        "type":"string"
     },
     "omschrijving":{
        "type":"string",
        "boost":"0.5"
     },
     "groep":{
        "type":"string"
     },
     "ean":{
        "type":"string",
        "boost":"2.0"
     },
     "kenmerken":{
        "type":"nested",
        "dynamic":true
     },
     "levertijd_min":{
        "type":"string"
     },
     "levertijd_max":{
        "type":"string"
     }
  }
}
}

And this is my config.json:

 {
"type": "jdbc",
"jdbc":{
    "url": "jdbc:mysql://localhost:3306/db",
    "strategy":"simple",
    "schedule" : "0 */30 8-16  ? * *",
    "user":"user",
    "versioning":true,
    "password":"password",
    "sql":"select * from producten_elasticsearch",
    "index":"products",
    "type":"products",
    "autocommit":true,
    "timezone":"TimeZone.getDefault()"
}

}

I tried switching from strategy: simple to column and back again but it keeps on happening.

This the the database schema:

enter image description here

As you can see the _id is the PK on the schema.

Why am I seeing twice as many documents in the elasticsearch index?


Solution

  • Have you tried comparing or even looking for dupes in Elasticsearch?

    Or you're just comparing counts from your SQL COUNT(*) query to document count in head plugin?

    Document count in head plugin

    If that's the case, then your issue should be that you're mapping kenmerken as nested type. They are stored as seperate documents in your index

    From documentation:

    Internally, nested objects are indexed as additional documents, but, since they can be guaranteed to be indexed within the same "block", it allows for extremely fast joining with parent docs.

    Which means that your one row, which you imported is stored as a document and kenmerken is stored as another document (and linked to, let's name it, core document), which means that for one imported row, you're having two documents indexed. That's how I can explain double number. If that's not the case - ignore my answer.