I have an elasticsearch index which has multiple documents, now I want to update the index with some new documents which might also contain duplicates of the existing documents. What's the best way to do this? I'm using elasticsearch py for all CRUD operations
Every update in elasticsearch deletes the old document and create a new document as the smallest unit of document collection is called segments in elastic-search which are immutable, hence when you index a new document or update any exiting documents, it gets into the new segments which are merged into bigger segments during the merge process.
Now even if you have duplicate data but with the same id, it will replace the existing document, and its fine and more performant than first fetching the document and than comparing both the documents to see if they are duplicate and than discard the update/upsert request from an application, rather than just index whatever if coming and ES will again insert the duplicate docs.