elasticsearch

Indexing on sensitive data


We need to be able to search a lot of personally identifiable information. We were thinking of using ElasticSearch for this but have a problem with the fact that it stores the original document.

Is there a way to index on a field, but not store the field? In this case, if we got a hit on a record, we would get back the guid - or more likely the encrypted guid - of a record in dynamoDB that would contain the original document. But if someone managed to pinch the ES database they couldn't easily reconstruct the original information.

Thanks,

Adam.


Solution

  • Just found the answer: https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-source-field.html

    You can disable storing the original source document, only having the index - just what I wanted. We will be storing the original docs in Amazon's DyanmoDB - encrypted, of course. The ES indexes will allow us to perform the searches we want without storing the original doc.

    The ES doc's do stress thinking carefully about this approach - for example, if we needed to re-index we'd have to pull everything out of Dynamo and feed it through ES again.

    As a side note, we've recently been to the AWS Summit and they encouraged us to look at Kinesis as the pipeline for indexing the docs and then storing them in DynamoDB.

    Adam