elasticsearchfull-text-searchinverted-indexelasticsearch-analyzers

Elastic Search Analyzer for Dynamically Defined Regular Expression Searches


We have lots of documents in an elastic search index and doing full text searches at the moment. My next requirement in a project is finding all credit cards data in documents. Also user will be able to define some regular expression searching rules dynamically in the future. But with standard analyzer it is not possible to search credit card info or any user defined rule. For instance, let's say a document contains credit card info such as 4321-4321-4321-4321 or 4321 4321 4321 4321. Elastic search indexes this data as 4 parts as seen below :

  "tokens" : [
    {
      "token" : "4321",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "<NUM>",
      "position" : 0
    },
    {
      "token" : "4321",
      "start_offset" : 5,
      "end_offset" : 9,
      "type" : "<NUM>",
      "position" : 1
    },
    {
      "token" : "4321",
      "start_offset" : 10,
      "end_offset" : 14,
      "type" : "<NUM>",
      "position" : 2
    },
    {
      "token" : "4321",
      "start_offset" : 15,
      "end_offset" : 19,
      "type" : "<NUM>",
      "position" : 3
    }
  ]


I just don't take into account Luhm algorithm now. If i do a basic regular expression search for finding a credit card with reg exp "([0-9]{4}[- ]){3}[0-9]{4}" it returns nothing because data is not analyzed and indexed for that. I thought for this purpose, i need to define a custom analyzer for regular expression searches and store the another version of data in another field or index. But as I said before in the future the user will define his/her own custom rule patterns for searching. How should i define the custom analyzer? Should i define ngram tokenizer(min:2, max:20) for that? With ngram tokenizer i think i can search for all defined regular expression rules. But is it reasonable? Project has to work with huge data without any performance problems. (A company's whole file system will be indexed). Do you have any other suggestion for this type of data discovery problem? My main purpose is finding credit cards at the moment. Thanks for helping.


Solution

  • Ok, here is a pair of custom analyzers that can help you detect credit card numbers and social security numbers. Feel free to adapt the regular expression as you see fit (by adding/removing other character separators that you will find in your data).

    PUT test
    {
      "settings": {
        "analysis": {
          "analyzer": {
            "card_analyzer": {
              "type": "custom",
              "tokenizer": "keyword",
              "filter": [
                "lowercase",
                "card_number"
              ]
            },
            "ssn_analyzer": {
              "type": "custom",
              "tokenizer": "keyword",
              "filter": [
                "lowercase",
                "social_number"
              ]
            }
          },
          "filter": {
            "card_number": {
              "type": "pattern_replace",
              "preserve_original": false,
              "pattern": """.*(\d{4})[\s\.\-]+(\d{4})[\s\.\-]+(\d{4})[\s\.\-]+(\d{4}).*""",
              "replacement": "$1$2$3$4"
            },
            "social_number": {
              "type": "pattern_replace",
              "preserve_original": false,
              "pattern": """.*(\d{3})[\s\.\-]+(\d{2})[\s\.\-]+(\d{4}).*""",
              "replacement": "$1$2$3"
            }
          }
        }
      },
      "mappings": {
        "properties": {
          "text": {
            "type": "text",
            "fields": {
              "card": {
                "type": "text",
                "analyzer": "card_analyzer"
              },
              "ssn": {
                "type": "text",
                "analyzer": "ssn_analyzer"
              }
            }
          }
        }
      }
    }
    

    Let's test this.

    POST test/_analyze
    {
      "analyzer": "card_analyzer",
      "text": "Mr XYZ whose SSN is 442-23-1452 has a credit card whose number was 3526 4728 4723 6374"
    }
    

    Will yield a nice digit-only credit card number:

    {
      "tokens" : [
        {
          "token" : "3526472847236374",
          "start_offset" : 0,
          "end_offset" : 86,
          "type" : "word",
          "position" : 0
        }
      ]
    }
    

    Similarly for SSN:

    POST test/_analyze
    {
      "analyzer": "ssn_analyzer",
      "text": "Mr XYZ whose SSN is 442-23-1452 has a credit card whose number was 3526 4728 4723 6374"
    }
    

    Will yield a nice digit-only social security number:

    {
      "tokens" : [
        {
          "token" : "442231452",
          "start_offset" : 0,
          "end_offset" : 86,
          "type" : "word",
          "position" : 0
        }
      ]
    }
    

    And now we can search for either a credit card or a SSN. Let's say we have the following two documents. The SSN and credit card numbers are the same, yet they use different character separators

    POST test/_doc
    { "text": "Mr XYZ whose SSN is 442-23-1452 has a credit card whose number was 3526 4728 4723 6374" }
    
    POST test/_doc
    { "text": "SSN is 442.23.1452 belongs to Mr. XYZ. He paid $20 via credit card number 3526-4728-4723-6374" }
    

    You can now find both documents by looking for the credit card number and/or SSN in any format:

    POST test/_search 
    {
      "query": {
        "match": {
          "text.card": "3526 4728 4723 6374"
        }
      }
    }
    
    POST test/_search 
    {
      "query": {
        "match": {
          "text.card": "3526 4728 4723-6374"
        }
      }
    }
    
    POST test/_search 
    {
      "query": {
        "match": {
          "text.ssn": "442 23-1452"
        }
      }
    }
    

    All the above queries will match and return both documents.