javaregex

Extracting customer IDs from text


I need to extract customer IDs from text.

The customer IDs are unique alphanumeric character sequences. They can contain:

We can assume that they are longer than 5 characters. They might be capitalized or not.

I thought about using a dictionary. If the character sequence is not a word in dictionary and a sequence longer than 5, it is a good candidate.

Any ideas or sample Java code?


Solution

  • Here is a simple regular expression that will match alphanumeric sequences of 6 characters or more:

    (?<![A-Za-z0-9])[A-Za-z0-9]{6,}
    

    I used a negative lookbehind here instead of a word boundary (\b) in case there were underscores in your text. If your regex flavor doesn't have lookbehind then you'll want to use the word boundary instead (but I note now that you mentioned java in your question - and java does have lookbehind).

    If the customer ID must contain a number, then a regular expression to match these would look like this:

    (?<![A-Za-z0-9])(?=[A-Za-z]*[0-9][A-Za-z0-9]*)[A-Za-z0-9]{6,}
    

    See Regex101 demo.

    Is there a limit to how long your customer IDs can be? If so, then putting that limit in would probably be helpful - any alphanumeric character sequence longer than that number obviously won't be a match. If the limit is 25 characters, for example, the regex would look like this:

    (?<![A-Za-z0-9])(?=[A-Za-z]*[0-9][A-Za-z0-9]*)[A-Za-z0-9]{6,25}(?![A-Za-z0-9])
    

    (I added the lookahead at the end, otherwise this could simply match the first 25 characters of a long alphanumeric sequence!)

    Once you have the matches extracted from your text, then you could do a dictionary lookup. I know there are questions and answers on StackOverflow on this subject.

    To actually use this regex in Java, you would use the Pattern and Matcher classes. For example,

    String mypattern = "(?<![A-Za-z0-9])(?=[A-Za-z]*[0-9][A-Za-z0-9]*)[A-Za-z0-9]{6,25}(?![A-Za-z0-9])";
    Pattern tomatch = Pattern.compile(mypattern);
    

    Etc. Hope this helps.

    UPDATE

    This just occurred to me, rather than trying a dictionary match, it might be better to store the extracted values in a database table and then compare that against your customers table.