regexopenrefinegoogle-refinegrel

Extracting email addresses from messy text in OpenRefine


I am trying to extract just the emails from text column in openrefine. some cells have just the email, but others have the name and email in john doe <john@doe.com> format. I have been using the following GREL/regex but it does not return the entire email address. For the above exaple I'm getting ["n@doe.com"]

value.match(
/.*([a-zA-Z0-9_\-\+]+@[\._a-zA-Z0-9-]+).*/
)

Any help is much appreciated.


Solution

  • The n is captured because you are using .* before the capturing group, and since it can match any 0+ chars other than line break chars greedily the only char that can land in Group 1 during backtracking is the char right before @.

    If you can get partial matches git rid of the .* and use

    /[^<\s]+@[^\s>]+/
    

    See the regex demo

    Details

    Python/Jython implementation:

    import re
    res = ''
    m = re.search(r'[^<\s]+@[^\s>]+', value)
    if m:
        res = m.group(0)
    return res
    

    There are other ways to match these strings. In case you need a full string match .*<([^<]+@[^>]+)>.* where .* will not gobble the name since it will stop before an obligatory <.