htmlgrep

grep REGEX parse html for email


I've got just over 1000 html pages to grep and extract email addresses. My issue is the command I'm using below is returning javascript strings like @typeface where @ is present in with the email addresses I'm trying to get. wondering if there are any recommendations for ignoring these unwanted strings and only getting the email addresses

grep -o '[[:alnum:]+\.\_\-]*@[[:alnum:]+\.\_\-]*' temp9.txt | sort | uniq -i 

Solution

  • FWIW I'd use a regexp like

    grep -Eo '[[:alnum:]._%+-]{2,}@[[:alnum:].-]{2,}\.[[:alpha:]]{2,}'
    

    to get a good-enough-for-me list of email addresses from a file. If you need something perfect then you should find some other solution.

    You should not escape regexp metachars inside bracket expressions as a) that's not necessary, b) it's undefined behavior, and c) it can turn the char into a metachar in some greps. So [[:alnum:]+\.\_\-] should be [[:alnum:]+._-].

    You also want at least 1 of those chars each side of the @ (personally, I test for 2 as I find some garbage sometimes if I test for just 1) so you need to use + or {2,} instead of * as the repetition metachar (and add -E to the grep args).

    Also make sure to add a regexp segment to match at least a couple of letters after a . at the end so you ensure the string ends in .com or .uk or similar TLD.

    See also https://www.regular-expressions.info/email.html but most regexps described there are PCREs so you could only use them in grep if it's GNU grep and you use grep -P instead of grep -E.