regexsedpcrecase-insensitive

Regex case-insensitive search, but have output match the case in the regex


Recently I rewrote my program to find English words that are made of chemical symbol abbreviations, for example "HErSHeY". I came up with this dynamic regex:

grep -Pi "^($(paste -s -d'|' element_symbols.txt))+$" /usr/share/dict/words

The regex in the paste expands to something like (H|He|Li| ... ). element_symbols.txt is a file starting with

H
He
Li

Sample words list (longest word: formaldehydesulphoxylate - 24 characters)

The -i makes the search case-insensitive, so "Hershey" is in the output, but is there a way to preserve the capitalizations of the letters within the regex, so the output is like "HErSHeY"? This would require replacing the letters in the words file, so maybe something with sed instead.


Solution

  • You can also implement this in python using the regex module which has the nice additional feature of capturing all matches in a repeating group. This allows for the regex to just be a repeated capture group:

    import regex
    
    symbols = open("element_symbols.txt").read().split()
    words = open("/usr/share/dict/words").read().split()
    
    pattern = regex.compile(rf'(\L<symbols>)+', symbols=symbols, flags=regex.I)
    for word in words:
        match = pattern.fullmatch(word)
        if match is not None:
            res = ''.join(s.title() for s in match.captures(1))
            print(res)
    

    Output on Ubuntu 22.04 with american-english words:

    Ac
    AcLu
    AcTh
    Al
    Am
    Ar
    AsCII
    AsCIIS
    AsPCa
    AtP
    AtV
    Ac
    AcCRa
    AcHeBe
    AcHErNAr
    AcHEsON
    AcOSTa
    ...