Recently I rewrote my program to find English words that are made of chemical symbol abbreviations, for example "HErSHeY". I came up with this dynamic regex:
grep -Pi "^($(paste -s -d'|' element_symbols.txt))+$" /usr/share/dict/words
The regex in the paste expands to something like (H|He|Li| ... )
. element_symbols.txt is a file starting with
H
He
Li
Sample words list (longest word: formaldehydesulphoxylate
- 24 characters)
The -i
makes the search case-insensitive, so "Hershey" is in the output, but is there a way to preserve the capitalizations of the letters within the regex, so the output is like "HErSHeY"? This would require replacing the letters in the words file, so maybe something with sed instead.
You can also implement this in python using the regex
module which has the nice additional feature of capturing all matches in a repeating group. This allows for the regex to just be a repeated capture group:
import regex
symbols = open("element_symbols.txt").read().split()
words = open("/usr/share/dict/words").read().split()
pattern = regex.compile(rf'(\L<symbols>)+', symbols=symbols, flags=regex.I)
for word in words:
match = pattern.fullmatch(word)
if match is not None:
res = ''.join(s.title() for s in match.captures(1))
print(res)
Output on Ubuntu 22.04 with american-english
words:
Ac
AcLu
AcTh
Al
Am
Ar
AsCII
AsCIIS
AsPCa
AtP
AtV
Ac
AcCRa
AcHeBe
AcHErNAr
AcHEsON
AcOSTa
...