pythonregexpython-re

Extract only words using regex


I want to extract only words. The word should not contain any number or any special character attached to it, e.g. (64-bit), WebView2, x86_64. Current regex is able to ignore WebView2 and x86_64 but in the case of (64-bit) it's returning me bit, which I don't want. I want to exclude it because it contains numbers with -,(,) characters.

I've this input data:

Brave
Google Chrome
Microsoft Edge WebView2 Runtime
Robo 3T 1.4.4
WinRAR 7.01 (64-bit)
Python 3.12.3 Core Interpreter (64-bit)

and this regex:

\b[a-zA-Z]+\b

above regex return this result

['Python', 'Core', 'Interpreter', 'bit']

instead of the expected:

['Python', 'Core', 'Interpreter']

Solution

  • IIUC, you don't need a regex, you can split the words and filter based on isalpha:

    txt = 'Python 3.12.3 Core Interpreter (64-bit)'
    
    out = [s for s in txt.split() if s.isalpha()]
    

    If you really want to use a regex, be aware that \b matches -. To avoid this, you would need:

    import re
    
    out = re.findall(r'(?:^|\s)([a-zA-Z]+)(?=\s|$)',
                     'Python 3.12.3 Core Interpreter (64-bit)')
    

    regex demo

    Output:

    ['Python', 'Core', 'Interpreter']