pythonregex

Python regex ignoring pattern


I have a list of two keywords like below:

keywords = ["Azure", "Azure cloud"]

but python unable to find the second keyword "Azure cloud"

>>> keywords = ["Azure", "Azure cloud"]
>>> r = re.compile('|'.join([re.escape(w) for w in keywords]), flags=re.I)
>>> word = "Azure and Azure cloud"
>>> r.findall(word)
['Azure', 'Azure']

I am expecting the output like this : ['Azure', 'Azure', 'Azure cloud']

Any guide/help would be highly appreciated!


Solution

  • You can run multiple search.

    import itertools
    import re
    
    keywords = ["Azure", "Azure cloud"]
    patterns = [re.compile(re.escape(w), flags=re.I) for w in keywords]
    word = "Azure and Azure cloud"
    results = list(itertools.chain.from_iterable(
        r.findall(word) for r in patterns
    ))
    print(results)
    

    output:

    ['Azure', 'Azure', 'Azure cloud']
    

    Append

    if I'd have word = "Azure and azure cloud" - I will have the output as ['Azure', 'azure', 'azure cloud'] - so the 2nd keyword "azure" which is in small, if i would have to get the exact word matching with the "keywords" list which is "Azure", what modification has to be made in the code?

    The flag re.I means ignore-case. So simply remove this.

    patterns = [re.compile(re.escape(w)) for w in keywords]
    

    Append 2

    sorry my last comment was vague, so I want the pattern matching to ignore the case, but while fetching the output I would want the keywords to have exact case what we have in the "keyword" list and not in the "word"

    Sorry for misunderstanding. Try this:

    import re
    
    keywords = ["Azure", "azure cloud"]
    patterns = [re.compile(w, flags=re.I) for w in keywords]
    word = "Azure and azure cloud"
    results = [
        match_obj.re.pattern
        for r in patterns
        for match_obj in r.finditer(word)
    ]
    print(results)
    

    output:

    ['Azure', 'Azure', 'azure cloud']
    

    I'm not sure it is effecient way, but it works.
    Note that I remove re.escape because it cause space escape so result was:

    ['Azure', 'Azure', 'azure\\ cloud']