python-3.xregextextpattern-matching

Python regular expression exclude string


I want to exclude the text after "(trading as". My regular expression so far is looking like below. I tried a negative look ahead (?!\s\(trading as)). But it isn't working as expected. Any help is appriciated.

import re
def extract_company_name(title):
    match = re.findall(r'\b[A-Z0-9-](?:[A-Z0-9 \t&.-](?:\s*\(\d+\))?)*(?:\b|(?<=\)))', title)
    return ','.join(match) if match else None
  
text = """TEST LIMITED (trading as FOO Limited) (in relation), TEST (2005) LTD, WINDING LIMITED (in liquidation)"""
print(extract_company_name(text))

Text : TEST LIMITED (trading as FOO Limited) (in relation), TEST (2005) LTD, WINDING LIMITED (in liquidation)

Expected Output : TEST LIMITED, TEST (2005) LTD, WINDING LIMITED


Solution

  • "To exclude the text after (trading as" you can use the usual regex trick to match what you do not need and capture what you need.

    However, you need to also adapt your code for that trick to work the way you want.

    So, the code will look like

    import re
    def extract_company_name(title):
        match = re.findall(r'\(trading as.*|\b([A-Z0-9-](?:[A-Z0-9 \t&.-](?:\s*\(\d+\))?)*)(?:\b|(?<=\)))', title)
        return ','.join(x for x in match if x) if match else None
      
    text = """TEST LIMITED (trading as FOO Limited) (in relation)"""
    print(extract_company_name(text))
    

    See the online demo

    Changes:

    To find matches outside of (trading as and the next ) you simply need to adjust the first alternative and use

    \(trading as[^()]*\)|\b([A-Z0-9-](?:[A-Z0-9 \t&.-](?:\s*\(\d+\))?)*)(?:\b|(?<=\)))
    

    See the regex demo.