pythonregexpdf-extraction

how to ignore unwanted pattern in regex


I have the following python code

from io import BytesIO
import pdfplumber, requests
test_case = {
    'https://www1.hkexnews.hk/listedco/listconews/sehk/2020/0514/2020051400555.pdf': 59,
    'https://www1.hkexnews.hk/listedco/listconews/gem/2020/0529/2020052902118.pdf': 55,
    'https://www1.hkexnews.hk/listedco/listconews/sehk/2020/0618/2020061800366.pdf': 47,
    'https://www1.hkexnews.hk/listedco/listconews/gem/2020/0630/2020063002674.pdf': 30,
}

for url, page in test_case.items():
    rq = requests.get(url)
    pdf = pdfplumber.load(BytesIO(rq.content))
    txt = pdf.pages[page].extract_text()
    txt = re.sub("([^\x00-\x7F])+", "", txt)  # no chinese
    pattern = r'.*\n.*?(?P<auditor>[A-Z].+?\n?)(?:LLP\s*)?\s*((PRC.*?|Chinese.*?)?[Cc]ertified [Pp]ublic|[Cc]hartered) [Aa]ccountants'
    try:
        auditor = re.search(pattern, txt, flags=re.MULTILINE).group('auditor').strip()
        print(repr(auditor))
    except AttributeError:
        print(txt)
        print('============')
        print(url)

It produces the following result

'ShineWing'
'ShineWing'
'Hong Kong Standards on Auditing (HKSAs) issued by the Hong Kong Institute of'
'Hong Kong Financial Reporting Standards issued by the Hong Kong Institute of'

The desired result is:

'ShineWing'
'ShineWing'
'Ernst & Young'
'Elite Partners CPA Limited'

I tried:

pattern = r'.*\n.*?(?P<auditor>[A-Z].+?\n?)$(?!Institute)(?:LLP\s*)?\s*((PRC.*?|Chinese.*?)?[Cc]ertified [Pp]ublic|[Cc]hartered) [Aa]ccountants' This pattern captures the last two cases but not the first 2.

pattern = r'.*\n.*?(?P<auditor>^(?!Hong|Kong)[A-Z].+?\n?)(?:LLP\s*)?\s*((PRC.*?|Chinese.*?)?[Cc]ertified [Pp]ublic|[Cc]hartered) [Aa]ccountants' This produces the desired result but ^(?!Hong|Kong) is potentially risky because it may ignore other desired results in the future so it is not a good candidate.

Instead, $(?!Institute) is more general and appropriate but I have no idea why it couldn't be matched in the first 2 cases. it would be great if there is a way that I could ignore matches that contain issued by the Hong Kong Institute of

Any suggestion will be appreciated. Thank you.


Solution

  • pattern = r'\n.*?(?P<auditor>(?!.*Institute)[A-Z].+?)(?:LLP\s*)?\s*((PRC.*?|Chinese.*?)?[Cc]ertified [Pp]ublic|[Cc]hartered) [Aa]ccountants'
    

    This works.