pythonfor-loopdrypdf-parsingpdfplumber

How to avoid duplication in Python PDF parsing code for mismatching table structures?


I have over 100 PDFs that are match reports from which I want to scrape data in order to store it in dataframes so I can work with it afterwards. Problem is: Those PDFs don't always have the same structure and the reading from pdfplumber gives me tables with rows that do not have the same length, therefore making it almost impossible to not repeat the code several times for each type of row. I'd like to find an approach to make my code prettier, easier to read, and easier to debug.

This is the kind of table I'm reading for each PDF: an extract from one of the PDFs

I need to get data from both columns for every PDF.

This is the code I have to extract the tables from the PDFs

directory = os.fsencode('')
matchs_raw = {}
for file in os.listdir(directory):
    
    filename = os.fsdecode(file)
    if '.pdf' not in filename:
        continue
    matchs_raw[filename] = []

    with pdfplumber.open(f'\\{filename}') as pdf:
        for page in pdf.pages:
            tables = page.extract_tables()
            for table in tables:
                for i in table:
                    matchs_raw[filename].append(i)

It correctly stores all the data I want, with one key per PDF and each row from all tables is one element of the list, which is the value of the key in matchs_raw.

I then try and extract the relevant data from matchs_raw to store it in a pandas DataFrame. I've managed to do it with code that looks like this:

columns = ['file_name','minute','role','numero','nom','recevant_ou_visiteur','equipe',
'description','score_1_ponctuel','score_2_ponctuel']

data = {col: [] for col in columns}

for match in tqdm(matchs_raw):
    if matchs_raw[match][0][0]== "Organisateur":
        for l in range(len(matchs_raw[match])):
            if matchs_raw[match][l][0]=="Déroulé du Match" or matchs_raw[match][l][0]=="DérouléduMatch":
                starting_row = l+3
                break
        for j in matchs_raw[match][starting_row:]:
            
            if len(j)!=13:
                ## LEFT COLUMN
                
                data['file_name'].append(matchs_raw[match][0][20])
                
                try:
                    data['minute'].append(j[0])
                except:
                    data['minute'].append(np.nan)
                try:
                    data['role'].append(re.findall("JR|JV|OR|OV",j[3])[0][0])  
                except:
                    data['role'].append(np.nan)
                
                try:
                    if re.findall("JR|JV|OR|OV",j[3])[0][0] == "J":
                        try:
                            data['numero'].append(re.findall("N[^\x00-\x7F]+\d*",j[3])[0])
                        except:
                            data['numero'].append(np.nan)
                        try:
                            data['nom'].append(re.findall("N[^\x00-\x7F]+\d*(\D+)",j[3])[0])
                        except:
                            data['nom'].append(np.nan)
                        try:
                            data['description'].append(re.findall("(.+?)(JR|JV|OR|OV)N[^\x00-\x7F]",j[3])[0][0])
                        except:
                            data['description'].append(np.nan)
                    
                    else:
                        try:
                            data['numero'].append("Officiel")
                        except:
                            data['numero'].append(np.nan)
                        try:
                            data['nom'].append(re.findall("^(.+?)(OV|OR)(.+)",j[3])[0][2].strip())
                        except:
                            data['nom'].append(np.nan)
                        try:
                            data['description'].append(re.findall("^(.+?)(OV|OR)",j[3])[0][0].strip())
                        except:
                            data['description'].append(np.nan)

And it keeps going a little bit for this particular case (type=="Organisateur", len(j)!=13, left column). I have to do the same thing for len(j)==13 both left and right columns, and another type of pdf with 3 different len cases. The indexes for j in the for loop do not make any sense respect each other (for instance, not always there will be a difference of 3 levels between data['minute'].append(j[0]) and data['role'].append(re.findall("JR|JV|OR|OV",j[3])[0][0]).

Do you have any suggestions on how can I avoid repeating all those try/except blocks for each case? Help will be much appreciated.

Thank you!


Solution

  • It looks like you're parsing these PDF files: https://www.ffhandball.fr/api/s3/fdm/O/A/C/P/OACPSGG.pdf

    It seems like you could use the headers to identify the columns:

    temps = page.search(r'Temps (?=Score Action)')
    
    vlines = sorted(set(t['x0'] for t in temps)) + [ page.bbox[-2] ]
    
    im = page.to_image(300)
    im.reset().draw_vlines(vlines, stroke_width=10, stroke='black')
    
    im.save('tbl.png')
    

    enter image description here

    You can then .crop() out each column:

    enter image description here

    You can crop again at the first instance of the header in each column (Temps, Score, Action)

    for left, right in itertools.pairwise(vlines):
        crop = page.crop((left, 0, right, page.bbox[-1]))
        left, top, right, bottom = crop.bbox
        
        top = crop.search('Temps Score Action')[0]['top']
        crop = crop.crop((left, top, right, bottom))
        
        print(pl.DataFrame(crop.extract_table(), orient='row'))
    
    shape: (64, 4)
    ┌──────────┬──────────┬──────────┬───────────────────────────────────┐
    │ column_0 ┆ column_1 ┆ column_2 ┆ column_3                          │
    │ ---      ┆ ---      ┆ ---      ┆ ---                               │
    │ str      ┆ str      ┆ str      ┆ str                               │
    ╞══════════╪══════════╪══════════╪═══════════════════════════════════╡
    │ Temps    ┆ Score    ┆ Action   ┆ null                              │
    │ 00:57    ┆ 01 - 00  ┆          ┆ But JR N°44 DEMBELE sitha lauree… │
    │ 02:24    ┆ 01 - 00  ┆          ┆ Tir JR N°28 BALLUREAU lea         │
    │ 02:29    ┆ 01 - 00  ┆          ┆ Arrêt JV N°12 SCHAMBACHER laura   │
    │ …        ┆ …        ┆ …        ┆ …                                 │
    │ 25:23    ┆ 16 - 07  ┆          ┆ Arrêt JR N°16 PORTES laura        │
    │ 25:32    ┆ 16 - 07  ┆          ┆ Tir JR N°15 AUGUSTINE anne-emman… │
    │ 25:35    ┆ 16 - 07  ┆          ┆ 2MN JR N°2 JACQUES emma           │
    │          ┆ null     ┆ null     ┆ null                              │
    └──────────┴──────────┴──────────┴───────────────────────────────────┘
    shape: (67, 4)
    ┌──────────┬──────────┬──────────┬───────────────────────────────────┐
    │ column_0 ┆ column_1 ┆ column_2 ┆ column_3                          │
    │ ---      ┆ ---      ┆ ---      ┆ ---                               │
    │ str      ┆ str      ┆ str      ┆ str                               │
    ╞══════════╪══════════╪══════════╪═══════════════════════════════════╡
    │ Temps    ┆ Score    ┆ Action   ┆ null                              │
    │ 25:50    ┆ 16 - 08  ┆          ┆ But JV N°10 SAID AHMED anais      │
    │ 26:28    ┆ 16 - 08  ┆          ┆ Tir JR N°47 DEMBELE mahoua-audre… │
    │ 26:30    ┆ 16 - 08  ┆          ┆ Arrêt JV N°61 NAILI yousra        │
    │ …        ┆ …        ┆ …        ┆ …                                 │
    │ 49:05    ┆ 28 - 17  ┆          ┆ Tir JV N°11 BROUTIN auriane       │
    │ 49:09    ┆ 28 - 17  ┆          ┆ Arrêt JR N°16 PORTES laura        │
    │ 49:41    ┆ 28 - 17  ┆          ┆ Tir JR N°70 LE BLEVEC julie       │
    │ 51:00    ┆ 28 - 18  ┆          ┆ But JV N°9 DIEYE eloise           │
    └──────────┴──────────┴──────────┴───────────────────────────────────┘
    

    You could refine it further, but this should allow you to stack all the columns together to simplify your parsing.