pythonpandasdataframeraw-dataunstructured-data

Create a pandas DataFrame from unstructured data (end-of-season football tables)


basically I have a lot of data that represents the final standings table for each La Liga season starting from 1950. The structure of the table remains relatively consistent across different seasons. It looks something like this

1.FC Barcelona                  38  28  9  1  99-29  93       Champions
 2.Atlético de Madrid            38  23 10  5  58-22  79
 3.Real Madrid CF                38  22 10  6  94-44  76  [C]
 4.Valencia CF                   38  22  7  9  65-38  73
 5.Villarreal CF                 38  18  7 13  57-50  61
 6.Real Betis Balompié (Sevilla) 38  18  6 14  60-61  60
 7.Sevilla FC                    38  17  7 14  49-58  58
 8.Getafe CF                     38  15 10 13  42-33  55  [P]
 9.SD Eibar                      38  14  9 15  44-50  51       [2 2 0 0 8-2 6]
10.Girona FC                     38  14  9 15  50-59  51  [P]  [2 0 0 2 2-8 0]
11.RCD Espanyol (Barcelona)      38  12 13 13  36-42  49       [4 2 2 0 7-5 8]
12.Real Sociedad (San Sebastián) 38  14  7 17  66-59  49       [4 1 1 2 6-7 4]
13.RC Celta (Vigo)               38  13 10 15  59-60  49       [4 1 1 2 7-8 3]
14.Deportivo Alavés (Vitoria)    38  15  2 21  40-50  47
15.Levante UD (Valencia)         38  11 13 14  44-58  46  [P]
16.Athletic de Bilbao            38  10 13 15  41-49  43       [2 1 0 1 2-1 3]
17.CD Leganés                    38  12  7 19  34-51  43       [2 1 0 1 1-2 3]**
--------------------------------------------------------**
18.RC Deportivo (La Coruña)      38   6 11 21  38-76  29       Relegated
19.UD Las Palmas                 38   5  7 26  24-74  22       Relegated
20.Málaga CF                     38   5  5 28  24-61  20       Relegated

I want to convert this data (unstructured text) into pandas Data Frame, and I am primarily interested in three columns: position, team and points. So what I hope the outcome to look would be this: -> Expected outcome

I can create the first two columns without issues because they follow a clear structure: integer -> '.' -> text. But I'm struggling extracting 'end-of-season points' from all the other numbers that are irrelevant.

Maybe someone knows how to approach this. Thanks!

Here's my logic:

import re
from tabulate import tabulate
import pandas as pd

# Raw data provided
raw_data = my_unstructured_text

# Remove lines of hyphens
raw_data = re.sub(r'-{40,}', '', raw_data)

# Remove brackets and content inside them
raw_data = re.sub(r'\[.*?\]', '', raw_data)

# Remove "Champions" and "Relegated" words
raw_data = raw_data.replace('Champions', '').replace('Relegated', '')

# Split the data into rows
rows = raw_data.strip().split('\n')

# Split each row into columns
cleaned_data = [re.split(r'\s{2,}', row.strip()) for row in rows]

# Split the first column into position and team name
cleaned_data = [col[0].split('.', 1) + col[1:] for col in cleaned_data if col]

# Define column headers
headers = ['pos', 'tm', 'played', 'a', 'b', 'c', 'gd', 'pts']

# Create a DataFrame
df = pd.DataFrame(cleaned_data, columns=headers)

print(tabulate(df, headers='keys', tablefmt='psql'))

Solution

  • You can findall the relevant columns (using this regex), then make a DataFrame :

    import re
    
    pat = (
        r"(\d+)\.(.+?)"  # get pos & team
        r"\s+(?=\d).+?"  # skip middle columns
        r"(\d+)(?=$|\s+(?:\[|[A-Z]))"  # get points
    )
    
    df = pd.DataFrame(
        re.findall(pat, my_unstructured_text, flags=re.M),
        columns=["pos", "tm", "pts"],
    )
    

    Output :

    print(df.to_string(index=False))
    
    pos                            tm pts
      1                  FC Barcelona  93
      2            Atlético de Madrid  79
      3                Real Madrid CF  76
      4                   Valencia CF  73
      5                 Villarreal CF  61
      6 Real Betis Balompié (Sevilla)  60
      7                    Sevilla FC  58
      8                     Getafe CF  55
      9                      SD Eibar  51
     10                     Girona FC  51
     11      RCD Espanyol (Barcelona)  49
     12 Real Sociedad (San Sebastián)  49
     13               RC Celta (Vigo)  49
     14    Deportivo Alavés (Vitoria)  47
     15         Levante UD (Valencia)  46
     16            Athletic de Bilbao  43
     17                    CD Leganés  43
     18      RC Deportivo (La Coruña)  29
     19                 UD Las Palmas  22
     20                     Málaga CF  20