pythonpython-3.xpdfplumber

How do I continue extracting data from a PDF table in python using PDF Plumber?


I'm currently working on extracting data from a table within a PDF using python, specifically its lap time data, which is provided as a PDF that looks like this: f1 laptimedata I'm using PDF Plumber to extract the table data, and then python to process the data, in order to create a series of lists of dicts listing each drivers laps and their lap times so that I can do some further work with the information.

At the moment, my code looks like this:

import pdfplumber
import re

# Predefined list of drivers
drivers_list = ["Max VERSTAPPEN", "Daniel RICCIARDO", "Nicholas LATIFI", "Lewis HAMILTON", "Lando NORRIS", "Sebastian VETTEL", "Nicholas LATIFI", "Pierre GASLY", "Sergio PEREZ", "Fernando ALONSO", "Charles LECLERC", "George RUSSELL", "Alexander ALBON", "Lance STROLL", "Kevin MAGNUSSEN", "Yuki TSUNODA", "ZHOU Guanyu", "Esteban OCON"]

# Initialize a dict for lap times
driver_lap_times = {driver: [] for driver in drivers_list}

# Define a pattern to detect the start of a lap time section
pattern = re.compile(r'(\d+)\s+(\d{1,2}:\d{2}:\d{2}|\d{1,2}:\d{2}\.\d{3})')

# Extract text from the PDF
pdf_path = "Lap Analysis - SIN.pdf"
with pdfplumber.open(pdf_path) as pdf:
    for page in pdf.pages:
        text = page.extract_text()
        
        # Process text to split by driver
        lines = text.split('\n')
        current_driver = None
        current_lap_times = []
        in_lap_time_section = False

        for line in lines:
            line = line.strip()
            
            # Check if the line contains a driver's name
            driver_found = False
            for driver in drivers_list:
                if driver in line:
                    if current_driver:
                        # Save lap times for the previous driver
                        driver_lap_times[current_driver].extend(current_lap_times)

                    current_driver = driver
                    current_lap_times = []
                    driver_found = True
                    in_lap_time_section = False
                    break
            
            if driver_found:
                continue

            # Determine if the line is part of a lap time section
            if "LAP TIME" in line:
                in_lap_time_section = True
                continue 

            # Collect lap times if in a lap time section and a driver is currently identified
            if in_lap_time_section and current_driver:
                match = pattern.match(line)
                if match:
                    lap_number, lap_time = match.groups()
                    if lap_number.isdigit() or 'P' in lap_number:  # Handle 'P' (pit) laps
                        current_lap_times.append({lap_number: lap_time})

        # Save lap times for the last driver in the current page
        if current_driver:
            driver_lap_times[current_driver].extend(current_lap_times)
#Print by driver
for driver, laps in driver_lap_times.items():
    formatted_laps = ', '.join(f"{{'{lap_number}': '{lap_time}'}}" for lap in laps for lap_number, lap_time in lap.items())
    print(f"{driver}: [{formatted_laps}]")

And in a slightly hit-and-miss way is producing an almost working output, it doesn't find every driver, but the ones it does find it's getting the right information seemingly, but it stops at lap30 for every driver, and looks like this:

Max VERSTAPPEN: [{'1': '21:11:14'}, {'2': '2:04.389'}, {'3': '2:03.369'}, {'4': '2:03.238'}, {'5': '2:02.703'}, {'6': '2:03.027'}, {'7': '2:03.289'}, {'8': '2:23.240'}, {'9': '2:42.690'}, {'10': '2:38.596'}, {'11': '2:01.612'}, {'12': '2:00.967'}, {'13': '2:01.842'}, {'14': '2:01.558'}, {'15': '2:01.407'}, {'16': '2:01.138'}, {'17': '2:00.909'}, {'18': '2:00.807'}, {'19': '2:00.520'}, {'20': '2:00.559'}, {'21': '2:09.641'}, {'22': '2:35.288'}, {'23': '1:58.377'}, {'24': '1:58.784'}, {'25': '1:59.689'}, {'26': '2:22.695'}, {'27': '2:00.464'}, {'28': '2:13.500'}, {'29': '2:40.814'}, {'30': '2:08.953'}]
Daniel RICCIARDO: []
Nicholas LATIFI: [{'1': '21:11:11'}, {'2': '2:05.790'}, {'3': '2:04.098'}, {'4': '2:03.184'}, {'5': '2:03.366'}, {'6': '2:03.052'}, {'7': '2:03.297'}, {'8': '2:21.496'}, {'9': '2:43.215'}, {'10': '2:39.865'}, {'11': '2:04.900'}, {'12': '2:02.910'}, {'13': '2:02.938'}, {'14': '2:02.701'}, {'15': '2:02.067'}, {'16': '2:01.664'}, {'17': '2:01.690'}, {'18': '2:01.285'}, {'19': '2:01.333'}, {'20': '2:01.365'}, {'21': '2:14.128'}, {'22': '2:30.735'}, {'23': '1:59.802'}, {'24': '2:00.070'}, {'25': '1:59.714'}, {'26': '2:21.882'}, {'27': '1:59.805'}, {'28': '2:17.203'}, {'29': '2:37.487'}, {'30': '2:07.022'}]
Lando NORRIS: []

Setting aside for now the problem that it's not finding all the drivers and appending their info, how do I get it to move beyond lap 30 for the drivers it is finding? Am I missing something very obvious? Also, if anyone does have some advice on why it's only finding the first driver on each page of data, I'd be super grateful for your advice!

I'm keen to stay using PDF Plumber as it's compatible with python 3.12 and for the data I am successfully extracting it's maintaining a high level of accuracy.


Solution

  • I improved your code, now it finds all drivers as well as all lap times.

    import pdfplumber
    import re
    from itertools import groupby, islice
    
    # Pattern to match timing formats (HH:MM:SS or MM:SS.sss)
    timing_pattern = r'(\d+)\s+(\d\d?:\d\d:\d\d|\d\d?:\d\d\.\d{3})'
    
    # Predefined list of drivers
    drivers_list = ["Max VERSTAPPEN", "Daniel RICCIARDO", "Nicholas LATIFI", "Lewis HAMILTON", "Lando NORRIS", "Sebastian VETTEL", "Nicholas LATIFI", "Pierre GASLY", "Sergio PEREZ", "Fernando ALONSO", "Charles LECLERC", "George RUSSELL", "Alexander ALBON", "Lance STROLL", "Kevin MAGNUSSEN", "Yuki TSUNODA", "ZHOU Guanyu", "Esteban OCON", "Mick SCHUMACHER", "Carlos SAINZ", "George RUSSELL", "Valtteri BOTTAS"]
    
    # Initialize a dict for lap times
    driver_lap_times = {driver: [] for driver in drivers_list}
    
    # Extract text from the PDF
    pdf_path = "Lap Analysis - SIN.pdf"
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            text = page.extract_text()
            lines = text.split('\n')
    
            current_drivers = []
            for line in lines:
                line = line.strip()
    
                if any(driver_name in line for driver_name in drivers_list):
                    current_drivers = ["".join(g).strip() for _, g in groupby(line, key=str.isdigit)][1::2]
                    continue
                elif ":" in line:
                    iterator = iter(re.findall(timing_pattern, line.replace("P", "")))
                    cnt=0
                    while chunk := list(islice(iterator, 2)):
                        for time in chunk:
                            driver_lap_times[current_drivers[cnt]].insert(int(time[0])-1, time[1])
                        cnt+=1
    
    for driver, times in driver_lap_times.items():
        print(f"{driver}: {times}")
    

    Results in:

    Max VERSTAPPEN: ['21:11:14', '2:04.389', '2:03.369', '2:03.238', '2:02.703', '2:03.027', '2:03.289', '2:23.240', '2:42.690', '2:38.596', '2:01.612', '2:00.967', '2:01.842', '2:01.558', '2:01.407', '2:01.138', '2:00.909', '2:00.807', '2:00.520', '2:00.559', '2:09.641', '2:35.288', '1:58.377', '1:58.784', '1:59.689', '2:22.695', '2:00.464', '2:13.500', '2:40.814', '2:01.087', '2:08.953', '1:59.035', '1:58.925', '1:59.987', '1:59.730', '2:05.082', '2:55.179', '2:37.134', '2:47.754', '2:28.080', '2:13.541', '2:18.778', '1:53.170', '1:51.370', '1:50.616', '1:50.049', '1:51.824', '1:51.455', '1:50.508', '1:49.944', '1:50.250', '1:49.846', '1:49.142', '1:49.979', '1:50.890', '1:50.878', '1:50.652', '1:50.934', '1:50.597', '1:51.068']
    Daniel RICCIARDO: ['21:11:15', '2:08.435', '2:05.767', '2:04.551', '2:04.656', '2:03.973', '2:03.816', '2:22.957', '2:38.839', '2:36.441', '2:03.032', '2:02.862', '2:03.083', '2:03.193', '2:02.900', '2:02.186', '2:02.632', '2:02.034', '2:01.884', '2:01.471', '2:16.073', '2:29.699', '2:01.076', '2:00.897', '2:00.839', '2:23.917', '2:00.418', '2:20.556', '2:37.841', '2:08.887', '1:59.291', '1:59.098', '1:59.323', '1:58.690', '1:59.070', '2:33.600', '2:56.617', '2:38.596', '2:24.180', '1:58.500', '1:56.357', '1:55.744', '1:55.095', '1:54.728', '1:53.419', '1:53.508', '1:51.975', '1:52.198', '1:52.162', '1:51.589', '1:51.688', '1:51.621', '1:51.076', '1:51.061', '1:51.197', '1:51.176', '1:51.006', '1:52.526', '1:52.080']
    Nicholas LATIFI: ['21:11:20', '2:09.155', '2:07.657', '2:03.402', '2:03.659', '2:03.653', '2:03.128', '2:19.609', '2:43.943', '2:41.360', '2:04.843', '2:03.028', '2:02.941', '2:02.308', '2:01.994', '2:01.484', '2:01.533', '2:01.284', '2:01.020', '2:00.747', '2:12.248', '2:31.864', '2:00.568', '2:00.254', '1:59.829', '2:22.811', '2:00.028', '2:16.203', '2:38.033', '2:06.887', '2:05.585', '2:06.200', '3:14.303', '2:01.565', '2:35.635', '2:38.164', '2:37.571', '2:24.004', '2:00.093', '1:56.624', '1:55.031', '1:54.342', '1:53.713', '1:55.876', '1:53.497', '1:52.934', '1:52.678', '1:54.531', '1:52.307', '1:54.699', '1:51.753', '1:51.349', '1:50.728', '1:50.836', '1:50.569', '1:51.662', '1:50.894', '1:52.655']
    Lewis HAMILTON: ['21:11:06', '2:02.920', '2:01.712', '2:01.585', '2:01.454', '2:00.981', '2:00.460', '2:13.316', '2:59.421', '2:45.752', '2:02.192', '2:00.260', '2:00.603', '2:00.810', '2:00.576', '2:00.436', '2:00.320', '2:00.161', '1:59.706', '1:59.735', '2:06.741', '2:36.072', '2:01.904', '1:59.451', '1:59.195', '2:17.933', '2:04.014', '2:09.534', '2:38.755', '2:12.223', '2:13.306', '1:57.601', '1:57.973', '2:14.994', '2:07.369', '2:38.829', '2:36.164', '2:36.898', '2:37.773', '2:23.575', '2:00.250', '1:56.067', '1:54.858', '1:55.027', '1:54.632', '1:55.941', '1:53.348', '1:52.403', '1:52.736', '1:51.903', '1:51.363', '1:51.935', '1:50.994', '1:51.249', '1:50.798', '1:50.794', '1:50.750', '1:54.064', '1:50.622', '1:51.101']
    Lando NORRIS: ['21:11:08', '2:04.065', '2:02.717', '2:02.313', '2:02.322', '2:02.022', '2:02.009', '2:18.617', '2:49.938', '2:43.014', '2:03.436', '2:01.307', '2:01.110', '2:01.304', '2:01.074', '2:01.026', '2:00.715', '2:00.524', '2:00.932', '2:00.536', '2:08.476', '2:36.906', '1:59.786', '1:59.302', '2:00.145', '2:22.891', '1:59.951', '2:13.595', '2:41.231', '1:58.834', '1:58.753', '1:59.385', '1:58.369', '1:58.746', '2:26.728', '3:00.514', '3:00.513', '2:29.021', '1:59.591', '1:54.999', '1:53.336', '1:53.478', '1:52.396', '1:51.109', '1:51.071', '1:50.560', '1:51.165', '1:50.139', '1:49.684', '1:49.993', '1:50.472', '1:50.253', '1:49.929', '1:50.427', '1:49.212', '1:49.749', '1:50.014', '1:50.751']
    Sebastian VETTEL: ['21:11:11', '2:05.790', '2:04.098', '2:03.184', '2:03.366', '2:03.052', '2:03.297', '2:21.496', '2:43.215', '2:39.865', '2:04.900', '2:02.910', '2:02.938', '2:02.701', '2:02.067', '2:01.664', '2:01.690', '2:01.285', '2:01.333', '2:01.365', '2:14.128', '2:30.735', '1:59.802', '2:00.070', '1:59.714', '2:21.882', '1:59.805', '2:17.203', '2:37.487', '2:06.509', '2:07.022', '1:58.680', '1:58.943', '1:58.895', '2:06.308', '2:28.180', '2:35.422', '2:37.784', '2:38.552', '2:23.272', '1:59.228', '1:57.462', '1:55.313', '1:54.864', '1:54.654', '1:55.825', '1:53.467', '1:52.259', '1:52.233', '1:51.999', '1:51.319', '1:51.799', '1:51.396', '1:51.311', '1:51.040', '1:51.022', '1:50.759', '1:51.449', '1:50.669', '1:52.728']
    Pierre GASLY: ['21:11:10', '2:05.132', '2:03.844', '1:59.033', '1:58.769', '2:07.555', '2:29.129']
    Sergio PEREZ: ['21:11:01', '2:01.358', '2:00.875', '2:00.310', '2:00.267', '2:00.094', '2:00.714', '2:15.730', '3:02.306', '2:49.136', '1:59.580', '1:59.473', '1:59.434', '1:59.429', '1:59.018', '1:59.358', '1:59.238', '1:58.905', '1:58.717', '1:58.519', '1:58.780', '2:39.777', '2:03.986', '1:58.332', '1:58.161', '2:14.801', '2:06.003', '2:02.874', '2:39.800', '2:15.377', '2:17.858', '1:57.451', '1:57.603', '1:56.945', '1:56.267', '2:06.352', '2:43.256', '3:06.473', '3:02.549', '2:35.490', '1:56.340', '1:53.693', '1:52.701', '1:51.903', '1:51.355', '1:50.538', '1:50.363', '1:50.501', '1:49.500', '1:49.189', '1:49.285', '1:49.565', '1:48.841', '1:48.578', '1:48.576', '1:48.645', '1:48.251', '1:48.165', '1:49.009', '1:49.652']
    Fernando ALONSO: ['21:11:09', '2:04.847', '2:03.306', '2:02.935', '2:02.858', '2:02.447', '2:02.399', '2:17.774', '2:48.597', '2:42.370', '2:00.177', '1:59.328', '1:59.653', '1:59.603', '1:59.357', '1:59.359', '1:58.983', '1:59.115', '1:59.176', '1:59.107', '1:59.446', '2:39.789', '2:03.112', '1:58.443', '1:58.576', '2:15.018', '2:06.478', '2:04.950', '2:40.754', '2:02.788', '2:01.994', '2:01.844', '2:01.474', '2:01.425', '2:00.911', '2:00.964', '2:00.709', '2:00.463', '2:00.640', '1:53.302', '1:52.533', '1:51.724', '1:51.582', '1:50.328', '1:50.798', '1:50.151', '1:50.469', '1:49.177', '1:49.336', '1:50.099', '1:49.557', '1:48.839', '1:48.753', '1:49.016', '1:49.012', '1:49.069', '1:51.181', '1:49.913']
    Charles LECLERC: ['21:11:02', '2:01.214', '2:00.939', '2:00.734', '2:00.219', '2:00.229', '2:00.138', '2:16.598', '3:02.958', '2:47.301', '1:56.259', '1:57.226', '1:56.811', '2:05.037', '2:27.541', '2:17.673', '3:01.111', '3:03.916', '2:33.178', '1:56.709']
    George RUSSELL: ['21:11:20', '2:08.743', '2:07.696', '2:04.604', '2:02.566', '2:03.817', '2:10.503', '2:23.222', '2:38.861', '2:28.349', '2:03.247', '2:02.443', '2:02.647', '2:03.310', '2:02.758', '2:02.750', '2:02.744', '2:02.454', '2:02.174', '2:02.244', '2:27.932', '2:45.912', '2:09.869', '2:05.338', '2:13.650', '2:20.493', '2:05.212', '2:37.252', '2:36.017', '1:59.285', '2:29.586', '2:02.097', '1:59.791', '1:57.392', '1:56.177', '1:54.795', '2:14.468', '2:52.303', '2:02.336', '2:11.341', '2:19.959', '1:59.931', '3:12.235', '2:25.466', '1:55.808', '2:04.133', '1:55.047', '1:53.640', '1:50.505', '1:51.603', '1:58.702', '1:49.567', '2:01.663', '2:16.484', '1:50.349', '1:46.458', '1:51.674', '1:55.950', '1:51.903']
    Alexander ALBON: ['21:11:21', '2:09.436', '2:07.765', '2:06.625', '2:05.513', '2:05.889', '2:04.953', '2:27.674', '2:36.241', '2:27.761', '2:03.609', '2:03.271', '2:02.665', '2:02.652', '2:03.222', '2:03.057', '2:03.125', '2:03.446', '2:03.364', '2:02.806', '2:03.093', '2:20.921', '2:29.359', '2:03.038', '2:02.121', '2:50.661']
    Lance STROLL: ['21:11:13', '2:08.661', '2:05.367', '2:04.710', '2:04.403', '2:03.964', '2:03.323', '2:24.030', '2:37.465', '2:37.318', '2:03.997', '2:03.049', '2:03.193', '2:02.999', '2:02.272', '2:02.335', '2:02.808', '2:02.165', '2:01.428', '2:01.250', '2:16.746', '2:29.236', '1:59.592', '1:59.199', '1:59.695', '2:22.271', '1:59.002', '2:19.004', '2:37.291', '1:58.710', '2:04.724', '1:59.260', '1:58.733', '1:58.974', '1:59.151', '2:06.733', '2:55.690', '2:37.344', '2:38.824', '2:22.974', '1:59.611', '1:57.095', '1:56.169', '1:55.502', '1:54.765', '1:55.740', '1:52.756', '1:52.270', '1:51.564', '1:51.854', '1:51.786', '1:52.587', '1:51.511', '1:50.823', '1:51.337', '1:51.074', '1:50.708', '1:50.420', '1:50.283', '1:51.958']
    Kevin MAGNUSSEN: ['21:11:14', '2:08.617', '2:05.679', '2:04.881', '2:04.360', '2:04.122', '2:11.492', '3:06.525', '2:35.888', '2:04.183', '2:01.895', '2:02.043', '2:02.299', '2:02.769', '2:02.496', '2:02.777', '2:02.798', '2:02.811', '2:02.761', '2:03.088', '2:20.564', '2:27.787', '2:00.849', '2:00.292', '2:00.860', '2:25.279', '1:59.948', '2:26.924', '2:33.043', '1:58.604', '1:58.696', '2:07.629', '2:31.683', '2:10.970', '2:36.740', '2:09.626', '2:25.490', '2:22.108', '2:01.971', '1:58.852', '1:55.979', '1:56.500', '1:55.982', '1:53.827', '1:54.003', '1:55.269', '1:53.596', '1:52.081', '1:52.540', '1:53.138', '1:52.514', '1:52.228', '1:53.317', '1:52.660', '1:53.041', '1:53.585', '1:54.259', '1:52.067']
    Yuki TSUNODA: ['21:11:12', '2:06.485', '2:05.476', '2:04.333', '2:03.862', '2:03.803', '2:03.162', '2:22.256', '2:41.673', '2:38.012', '2:03.513', '2:03.167', '2:03.129', '2:03.167', '2:02.582', '2:02.443', '2:02.801', '2:01.737', '2:01.240', '2:01.304', '2:28.641', '2:28.707', '2:01.296', '2:00.864', '2:00.402', '2:25.390', '2:00.193', '2:23.205', '2:38.491', '1:59.934', '1:59.622', '1:58.716', '2:06.961', '2:30.183']
    ZHOU Guanyu: ['21:11:20', '2:09.958', '2:07.541', '2:04.482', '2:04.592', '2:04.209', '2:04.065', '2:24.525', '2:38.253', '2:35.211', '2:03.222', '2:02.811', '2:06.289', '2:05.556', '2:06.105', '2:02.641', '2:02.501', '2:02.158', '2:02.038', '2:21.871', '2:28.090', '2:01.732', '2:01.105', '2:01.171', '2:24.936']
    Esteban OCON: ['21:11:17', '2:08.739', '2:06.079', '2:03.245', '2:02.831', '2:02.983']
    Mick SCHUMACHER: ['21:11:16', '2:08.586', '2:05.628', '2:04.531', '2:04.601', '2:03.975', '2:04.039', '2:23.615', '2:37.698', '2:36.977', '2:02.944', '2:02.854', '2:03.170', '2:03.100', '2:02.944', '2:03.052', '2:02.455', '2:02.454', '2:02.038', '2:01.780', '2:20.120', '2:28.565', '2:00.362', '2:00.225', '2:00.487', '2:23.638', '2:00.664', '2:22.204', '2:36.902', '2:01.745', '1:58.777', '1:58.692', '1:59.069', '2:07.583', '2:34.532', '2:36.563', '2:20.553', '2:32.841', '2:24.621', '2:00.769', '3:01.053', '2:32.828', '1:59.502', '2:01.846', '1:57.253', '1:55.624', '1:55.088', '1:52.651', '1:52.416', '1:51.132', '1:52.195', '1:50.731', '1:51.607', '1:52.194', '1:51.917', '1:52.865', '1:51.193', '1:50.290']
    Carlos SAINZ: ['21:11:04', '2:02.702', '2:01.872', '2:01.488', '2:01.111', '2:00.428', '2:00.833', '2:14.044', '3:00.236', '2:46.483', '2:01.252', '2:00.463', '2:00.631', '2:00.917', '2:00.439', '2:00.389', '2:00.144', '2:00.061', '1:59.869', '2:00.003', '2:06.780', '2:36.659', '2:01.685', '1:58.940', '1:59.192', '2:17.465', '2:04.708', '2:10.081', '2:38.981', '1:57.456', '1:58.538', '1:58.780', '1:58.296', '2:07.162', '2:49.952', '2:40.946', '3:00.885', '2:31.655', '1:58.724', '1:54.980', '1:53.471', '1:54.241', '1:52.063', '1:50.988', '1:51.048', '1:50.340', '1:50.101', '1:50.005', '1:50.096', '1:49.424', '1:49.683', '1:49.420', '1:49.626', '1:49.346', '1:49.013', '1:48.712', '1:48.746', '1:48.414']
    Valtteri BOTTAS: ['21:11:18', '2:09.037', '2:06.307', '2:05.300', '2:04.635', '2:04.208', '2:05.534', '2:25.828', '2:37.459', '2:32.122', '2:03.254', '2:02.395', '2:02.810', '2:03.152', '2:02.841', '2:02.980', '2:02.744', '2:02.374', '2:02.142', '2:02.065', '2:21.775', '2:28.554', '2:01.548', '2:01.628', '2:00.992', '2:25.028', '2:01.219', '2:24.225', '1:59.572', '1:59.484', '2:07.048', '2:32.611', '2:11.632', '2:36.790', '2:09.167', '2:25.762', '2:23.782', '2:01.914', '1:57.085', '1:54.236', '1:54.129', '1:53.723', '1:54.985', '1:53.728', '1:52.736', '1:52.579', '1:53.071', '1:54.301', '1:53.656', '1:51.864', '1:52.899', '1:52.228', '1:52.335', '1:52.262', '1:53.276', '1:56.931', '1:56.059']