pythonautomationpython-camelotpython-pdfreaderpdftables

Better Layout Output for PDF Tables Extracted using Camelot


I'm building a python program using Camelot that extracts tables from a PDF (see code below). I am able to successfully execute the code, but I am hitting a road block on how to get a better output result. Specifically, I'm trying to get the code to create a better output for the second table on page 2 (see PDF link).

PDF Link: https://gzhls.at/blob/ldb/d/3/8/d/2e6caff67e5550c07775014eadfd481b39fc.pdf

The extracted table is saved as an excel with the following output result:

output for table 2

I noticed that the extracted result is not quite flowing right when it comes to the alignment of text to match the table in the PDF, which can be confusing when interpreting the data. Another issue is the text "Single Package" and "Single-sided" are combined in row 5 and should (as I picture it) be on its own row.

Any thoughts on how to optimize my code would be appreciated.

Here's my code (please excuse all my notes):

import camelot
import pandas as pd
import xlsxwriter as writer


# Replace 'input.pdf' with the name of your PDF file
# Replace [1, 2, 3, or "all"] with the page numbers to extract tables from

tables = camelot.read_pdf('cSSD_BG3_Series_EOL.pdf', pages='all', layout_kwargs={'detect_vertical': True, 'all_texts': True}, flavor= 'stream', split_text = True, flag_size = True, edge_tol = 900, row_tol = 10, column_tol = 15) #line_scale = 50, shift_text=['']

#Note: Use "stream" to parse tables that have white spaces between cells to simulate a table structure, "line_scale" and "shift_text" cannot be used with 'stream'
#Note: edge_tol,row_tol cannot be used with flavor='lattice'
#Note: use process_background = True when using "lattice"

camelot.plot(tables[0], kind = 'textedge') #see note below 

#Note ^ using the "kind" keyword augment, you can can designate the following plot types: 'text', 'grid', 'line', 'joint', 'contour' and 'textedge' - keep in mind that 'line' and 'joint can only be used with 'lattice' and 'textedge can only be used with 'stream'


# Initialize Pandas Excel writer
writer = pd.ExcelWriter('all_tables.xlsx', engine='xlsxwriter')

# Iterate through each table found in the PDF
for i, table in enumerate(tables):
    # Extract table data as a Pandas DataFrame, including headers
    df = table.df

    # Define sheet name based on table number
    sheet_name = f'Table {i+1}'

    # Write table data to Excel sheet
    df.to_excel(writer, sheet_name=sheet_name, index=False)

# Save Excel file
writer.close()
print('All tables exported to all_tables.xlsx')

Solution

  • Have you tried pdfplumber?

    It seems to give the best results from all the libraries I've tried.

    import pdfplumber
    import pandas as pd
     
    pdf = pdfplumber.open("pdf/2e6caff67e5550c07775014eadfd481b39fc.pdf")
    
    df = pd.DataFrame(pdf.pages[1].extract_table())
    
    0 1 2 3 4 5 6 7
    0 Model Number KBG30ZPZ512G KBG30ZPZ256G KBG30ZPZ128G KBG30ZMS512G KBG30ZMS256G KBG30ZMS128G
    1 SED Model Number KBG3AZPZ512G KBG3AZPZ256G KBG3AZPZ128G KBG3AZMS512G KBG3AZMS256G KBG3AZMS128G
    2 Capacity[1] 512 GB 256 GB 128 GB 512 GB 256 GB 128 GB
    3 Form Factor Single Package Single-Sided
    4 M.2 1620-S3 M.2 1620-S2 M.2 2230-S3 M.2 2230-S2
    5 Performance[2] (Up to)
    6 Sequential Non-SED 1,500 MB/s 1,400 MB/s 1,300 MB/s 1,500 MB/s 1,400 MB/s 1,300 MB/s
    Read {1,430 MiB/s} {1,330 MiB/s} {1,240 MiB/s} {1,430 MiB/s} {1,330 MiB/s} {1,240 MiB/s}
    7 SED 1,300 MB/s 1,250 MB/s 1,200 MB/s 1,300 MB/s 1,250 MB/s 1,200 MB/s
    {1,240 MiB/s} {1,190 MiB/s} {1,140 MiB/s} {1,240 MiB/s} {1,190 MiB/s} {1,140 MiB/s}
    8 Sequential Non-SED 1,000 MB/s 800 MB/s 600 MB/s 1,000 MB/s 800 MB/s 600 MB/s
    Write {950 MiB/s} {760 MiB/s} {570 MiB/s} {950 MiB/s} {760 MiB/s} {570 MiB/s}
    9 SED 950 MB/s 750 MB/s 550 MB/s 950 MB/s 750 MB/s 550 MB/s
    {900 MiB/s} {710 MiB/s} {520 MiB/s} {900 MiB/s} {710 MiB/s} {520 MiB/s}
    10 Power Requirements
    11 Supply Voltage 3.3 V ± 5 % 1.8 V ± 5 % 1.2 V ± 5 % 3.3 V ± 5 %
    12 Power Active 2.8 W typ. 2.7 W typ. 3.3 W typ. 3.2 W typ.
    Consump-
    tion (Typ.)
    13 L1.2 mode 5 mW typ. 5 mW typ.
    14 Reliability[3]
    15 MTTF 1,500,000 hours
    Product Life: Approximately 5 years
    16 Mechanical
    17 Dimension (LxWxH) 20.0 mm x 20.0 mm x 30.0 mm x 30.0 mm x
    16.0 mm x 16.0 mm x 22.0 mm x 22.0 mm x
    1.5 mm 1.3 mm 2.38 mm 2.18 mm
    18 Weight (Typ.) 1.00 g typ. 0.85 g typ. 2.60 g typ. 2.42 g typ.
    19 Environmental
    20 Temperature Operating: 0 °C to 80 °C (Package Surface Temperature) Operating: 0 °C to 80 °C (Components Temperature)
    21 Non-Operating: -40 °C to 85 °C
    22 Vibration (Operating) - 196 m/s2 { 20 G } ( Peak, 10 ~ 2,000 Hz )
    23 Shock (Operating) - 14.7 km/s2 { 1,500 G } ( 0.5 ms )
    24 Additional Features • Device Self-test is supported.
    • Host Controlled Thermal Management (HCTM) is supported.
    • The feature of Host Memory Buffer (HMB) is supported.
    • Firmware security feature (only digitally signed firmware can be installed) is supported.