I'm building a python program using Camelot that extracts tables from a PDF (see code below). I am able to successfully execute the code, but I am hitting a road block on how to get a better output result. Specifically, I'm trying to get the code to create a better output for the second table on page 2 (see PDF link).
PDF Link: https://gzhls.at/blob/ldb/d/3/8/d/2e6caff67e5550c07775014eadfd481b39fc.pdf
The extracted table is saved as an excel with the following output result:
I noticed that the extracted result is not quite flowing right when it comes to the alignment of text to match the table in the PDF, which can be confusing when interpreting the data. Another issue is the text "Single Package" and "Single-sided" are combined in row 5 and should (as I picture it) be on its own row.
Any thoughts on how to optimize my code would be appreciated.
Here's my code (please excuse all my notes):
import camelot
import pandas as pd
import xlsxwriter as writer
# Replace 'input.pdf' with the name of your PDF file
# Replace [1, 2, 3, or "all"] with the page numbers to extract tables from
tables = camelot.read_pdf('cSSD_BG3_Series_EOL.pdf', pages='all', layout_kwargs={'detect_vertical': True, 'all_texts': True}, flavor= 'stream', split_text = True, flag_size = True, edge_tol = 900, row_tol = 10, column_tol = 15) #line_scale = 50, shift_text=['']
#Note: Use "stream" to parse tables that have white spaces between cells to simulate a table structure, "line_scale" and "shift_text" cannot be used with 'stream'
#Note: edge_tol,row_tol cannot be used with flavor='lattice'
#Note: use process_background = True when using "lattice"
camelot.plot(tables[0], kind = 'textedge') #see note below
#Note ^ using the "kind" keyword augment, you can can designate the following plot types: 'text', 'grid', 'line', 'joint', 'contour' and 'textedge' - keep in mind that 'line' and 'joint can only be used with 'lattice' and 'textedge can only be used with 'stream'
# Initialize Pandas Excel writer
writer = pd.ExcelWriter('all_tables.xlsx', engine='xlsxwriter')
# Iterate through each table found in the PDF
for i, table in enumerate(tables):
# Extract table data as a Pandas DataFrame, including headers
df = table.df
# Define sheet name based on table number
sheet_name = f'Table {i+1}'
# Write table data to Excel sheet
df.to_excel(writer, sheet_name=sheet_name, index=False)
# Save Excel file
writer.close()
print('All tables exported to all_tables.xlsx')
Have you tried pdfplumber?
It seems to give the best results from all the libraries I've tried.
import pdfplumber
import pandas as pd
pdf = pdfplumber.open("pdf/2e6caff67e5550c07775014eadfd481b39fc.pdf")
df = pd.DataFrame(pdf.pages[1].extract_table())
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | ||
---|---|---|---|---|---|---|---|---|---|
0 | Model Number | KBG30ZPZ512G | KBG30ZPZ256G | KBG30ZPZ128G | KBG30ZMS512G | KBG30ZMS256G | KBG30ZMS128G | ||
1 | SED Model Number | KBG3AZPZ512G | KBG3AZPZ256G | KBG3AZPZ128G | KBG3AZMS512G | KBG3AZMS256G | KBG3AZMS128G | ||
2 | Capacity[1] | 512 GB | 256 GB | 128 GB | 512 GB | 256 GB | 128 GB | ||
3 | Form Factor | Single Package | Single-Sided | ||||||
4 | M.2 1620-S3 | M.2 1620-S2 | M.2 2230-S3 | M.2 2230-S2 | |||||
5 | Performance[2] (Up to) | ||||||||
6 | Sequential | Non-SED | 1,500 MB/s | 1,400 MB/s | 1,300 MB/s | 1,500 MB/s | 1,400 MB/s | 1,300 MB/s | |
Read | {1,430 MiB/s} | {1,330 MiB/s} | {1,240 MiB/s} | {1,430 MiB/s} | {1,330 MiB/s} | {1,240 MiB/s} | |||
7 | SED | 1,300 MB/s | 1,250 MB/s | 1,200 MB/s | 1,300 MB/s | 1,250 MB/s | 1,200 MB/s | ||
{1,240 MiB/s} | {1,190 MiB/s} | {1,140 MiB/s} | {1,240 MiB/s} | {1,190 MiB/s} | {1,140 MiB/s} | ||||
8 | Sequential | Non-SED | 1,000 MB/s | 800 MB/s | 600 MB/s | 1,000 MB/s | 800 MB/s | 600 MB/s | |
Write | {950 MiB/s} | {760 MiB/s} | {570 MiB/s} | {950 MiB/s} | {760 MiB/s} | {570 MiB/s} | |||
9 | SED | 950 MB/s | 750 MB/s | 550 MB/s | 950 MB/s | 750 MB/s | 550 MB/s | ||
{900 MiB/s} | {710 MiB/s} | {520 MiB/s} | {900 MiB/s} | {710 MiB/s} | {520 MiB/s} | ||||
10 | Power Requirements | ||||||||
11 | Supply Voltage | 3.3 V ± 5 % | 1.8 V ± 5 % | 1.2 V ± 5 % | 3.3 V ± 5 % | ||||
12 | Power | Active | 2.8 W typ. | 2.7 W typ. | 3.3 W typ. | 3.2 W typ. | |||
Consump- | |||||||||
tion (Typ.) | |||||||||
13 | L1.2 mode | 5 mW typ. | 5 mW typ. | ||||||
14 | Reliability[3] | ||||||||
15 | MTTF | 1,500,000 hours | |||||||
Product Life: Approximately 5 years | |||||||||
16 | Mechanical | ||||||||
17 | Dimension (LxWxH) | 20.0 mm x | 20.0 mm x | 30.0 mm x | 30.0 mm x | ||||
16.0 mm x | 16.0 mm x | 22.0 mm x | 22.0 mm x | ||||||
1.5 mm | 1.3 mm | 2.38 mm | 2.18 mm | ||||||
18 | Weight (Typ.) | 1.00 g typ. | 0.85 g typ. | 2.60 g typ. | 2.42 g typ. | ||||
19 | Environmental | ||||||||
20 | Temperature | Operating: 0 °C to 80 °C (Package Surface Temperature) | Operating: 0 °C to 80 °C (Components Temperature) | ||||||
21 | Non-Operating: -40 °C to 85 °C | ||||||||
22 | Vibration (Operating) | - | 196 m/s2 { 20 G } ( Peak, 10 ~ 2,000 Hz ) | ||||||
23 | Shock (Operating) | - | 14.7 km/s2 { 1,500 G } ( 0.5 ms ) | ||||||
24 | Additional Features | • Device Self-test is supported. | |||||||
• Host Controlled Thermal Management (HCTM) is supported. | |||||||||
• The feature of Host Memory Buffer (HMB) is supported. | |||||||||
• Firmware security feature (only digitally signed firmware can be installed) is supported. |