I am trying to extract table information from pdf using Camelot-py library. Initially using stream function like this:
import camelot
tables = camelot.read_pdf('sample.pdf', flavor='stream', pages='1', columns=['110,400'], split_text=True, row_tol=10)
tables.export('ipc_export.csv', f='csv', compress=True)
tables[0]
tables[0].parsing_report
tables[0].to_csv('ipc_export.csv')
tables[0].df
However could not get the desired outcome, even after adjusting the columns value. Then I switched to lattice flavor. It can now determine the column accurately, however due to the nature that the pdf source does not separate rows using lines, the whole table content are extracted on one row.
Below using lattice:
import camelot
tables = camelot.read_pdf('sample_camelot_extract.pdf', flavor='lattice', pages='1')
tables.export('ipc_export.csv', f='csv', compress=True)
tables[0]
tables[0].parsing_report
tables[0].to_csv('ipc_export.csv')
tables[0].df
The logic that I want to implement is that for each new text that exists on the first column (FIG ITEM), it should be the start of the new row.
Have tried both flavors but not sure which is the best approach.
Link for original file here:
Thank you.
You could try using pdfplumber - it allows you to customize all of its table extraction settings.
For example - changing just the default horizontal strategy to text
produces:
table = page.extract_table(table_settings={"horizontal_strategy": "text"})
[['FIG', '', '', 'EFFECT', 'UNITS'],
['ITEM', 'PART NUMBER', '1234567 NOMENCLATURE', 'FROM TO', 'PER\nASSY'],
['', '', '', '', ''],
['1', '', '', '', ''],
['', '', 'SYSTEM INSTL-AIR DISTR MIX', '', ''],
['', '', 'BAY (MAIN AIR', '', ''],
['', '', 'DISTRIBUTION ONLY)', '', ''],
You could play around with more settings to see if it's possible to extract the whole table the way you intend.
From here though - you could manually clean up and extract the column rows:
>>> df = pd.DataFrame(table)
>>> (df.iloc[0] + " " + df.iloc[1]).str.replace("\n", " ").str.strip()
0 FIG ITEM
1 PART NUMBER
2 1234567 NOMENCLATURE
3 EFFECT FROM TO
4 UNITS PER ASSY
dtype: object
df.columns = (df.iloc[0] + " " + df.iloc[1]).str.replace("\n", " ").str.strip()
df = df.tail(-3)
You could then forward fill the FIG ITEM
column and group on that - allowing you to combine the items.
df.groupby(df["FIG ITEM"].replace("", float("nan")).ffill()).agg({
"PART NUMBER": "first",
"1234567 NOMENCLATURE": "\n".join,
"UNITS PER ASSY": "first",
})
PART NUMBER 1234567 NOMENCLATURE UNITS PER ASSY
FIG ITEM
- 1 M0DREF452754 SYSTEM INSTL-AIR DISTR MIX\nBAY (MAIN AIR\nDIS... RF
1 \nSYSTEM INSTL-AIR DISTR MIX\nBAY (MAIN AIR\nD...
10 BACS12GU3K8 .SCREW 12
15 BACS12GU3K9 .SCREW 18
20 BACB30NM3K15 .BOLT 15
27 BACB30NM3K17 .BOLT 2
28 BACB30NM3K20 .BOLT 1
30 NAS1149D0332J .WASHER 60
35 BACW10P44AL .WASHER 2
40 PLH53CD .NUT-\nSUPPLIER CODE:\nVF0224\nSPECIFICATION N... 2
45 SLT8LHC6 .STRAP-\nSUPPLIER CODE:\nV06383\nTRUE PART NUM... 7
5 BACS12GU3K7 .SCREW 12