pythonpdfpython-camelot

Python Library Camelot not reading all tables in one page


I'm using Camelot Python Library to read all tables in a page of pdf document

I'm tring to read all tables at page 10 in this pdf

I tried to debug plotting the page and I noticed something if I change the flavor:

This is with flavor lattice

This is with flavor stream

The problem is if I use lattice flavor it will not read properly the tables an example here

If I use flavor='stream', It will read data properly but just of one table: The output is somenthing like this.

I tried to use table_area/table_regions for detect the two tables with flavor='stream', but it didn't work. I paste the code down here.

Code with lattice:

import camelot

file = "2022/Auto-trend0122.pdf" 
tables = camelot.read_pdf(file,pages='10',flavor='lattice',edge_tool=1500) 
print("Total tables extracted:", tables.n) 
print(tables[0].df) camelot.plot(tables[0],filename="try_plot.png", kind='contour') 
print(tables[1].df)

Code with stream, without table_area/table_regions:

import camelot

file = "2022/Auto-trend0122.pdf"
tables = camelot.read_pdf(file,pages='10',flavor='stream', edge_tool=1500)
print("Total tables extracted:", tables.n)
print(tables[0].df)
camelot.plot(tables[0],filename="try_plot.png", kind='contour')

Code with stream, with table_area:

import camelot

file = "2022/Auto-trend0122.pdf"
tables = camelot.read_pdf(file,pages='10',flavor='stream',edge_tool=1500,table_area=['10,450,550,50','10,750,550,450'])
print("Total tables extracted:", tables.n)
print(tables[0].df)
camelot.plot(tables[0],filename="try_plot.png", kind='contour')

Code with stream, with table_regions:

import camelot

file = "2022/Auto-trend0122.pdf"
tables = camelot.read_pdf(file,pages='10',flavor='stream',edge_tool=1500,table_regions=['10,450,550,50','10,750,550,450'])
print("Total tables extracted:", tables.n)
print(tables[0].df)
camelot.plot(tables[0],filename="try_plot.png", kind='contour')

The output for table_regions/table_area/without is the same.


Solution

  • The problem is that you are using table_area instead of the correct parameter table_areas (read the docs).

    The following command works perfectly:

    tables = camelot.read_pdf(file,pages='10', flavor='stream', edge_tool=1500, table_areas=['10,450,550,50','10,750,550,450'])

    Difference between table_areas and table_regions

    table_areas should be used when you know the exact position of the table. Conversely, table_regions makes the detection engine look for tables only in those generic page regions.