pythonpdftabula-py

Tabula py not reading all rows for PDFs with alternating colors for each row when Lattice is set to True


I am trying to extract all rows from the PDF attached here.

Here is the code I used:

def parse_latticepdf_pages(pdf):
    pages = read_pdf(
        pdf,
        pages = "all",
        guess = False,
        lattice = True,
        silent = True,
        area = [43, 5, 568, 774], 
        pandas_options = {'header': None}
    )
       
    return pd.concat(pages)

parse_latticepdf_pages(pdf = "file.pdf")

The output shows only those rows which are in the grey background color. İt doesn't show rows with the white background color. How do I get all rows regardless of the color the rows are in?

Note: Initially I tried with stream = True, but that caused other problems where each line appears as a separate row and it is impossible to group the rows as needed. Hence, I set Lattice = True. Also, enabling and not enabling multiple_tables return the same issue.

I would appreciate any help regarding this. Thank you!


Solution

  • I managed to finally solve this. For this particular PDF format, it's better to use other python packages such as PyMuPDF. I had posted a similar question on another post in StackOverflow. I am posting the link here. Hope this helps others too struggling to find a solution to a problem similar to that mentioned in this post.

    Data Wrangling of text extracted from PDF using PyMuPDF possible? (alternating colors for each row) - text positioned in the middle for each row