pythontabulatabula-py

I'm using Tabulas in a for loop; getting this error: IndexError: list index out of range


I'm using a for loop to work through an entire folder of pdfs, which are converted to csv files.

import tabula
import os
import pandas as pd
files_in_directory = os.listdir()

filtered_files = [file for file in files_in_directory if file.endswith(".pdf")]
print(range(len(filtered_files)))
for file in range(len(filtered_files)):
    print(file-1)
    print(range(len(filtered_files)))

    print(file)
    print(filtered_files[file-1])
    df = tabula.read_pdf(filtered_files[file-1])
    csv_name = filtered_files[file-1] + '.csv'
    df[file-1].to_csv(csv_name, encoding='utf-8')

Here is my log:

Traceback (most recent call last):
  File "/Users/braydenyates/Documents/Band PDFS/csv_converter.py", line 16, in <module>
    df[file-1].to_csv(csv_name, encoding='utf-8')
IndexError: list index out of range

The code appears to run two of the sixty-three files in the folder, then ends due to this error. Thank you for your help!


Solution

  • The number of PDF files you have does not necessarily equal to the number of dataframes tabula manages to extract from one of the PDFs. file represents the Nth file while df is a list of dataframes actually. Therefore df[file-1] is something that's not really sensible to use. Loop through the dataframes and same them individually or whatever is intended.

    Here, have a more pythonic and simpler solution:

    import tabula
    import os
    import pandas as pd
    
    files_in_directory = os.listdir()
    filtered_files = [file for file in files_in_directory if file.endswith(".pdf")]
    
    for file in filtered_files:
        dfs = tabula.read_pdf(file)
    
        for nth_frame, df in enumerate(dfs, start=1):
            csv_name = f'{file}_{nth_frame}.csv'
            df.to_csv(csv_name, encoding='utf-8')