I'm using a for loop to work through an entire folder of pdfs, which are converted to csv files.
import tabula
import os
import pandas as pd
files_in_directory = os.listdir()
filtered_files = [file for file in files_in_directory if file.endswith(".pdf")]
print(range(len(filtered_files)))
for file in range(len(filtered_files)):
print(file-1)
print(range(len(filtered_files)))
print(file)
print(filtered_files[file-1])
df = tabula.read_pdf(filtered_files[file-1])
csv_name = filtered_files[file-1] + '.csv'
df[file-1].to_csv(csv_name, encoding='utf-8')
Here is my log:
Traceback (most recent call last):
File "/Users/braydenyates/Documents/Band PDFS/csv_converter.py", line 16, in <module>
df[file-1].to_csv(csv_name, encoding='utf-8')
IndexError: list index out of range
The code appears to run two of the sixty-three files in the folder, then ends due to this error. Thank you for your help!
The number of PDF files you have does not necessarily equal to the number of dataframes tabula manages to extract from one of the PDFs. file
represents the Nth file while df
is a list of dataframes actually. Therefore df[file-1]
is something that's not really sensible to use. Loop through the dataframes and same them individually or whatever is intended.
Here, have a more pythonic and simpler solution:
import tabula
import os
import pandas as pd
files_in_directory = os.listdir()
filtered_files = [file for file in files_in_directory if file.endswith(".pdf")]
for file in filtered_files:
dfs = tabula.read_pdf(file)
for nth_frame, df in enumerate(dfs, start=1):
csv_name = f'{file}_{nth_frame}.csv'
df.to_csv(csv_name, encoding='utf-8')