excelpdfpowerquery

Loading tables from multiple PDF documents into excel


my work receives monthly bank statements from clients that we need to be able to export to excel (or some other worksheet) and merge into one sheet, ideally one single table.

The bank statements are generally in the same format but some months there are more transactions than other months so the specific table I need to pull out can be in different spots/pages on the pdf.

I can use PowerQuery/GetData to pull information from one PDF and choose the right table manually which works reasonably well but this is slow and merging each table into something we can work with is taking a lot of time.

I know PowerQuery can pull from a folder but it requires a sample table from the first pdf in the folder to work out which tables to pull out and since the tables are in different places it then fails to bring what I need.

Does anyone have any advice on a way to overwrite which table excel extracts for each pdf prior to combining the files or any other work around for this?

Thanks Dan


Solution

  • This is the basic method for combining all PDFs in a directory without knowing their formats in advance

    let Source = Folder.Files("c:\temp9\"),
    #"Filtered Rows" = Table.SelectRows(Source, each ([Extension] = ".pdf")),
    #"Added Custom" = Table.AddColumn(#"Filtered Rows", "Data", each Pdf.Tables(File.Contents([Folder Path]&[Name]))),
    #"Renamed Columns" = Table.RenameColumns(#"Added Custom",{{"Name", "zName"}}),
    List = List.Union(List.Transform(#"Renamed Columns" [Data], each Table.ColumnNames(_))), 
    #"Expanded Data" = Table.ExpandTableColumn(#"Renamed Columns" , "Data", List,List)
    in #"Expanded Data"
    

    Then you would filter the ID column to pick up the types of items you want (starts with table, starts with page, etc), and then transform and expand the Data column