pythonpdfweb-scrapingpdfboxpdftotext

How to properly scrape newspaper PDFs


I'm starting to think it's impossible to do what I want to do, but I just thought I would ask here before I gave up.

I have almost 200 archival PDFs of a newspaper that I would like to analyse. However I would like to do this analysis specifically on the letters section, yet depending on the layout of adverts and the conversion method (be it pdfbox or pdftotext) there is no real beginning or ending to the section that I could accurately create a regex search with.

Can anyone think of a way to do something like this? I've looked at it for a while and it seems like the only reliable way might be to manually sift through every pdf.


Solution

  • Thought i'd just write a little about what i did to get this working:

    Following from what @TilmanHausherr said, I was about to start manually, cropping each page individually and then doing the text extraction on those cropped pages.

    However I thought I might as well reduce the amount of manual cropping as much as possible by getting rid of the pages that were totally unnecessary (99% of them).

    So even if my semi automated selections weren't 100% accurate - they would at least make less manual work for me, which would be helpful either way. So I followed this process:

    1. Using Acrobat, I ran a javascript search to extract any pages with a certain keyword, to a new document. The catch being this has to be a one word keyword, nonetheless I found a pretty unique word 'disclaimer' that appeared on all letter pages. Even if it did catch another page - it didn't matter as all I wanted to do was reduce the ultimate manual work.

    2. I then wanted to make the pages as easy as possible for me to manually crop, so knowing that all images were irrelevant, I used the program pdftoolbox on the 14 day trial to use this crazy feature which automatically splits, text images and vectors into different layers which can then be deleted or made invisible.

    3. This is done by going to the fixups menu, searching for the create different layers for vectors..option and clicking fix. Then once it's done - going to the explore layers option under the main menu and deleting everything but the text layer. Which as you can see is super effective in removing any additional junk and almost becomes like adblock for newspapers :) enter image description here

    4. There is still some remaining junk, but after removing all the images all I had to do was go through a couple pages and check there was no unrelated text in acrobat editor. The only manual work left to be done.

    I think it's pretty funny how I was completely stuck with this when I was trying to automate the entire process. But when I instead tried to reduce as much manual work as possible I had automated like 99% of the process anyway.

    Guess I was making the Perfect solution fallacy subconsciously, when I was trying to automate it.

    ¯\_(ツ)_/¯