pythonparsingpdfpdfminerpdf-parsing

Extracting Data from PDF with particular heading in python


I wanted to parse the PDF file in python. I have seen examples with PDFMiner which could not explain my requirement.

For Example if I want to parse a resume, it contains various fields like Summary, Experience and Hobbies.

I am interested to extract only experience and this experience field will be in the first place or second place or at any place, I need to Identify where the experience field located and need to extract the data.

How can I do this?


Solution

  • There are 2 viable approaches to extract that field data:

    1. Search for some predefined keyword, like Experience to get its location. Then search for the next section's keyword (Hobbies) and then just determine coordinates of the text partition between these 2 sections and extract this text from this location.

    2. If PDF are generated using the same generator then you may just find coordinates of Experience section and just extract text from the same location everytime.

    3. (easiest) Just convert the whole page into text and then parse the generated text using substring search or regular expressions. This will be the easiest and simpliest way as all the work regarding PDF format relies on the specialized tool