pythonpdfunicodepypdf

Whitespace gone from PDF extraction, and strange word interpretation


Using the snippet below, I've attempted to extract the text data from this PDF file.

import pyPdf

def get_text(path):
    # Load PDF into pyPDF
    pdf = pyPdf.PdfFileReader(file(path, "rb"))
    # Iterate pages
    content = ""
    for i in range(0, pdf.getNumPages()):
        content += pdf.getPage(i).extractText() + "\n"  # Extract text from page and add to content
    # Collapse whitespace
    content = " ".join(content.replace(u"\xa0", " ").strip().split())
    return content

The output I obtain, however,is devoid of whitespace between most of the words. This makes it difficult to perform natural language processing on the text (my ultimate goal, here).

Also, the 'fi' in the word 'finger' is consistently interpreted as something else. This is rather problematic since this paper is about spontaneous finger movements...

Does anybody know why this might be happening? I don't even know where to start!


Solution

  • Your PDF file doesn't have printable space characters, it simply positions the words where they need to go. You'll have to do extra work to figure out the spaces, perhaps by assuming multi-character runs are words, and put spaces between them.

    If you can select text in the PDF reader, and have spaces appear properly, then at least you know there is enough information to reconstruct the text.

    "fi" is a typographic ligature, shown as a single character. You may find this is also happening with "fl", "ffi", and "ffl". You can use string replacement to substitute "fi" for the fi ligature.