Using the snippet below, I've attempted to extract the text data from this PDF file.
import pyPdf
def get_text(path):
# Load PDF into pyPDF
pdf = pyPdf.PdfFileReader(file(path, "rb"))
# Iterate pages
content = ""
for i in range(0, pdf.getNumPages()):
content += pdf.getPage(i).extractText() + "\n" # Extract text from page and add to content
# Collapse whitespace
content = " ".join(content.replace(u"\xa0", " ").strip().split())
return content
The output I obtain, however,is devoid of whitespace between most of the words. This makes it difficult to perform natural language processing on the text (my ultimate goal, here).
Also, the 'fi' in the word 'finger' is consistently interpreted as something else. This is rather problematic since this paper is about spontaneous finger movements...
Does anybody know why this might be happening? I don't even know where to start!
Your PDF file doesn't have printable space characters, it simply positions the words where they need to go. You'll have to do extra work to figure out the spaces, perhaps by assuming multi-character runs are words, and put spaces between them.
If you can select text in the PDF reader, and have spaces appear properly, then at least you know there is enough information to reconstruct the text.
"fi" is a typographic ligature, shown as a single character. You may find this is also happening with "fl", "ffi", and "ffl". You can use string replacement to substitute "fi" for the fi ligature.