pythonpdfpypdfos.walkpdf-reader

word count PDF files when walking directory


Hello Stackoverflow community!

I'm trying to build a Python program that will walk a directory (and all sub-directories) and do a accumulated word count total on all .html, .txt, and .pdf files. When reading a .pdf file it requires a little something extra (PdfFileReader) to parse the file. When parsing a .pdf files I'm getting the following error and the program stops:

AttributeError: 'PdfFileReader' object has no attribute 'startswith'

When not parsing .pdf files the problem completely successfully.

CODE

#!/usr/bin/python

import re
import os
import sys
import os.path
import fnmatch
import collections
from PyPDF2 import PdfFileReader


ignore = [<lots of words>]

def extract(file_path, counter):
    words = re.findall('\w+', open(file_path).read().lower())
    counter.update([x for x in words if x not in ignore and len(x) > 2])

def search(path):
    print path
    counter = collections.Counter()

    if os.path.isdir(path):
        for root, dirs, files in os.walk(path):
            for file in files:
                if file.lower().endswith(('.html', '.txt')):
                        print file
                        extract(os.path.join(root, file), counter)
                if file.lower().endswith(('.pdf')):
                    file_path = os.path.abspath(os.path.join(root, file))
                    print file_path

                    with open(file_path, 'rb') as f:
                        reader = PdfFileReader(f)
                        extract(os.path.join(root, reader), counter)
                        contents = reader.getPage(0).extractText().split('\n')
                        extract(os.path.join(root, contents), counter)
                        pass
    else:
        extract(path, counter)

    print(counter.most_common(50))

search(sys.argv[1])

The full error

Traceback (most recent call last):File line 50, in <module> search(sys.argv[1])

File line 36, in search extract(os.path.join(root, reader), counter)

File line 68, in join if b.startswith('/'):

AttributeError: 'PdfFileReader' object has no attribute 'startswith'

It appears there is a failure when calling the extract function with the .pdf file. Any help/guidance would be greatly appreciated!

Expected Results (works w/out .pdf files)

[('cyber', 5101), ('2016', 5095), ('date', 4912), ('threat', 4343)]

Solution

  • The problems is that this line

    reader = PdfFileReader(f)
    

    returns an object of type PdfFileReader. You're then passing this object to the extract() function which is expecting a file path and not a PdfFileReader object.

    Suggestion would be to move the PDF related processing that you currently have in the search() function to the extract function() instead. Then, in the extract function, you would check to see if it is a PDF file and then act accordingly. So, something like this:

    def extract(file_path, counter):
        if file_path.lower().endswith(('.pdf')):
            reader = PdfFileReader(file)
            contents = reader.getPage(0).extractText().split('\n')
            counter.update([x for x in contents if x not in ignore and len(x) > 2])
        elif file_path.lower().endswith(('.html', '.txt')):
            words = re.findall('\w+', open(file_path).read().lower())
            counter.update([x for x in words if x not in ignore and len(x) > 2])
        else:
            ## some other file type...
    

    Haven't tested the code snippet above but hopefully you should get the idea.