python-3.xtext-parsingtext-extractionpdfminerpdf-scraping

pdfminer: extract only text according to font size


I only want to extract text that has font size 9.800000000000068 and 10.000000000000057 from my pdf files. The code below returns a list of the font size of each text block and its characters for one pdf file.

Extract_Data=[]
for page_layout in extract_pages(path):
    print(page_layout)
    for element in page_layout:
        if isinstance(element, LTTextContainer):
            for text_line in element:
                for character in text_line:
                    if isinstance(character, LTChar):
                        Font_size=character.size
            Extract_Data.append([Font_size,(element.get_text())])

gives me an Extract_Data list with the various font sizes

[[9.800000000000068, 'aaa\n'], [11.0, 'dffg\n'], [10.000000000000057, 'bbb\n'], [10.0, 'hs\n'], [8.0, '2\n']]

example: font size 10.000000000000057

Extract_Data=[]
for page_layout in extract_pages(path):
    print(page_layout)
    for element in page_layout:
        if isinstance(element, LTTextContainer):
            for text_line in element:
                for character in text_line:
                    if isinstance(character, LTChar):
                        if character.size == '10.000000000000057':
                            element.get_text()
                Extract_Data.append(element.get_text())
                Data = ''.join(map(str, Extract_Data))

gives me a Data list with all of the text. How can i make it only extract font size '10.000000000000057' characters?

['aaa\ndffg\nbbb\nhs\n2\n']

I also want to integrate into a function that does this for multiple files resulting in a pandas df that has one row for each pdf. Desired output: [['aaa\n bbb\n']]. Convertin pixels to points (int(character.size) * 72 / 96) as suggested eksewhere did not help. Maybe this has something to do with this? https://github.com/pdfminer/pdfminer.six/issues/202

This is the function it would be integrated later on:

directory = 'C:/Users/Sample/'
resource_manager = PDFResourceManager()
for file in os.listdir(directory):
    if not file.endswith(".pdf"):
        continue
    fake_file_handle = io.StringIO()
    manager = PDFResourceManager()
    device = PDFPageAggregator(manager, laparams=params)
    interpreter = PDFPageInterpreter(manager, device)
    device = TextConverter(interpreter, fake_file_handle, laparams=LAParams())
    params = LAParams(detect_vertical=True, all_texts=True)
    elements = []
    with open(os.path.join(directory, file), 'rb') as fh:
        parser = PDFParser(fh)
        document = PDFDocument(parser, '')
        if not document.is_extractable:
            raise PDFTextExtractionNotAllowed

        for page in enumerate (PDFPage.create_pages(document)):
            for element in page:

Solution

  • Pdfminer is the wrong tool for that.

    Use pdfplumber (which uses pdfminer under the hood) instead https://github.com/jsvine/pdfplumber, because it has utility functions for filtering out objects (eg. based on font size as you're trying to do), whereas pdfminer is primarily for getting all text.

    import pdfplumber
    
    def get_filtered_text(file_to_parse: str) -> str:
        with pdfplumber.open(file_to_parse) as pdf: 
            text = pdf.pages[0]
            clean_text = text.filter(lambda obj: not (obj["object_type"] == "char" and obj["size"] != 9))
            print(clean_text.extract_text())
    
    get_filtered_text("./my_pdf.pdf")
    

    The example above I've shown is easier than yours because it just checks for font size 9.0, and you have

    9.800000000000068 and 10.000000000000057

    so the obj["size"] condition will be more complex in your case

    obj["size"] has the datatype Decimal (from decimal import Decimal) so you probably will have to do something like obj["size"].compare(Decimal(9.80000000068)) == 0