node.jspdf-parsing

What can't this PDF be parsed by PDF parsing packages?


I have a PDF document that I'm attempting to parse into text. The document is part of the public domain of non-profit financial documents and is safe to share.

Sample page on Google Storage

I've attempted to parse the document using a common NPM package called pdf-parse, but it doesn't output any text.

JavaScript code:

const fs = require('fs');
const pdf = require('pdf-parse');

let dataBuffer = fs.readFileSync('./sample-one-page.pdf');

pdf(dataBuffer).then(function(data) {
    console.log(data.numpages);
    console.log(data.numrender);
    console.log(data.info);
    console.log(data.metadata); 
    console.log(data.version);
    console.log(data.text); 
});

The script accurately detects the number of pages and all of the other metadata, but it doesn't parse the text. The output of running this script is below.

1
1
{
  PDFFormatVersion: '1.3',
  IsAcroFormPresent: false,
  IsXFAPresent: false,
  Title: 'PDF TIFF Wrapper',
  Author: 'Awesome Donald',
  Creator: 'ServiceFileCopy',
  Producer: 'macOS Version 13.3.1 (Build 22E261) Quartz PDFContext',
  CreationDate: "D:20230504152855Z00'00'",
  ModDate: "D:20230504152855Z00'00'"
}
null
1.10.100

I've verified that the script works with other PDF documents and I've also replicated the issue with a Python library for PDF parsing as well (https://pypi.org/project/pypdf/).

Is there something about this document that prevents text extraction?


Solution

  • It's just a PDF wrapper around a TIFF image. Didn't realize that until Dave pointed it out.