I have a PDF document that I'm attempting to parse into text. The document is part of the public domain of non-profit financial documents and is safe to share.
I've attempted to parse the document using a common NPM package called pdf-parse, but it doesn't output any text.
JavaScript code:
const fs = require('fs');
const pdf = require('pdf-parse');
let dataBuffer = fs.readFileSync('./sample-one-page.pdf');
pdf(dataBuffer).then(function(data) {
console.log(data.numpages);
console.log(data.numrender);
console.log(data.info);
console.log(data.metadata);
console.log(data.version);
console.log(data.text);
});
The script accurately detects the number of pages and all of the other metadata, but it doesn't parse the text. The output of running this script is below.
1
1
{
PDFFormatVersion: '1.3',
IsAcroFormPresent: false,
IsXFAPresent: false,
Title: 'PDF TIFF Wrapper',
Author: 'Awesome Donald',
Creator: 'ServiceFileCopy',
Producer: 'macOS Version 13.3.1 (Build 22E261) Quartz PDFContext',
CreationDate: "D:20230504152855Z00'00'",
ModDate: "D:20230504152855Z00'00'"
}
null
1.10.100
I've verified that the script works with other PDF documents and I've also replicated the issue with a Python library for PDF parsing as well (https://pypi.org/project/pypdf/).
Is there something about this document that prevents text extraction?
It's just a PDF wrapper around a TIFF image. Didn't realize that until Dave pointed it out.