hi i am trying to read pdf in node js . when i try to read this pdf. it start showing this error.
(while reading XRef): Error: Invalid XRef stream header
Error: Error: Invalid XRef stream header
at error (eval at <anonymous> (/home/satyaarth/Desktop/react/baby/node_modules/pdf2json/lib/pdf.js:62:1), <anonymous>:195:9)
at XRef_readXRef [as readXRef] (eval at <anonymous> (/home/satyaarth/Desktop/react/baby/node_modules/pdf2json/lib/pdf.js:62:1), <anonymous>:5692:9)
at XRef_parse [as parse] (eval at <anonymous> (/home/satyaarth/Desktop/react/baby/node_modules/pdf2json/lib/pdf.js:62:1), <anonymous>:5280:28)
at PDFDocument_setup [as setup] (eval at <anonymous> (/home/satyaarth/Desktop/react/baby/node_modules/pdf2json/lib/pdf.js:62:1), <anonymous>:4622:17)
at PDFDocument_parse [as parse] (eval at <anonymous> (/home/satyaarth/Desktop/react/baby/node_modules/pdf2json/lib/pdf.js:62:1), <anonymous>:4506:12)
at LocalPdfManager_ensure [as ensure] (eval at <anonymous> (/home/satyaarth/Desktop/react/baby/node_modules/pdf2json/lib/pdf.js:62:1), <anonymous>:32515:24)
at LocalPdfManager.BasePdfManager_ensureModel [as ensureModel] (eval at <anonymous> (/home/satyaarth/Desktop/react/baby/node_modules/pdf2json/lib/pdf.js:62:1), <anonymous>:32451:19)
at Object.eval [as onResolve] (eval at <anonymous> (/home/satyaarth/Desktop/react/baby/node_modules/pdf2json/lib/pdf.js:62:1), <anonymous>:27142:22)
at Object.runHandlers (eval at <anonymous> (/home/satyaarth/Desktop/react/baby/node_modules/pdf2json/lib/pdf.js:62:1), <anonymous>:864:35)
at listOnTimeout (internal/timers.js:557:17)
Error: Invalid XRef stream header
error: { parserError: 'Error: Invalid XRef stream header' }
here is my code as well
import { PdfReader } from "pdfreader";
new PdfReader().parseFileItems("./GeM-Bidding-3342395.pdf", (err, item) => {
if (err) console.error("error:", err);
else if (!item) console.warn("end of file");
else if (item.text) console.log(item.text);
});
but when i try to parse the same pdf using online parsers the pdf get parsed and here is a sample of it . and also sujjest if not this way how can i extract the data using api or something.
From any OS console system (Linux Mac Windows) the easiest way to parse a PDF is to use either of the utility commands pdftotext
- Xpdf or Poppler (generally 64 bit) Windows binary here
To export say two pages to console use pdftotext -nopgbrk -f 1 -l 2 GeM-Bidding-3342395.pdf -
To save in a file use a filename in place of -
or pipe to another command
The sequence of output can vary depending on options so the above without mod looks like this:-
However if I add -layout in the poppler version its more like this:-
And there are other options in the Xpdf version such as -table -simple -simple2, so you need to pick the one best suited to your desire.