javascriptnode.jspdf-parsing

(while reading XRef): Error: Invalid XRef stream header?


hi i am trying to read pdf in node js . when i try to read this pdf. it start showing this error.

(while reading XRef): Error: Invalid XRef stream header
Error: Error: Invalid XRef stream header
    at error (eval at <anonymous> (/home/satyaarth/Desktop/react/baby/node_modules/pdf2json/lib/pdf.js:62:1), <anonymous>:195:9)
    at XRef_readXRef [as readXRef] (eval at <anonymous> (/home/satyaarth/Desktop/react/baby/node_modules/pdf2json/lib/pdf.js:62:1), <anonymous>:5692:9)
    at XRef_parse [as parse] (eval at <anonymous> (/home/satyaarth/Desktop/react/baby/node_modules/pdf2json/lib/pdf.js:62:1), <anonymous>:5280:28)
    at PDFDocument_setup [as setup] (eval at <anonymous> (/home/satyaarth/Desktop/react/baby/node_modules/pdf2json/lib/pdf.js:62:1), <anonymous>:4622:17)
    at PDFDocument_parse [as parse] (eval at <anonymous> (/home/satyaarth/Desktop/react/baby/node_modules/pdf2json/lib/pdf.js:62:1), <anonymous>:4506:12)
    at LocalPdfManager_ensure [as ensure] (eval at <anonymous> (/home/satyaarth/Desktop/react/baby/node_modules/pdf2json/lib/pdf.js:62:1), <anonymous>:32515:24)
    at LocalPdfManager.BasePdfManager_ensureModel [as ensureModel] (eval at <anonymous> (/home/satyaarth/Desktop/react/baby/node_modules/pdf2json/lib/pdf.js:62:1), <anonymous>:32451:19)
    at Object.eval [as onResolve] (eval at <anonymous> (/home/satyaarth/Desktop/react/baby/node_modules/pdf2json/lib/pdf.js:62:1), <anonymous>:27142:22)
    at Object.runHandlers (eval at <anonymous> (/home/satyaarth/Desktop/react/baby/node_modules/pdf2json/lib/pdf.js:62:1), <anonymous>:864:35)
    at listOnTimeout (internal/timers.js:557:17)
Error: Invalid XRef stream header
error: { parserError: 'Error: Invalid XRef stream header' }

here is my code as well

import { PdfReader } from "pdfreader";

new PdfReader().parseFileItems("./GeM-Bidding-3342395.pdf", (err, item) => {
  if (err) console.error("error:", err);
  else if (!item) console.warn("end of file");
  else if (item.text) console.log(item.text);
});

but when i try to parse the same pdf using online parsers the pdf get parsed and here is a sample of it . and also sujjest if not this way how can i extract the data using api or something.


Solution

  • From any OS console system (Linux Mac Windows) the easiest way to parse a PDF is to use either of the utility commands pdftotext - Xpdf or Poppler (generally 64 bit) Windows binary here

    To export say two pages to console use pdftotext -nopgbrk -f 1 -l 2 GeM-Bidding-3342395.pdf - To save in a file use a filename in place of - or pipe to another command

    The sequence of output can vary depending on options so the above without mod looks like this:- enter image description here

    However if I add -layout in the poppler version its more like this:-

    enter image description here

    And there are other options in the Xpdf version such as -table -simple -simple2, so you need to pick the one best suited to your desire.

    enter image description here