javascripttypescriptdenopdfjs-dist

Is there a way to fetch PDF from URL and extract text from it in Deno?


I am trying to create a supabase edge function to read a file from an URL and return its text, however I can't find any working libraries for Deno environment.

This is what I tried so far:

import { PDFDocument } from 'https://cdn.skypack.dev/pdf-lib';

async function fetchPDF(url: string): Promise<Uint8Array> {
    const response = await fetch(url);
    const data = await response.arrayBuffer();
    return new Uint8Array(data);
}

async function readPDFText(url: string): Promise<string> {
    const pdfBytes = await fetchPDF(url);
    const pdfDoc = await PDFDocument.load(pdfBytes);
    const pages = pdfDoc.getPages();

    let text = '';
    for (const page of pages) {
        const content = await page.extractText();
        text += content;
    }

    return text;
}

const pdfUrl = 'URL_GOES_HERE';
const pdfText = await readPDFText(pdfUrl);
console.log(pdfText);

however, I get a TypeError that .extractText() is not a function, I also tried getTextContent(), same error.


Solution

  • That library does not support text extraction

    It is not currently possible to parse plain text out of a document with pdf-lib (but you can extract the content of acroform fields). I'd suggest you consider using PDF.js to parse/extract text.

    Of course, this isn't an ideal solution since it requires two different libraries for a seemingly simple task. But it's the best approach I know of for now, until pdf-lib gains support for text parsing.

    As an alternative, you could use any npm package that has that functionality.

    Here's a working example using pdf-parse

    import pdf from 'npm:pdf-parse/lib/pdf-parse.js'
    
    async function extractTextFromPDF(pdfUrl) {
        const response = await fetch(pdfUrl);
        const data = await pdf(await response.arrayBuffer());
        return data.text;    
    }
    
    const pdfUrl = 'https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf';
    const pdfText = await extractTextFromPDF(pdfUrl);
    console.log(pdfText);