I am trying to create a supabase edge function to read a file from an URL and return its text, however I can't find any working libraries for Deno environment.
This is what I tried so far:
import { PDFDocument } from 'https://cdn.skypack.dev/pdf-lib';
async function fetchPDF(url: string): Promise<Uint8Array> {
const response = await fetch(url);
const data = await response.arrayBuffer();
return new Uint8Array(data);
}
async function readPDFText(url: string): Promise<string> {
const pdfBytes = await fetchPDF(url);
const pdfDoc = await PDFDocument.load(pdfBytes);
const pages = pdfDoc.getPages();
let text = '';
for (const page of pages) {
const content = await page.extractText();
text += content;
}
return text;
}
const pdfUrl = 'URL_GOES_HERE';
const pdfText = await readPDFText(pdfUrl);
console.log(pdfText);
however, I get a TypeError that .extractText() is not a function, I also tried getTextContent(), same error.
That library does not support text extraction
It is not currently possible to parse plain text out of a document with pdf-lib (but you can extract the content of acroform fields). I'd suggest you consider using PDF.js to parse/extract text.
Of course, this isn't an ideal solution since it requires two different libraries for a seemingly simple task. But it's the best approach I know of for now, until pdf-lib gains support for text parsing.
As an alternative, you could use any npm package that has that functionality.
Here's a working example using pdf-parse
import pdf from 'npm:pdf-parse/lib/pdf-parse.js'
async function extractTextFromPDF(pdfUrl) {
const response = await fetch(pdfUrl);
const data = await pdf(await response.arrayBuffer());
return data.text;
}
const pdfUrl = 'https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf';
const pdfText = await extractTextFromPDF(pdfUrl);
console.log(pdfText);