pdftext-extractionpdf-scraping

What's a good method for extracting text from a PDF using C# or classic ASP (VBScript)?


Is there a good library for extracting text from a PDF? I'm willing to pay for it if I have to.

Something that works with C# or classic ASP (VBScript) would be ideal and I also need to be able to separate the pages from the PDF.

This question had some interesting stuff, especially pdftotext but I'd like to avoid calling to an external command-line app if I can.


Solution

  • You can use the IFilter interface built into Windows to extract text and properties (author, title, etc.) from any supported file type. It's a COM interface so you would have use the .NET interop facilities.

    You'd also have to download the free PDF IFilter driver from Adobe.