pdfpdfclown

Empty whitespace conversion in PDFClown


I'm having an issue when using the TextExtractor class in PDFClown, with occurrences of empty whitespace also known as a "discretionary newline". These characters are embedded randomly but ignored in Acrobat Reader. So, lines where these characters exist will show as a single line in Acrobat, but are broken into many lines when the text is extracted, if I specify '\n' as the newline character in TextExtractor.ToString(...).

It appears that PDF clown simply takes any whitespace character and converts it into a single space, or ' '. Is there a way to bypass this conversion, so that the original character is extracted instead?


Solution

  • After more research, it appears that the PDFClown library is very buggy. There are several issues:

    To come directly to the issue I had, you can detect and remove these "false" whitespace characters by checking their bounding rectangle to see if they overlap other non-whitespace characters, but given all the other issues with the library, my advice to use use PDFBox instead.

    If you're using .NET and you'd like to use PDFBox, you can use Tika On Dot Net which is the Apache Tika project brought over to .NET via IKVM.

    Apache Tika is a collection of other libraries, include PDFBox. Tika On Dot Net currently has PDFBox 1.8.10 and also has a Nuget package to make adding to your project easy.

    I had a project go 1.5 weeks over deadline because all of these issues were discovered half way through, which required a full rewrite. Just a heads up.