parsingpdftext-extractioncgpdfscanner

Parsing PDF get same text twice in different page


I have a PDF file which contains 2 pages. When I parse it with my parser, in Ojective-C, I have the following situation.

For the first page everything is Ok, I have text that I should have (that I visually see in pdf readers like Preview, Adobe reader ...). For the second page I have the text that I see in the second page PLUS a part of the text from the first page, that is not in the second page.

I tried with others parsers : pdftotext (xpdf) they managed to have the correct result. Pdfminer (in python) https://pypi.python.org/pypi/pdfminer/, I got the same result as I had. A part of thext from the first page is extracted twice.

My question is : How can this happen ? Have you ever seen this situation ? If the text is really present in the second page, why don't pdf readers show it ? Do you have any thoughts about this ?


Solution

  • I've ran your file through Acrobat (using "Examine Document") and it tells me there's some hidden text in it. Take a look at the following screen shot:

    enter image description here

    The text in red in the screen shot marks what is hidden. As mkl indicates, it's present OUTSIDE the MediaBox, which makes it invisible when looking at the document in a PDF viewer. That doesn't mean the text is there. If you look inside the content stream (which is what parsers do), you'll still find it.

    Your parser should discard everything that is outside the MediaBox. Normally there's an option to do that. I know there is one in iText; I don't know about other parsers.