pdfpdfboxxpdf

why from scanned documents, text can be extracted, but not image


I asked a similar question before, in stackoverflow. I wanted to ask another related question, so I am rephrasing the original question again.

I was using PDFBox to extract image and text from a pdf, available in skydrive and scribd. I had following code for extraction of text:

 PDFTextStripper p = new PDFTextStripper();
 String thistext=p.getText(document);

Which extracted the text properly. However, when I tried to extract images from the same pdf using ExtractImages class, the images produced were all pages of the pdf, not the actual images (which should be 1).

It appeared to me that the pdf could be a scanned document. The answer said the fact that it is scanned is your issue. I tried once more with pdftotext and pdfimages. The text is extracted, but pdfimages output 5 image files, which are all pages of the pdf (same as PDFBox).

As far I know, the raster images are stored as Xobjects in the pdf. When I opened the pdf with a text editor, I saw 5 appearances of following line:

<< /Type /XObject /Subtype /Image /Name /X /Width 2600 /Height 3799

Which is probably why PDFBox and XPDF output 5 pages of the pdf as image files. Then how is the text getting extracted from the pdf? Is there a technical documentation which mentions why (or how) text can be extracted from such a document, where the pages are "supposedly" embedded as XObjects. I can cite the documentation in my report.


Solution

  • Having inspected your PDF file the first guess in the comments to your question has been confirmed...

    Your sample document is scanned and essentially consists of one bitmap image per page. When you zoom into the document, you can quickly see that all content looks fairly pixel'ish.

    All the images have a resolution of 2600x3799 and are black and white.

    These images have furthermore been OCR'ed and the resulting text has been invisibly added to the pages which allows for selecting, copying & pasting.

    E.g. have a look at the top of page 885:

    top of your page 885

    Its content stream starts like this:

    1 0 0 1 -0.5998 -0.4801 cm
    1 1 1 rg
    1 i 
    /RelativeColorimetric ri
    /GS0 gs
    0 0 469.2 684.7 re
    f
    q
    467.9972 0 0 683.8015 0.6014 0.4492 cm
    /Im0 Do
    Q
    

    Here /Im0, the page image, is inserted

    1 0 0 1 0.5998 0.4801 cm
    0 0 0 rg
    BT
    /TT0 1 Tf
    3 Tr 9.8 0 0 10.4 35.8002 640.4199 Tm
    

    Here addition of text is prepared; especially have a look at 3 Tr: This oparation sets the text rendering mode to 3 which is Neither fill nor stroke text (invisible). (section 9.3.6 Text Rendering Mode in ISO 32000-1:2008)

    (A )Tj
    /TT1 1 Tf
    -0.01 Tc 8.8 0 0 9.5 43.4002 640.4199 Tm
    (%gust )Tj
    

    Here you see text added, starting with an 'A ' and an '%gust '. This actually shows that the result of the OCR'ing does not seem to have been properly checked as that should have been 'August'. The low quality text information continues:

    A %gust , 1978 SHORT PAPERS 885
    where
    and also
    Similarly for B. Also,
    T, = AY-l T
    as a result of the adiabatic cooling of the vapour.
    Stage 2:
    Here a volume of vapour and a volume of liquid I are removed and replaced with an
    equal volume of air containing concentrations Y and s of A and B, respectively. Of course,
    r or s may either or both be negligibly small, with subsequent simplification.
    

    As you see many special characters and formulas have not or not correctly been recognized.