pdfmupdf

Extracting figures(Form Xobjects) from scientific papers


In scientific papers, figures are often PDF files later compiled into main file by latex compilers. I'm trying to extract such kind figures from a PDF file.

Before I started exploring this topic, I had little knowledge of PDF format, so sorry if I misunderstood anything.

Take a publicly available paper as an example, I observed the file by opening it with text editor and found the figures are Form Xobjects like the following one.

152 0 obj
<<
/Type /XObject
/Subtype /Form
/FormType 1
/PTEX.FileName (./graphs/spectrum.pdf)
/PTEX.PageNumber 1
/PTEX.InfoDict 203 0 R
/BBox [135.2168 266.4003 592.6775 381.8708]
/Resources <<
/ProcSet [ /PDF /Text ]
/ColorSpace <<
/Cs1 204 0 R
>>/Font << /TT2 205 0 R>>
>>
/Length 1052
/Filter /FlateDecode
>>
stream
x½VËn7¼ó+:/{6ÖRdóm'N`Áß-ƒÓ"‚Xdòû®&çAíZk
xˆœN³º«ºÉ[zK·d™Š¡3yÇôñú‹þ£ó‹;Kû;²õïn/«3ÝPʦÎ>Ô¾ü€wªŽ×tõÍáIülõdó“x7š£üfÞ'rÆØ#§@Ó;mr‰ä8êì9©
ÙkÃi¶E¸îlžXlÝ·Õ–`»î¬¶èØ<ZWtJ¡md]šöeí¸ô†P?#€[e]Bôjqäl[µ€˜,TÔÞ8Gn²Y_¦í&O‹eÙo± fG3î9_“e¯®»ì´
#x9içí"RrZ^ˆÞRADµïm"—ÎSÛžf†&¾Ë ù™rUHyÚoò´XÚ~Hr·jB5{£QË~S|{0}E%p ÿ!ù[\\®(îòBÝb‰6r)Œ*1øÛÞ7YŸuŽ6‘KV—d˜ö7uTSŽñ å$S}Íx64O¯é…%žˆ\¶dßžebƒ‡¶´eÇ:‹÷W;e³S;I¥vøµ/ºöÁÐî†Îw;Fл+zGÃ?n 8~Â`høÓð†¨†'ú›voèõ®ú!ìÃQ`@xÙw8B¢êÇSì\hx?là㦿"XÉ=Û Eí©Wsê­-Ƈ€4¢ôÇùœñ£äæœQËÞš8æ\ÜŽ9_õ(ç÷bUÇÚ“ÎlµwYfûñÉ~v~[²‘sPV?@® :¨<`    €ÿ[¬H4èº/‡2œA—tHšK¬£c·ÂÈ€:
éöœÐ!k)5ÒêZ‘ÞçZC    @¹i@‚¶-ðû÷”8ë´
(AÙ,¢Xq¤ÿïEIÉ:ù€¶~L ›’mi”´ùÒ’VÄ4¥äòg(AÎK‰`º³ äÉF    ?,2U¨Ò"ØA°X‰Ê[¼|˜*%Ú]¥j©_>—.‰ƒQ²3:¸H6h  `B“>Ä´R0¾àZÊrrtí˹Q¦µ`Æù×·/_:t¼„oOh_Qê­B‚ÌjH`…†ÜX(søàíùaø}õ“qª¦†6Õð ”Øz`@/*/6
\üV;š ÅC¥4qÒƒ²!àtÇY¶åÌÚg«’³µ”G«øwHbÀŽÒ_å„e% ä 0"‡oÓ®œºÒt14yXÏÁàà΂‹×|=æKÉ2ƒ}MW‹ŠÐâ„°
HQâ€?ú"X†ªÔј¨w`OPßÓ3); ‘‹‹URk·‡³éúØ üì:´[£›îÓBøBHâCú€ì‚È€æ¾mõgOÌÛOR›Œ
endstream
endobj

As the name suggests, it's figure 1 in the paper. Then, I tried to extract the "graphs/spectrum.pdf" directly with mutool extract paper.pdf but only got font files.

Then I turned to another way — rendering the page first and extracting the pixmap using the "/BBox" attribute. The script I run with mutool run script.js is as follows.

car pdf_path = "paper.pdf"
var doc = new Document(pdf_path);
var page = doc.loadPage(1);
var submap = new Pixmap(DeviceRGB, [135, 266, 592, 381], true)
var dev = new DrawDevice(Identity, submap)
page.run(dev, Identity, true)
submap.saveAsPNG("result.png")

But this way, the result PNG is not at the right place (see below). Now I'm confused by the BBox numbers...

expected result:

expected result

the result I got:

result I got

In a word, my questions are:

  1. How much does the segment I pasted above tell us? Does it contain all data of the original "spectrum.pdf"?
  2. Can I extract Xobjects like the one above as standalone pdf files?
  3. How could I compute the final absolute location of the figures?

Solution

    1. How much does the segment I pasted above tell us? Does it contain all data of the original "spectrum.pdf"?

    No. First of all only material from the PDF required for visually including the file needs to be copied; there may be other information in the original file that is not included.

    Furthermore, you see references to other indirect objects in that segment - 203 0 R, 204 0 R, 205 0 R - which in turn may reference yet more indirect objects. All those objects are needed.

    1. Can I extract Xobjects like the one above as standalone pdf files?

    You can create stand-alone PDF files from Xobjects if you copy all related objects from the original file.

    Beware, though, they may look differently than in the document they are extracted from. On one hand the caller of the Xobject may initialize the graphics state in a way to cause differences, and additional stuff afterwards may be drawn over the area where the Xobject has been drawn.

    1. How could I compute the final absolute location of the figures?

    In general an Xobject may be drawn an arbitrary number of times on a page (I don't know whether LaTeX compilers make use of that, though). Thus, there may not be the final absolute location but alternatively none (if the Xobject is not used at all) or multiple ones.

    The absolute position of a use of an Xobject on a page can be calculated from the BBox entry of the Xobject and the current transformation matrix (a part of the graphics state) at the time it is called. To determine that CTM value you have to parse the page content streams (and possibly other - Xobject, Pattern, ... - nested content streams) and observe the changes to the CTM until the use of your Xobject.