I am trying to extract vector graphics from a PDF file and create corresponding SVG files. I am using SVGOutputDev (https://github.com/immateriel/pdf2svg/blob/master/SVGOutputDev.cc) with xpdf library for this purpose. Now SVGOutputDev hasn't implemented clip path extraction and I am trying to implement the same. While I am able to extract the clip path definitions themselves, I am unable to determine which of these definitions apply to a normal stroke or fill region. For instance, please refer to http://pastebin.com/jTdzv3YZ for the SVG I extracted from a page of PDF, and the corresponding dump of the sequence of PDF graphics commands as seen during extraction. As seen from that SVG, there are multiple clip paths and one rectangular fill region. Even though there are multiple clip paths defined before the filled rectangle is defined, only the circular clip paths defined just before the rectangle definition are expected to be associated with the rectangle (going by how the PDF page was rendered on various PDF readers, which show only 2 black-filled circles in a white background). The question is how does one know which clip paths are associated with a regular fill/stroke region defined in a PDF? FYI, I went through the relevant section of the PDF specification document butit wasn't very clear to me ("A clipping path operation may appear after the last path construction operator and before the path-painting operator that terminates a path object. Although the clipping path operator appears before the painting operator, it does not alter the clipping path at the point where it appears. Rather, it modifies the effect of the succeeding painting operator"). Can someone explain how to identify the relevant clip paths to apply to any normal path?
The question is how does one know which clip paths are associated with a regular fill/stroke region defined in a PDF?
In a nutshell: The intersection of all those clip path areas which have been defined at the time the fill or stroke operation is executed, applies with the exception of those which were voided during a Q (restore state) operator.
Thus, your analysis for your sample file
Even though there are multiple clip paths defined before the filled rectangle is defined, only the circular clip paths defined just before the rectangle definition are expected to be associated with the rectangle (going by how the PDF page was rendered on various PDF readers, which show only 2 black-filled circles in a white background)
is wrong: Not the last clip area but the intersection of all clip areas before the rectangle definition defines the current one. As each of those clip areas is contained in the preceding one, the result of the intersection indeed consists of those two circles.
In the documentation:
The graphics state shall contain a current clipping path that limits the regions of the page affected by painting operators. The closed subpaths of this path shall define the area that can be painted.
The initial clipping path shall include the entire page.
[Clipping Path Operators] modify the current clipping path by intersecting it with the current path, using the [nonzero winding number rule / even-odd rule] to determine which regions lie inside the clipping path.
There is no way to enlarge the current clipping path or to set a new clipping path without reference to the current one. However, since the clipping path is part of the graphics state, its effect can be localized to specific graphics objects by enclosing the modification of the clipping path and the painting of those objects between a pair of q and Q operators (see 8.4.2, "Graphics State Stack"). Execution of the Q operator causes the clipping path to revert to the value that was saved by the q operator before the clipping path was modified.
(section 8.5.4 in the current PDF specification ISO 32000-1)
In action: Let's look at the content stream of the page of your document (which has a Mediabox [0, 0, 595, 842]):
q
q
Twice push the graphics state.
0 842 m
0 0 l
595 0 l
595 842 l
h
W
n
Defines a clip path equivalent with the whole media box.
1 w
2 J
0 j
10 M
[]0 d
Defines general graphics state properties (line width, line cap style, line join style, miter Limit, and dash pattern).
q
Pushes the graphics state again, this time with the explicitly set clip path and those other graphics properties.
0 718.5 m
595 718.5 l
595 123.5 l
0 123.5 l
0 718.5 l
h
W
n
Defines a clip path which contains a rectangle as wide as the whole media box but cutting off the top and bottom stripes of 124 user space units height. As this clip path is completely contained in the clip path set before, the intersection equals this clip path here. Thus, the currently effective clip area is this smaller rectangle.
0 718.5 m
595 718.5 l
595 123.5 l
0 123.5 l
0 718.5 l
h
W
n
Defines a clip path which is identical to the former one. Thus, intersecting them changes nothing.
148.75 668.92 m
93.98 668.92 49.58 624.52 49.58 569.75 c
49.58 514.98 93.98 470.58 148.75 470.58 c
203.52 470.58 247.92 514.98 247.92 569.75 c
247.92 624.52 203.52 668.92 148.75 668.92 c
h
347.08 470.58 m
292.32 470.58 247.92 426.18 247.92 371.42 c
247.92 316.65 292.32 272.25 347.08 272.25 c
401.85 272.25 446.25 316.65 446.25 371.42 c
446.25 426.18 401.85 470.58 347.08 470.58 c
h
W
n
Defines a clip path consisting of two circle subpaths. These two circles don't intersect; thus we don't have to deal with the differences between the "Nonzero Winding Number Rule" and the "Even-Odd Rule". Furthermore, the circles are contained inside the present clip area. Thus, the new clip area consists of these two circles.
0 0 0 rg
49.58 668.92 m
545.42 668.92 l
545.42 173.08 l
49.58 173.08 l
49.58 668.92 l
h
f
This draws a filled black rectangle which contains the current clipping area. Thus, the whole clipping area (i.e. the two circles) is painted black.
Q
q
This restores the graphics state to the last pushed one. I.e. the clipping path for any following operations is the first one which encompassed the whole media box. This graphics state is pushed again.
0 718.5 m
0 123.5 l
595 123.5 l
595 718.5 l
h
W
n
Once again the clipping path clipping off bars at the top and the bottom is defined...
Q
q
... and immediately dropped by a restore state Operation; the state is again pushed.
0 718.5 m
0 123.5 l
595 123.5 l
595 718.5 l
h
W
n
Q
q
The same again...
0 718.5 m
0 123.5 l
595 123.5 l
595 718.5 l
h
W
n
Q
q
... and again.
0 842 m
0 0 l
595 0 l
595 842 l
h
W
n
This once again defines a clipping path circumpassing the whole media box. As this is the current clipping path anyhow, nothing changes by intersecting.
Q
Q
Q
All graphics states formerly pushed onto the stack are removed again.