I´m using PDFClown to analyze a PDF Document. In many documents it seems that some characters in PDFClown have different heights even if they obviously have the same height. Is there a workaround?
This is the Code:
while(_level.moveNext()) {
ContentObject content = _level.getCurrent();
if(content instanceof Text) {
ContentScanner.TextWrapper text = (ContentScanner.TextWrapper)_level.getCurrentWrapper();
for(ContentScanner.TextStringWrapper textString : text.getTextStrings()) {
List<CharInfo> chars = new ArrayList<>();
for(TextChar textChar : textString.getTextChars()) {
chars.add(new CharInfo(textChar.getBox(), textChar.getValue()));
}
}
}
else if(content instanceof XObject) {
// Scan the external level
if(((XObject)content).getScanner(_level)!=null){
getContentLines(((XObject)content).getScanner(_level));
}
}
else if(content instanceof ContainerObject){
// Scan the inner level
if(_level.getChildLevel()!=null){
getContentLines(_level.getChildLevel());
}
}
}
Here is an example PDFDocument:
In this Document I marked two text chunks which both contains the word "million". When analyzing the size of each char in both "million" the following happens:
Even if all chars of the two text chunks obviously have the same size pdf clown said that the sizes are different.
The issue is caused by a bug in PDF Clown: it assumes that marked content sections and save/restore graphics state blocks are properly contained in each other and don't overlap. I.e. it assumes that these structures only intermingle as
begin-marked-content
save-graphics-state
restore-graphics-state
end-marked-content
or
save-graphics-state
begin-marked-content
end-marked-content
restore-graphics-state
but never as
save-graphics-state
begin-marked-content
restore-graphics-state
end-marked-content
or
begin-marked-content
save-graphics-state
end-marked-content
restore-graphics-state.
Unfortunately this assumption is wrong, marked content sections and save/restore graphics state blocks can intermingle any way they like.
E.g. in the document at hand there are sequences like this:
q
[...1...]
/P <</MCID 0 >>BDC
Q
[...2...]
EMC
Here [...1...]
is contained in the save/restore graphics state block enveloped by q
and Q
and [...2...]
is contained in the marked content block enveloped by /P <</MCID 0 >>BDC
and EMC
.
Due to the wrong assumption, though, and the way /P <</MCID 0 >>BDC
and Q
are arranged, PDF Clown parses the above as [...1...]
and an empty marked content block and [...2...]
being contained in a save/restore graphics state block.
Thus, if there are changes in the graphics state inside [...2...]
, PDF Clown assumes them limited to the lines above while they actually are not.
The only easy way I found to repair this was to disable the marked content parsing in PDF Clown.
To do this I changed org.pdfclown.documents.contents.tokens.ContentParser
as follows:
In parseContentObjects()
I disablked the contentObject instanceof EndMarkedContent
option:
public List<ContentObject> parseContentObjects(
)
{
final List<ContentObject> contentObjects = new ArrayList<ContentObject>();
while(moveNext())
{
ContentObject contentObject = parseContentObject();
// Multiple-operation graphics object end?
if(contentObject instanceof EndText // Text.
|| contentObject instanceof RestoreGraphicsState // Local graphics state.
/* || contentObject instanceof EndMarkedContent // End marked-content sequence. */
|| contentObject instanceof EndInlineImage) // Inline image.
return contentObjects;
contentObjects.add(contentObject);
}
return contentObjects;
}
In parseContentObject
I removed the if(operation instanceof BeginMarkedContent)
branch:
public ContentObject parseContentObject(
)
{
final Operation operation = parseOperation();
if(operation instanceof PaintXObject) // External object.
return new XObject((PaintXObject)operation);
else if(operation instanceof PaintShading) // Shading.
return new Shading((PaintShading)operation);
else if(operation instanceof BeginSubpath
|| operation instanceof DrawRectangle) // Path.
return parsePath(operation);
else if(operation instanceof BeginText) // Text.
return new Text(
parseContentObjects()
);
else if(operation instanceof SaveGraphicsState) // Local graphics state.
return new LocalGraphicsState(
parseContentObjects()
);
/* else if(operation instanceof BeginMarkedContent) // Marked-content sequence.
return new MarkedContent(
(BeginMarkedContent)operation,
parseContentObjects()
);
*/ else if(operation instanceof BeginInlineImage) // Inline image.
return parseInlineImage();
else // Single operation.
return operation;
}
With these changes in place, the character sizes are properly extracted.
As an aside, while the returned individual character boxes seem to imply that the box is completely custom to the character in question, that is not true: Merely the width of the box is character specific, the height is calculated from overall font properties (and the current font size) but not specifically to the character, cf. the org.pdfclown.documents.contents.fonts.Font
method getHeight(char)
:
/**
Gets the unscaled height of the given character.
@param textChar
Character whose height has to be calculated.
*/
public final double getHeight(
char textChar
)
{
/*
TODO: Calculate actual text height through glyph bounding box.
*/
if(textHeight == -1)
{textHeight = getAscent() - getDescent();}
return textHeight;
}
Individual character height calculation still is a TODO.