I'm using PoDoFo to extract character displacement to update a text matrix correctly. This is a code fragment of mine:
PdfString str, ucode_str;
std::stack<PdfVariant> *stack;
const PdfFontMetrics *f_metrics;
...
/* Convert string to UTF8 */
str = stack->top().GetString();
ucode_str = ts->font->GetEncoding()->ConvertToUnicode(str, ts->font);
stack->pop();
c_str = (char *) ucode_str.GetStringUtf8().c_str();
/* Font metrics to obtain a character displacement */
f_metrics = ts->font->GetFontMetrics();
for (j = 0; j < strlen(c_str); j++) {
str_w = f_metrics->CharWidth(c_str[j]);
/* Adjust text matrix using str_w */
...
}
It works well for some PDF files (str_w
contains a useful width), but doesn't work for others. In these cases str_w
contains 0.0
. I took a look at the PoDoFo 0.9.5
sources and found CharWidth()
implemented for all sub-classes of PdfFontMetrics
.
Am I missing something important during this string conversion?
Update from 04.08.2017
@mkl did a really good job reviewing PoDoFo's code. However, I realized that I had to obtain a bit different parameter. To be precise, I needed a glyph width expressed in text space units (see PDF Reference 1.7, 5.1.3 Glyph Positioning and Metrics), but CharWidth()
is implemented in PdfFontMetricsObject.cpp
like:
double PdfFontMetricsObject::CharWidth(unsigned char c) const
{
if (c >= m_nFirst && c <= m_nLast &&
c - m_nFirst < static_cast<int>(m_width.GetSize())) {
double dWidth = m_width[c - m_nFirst].GetReal();
return (dWidth * m_matrix.front().GetReal() * this->GetFontSize() + this->GetFontCharSpace()) * this->GetFontScale() / 100.0;
}
if (m_missingWidth != NULL)
return m_missingWidth->GetReal();
else
return m_dDefWidth;
}
Width is calculated using additional multipliers (like font size, character space, etc.). What I really needed was dWidth * m_matrix.front().GetReal()
only. Thus, I decided to implement GetGlyphWidth(int c)
from the same file like:
double PdfFontMetricsObject::GetGlyphWidth(int c) const
{
if (c >= m_nFirst && c <= m_nLast &&
c - m_nFirst < static_cast<int>(m_width.GetSize())) {
double dWidth = m_width[c - m_nFirst].GetReal();
return dWidth * m_matrix.front().GetReal();
}
return 0.0;
}
and call this one instead of CharWidth()
from the first listing.
If I understand the Podofo code correctly (I'm not really a Podofo expert...), the PdfFontMetricsObject
class is used to represent the metrics of fonts contained in an already existing PDF:
/** Create a font metrics object based on an existing PdfObject
*
* \param pObject an existing font descriptor object
* \param pEncoding a PdfEncoding which will NOT be owned by PdfFontMetricsObject
*/
PdfFontMetricsObject( PdfObject* pFont, PdfObject* pDescriptor, const PdfEncoding* const pEncoding );
The method CharWidth
here is implemented like this:
double PdfFontMetricsObject::CharWidth( unsigned char c ) const
{
if( c >= m_nFirst && c <= m_nLast
&& c - m_nFirst < static_cast<int>(m_width.GetSize()) )
{
double dWidth = m_width[c - m_nFirst].GetReal();
return (dWidth * m_matrix.front().GetReal() * this->GetFontSize() + this->GetFontCharSpace()) * this->GetFontScale() / 100.0;
}
if( m_missingWidth != NULL )
return m_missingWidth->GetReal ();
else
return m_dDefWidth;
}
One in particular sees that the parameter c
is not encoded according to the font encoding but left as is for the lookup in the widths array. Thus, the expected input of this method does not appear to be a ASCII or ANSI character code but the original glyph ID.
Your code, on the other hand, has already transformed the glyph IDs to Unicode in UTF-8 and, therefore, essentially tries to lookup by ANSI character codes.
This would match the example documents, a typical font encoding in the PDF processed with error looks like this
28 0 obj
<<
/Differences[0/B/G/W/a/d/e/f/g 9/i/l/n/o/p/r/space/t/w]
/BaseEncoding/MacRomanEncoding
/Type/Encoding
>>
endobj
with glyph codes from 0 (FirstChar) to 17 (LastChar), or
12 0 obj
<<
/Differences[1/A/B/C/D/F/I/L/M/N/O/P/R/T/U/a/c/d
/degree/e/eight/f/five/four/g/h
27/i/l/m/n/o/one/p/parenleft/parenright
/period/r/registered/s/space
/t/three/two/u/w/zero]
/BaseEncoding/MacRomanEncoding
/Type/Encoding
>>
endobj
with glyph codes from 1 (FirstChar) to 46 (LastChar).
So these encoding deal glyph codes starting from 0 for all required glyphs and don't really cover that many glyphs
Thus, CharWidth
will return 0
for all char values above 17 or above 46 which means all (in the former case) or most (in the latter case) ANSI non control characters.
On the other hand a typical font encoding in the PDF processed correctly looks like this:
1511 0 obj
<<
/Type/Encoding
/BaseEncoding/WinAnsiEncoding
/Differences[
1/Delta/Theta
8/Phi
11/ff/fi/fl/ffi
39/quoteright
]
>>
endobj
with glyph codes from 1 (FirstChar) to 122 (LastChar).
These encodings basically are WinAnsiEncoding with minor additions in the lower values, in particular the control character values.
What you can do, therefore, is to iterate over glyph codes in str
(allowing you to call CharWidth
for them) and converting them individually to Unicode when needed instead of first converting str
to Unicode ucode_str
and then iterating over ANSI characters in ucode_str
.