As can be seen in the documentation PyMuPDF get_page_fonts the returned set of fonts have names like FNUUTH+Calibri-Bold
or DOKBTG+Calibri
.
What do the string prefixs (FNUUTH+
, DOKBTG+
) represent?
A font can be large in size. Especially true for Asian scripts (e.g., Chinese) with their thousands of symbols. In these cases, font sizes of one or two digit megabyte sizes can occur.
Any document however only uses a limited amount of characters from any font it happens to use.
So the technique of "subsetting" a font has been invented:
A subset of a font only contains relevant, used characters of its parent. In PDF this is indicated with a (unique per font) prefix "ABCDEF+" of 6 uppercase arbitrary ASCII letters followed by the "+" symbol.
So DOKBTG+Calibri
is a subset of the Calibri font.
There is no regulation on how that prefix has to be built - except its uniqueness for (this case) Calibri in the given file.
The standard font information of PyMuPDF's text extraction does not show that subset prefix, but can be requested by setting a (global) parameter.
Note: I am a maintainer and the original creator of PyMuPDF.