pdfpdftk

How can I properly create multilingual metadata in pdftk


pdftk let's you set the title of a PDF with the following command:

pdftk input.pdf update_info metadata.txt output output.pdf

However, if I use special characters in the metadata.txt file (such as German characters or chinese characters) then it doesn't seem to work.

Here's an example of changing the title:

InfoBegin
InfoKey: Title
InfoValue: Fingerspitzengefühl is a German term.

However, the PDF ends up with a strange character for the ü

In the documentation of pdftk it says that non-ASCII characters should be encoded as XML numerical entities. However, I Googled myself silly but couldn't find anything that works.


Solution

  • The best reference I've found is Numerical Character Reference, which is applicable to XML (and XHTML and SGML).

    This is generally used to represent characters that are not directly encodable.

    In your case, the character is U+252, ü which can be substituted with ü (Decimal), &0374; (Octal), or ü (Hexidecimal).

    Using a decimal reference, your file should be encoded as:

    InfoBegin
    InfoKey: Title
    InfoValue: Fingerspitzengefühl is a German term.
    

    Note:

    If you're on 'Nix, you can use recode to encode the file.

    % cat metadata.txt | recode ..xml