[SOLVED] Getting strange character translations using unoconv to convert from docx/doc to pdf

Getting strange character translations using unoconv to convert from docx/doc to pdf

I am using unoconv (https://github.com/dagwieers/unoconv) to convert DOCX and DOC file to PDF, but will often get strange results on certain characters when they are rendered in the PDF.

One particular problem is numbers translating oddly for example, the section label of:

Section 2.3 (http://note.io/1Q33RX6)

Get's turned into a roman numeral:

Section II.3 (http://note.io/1b6MDs5)

I have a feeling this has to do with installed character sets but have no idea how to debug it.

The setting for the issue is a Django app making call to a unix shell script to convert a document on disk.

Solution

unoconv simply programmatically opens the file, and then saves/exports it to the desired format. I would expect the same to happen when you open the file using LibreOffice and saving it from the GUI.

If this is the case, you may want to test using the latest LibreOffice release, and if that does not solve your issue, report the problem to the LibreOffice bug-tracker.