unicodedxf

what is this encoding \M+5BCFE\M+5BAC5?


We are seeing some unfamiliar encoded text from dxf/dwg mtext entities, for example, \M+5BCFE\M+5BAC5 which is supposed to the Chinese characters 件号. Does anyone recognize these encoding and know how to decode them?

Searched on Google as well as asked ChatGPT, but nothing came back.


Solution

  • From Unicode and DXF files:

    Texts (contents of text objects) represented in the DXF interchange format files (a textual version of DWG) were stored in plain ASCII up to AutoCAD version 2017. Special characters - e.g. czech, asian, cyrillic - were expressed using either the MIF (Maker Interchange Format, syntax: M+nxxyy) or CIF (Common Interchange Format, syntax: U+nnnn) codes.

    Since version 2007, DXF files and saved and loaded in Unicode (UTF-8). So any special characters can be also stored directly to the DXF. But they can still be interpreted from the MIF and CIF codes. When interpreting the codes, the current setting of the DWGCODEPAGE variable is applied.

    About n in M+nxxyy syntax, read AdCharFormatter::getMIFCodePage:

    This static method retrieves the code page value for a given index. ch must be in the range from '1' (0x31) to '5' (0x35).

    Returns zero if ch does not represent a valid index. Otherwise, it returns one of the following values:

    • 932 - Shift-JIS (Japanese)
    • 950 - Big 5 (Traditional Chinese)
    • 949 - KS C-5601-1987 (Wansung)
    • 1361 - KS C-5601-1992 (Johab)
    • 936 - GB 2312-80 (Simplified Chinese)

    Let's check it (example in Python for its universal intelligibility):

    '件号'.encode('gb2312-80')
    

    b'\xbc\xfe\xba\xc5'

    We see the same hexadecimal values bc fe ba c5 like in \M+5BCFE\M+5BAC5