utf-8character-encodingqr-code

Choosing a character encoding for QR Codes


I'm building an application which will have the ability to generate QR Codes including arbitrary text data. However, this poses a challenge: I'm expecting users to include non-ASCII characters such as á or ö.

From what I've gathered, the default for QR Codes is ISO-8859-1, but UTF-8 seems to be a common choice (and accepts a wider range of characters, such as Arabic or Hebrew characters that wouldn't be shown in ISO-8859-1).

However, the question I've linked doesn't answer a vital question for me - can I expect most real world QR code readers (e.g., smartphones or any commonly used tools for QR reading) to reliably read QR codes with UTF-8 encoding? Is it safer to use ISO-8859-1 instead? Or should I just assume that including non-ASCII characters in QR Codes is a recipe for failure?


Solution

  • Most QR code scanners use heuristics to detect character encoding, whether the default encoding (ISO-8859-1) is used or another encoding (like UTF-8) is specified via an ECI extension. These heuristics can fail under certain conditions. You need to test your QR codes with the most widespread scanners to determine which produces fewer errors: ISO-8859-1 or UTF-8 with ECI. Do not use a QR code generator that omits ECI for UTF-8, as the generated QR codes would not comply with the standard.

    Although ISO-8859-1 is default encoding for QR codes, this became so only after 2005 standard update. The earlier version of the standard, published in 2000 (ISO/IEC 18004:2000), specified the 8-bit Latin/Kana character set according to JIS X 0201 (also known as JIS8 or ISO-2022-JP) as the default encoding for 8-bit mode.

    There are four modes for storing text in a QR code: (1) numeric, (2) alphanumeric, (3) 8-bit, and (4) Kanji. The QR code standard does not inherently support UTF-8. To use UTF-8 encoding (instead of the default ISO-8859-1 or JIS8) in the 8-bit string, the implementation must insert an ECI (Extended Channel Interpretations) before that string. ECI is an optional, additional feature for a QR code, defined in the earliest QR code standard at least as early as 2000. ECI enables data encoding using character sets other than the default and allows other data interpretations (e.g., compressed data using defined schemes) or industry-specific requirements to be encoded. The ECI protocol is defined in a specification developed by AIM, Inc, and can be purchased for $50 at AIM Global.

    Unfortunately, not all QR scanners can handle the ECI protocol, even for basic tasks such as changing the default encoding to UTF-8. Most implementations use heuristics (character encoding detection algorithms) to guess the encoding, even if the encoding is explicitly specified in the ECI of the decoded QR code.

    You need to test your QR codes with various scanners to determine which option yields better results. There is no universal solution. Some scanners will fail due to errors in their heuristics. Only scanners that do not use heuristics (at least when ECI is provided) will avoid such issues. Personally, I would choose ISO-8859-1 for two reasons. First, it does not require using ECI. Second, it needs only one byte for ISO-8859-1 to encode non-US-ASCII characters such as á or ö, compared to UTF-8, which needs two bytes for these characters. Therefore, QR codes will be smaller with ISO-8859-1, not just because fewer bytes are needed to encode an ISO-8859-1 string, but also because the total size is yet 2 bytes fewer because of omitting the ECI.