In my JRuby application, I get input from two sources:
Some of the external data is (supposed) to be encoded as ISO_8859_1, while I internally process it as UTF_8 and also produce UTF_8 as output.
Unfortunately, there are sometimes encoding errors: The data contains occasionally bytes which are not valid ISO_8859_1, and this isn't going to be fixed. The specification requires to simply throw away those illegal input bytes.
For a file, I'm reading the file using
string = File.new(filename, {external_encoding: Encoding::ISO_8859_1, internal_encoding: Encoding::UTF_8, converters: UTF8_CONVERTER})
The converts clause takes care, that illegal input bytes are skipped.
For a string received from the Java side, I could of course convert them to UTF_8, by doing a
string = iso_string.encode(Encoding::UTF_8)
but how could I catch here illegal characters? From my understanding of the Ruby Docs for the encode
method, the options, which can be stated after the destination encoding, don't provide a converts key.
UPDATE
Here is a simple example to demonstrate the problem:
(1) Good case (no error)
s = [49, 67].pack('C*')
put s
puts s.encoding
u = s.encode(Encoding::UTF_8)
puts u
puts u.encoding
This prints
1C
ASCII-8BIT
1C
UTF-8
(2) Error case
x = [49, 138, 67].pack('C*')
x.encode(Encoding::UTF_8)
raises, as expected, UndefinedConversionError: ""\x8A"" from ASCII-8BIT to UTF-8
What I tried (though not documented):
t = x.encode(external_encoding: Encoding::ISO_8859_1, internal_encoding: Encoding::UTF_8, converters: UTF8_CONVERTER)
Interestingly, this got rid of the exception, but the nevertheless the conversion did not succeed. If I do a
t.encoding
I still see ASCII-8BIT. It seems that nothing had been converted. I would like to see the illegal character to be removed, i.e. in this case t
being the empty string.
Unfortunately, there are sometimes encoding errors: The data contains occasionally bytes which are not valid ISO_8859_1
This is strange, because there is no such thing. The ISO 8859-1 character encoding covers all the 256 possible 8-bit bytes and none of them are invalid. They can also all be converted to Unicode — trivially so, since the lowest 256 Unicode code points correspond 1:1 to the 256 characters in ISO 8859-1.
(It does have 65 non-printable "control characters" mapped to the bytes 0–31 and 127–159, but those are all also included in Unicode. These control characters include some fairly common ones like tabulators, line feeds and carriage returns, but also many other rarely used ones.)
You actual problem seems to be that Ruby has marked your byte string as having the default ASCII_8BIT
encoding, not ISO_8859_1
. This is a special encoding that allows a string to contain all of the 256 8-bit bytes, but defines Unicode character values only for the first 128 of them that correspond to the 7-bit ASCII character encoding. To quote the Ruby documentation:
Encoding::ASCII_8BIT
is a special encoding that is usually used for a byte string, not a character string. But as the name insists, its characters in the range of ASCII are considered as ASCII characters. This is useful when you use ASCII-8BIT characters with other ASCII compatible characters.
Anyway, the solution in your case is simply the use the String#force_encoding
method (which modifies the string in-place, despite lacking the conventional exclamation point for some reason!) to change the encoding of your byte string to what it should be, i.e. in your case Encoding::ISO_8859_1
, like this:
x = [49, 138, 67].pack('C*')
puts "x = #{x.inspect} has encoding #{x.encoding}"
x.force_encoding(Encoding::ISO_8859_1)
puts "x = #{x.inspect} now has encoding #{x.encoding}"
u = x.encode(Encoding::UTF_8)
puts "u = #{u.inspect} has encoding #{u.encoding}"
which will print:
x = "1\x8AC" has encoding ASCII-8BIT
x = "1\x8AC" now has encoding ISO-8859-1
u = "1\u008AC" has encoding UTF-8
As you can see, the ISO 8859-1 control character 138 (hex 0x8A, represented as \x8A
in the inspect
output) has been successfully converted to its Unicode equivalent U+008A (\u008A
).
Ps. It's also possible that your input data is actually not in the ISO 8859-1 encoding but in some other related encoding, such as Windows-1252, which differs from ISO 8859-1 only in the fact that it replaces 32 of the 65 non-printable control characters (the C1 block consisting of the bytes from 128 to 159, to be exact) with various additional symbols and accented letters.
If that's the case (which you should be able to fairly easily test by trying to decode some of the data as Windows-1252 and seeing if the results make sense), you should use Encoding::WINDOWS_1252
instead of Encoding::ISO_8859_1
. For example:
x = [49, 138, 67].pack('C*')
puts "x = #{x.inspect} has encoding #{x.encoding}"
x.force_encoding(Encoding::WINDOWS_1252)
puts "x = #{x.inspect} now has encoding #{x.encoding}"
u = x.encode(Encoding::UTF_8)
puts "u = #{u.inspect} has encoding #{u.encoding}"
will print:
x = "1\x8AC" has encoding ASCII-8BIT
x = "1\x8AC" now has encoding Windows-1252
u = "1ŠC" has encoding UTF-8
Note how the \x8A
byte has now been converted into the accented letter Š
, which is what it represents in the Windows-1252 encoding.