Construct UTF32String (b,c), where b contains large value:
using LegacyStrings
a=Char(69058047)
b=UTF32String(a)
c=UTF32String("")
Now concatenate b and c into d:
d=b*c
Read b,c,d respectively. b retained its value, but d is forcibly converted below 2^16, typed UTF8String, with its value being lost?
julia> typeof(d)
UTF8String
julia> typeof(b)
UTF32String
julia> typeof(c)
UTF32String
julia> D=Int(Char(d[1]))
65533
julia> B=Int(Char(b[1]))
69058047
Doing this on Julia 0.4 and 0.6 yielded the same result. Is it possible to get a work around to operate on strings with large Chars?
Given that there is a change in char representation coming in 0.7 the answer to the question depends on version of Julia you use.
If you want to use Julia 0.7 (and probably this is what you should be targeting as in the long run you have to switch to it anyway) you will get:
julia> a=Char(69058047)
ERROR: Base.CodePointError(0x041dbdff)
Stacktrace:
[1] code_point_err(::UInt32) at .\char.jl:10
[2] Type at .\char.jl:42 [inlined]
[3] Char(::Int64) at .\boot.jl:682
[4] top-level scope
In short - you will not be allowed to create it at all.
It is important to know that the border value for throwing conversion error is 0x001fffff
although it is invalid (maximum valid Unicode is 0x0010ffff
).
This is a catch that you simply have to remember in 0.7.
The reason is that values up to 0x001fffff
can be mapped to UTF-8, although some of those UTF-8 representations are invalid (larger values are impossible to map).
Here you can create a
, b
, and c
, but the problem is that b*c
is equivalent to string(b, c)
(thus it will convert it to String
in the end no mater what type you pass to it as an argument) and in the end, if you dig deep enough this will call write(s::IO, ch::Char)
, witch ch
equal to a
, and if you look at the definition of this method you will see that for a
it will produce '\ufffd'
- and this is what you get.
Julia 0.6.2 will emit '\ufffd'
for all invalid Unicode, i.e. any larger than 0x0010ffff
.