juliautf-32

Is it possible to concatenate strings with large Chars(UTF32String) in Julia-lang?


Construct UTF32String (b,c), where b contains large value:

using LegacyStrings
a=Char(69058047)
b=UTF32String(a)
c=UTF32String("")

Now concatenate b and c into d:

d=b*c

Read b,c,d respectively. b retained its value, but d is forcibly converted below 2^16, typed UTF8String, with its value being lost?

julia> typeof(d)
UTF8String

julia> typeof(b)
UTF32String

julia> typeof(c)
UTF32String

julia> D=Int(Char(d[1]))
65533

julia> B=Int(Char(b[1]))
69058047

Doing this on Julia 0.4 and 0.6 yielded the same result. Is it possible to get a work around to operate on strings with large Chars?


Solution

  • Given that there is a change in char representation coming in 0.7 the answer to the question depends on version of Julia you use.

    Julia 0.7

    If you want to use Julia 0.7 (and probably this is what you should be targeting as in the long run you have to switch to it anyway) you will get:

    julia> a=Char(69058047)
    ERROR: Base.CodePointError(0x041dbdff)
    Stacktrace:
     [1] code_point_err(::UInt32) at .\char.jl:10
     [2] Type at .\char.jl:42 [inlined]
     [3] Char(::Int64) at .\boot.jl:682
     [4] top-level scope
    

    In short - you will not be allowed to create it at all.

    It is important to know that the border value for throwing conversion error is 0x001fffff although it is invalid (maximum valid Unicode is 0x0010ffff). This is a catch that you simply have to remember in 0.7.

    The reason is that values up to 0x001fffff can be mapped to UTF-8, although some of those UTF-8 representations are invalid (larger values are impossible to map).

    Julia 0.6.2

    Here you can create a, b, and c, but the problem is that b*c is equivalent to string(b, c) (thus it will convert it to String in the end no mater what type you pass to it as an argument) and in the end, if you dig deep enough this will call write(s::IO, ch::Char), witch ch equal to a, and if you look at the definition of this method you will see that for a it will produce '\ufffd' - and this is what you get.

    Julia 0.6.2 will emit '\ufffd' for all invalid Unicode, i.e. any larger than 0x0010ffff.