rfile-ioutf-8asciifile-encodings

How to write and read printable ASCII characters to/from UTF-8 encoding file?


I want to write to a file with UTF-8 encoding containing the character 10001100 which is Œ the Latin capital ligature OE in extended ASCII table,

zz <- file("c:/testbin", "wb")
writeBin("10001100",zz)
close(zz)

When I open the file with office(encoding=utf-8), I can see Œ what I can not read is with readBin?

zz <- file("c:/testbin", "rb")
readBin(zz,raw())->x
x
[1] c5
readBin(zz,character())->x
Warning message:
In readBin(zz, character()) :
incomplete string at end of file has been discarded
x
character(0)

Solution

  • There are multiple difficulties here.

    So, to write UTF-8 from CP1252-as-binary-as-string, you have to convert your string into it a "raw" number (the R class for bytes) and then a character, change its "encoding" from CP1252 to UTF-8 (in fact convert its byte value to the corresponding one for the same character in UTF-8), after that you can re-convert it to raw, and finally write to the file:

    char_bin_str <- '10001100'
    char_u <- iconv(rawToChar(as.raw(strtoi(char_bin_str, base=2))),
                  # "\x8c"    8c     140    '10001100'
                    from="CP1252",
                    to="UTF-8")
    
    test.file <- "~/test-unicode-bytes.txt"
    
    zz <- file(test.file, 'wb')
    writeBin(charToRaw(char_u), zz)
    close(zz)
    

    This should keep things under control, write the correct bytes in UTF-8, and be the same on every OS. Hope it helps.


    PS: I am not exactly sure why in your code x returned c5, and I guess it would have returned c5 92 if you had set n=2 (or more) as a parameter to readBin(). On my machine (Mac OS X 10.7, R 3.0.2 and Win XP, R 2.15) it returns 31, the hex ASCII representation of '1' (the first char in '10001100', which makes sense), with your code. Maybe you opened your file in Office as CP1252 and saved it as UTF-8 there, before coming back to R?