rutf-8byte-order-markexport-to-csv

Export UTF-8 BOM to .csv in R


I am reading a file through RJDBC from a MySQL database and it correctly displays all letters in R (e.g., נווה שאנן). However, even when exporting it using write.csv and fileEncoding="UTF-8" the output looks like <U+0436>.<U+043A>. <U+041B><U+043E><U+0437><U+0435><U+043D><U+0435><U+0446>(in this case this is not the string above but a Bulgarian one) for Bulgarian, Hebrew, Chinese and so on. Other special characters like ã,ç etc work fine.

I suspect this is because of UTF-8 BOM but I did not find a solution on the net

My OS is a German Windows7.

edit: I tried

con<-file("file.csv",encoding="UTF-8")
write.csv(x,con,row.names=FALSE)

and the (afaik) equivalent write.csv(x, file="file.csv",fileEncoding="UTF-8",row.names=FALSE).


Solution

  • On help page to Encoding (help("Encoding")) you could read about special encoding - bytes.

    Using this I was able to generate csv file by:

    v <- "נווה שאנן"
    X <- data.frame(v1=rep(v,3), v2=LETTERS[1:3], v3=0, stringsAsFactors=FALSE)
    
    Encoding(X$v1) <- "bytes"
    write.csv(X, "test.csv", row.names=FALSE)
    

    Take care about differences between factor and character. The following should work:

    id_characters <- which(sapply(X,
        function(x) is.character(x) && Encoding(x)=="UTF-8"))
    for (i in id_characters) Encoding(X[[i]]) <- "bytes"
    
    id_factors <- which(sapply(X,
        function(x) is.factor(x) && Encoding(levels(x))=="UTF-8"))
    for (i in id_factors) Encoding(levels(X[[i]])) <- "bytes"
    
    write.csv(X, "test.csv", row.names=FALSE)