I have a large stata file that I think has some French accented characters that have been saved poorly.
When I import the file with the encoding set to blank, it won't read in. When I set it to latin1
it will read in, but in one variable, and I'm certain in others, French accented characters are not rendered properly. I had a similar problem with another stata file and I tried to apply the fix (which actually did not work in that case, but seems on point) here.
To be honest this seems to be the real problem here somehow. A lot of the garbled characters are "actual" and they match up to what is "expected" But I have no idea to go back.
Reproducible code is here:
library(haven)
library(here)
library(tidyverse)
library(labelled)
#Download file
temp <- tempfile()
temp2 <- tempfile()
download.file("https://github.com/sjkiss/Occupation_Recode/raw/main/Data/CES-E-2019-online_F1.dta.zip", temp)
unzip(zipfile = temp, exdir = temp2)
ces19web <- read_dta(file.path(temp2, "CES-E-2019-online_F1.dta"), encoding="latin1")
#Try with encoding set to blank, it won't work.
#ces19web <- read_dta(file.path(temp2, "CES-E-2019-online_F1.dta"), encoding="")
unlink(c(temp, temp2))
#### Diagnostic section for accented characters ####
ces19web$cps19_prov_id
#Note value labels are cut-off at accented characters in Quebec.
#I know this occupation has messed up characters
ces19web %>%
filter(str_detect(pes19_occ_text,"assembleur-m")) %>%
select(cps19_ResponseId, pes19_occ_text)
#Check the encodings of the occupation titles and store in a variable encoding
ces19web$encoding<-Encoding(ces19web$pes19_occ_text)
#Check encoding of problematic characters
ces19web %>%
filter(str_detect(pes19_occ_text,"assembleur-m")) %>%
select(cps19_ResponseId, pes19_occ_text, encoding)
#Write out messy occupation titles
ces19web %>%
filter(str_detect(pes19_occ_text,"Ã|©")) %>%
select(cps19_ResponseId, pes19_occ_text, encoding) %>%
write_csv(file=here("Data/messy.csv"))
#Try to fix
source("https://github.com/sjkiss/Occupation_Recode/raw/main/fix_encodings.R")
#store the messy variables in messy
messy<-ces19web$pes19_occ_text
library(stringi)
#Try to clean with the function fix_encodings
ces19web$pes19_occ_text_cleaned<-stri_replace_all_fixed(messy, names(fixes), fixes, vectorize_all = F)
#Examine
ces19web %>%
filter(str_detect(pes19_occ_text_cleaned,"Ã|©")) %>%
select(cps19_ResponseId, pes19_occ_text, pes19_occ_text_cleaned, encoding) %>%
head()
Assuming an utf-8 locale, can be checked with:
Sys.getlocale()
#> [1] "en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8"
At first we had this somewhere and everything was fine:
utf8 <- "Producteur télé"
Encoding(utf8)
#> [1] "UTF-8"
charToRaw(utf8) # é encoded to c3 a9 as expected for utf-8
#> [1] 50 72 6f 64 75 63 74 65 75 72 20 74 c3 a9 6c c3 a9
utf8
#> [1] "Producteur télé"
But something bad happened, and the string was considered as a latin string for which c3
and a9
are 2 separate chatacters "Ã" and "©", and was converted wrongly from latin1 to utf8, so now instead of having é in UTF-8 (with bytes that translate to "é" in latin), we have "é" in utf8, coded with the 2 characters c3 83
and c2 a9
oops <- iconv(utf8, from = "latin1", to = "UTF-8")
Encoding(oops)
#> [1] "UTF-8"
charToRaw(oops)
#> [1] 50 72 6f 64 75 63 74 65 75 72 20 74 c3 83 c2 a9 6c c3 83 c2 a9
oops
#> [1] "Producteur télé"
This string is not a proper (meaningful) utf-8 or latin1 string anymore, "é" is e9
in latin, or c3 a9
in utf-8, but never c3 83 c2 a9
!
We can undo the bad translation though:
proper_utf8_encoding_with_latin1_marking <-
iconv(oops, from = "UTF-8", to = "latin1")
Encoding(proper_utf8_encoding_with_latin1_marking)
#> [1] "latin1"
# c3 a9 is é in utf-8, not in latin1!
charToRaw(proper_utf8_encoding_with_latin1_marking)
#> [1] 50 72 6f 64 75 63 74 65 75 72 20 74 c3 a9 6c c3 a9
proper_utf8_encoding_with_latin1_marking
#> [1] "Producteur télé"
From there we can build either a proper utf-8 string (recommended) or a proper latin1 string
utf8 <- proper_utf8_encoding_with_latin1_marking
Encoding(utf8) <- "UTF-8"
Encoding(utf8)
#> [1] "UTF-8"
charToRaw(utf8)
#> [1] 50 72 6f 64 75 63 74 65 75 72 20 74 c3 a9 6c c3 a9
utf8
#> [1] "Producteur télé"
latin1 <-
iconv(proper_utf8_encoding_with_latin1_marking, from = "UTF-8", to = "latin1")
Encoding(latin1)
#> [1] "latin1"
charToRaw(latin1) # e9 is é in latin1
#> [1] 50 72 6f 64 75 63 74 65 75 72 20 74 e9 6c e9
latin1
#> [1] "Producteur télé"
Part of encoding hell is that R sees those MOSTLY as the same, because it mostly doesn't matter
identical(utf8, latin1)
#> [1] TRUE
But the truth can be seen with the Encoding()
and charToRaw()
functions, or when serializing, which shows both informations.
waldo::compare(
serialize(utf8, NULL),
serialize(latin1, NULL)
)
#> `old[31:42]`: "01" "00" "00" "80" "09" "00" "00" "00" "11" "50" and 2 more...
#> `new[31:42]`: "01" "00" "00" "40" "09" "00" "00" "00" "0f" "50" ...
#>
#> `old[49:56]`: "72" "20" "74" "c3" "a9" "6c" "c3" "a9"
#> `new[49:54]`: "72" "20" "74" "e9" "6c" "e9"
The 3 differences we see above are the encoding marking (80 for UTF-8, 40 for latin1, 00 for unknown), the length in byte (11=17 in decimal, 0f = 15 in decimal), and the byte values of the "é" characters ("c3" "a9" vs "e9")
Fun fact, if we change the locale to latin1 (here on a Mac), for reasons that I don't understand, oops
will actually print "é" (and the others won't print well anymore), proving that we can't always trust print()
and identical()
, and that charToRaw()
, Encoding()
and iconv()
are your friends to debug encoding hell.
Sys.setlocale("LC_CTYPE", "en_US.ISO8859-1")
oops
#> [1] "Producteur télé"