rutf-8havenlabelled

Modify encodings of accented characters in value labels


I am having a very hard time with accented characters in a stata file I have to import into R. I solved one problem over here, but there's another problem.

After import, anytime I use the lookfor command in the labelled package I get this error.

remotes::install_github("sjkiss/cesdata")
library(cesdata)
data("ces19web")
library(labelled)
look_for(ces19web, "vote")
  invalid multibyte string at '<e9>bec Solidaire'

Now I can find one value label that has that label, but it actually appears properly, so I don't know what is going on.

val_labels(ces19web$pes19_provvote)

But, there are other problematic value labels that cause other problems. For example, the value labels for the 13th variable cause this problem.

# This works fine
ces19web %>% 
  select(1:12) %>% 
  look_for(., "[a-z]")
# This chokes

ces19web %>% 
  select(1:13) %>% 
  look_for(., "[a-z]")

# See the accented character
val_labels(ces19web[,13])

I have come up with this way of replacing the accented characters of the second type.

names(val_labels(ces19web$cps19_imp_iss_party))<-iconv(names(val_labels(ces19web$cps19_imp_iss_party)), from="latin1", to="UTF-8")

And this even solves the problem for look_for()

#This now works!
ces19web %>% 
  select(1:13) %>% 
  look_for(., "[a-z]")

But what I need is a way to loop through all of the names of all of the the value labels and make this conversion for all the bungled accented characters.

This is so close, but I don't a know how to save the results of this as the new names for the value labels

ces19web %>% 
#map onto all the variables and get the value labels
  map(., val_labels) %>% 
#map onto each set of value labels
 map(., ~{
#Skip if there are no value labels
    if (!is.null(.x)){
#If not convert the names as above 
names(.x)<-iconv(names(.x), from="latin1", to="UTF-8")
}
    }) ->out
#Compare the 16th variable's value labels in the original
ces19web[,16]
#With the 16th set of value labels after the conversion function above
out[[16]]

But how do I make that conversion actually stick in the original dataset

Thank you!


Solution

  • There is a problem with character variables: all encodings are marked as either "unknown" (i.e. no non-ascii characters) or UTF-8, however there are strings which are really latin1 strings: for instance 0xe9 is the latin-1 encoding of "é".

    Assuming all character variables are actually latin1, you can do this:

    ces19web_corr <- as.data.frame(lapply(ces19web, function(v) {
      if (is.character(v)) {
        Encoding(v) <- "latin1"
        v <- iconv(v, from = "latin1", to = "UTF-8")
      } else if (is.factor(v)) {
        lev <- levels(v)
        Encoding(lev) <- "latin1"
        lev <- iconv(lev, from = "latin1", to = "UTF-8")
        levels(v) <- lev
      }
      v
    }))
    

    Alternately, if only some of them have the problem, you will have to select which one to fix.


    Side comment: it might be that you applied my fix from the other post to a data file (or some of its columns) which hasn't the problem described in your other question. Then you accidentally forced the wrong encoding, and the code above is just forcing back the right one.