rencodingspss-files

Encoding in R script


I am using the foreign package to read in 8 SPSS files. When they are read in some are re-encoded with UTF-8 and some with CP1252.

In my R script I want to compare an SPSS level with a piece of text. The test fails because of the "wrong" kind of dash.

> "Not working - long term sick or disabled" == "Not working – long term sick or disabled"
[1] FALSE
> "-" == "–"
[1] FALSE

Every time I re-open the R script in R Studio I have to change the dashes back to the longer versions. Can I save the R script so that the dashes are consistent with the levels in the SPSS file text?

> getOption("encoding")
[1] "native.enc

Solution

  • Find out which character you are dealing with:

    Unicode::as.u_char(utf8ToInt("-"))
    #[1] U+002D
    Unicode::as.u_char(utf8ToInt("–"))
    #[1] U+2013
    

    Then use that in your script for comparisons:

    "-" == "\u002D"
    #[1] TRUE
    
    "\u2013" == "–"
    #[1] TRUE