rtextencodingcharacter-encoding

Possible issues with text encoding


I'm having a strange problem when I read a .csv file using read_csv. I'm afraid I don't think I can produce a reproducible example because the issue may involve my current R/RStudio session and how they interact while reading a file, requiring the file and a similar set up. I don't know for sure, but I'm leaning towards some issue with a text encoding mismatch, but I know little about this and so I seek the advanced advice of the stack-overflow hive mind.

In any case, here's the behavior I have.

The tibble is named `fl' and contains the FIPS codes for all the administrative districts in the U.S. I'm looking at a 'Census Area' in Alaska as the example, but I have other similar cases in the same tibble.

The following produces sensible output.

> fl %>% filter(str_detect(fl$NAME, 'Hoonah'))
# A tibble: 1 × 6
  FIPS  NAME                      STATEFPn COUNTYFPn STATEFP COUNTYFP
  <chr> <chr>                        <int>     <int> <chr>   <chr>   
 1 02105 Hoonah–Angoon Census Area        2       105 02      105     

But, if I do the following, by typing the whole NAME at the console prompt, I get nothing.

> fl %>% filter(NAME=='Hoonah-Angoon Census Area')
 # A tibble: 0 × 6
 # … with 6 variables: FIPS <chr>, NAME <chr>, STATEFPn <int>, COUNTYFPn > <int>, STATEFP <chr>,
 #   COUNTYFP <chr>
 # ℹ Use `colnames()` to see all variable names

However, if I copy-and-paste from the first output, it works and I get this.

> fl %>% filter(NAME=='Hoonah–Angoon Census Area')
# A tibble: 1 × 6
  FIPS  NAME                      STATEFPn COUNTYFPn STATEFP COUNTYFP
  <chr> <chr>                        <int>     <int> <chr>   <chr>   
1 02105 Hoonah–Angoon Census Area        2       105 02      105     

I have some suspicion that's it's about some sort of character encoding issue/mismatch between my RStudio session and what's in the file, despite the fact that, to the best of my knowledge the file (as checked by guess_encoding()) and my session (as set in 'file:save with encoding' and then using 'file:reopen with encoding') both read 'UTF-8'.

Any ideas about what is happening?


Solution

  • Your issue, is not so much related to encoding, even with the same encoding there are characters that look similar to the eye, but are different. In your case, the dash. In the first case you have a hyphen, and in the second an en-dash.

    fl$NAME <- gsub("–", "-", fl$NAME)
    

    the following will convert en dash to hyphen in your text, so you can free type your hyphen as you have been doing and it should match.