r linux encoding special-characters read-csv

How to fix broken German special characters when reading CSV file?

I am trying to load a German CSV file into R on Fedora Linux 42, but the region names containing special characters (like u-umlaut, o-umlaut, eszett) are not rendered correctly.

Here's a simplified version of what I'm doing:

library(readr)

df <- read_delim(
  "APExpose_DE__2003-2022__nogeo.csv",
  delim = ";",
  locale = locale(encoding = "UTF-8")
)

The output contains garbled characters like:

> head(df$kreis, 20)
 [1] "Freiburg im Breisgau"      "Dillingen an der Donau"    "NÃ¼rnberg"                 "Neumarkt in der Oberpfalz"
 [5] "Rosenheim"                 "Wilhelmshaven"             "Wittmund"                  "Recklinghausen"           
 [9] "Steinfurt"                 "Kusel"                     "Neuwied"                   "Neunkirchen"              
[13] "Ostholstein"               "Amberg"                    "FÃ¼rth"                    "GÃ¼nzburg"                
[17] "Regensburg"                "RhÃ¶n-Grabfeld"            "Elbe-Elster"               "Marburg-Biedenkopf"

The file is known to be encoded in UTF-8, and Sys.getlocale() on my machine returns:

> Sys.getlocale()
[1] "LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C"

What's the correct way to read this CSV file with proper rendering of German characters like u-umlaut, o-umlaut, etc, on Linux (Fedora 42), using readr or base::read.csv()?

Edit:

The command below:

hexdump -C APExpose_DE__2003-2022__nogeo.csv | less

Gives the following output:

00000000  6b 72 65 69 73 3b 79 65  61 72 3b 4e 4f 32 5f 61  |kreis;year;NO2_a|
00000010  6e 6e 75 61 6c 4d 65 61  6e 3b 4e 4f 32 5f 68 72  |nnualMean;NO2_hr|
00000020  4f 76 65 72 32 30 30 3b  4e 4f 5f 61 6e 6e 75 61  |Over200;NO_annua|
00000030  6c 4d 65 61 6e 3b 4f 33  5f 61 6e 6e 75 61 6c 4d  |lMean;O3_annualM|
00000040  65 61 6e 3b 4f 33 5f 64  61 79 73 4f 76 65 72 31  |ean;O3_daysOver1|
00000050  32 30 3b 4f 33 5f 64 61  69 6c 79 4d 61 78 41 6e  |20;O3_dailyMaxAn|
00000060  6e 75 61 6c 4d 65 61 6e  3b 4f 33 5f 64 61 69 6c  |nualMean;O3_dail|
00000070  79 48 6f 75 72 6c 79 4d  61 78 3b 4f 33 5f 64 61  |yHourlyMax;O3_da|
00000080  69 6c 79 38 48 72 4d 61  78 3b 50 4d 31 30 5f 61  |ily8HrMax;PM10_a|
00000090  6e 6e 75 61 6c 4d 65 61  6e 3b 50 4d 31 30 5f 64  |nnualMean;PM10_d|
000000a0  61 79 73 4f 76 65 72 35  30 3b 50 4d 32 2e 35 5f  |aysOver50;PM2.5_|
000000b0  61 6e 6e 75 61 6c 4d 65  61 6e 3b 6b 72 65 69 73  |annualMean;kreis|
000000c0  5f 63 6f 64 65 3b 73 63  65 6e 61 72 69 6f 0a 46  |_code;scenario.F|
000000d0  72 65 69 62 75 72 67 20  69 6d 20 42 72 65 69 73  |reiburg im Breis|
000000e0  67 61 75 3b 32 30 31 39  3b 31 36 2e 30 31 37 30  |gau;2019;16.0170|
000000f0  32 3b 30 3b 38 2e 30 32  30 35 37 31 3b 35 35 2e  |2;0;8.020571;55.|
00000100  34 38 35 39 31 3b 36 3b  38 34 2e 31 33 30 36 38  |48591;6;84.13068|
00000110  3b 32 31 38 2e 35 3b 32  30 36 2e 39 32 35 3b 31  |;218.5;206.925;1|
00000120  32 2e 34 37 33 32 33 3b  32 3b 38 2e 33 33 32 37  |2.47323;2;8.3327|
00000130  38 31 3b 31 32 3b 72 65  6d 6f 74 65 0a 44 69 6c  |81;12;remote.Dil|
00000140  6c 69 6e 67 65 6e 20 61  6e 20 64 65 72 20 44 6f  |lingen an der Do|
00000150  6e 61 75 3b 32 30 31 39  3b 31 33 2e 34 38 39 39  |nau;2019;13.4899|
00000160  32 31 37 31 30 37 35 35  35 3b 30 2e 30 30 31 38  |217107555;0.0018|
00000170  31 37 35 38 34 39 31 37  36 35 36 34 32 3b 39 2e  |1758491765642;9.|
00000180  34 34 31 32 33 37 33 35  35 30 32 39 33 31 3b 35  |44123735502931;5|
00000190  37 2e 38 33 30 36 33 36  39 37 36 30 34 33 39 3b  |7.8306369760439;|
000001a0  33 2e 39 34 39 35 36 30  33 30 35 36 34 31 34 32  |3.94956030564142|
000001b0  3b 37 39 2e 37 39 32 30  31 38 31 37 31 39 39 38  |;79.792018171998|
000001c0  34 3b 31 37 35 2e 37 33  35 33 32 30 38 34 38 33  |4;175.7353208483|
000001d0  31 37 3b 31 36 30 2e 33  31 37 30 38 31 30 32 32  |17;160.317081022|
000001e0  30 34 36 3b 31 33 2e 38  34 38 34 38 34 33 32 32  |046;13.848484322|
000001f0  39 34 35 35 3b 31 2e 37  36 33 32 37 35 34 32 32  |9455;1.763275422|
00000200  38 36 38 31 3b 31 30 2e  34 39 35 31 30 32 33 35  |8681;10.49510235|
00000210  37 30 33 30 36 3b 36 38  3b 72 65 6d 6f 74 65 0a  |70306;68;remote.|
00000220  4e c3 83 c2 bc 72 6e 62  65 72 67 3b 32 30 31 39  |N....rnberg;2019|
00000230  3b 32 35 2e 33 38 30 30  33 3b 30 3b 31 32 2e 37  |;25.38003;0;12.7|
00000240  35 36 34 36 3b 34 35 2e  35 38 38 36 31 3b 30 3b  |5646;45.58861;0;|
00000250  37 33 2e 36 39 35 34 36  3b 31 36 30 2e 33 34 34  |73.69546;160.344|
00000260  31 3b 31 35 35 2e 36 37  31 38 3b 31 37 2e 36 31  |1;155.6718;17.61|
00000270  30 30 35 31 35 35 32 30  37 35 38 3b 31 2e 37 34  |00515520758;1.74|
00000280  30 31 37 37 33 35 36 39  39 30 36 33 3b 31 31 2e  |017735699063;11.|
00000290  35 31 37 35 35 3b 31 30  37 3b 72 65 6d 6f 74 65  |51755;107;remote|
000002a0  0a 4e 65 75 6d 61 72 6b  74 20 69 6e 20 64 65 72  |.Neumarkt in der|
000002b0  20 4f 62 65 72 70 66 61  6c 7a 3b 32 30 31 39 3b  | Oberpfalz;2019;|
000002c0  31 33 2e 34 30 30 37 35  32 32 38 39 34 34 36 37  |13.4007522894467|
000002d0  3b 30 2e 30 30 31 36 39  38 33 36 32 39 35 34 36  |;0.0016983629546|
000002e0  31 36 32 35 3b 37 2e 35  39 38 35 38 37 32 31 35  |1625;7.598587215|
000002f0  34 37 32 38 31 3b 35 37  2e 31 39 38 33 31 31 39  |47281;57.1983119|

Solution

Looking at the file, it is incorrectly encoded. It does satisfy the requirements for looking like a UTF-8 file, but it doesn't encode the characters you are expecting.

For example, in Nürnberg, the "ü" is in the file is encoded as c3 83 c2 bc but it should be c3 bc. The c3 bc value was incorrectly double encoded. The code point for "ü" is 252 (or 00FC). There are rules for how you turn code points into bytes for UTF-8 which you can test with charToRaw(intToUtf8(252)) that shows this value should be encoded as c3 bc. But if you mininterpret those bytes as code points themselves and encoded each again, you essentially do charToRaw(intToUtf8(as.numeric(charToRaw(intToUtf8(252))))) which returns the bad value of c3 83 c2 bc.

Because this is such a messed up format, I don't know of any easy way to fix it. Here's a somewhat messy way to decode the bytes. We look for bytes > 127 and then assume that these bad values come in clumps of 4 bytes so we reduce the 4 bytes to 2.

unencode <- function(p) {
  xx <- readBin(p, raw(), n=file.info(p)$size)
  out <- raw(length(xx))
  output_idx <- 1
  input_idx <- 1
  while(input_idx < length(xx)) {
    if (xx[input_idx] > 127) {
      r <- bitwShiftL(bitwAnd(as.numeric(xx[input_idx:(input_idx+3)]), c(31, 63)), c(6,0))
      out[output_idx] <- as.raw(r[1] + r[2])
      output_idx <- output_idx + 1
      out[output_idx] <- as.raw(r[3] + r[4])
      output_idx <- output_idx + 1
      input_idx <- input_idx + 4
    } else {
      out[output_idx] <- xx[input_idx]
      input_idx <- input_idx + 1
      output_idx <- output_idx + 1
    }
  }
  length(out) <- output_idx - 1
  out
}

This function returns the raw transformed bytes which we can pass to readr:read_delim to read the data

p <- "~/Downloads/APExpose_DE__2003-2022__nogeo.csv"
dd <- readr::read_delim(unencode(p), delim=";")
head(dd)
# A tibble: 6 × 15
#   kreis             year NO2_annualMean NO2_hrOver200 NO_annualMean O3_annualMean O3_daysOver120 O3_dailyMaxAnnualMean
#   <chr>            <dbl>          <dbl>         <dbl>         <dbl>         <dbl>          <dbl>                 <dbl>
# 1 Freiburg im Bre…  2019           16.0       0                8.02          55.5          6                      84.1
# 2 Dillingen an de…  2019           13.5       0.00182          9.44          57.8          3.95                   79.8
# 3 Nürnberg          2019           25.4       0               12.8           45.6          0                      73.7
# 4 Neumarkt in der…  2019           13.4       0.00170          7.60          57.2          3.70                   79.4
# 5 Rosenheim         2019           21.0       0.0118          13.9           50.1          0.881                  75.2

And now that the UTF encoding is correct, the values look as expected.