I am trying to load a German CSV file into R on Fedora Linux 42, but the region names containing special characters (like u-umlaut, o-umlaut, eszett) are not rendered correctly.
Here's a simplified version of what I'm doing:
library(readr)
df <- read_delim(
"APExpose_DE__2003-2022__nogeo.csv",
delim = ";",
locale = locale(encoding = "UTF-8")
)
The output contains garbled characters like:
> head(df$kreis, 20)
[1] "Freiburg im Breisgau" "Dillingen an der Donau" "Nürnberg" "Neumarkt in der Oberpfalz"
[5] "Rosenheim" "Wilhelmshaven" "Wittmund" "Recklinghausen"
[9] "Steinfurt" "Kusel" "Neuwied" "Neunkirchen"
[13] "Ostholstein" "Amberg" "Fürth" "Günzburg"
[17] "Regensburg" "Rhön-Grabfeld" "Elbe-Elster" "Marburg-Biedenkopf"
The file is known to be encoded in UTF-8, and Sys.getlocale() on my machine returns:
> Sys.getlocale()
[1] "LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C"
What's the correct way to read this CSV file with proper rendering of German characters like u-umlaut, o-umlaut, etc, on Linux (Fedora 42), using readr
or base::read.csv()
?
Edit:
The command below:
hexdump -C APExpose_DE__2003-2022__nogeo.csv | less
Gives the following output:
00000000 6b 72 65 69 73 3b 79 65 61 72 3b 4e 4f 32 5f 61 |kreis;year;NO2_a|
00000010 6e 6e 75 61 6c 4d 65 61 6e 3b 4e 4f 32 5f 68 72 |nnualMean;NO2_hr|
00000020 4f 76 65 72 32 30 30 3b 4e 4f 5f 61 6e 6e 75 61 |Over200;NO_annua|
00000030 6c 4d 65 61 6e 3b 4f 33 5f 61 6e 6e 75 61 6c 4d |lMean;O3_annualM|
00000040 65 61 6e 3b 4f 33 5f 64 61 79 73 4f 76 65 72 31 |ean;O3_daysOver1|
00000050 32 30 3b 4f 33 5f 64 61 69 6c 79 4d 61 78 41 6e |20;O3_dailyMaxAn|
00000060 6e 75 61 6c 4d 65 61 6e 3b 4f 33 5f 64 61 69 6c |nualMean;O3_dail|
00000070 79 48 6f 75 72 6c 79 4d 61 78 3b 4f 33 5f 64 61 |yHourlyMax;O3_da|
00000080 69 6c 79 38 48 72 4d 61 78 3b 50 4d 31 30 5f 61 |ily8HrMax;PM10_a|
00000090 6e 6e 75 61 6c 4d 65 61 6e 3b 50 4d 31 30 5f 64 |nnualMean;PM10_d|
000000a0 61 79 73 4f 76 65 72 35 30 3b 50 4d 32 2e 35 5f |aysOver50;PM2.5_|
000000b0 61 6e 6e 75 61 6c 4d 65 61 6e 3b 6b 72 65 69 73 |annualMean;kreis|
000000c0 5f 63 6f 64 65 3b 73 63 65 6e 61 72 69 6f 0a 46 |_code;scenario.F|
000000d0 72 65 69 62 75 72 67 20 69 6d 20 42 72 65 69 73 |reiburg im Breis|
000000e0 67 61 75 3b 32 30 31 39 3b 31 36 2e 30 31 37 30 |gau;2019;16.0170|
000000f0 32 3b 30 3b 38 2e 30 32 30 35 37 31 3b 35 35 2e |2;0;8.020571;55.|
00000100 34 38 35 39 31 3b 36 3b 38 34 2e 31 33 30 36 38 |48591;6;84.13068|
00000110 3b 32 31 38 2e 35 3b 32 30 36 2e 39 32 35 3b 31 |;218.5;206.925;1|
00000120 32 2e 34 37 33 32 33 3b 32 3b 38 2e 33 33 32 37 |2.47323;2;8.3327|
00000130 38 31 3b 31 32 3b 72 65 6d 6f 74 65 0a 44 69 6c |81;12;remote.Dil|
00000140 6c 69 6e 67 65 6e 20 61 6e 20 64 65 72 20 44 6f |lingen an der Do|
00000150 6e 61 75 3b 32 30 31 39 3b 31 33 2e 34 38 39 39 |nau;2019;13.4899|
00000160 32 31 37 31 30 37 35 35 35 3b 30 2e 30 30 31 38 |217107555;0.0018|
00000170 31 37 35 38 34 39 31 37 36 35 36 34 32 3b 39 2e |1758491765642;9.|
00000180 34 34 31 32 33 37 33 35 35 30 32 39 33 31 3b 35 |44123735502931;5|
00000190 37 2e 38 33 30 36 33 36 39 37 36 30 34 33 39 3b |7.8306369760439;|
000001a0 33 2e 39 34 39 35 36 30 33 30 35 36 34 31 34 32 |3.94956030564142|
000001b0 3b 37 39 2e 37 39 32 30 31 38 31 37 31 39 39 38 |;79.792018171998|
000001c0 34 3b 31 37 35 2e 37 33 35 33 32 30 38 34 38 33 |4;175.7353208483|
000001d0 31 37 3b 31 36 30 2e 33 31 37 30 38 31 30 32 32 |17;160.317081022|
000001e0 30 34 36 3b 31 33 2e 38 34 38 34 38 34 33 32 32 |046;13.848484322|
000001f0 39 34 35 35 3b 31 2e 37 36 33 32 37 35 34 32 32 |9455;1.763275422|
00000200 38 36 38 31 3b 31 30 2e 34 39 35 31 30 32 33 35 |8681;10.49510235|
00000210 37 30 33 30 36 3b 36 38 3b 72 65 6d 6f 74 65 0a |70306;68;remote.|
00000220 4e c3 83 c2 bc 72 6e 62 65 72 67 3b 32 30 31 39 |N....rnberg;2019|
00000230 3b 32 35 2e 33 38 30 30 33 3b 30 3b 31 32 2e 37 |;25.38003;0;12.7|
00000240 35 36 34 36 3b 34 35 2e 35 38 38 36 31 3b 30 3b |5646;45.58861;0;|
00000250 37 33 2e 36 39 35 34 36 3b 31 36 30 2e 33 34 34 |73.69546;160.344|
00000260 31 3b 31 35 35 2e 36 37 31 38 3b 31 37 2e 36 31 |1;155.6718;17.61|
00000270 30 30 35 31 35 35 32 30 37 35 38 3b 31 2e 37 34 |00515520758;1.74|
00000280 30 31 37 37 33 35 36 39 39 30 36 33 3b 31 31 2e |017735699063;11.|
00000290 35 31 37 35 35 3b 31 30 37 3b 72 65 6d 6f 74 65 |51755;107;remote|
000002a0 0a 4e 65 75 6d 61 72 6b 74 20 69 6e 20 64 65 72 |.Neumarkt in der|
000002b0 20 4f 62 65 72 70 66 61 6c 7a 3b 32 30 31 39 3b | Oberpfalz;2019;|
000002c0 31 33 2e 34 30 30 37 35 32 32 38 39 34 34 36 37 |13.4007522894467|
000002d0 3b 30 2e 30 30 31 36 39 38 33 36 32 39 35 34 36 |;0.0016983629546|
000002e0 31 36 32 35 3b 37 2e 35 39 38 35 38 37 32 31 35 |1625;7.598587215|
000002f0 34 37 32 38 31 3b 35 37 2e 31 39 38 33 31 31 39 |47281;57.1983119|
Looking at the file, it is incorrectly encoded. It does satisfy the requirements for looking like a UTF-8 file, but it doesn't encode the characters you are expecting.
For example, in Nürnberg, the "ü" is in the file is encoded as c3 83 c2 bc
but it should be c3 bc
. The c3 bc
value was incorrectly double encoded. The code point for "ü" is 252 (or 00FC
). There are rules for how you turn code points into bytes for UTF-8 which you can test with charToRaw(intToUtf8(252))
that shows this value should be encoded as c3 bc
. But if you mininterpret those bytes as code points themselves and encoded each again, you essentially do charToRaw(intToUtf8(as.numeric(charToRaw(intToUtf8(252)))))
which returns the bad value of c3 83 c2 bc
.
Because this is such a messed up format, I don't know of any easy way to fix it. Here's a somewhat messy way to decode the bytes. We look for bytes > 127 and then assume that these bad values come in clumps of 4 bytes so we reduce the 4 bytes to 2.
unencode <- function(p) {
xx <- readBin(p, raw(), n=file.info(p)$size)
out <- raw(length(xx))
output_idx <- 1
input_idx <- 1
while(input_idx < length(xx)) {
if (xx[input_idx] > 127) {
r <- bitwShiftL(bitwAnd(as.numeric(xx[input_idx:(input_idx+3)]), c(31, 63)), c(6,0))
out[output_idx] <- as.raw(r[1] + r[2])
output_idx <- output_idx + 1
out[output_idx] <- as.raw(r[3] + r[4])
output_idx <- output_idx + 1
input_idx <- input_idx + 4
} else {
out[output_idx] <- xx[input_idx]
input_idx <- input_idx + 1
output_idx <- output_idx + 1
}
}
length(out) <- output_idx - 1
out
}
This function returns the raw transformed bytes which we can pass to readr:read_delim
to read the data
p <- "~/Downloads/APExpose_DE__2003-2022__nogeo.csv"
dd <- readr::read_delim(unencode(p), delim=";")
head(dd)
# A tibble: 6 × 15
# kreis year NO2_annualMean NO2_hrOver200 NO_annualMean O3_annualMean O3_daysOver120 O3_dailyMaxAnnualMean
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 Freiburg im Bre… 2019 16.0 0 8.02 55.5 6 84.1
# 2 Dillingen an de… 2019 13.5 0.00182 9.44 57.8 3.95 79.8
# 3 Nürnberg 2019 25.4 0 12.8 45.6 0 73.7
# 4 Neumarkt in der… 2019 13.4 0.00170 7.60 57.2 3.70 79.4
# 5 Rosenheim 2019 21.0 0.0118 13.9 50.1 0.881 75.2
And now that the UTF encoding is correct, the values look as expected.