I have a problem with my dataframe. The missing values are marked with # and I cant find a way to automatically replace them with NA.
Here is my dataframe: https://gofile.io/?c=BfpgbC
This is what I have tried:
library(naniar)
df_new= testframe %>% replace_with_na(replace = list(NO2_Königsplatz = "#"))
testframe[testframe== "#"] <- NA
Both does not work. When I manually replace each value, it works but that is not an option because it takes too long.
After replacing the missing values with NA I want to convert all columns (not the 1st column) into numerics to calculate the means.
Any ideas how to solve this?
EDIT WITH CORRECT DATA
Here is a second approach:
The last step will generate warnings about coerced NA values, which can be ignored. We can use the lubridate
and dplyr
packages:
library(dplyr)
library(lubridate)
dat <- read.table("AUG-2017-Air.dat",
stringsAsFactors=FALSE) %>%
mutate(Zeitpunkt = dmy_hm(Zeitpunkt)) %>%
mutate_if(is.character, as.integer)
Note that the timezone is assumed UTC unless specified otherwise.
Result:
str(dat)
'data.frame': 8760 obs. of 13 variables:
$ Zeitpunkt : POSIXct, format: "2017-01-01 01:00:00" "2017-01-01 02:00:00" "2017-01-01 03:00:00" "2017-01-01 04:00:00" ...
$ NO2_Bourgesplatz : int 31 31 29 30 29 28 27 29 28 25 ...
$ NO2_Karlstraße : int 34 35 31 31 31 32 38 35 33 29 ...
$ NO2_Königsplatz : int 29 29 28 28 27 27 26 28 28 23 ...
$ NO2_LfU : int 31 31 29 28 27 26 25 23 22 24 ...
$ O3_Bourgesplatz : int 6 5 2 2 2 2 2 2 2 8 ...
$ O3_LfU : int 4 3 3 3 3 3 3 3 3 5 ...
$ PM10_Bourgesplatz: int 455 417 106 90 87 93 85 79 91 77 ...
$ PM10_Karlstraße : int 203 75 58 53 55 60 51 45 48 48 ...
$ PM10_Königsplatz : int 215 75 62 51 64 63 69 51 50 64 ...
$ PM10_LfU : int 376 321 62 101 61 112 123 112 118 147 ...
$ PM25_Bourgesplatz: int 267 308 100 87 87 94 88 80 92 79 ...
$ PM25_LfU : int 160 112 48 62 50 65 62 66 65 76 ...
EDIT WITH BASE R SOLUTION
dat <- read.table("AUG-2017-Air.dat",
stringsAsFactors=FALSE)
dat[2:13] <- lapply(dat[2:13], as.numeric)
OLD ANSWER WITH INCORRECT DATA
You can specify which values represent missing values when you import the data to R. In general if unsure about data, it's best to read "as is", explore and then figure out the quirks of that particular dataset, then go back and fix it up.
For the data linked in your question, this should work:
testframe <- read.table("testframe.dat",
sep = "",
na.strings = c(" # ",
" -"),
stringsAsFactors=FALSE)
Result:
str(testframe)
'data.frame': 1095 obs. of 13 variables:
$ Zeitpunkt : chr "01.01.2017 07:00" "01.01.2017 14:00" "01.01.2017 21:00" "02.01.2017 07:00" ...
$ NO2_Bourgesplatz : int 27 22 41 22 20 36 35 24 23 12 ...
$ NO2_Karlstraße : int 38 49 53 38 54 45 47 61 32 19 ...
$ NO2_Königsplatz : int 26 25 46 26 35 30 33 31 19 8 ...
$ NO2_LfU : int 25 16 36 19 13 26 24 17 18 9 ...
$ O3_Bourgesplatz : int 2 22 2 23 32 12 8 41 36 55 ...
$ O3_LfU : int 3 31 4 28 48 22 23 55 47 67 ...
$ PM10_Bourgesplatz: int 85 54 74 32 18 11 18 11 13 6 ...
$ PM10_Karlstraße : int 51 37 69 34 21 15 23 17 17 11 ...
$ PM10_Königsplatz : int 69 36 59 26 21 12 18 15 13 8 ...
$ PM10_LfU : int 123 35 68 30 12 6 16 11 14 6 ...
$ PM25_Bourgesplatz: int 88 61 53 32 19 12 15 11 11 6 ...
$ PM25_LfU : int 62 42 61 24 15 9 15 11 12 5 ...
You might also consider conversion of the Zeitpunkt
column to a datetime class, depending on what you want to do next.