I'm preparing to run Random Forest, but first I need to get my dataset in a form that would be functional for that. As such, I'm trying to fix a few problems with it. I have theoretical solutions, but I don't know how to execute them:
Since the data is a time series that contains half-hours values for over a year, I'm planning on finding the daily average for each day and filling in the NAs with the average for the corresponding day (I realize that this isn't the most statistically sound method, but I just need code that runs at the moment)
I plan on removing the character columns en masse (though I don't know what code would accomplish this). However, I need to keep the date columns, but I'm not sure how to maintain them when I'm taking averages of every column
I'm hoping to use code that takes the averages of all of these columns at once, so that I don't have to manually go through every column.
I have encountered asks about averaging entire columns indiscriminately (instead of grouped by days) and asks about keeping only the numeric variables (but I need to keep the dates as well). These unfortunately won't work.
Below is example data (I know that the TIMESTAMP and dates might be interpreted as characters by R. I don't know how to fix this when giving sample data, but they read as dates in my dataframe):
TIMESTAMP <- c("2019-04-27 17:30:00", "2019-04-27 18:00:00", "2019-04-27 18:30:00", "2019-04-27 19:00:00", "2019-04-27 19:30:00", "2019-04-28 10:00:00", "2019-04-28 10:30:00", "2019-04-28 11:00:00", "2019-04-28 11:30:00", "2019-04-28 12:00:00")
dates<-c("2019-04-27", "2019-04-27", "2019-04-27", "2019-04-27", "2019-04-27", "2019-04-28", "2019-04-28", "2019-04-28", "2019-04-28", "2019-04-28")
ch4_flux <- c(NA, 66.39, 65.39, 64.41, 63.52, 62.76, 62.16,NA, 61.54,61.53)
distance <- c(1000,1000,NA,125.35,1000,NA,1000,5.50,NA,1000)
Tau <-c(0.0322000, 0.0495000, 0.1737616, 0.1772567, NA, 0.1246816, 0.1435230, 0.1098670, NA, NA)
filename<- c("2019-04-27T173000_AIU-2079.ghg","2019-04-27T180000_AIU-2079.ghg", "2019-04-27T183000_AIU-2079.ghg","2019-04-27T190000_AIU-2079.ghg", "2019-04-27T193000_AIU-2079.ghg",NA, "2019-04-28T100000_AIU-2079.ghg","2019-04-28T103000_AIU-2079.ghg", "2019-04-28T110000_AIU-2079.ghg",NA)
dd<- data.frame(TIMESTAMP, dates, ch4_flux, distance,Tau, filename)
The head of the original raw dataset is below:
>dput(AMPdates2[1:10, 1:8])
structure(list(TIMESTAMP = structure(c(1556386200, 1556388000,
1556389800, 1556391600, 1556393400, 1556395200, 1556397000, 1556398800,
1556400600, 1556402400), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
Time_period_check = c(30, 30, 30, 30, 30, 30, 30, 30, 30,
30), filename = c("2019-04-27T170000_AIU-2079.ghg", "2019-04-27T173000_AIU-2079.ghg",
"2019-04-27T180000_AIU-2079.ghg", "2019-04-27T183000_AIU-2079.ghg",
"2019-04-27T190000_AIU-2079.ghg", "2019-04-27T193000_AIU-2079.ghg",
"2019-04-27T200000_AIU-2079.ghg", "2019-04-27T203000_AIU-2079.ghg",
"2019-04-27T210000_AIU-2079.ghg", "2019-04-27T213000_AIU-2079.ghg"
), date = structure(c(1556323200, 1556323200, 1556323200,
1556323200, 1556323200, 1556323200, 1556323200, 1556323200,
1556323200, 1556323200), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
time_adjusted = c("4/27/2019 17:30", "4/27/2019 18:00", "4/27/2019 18:30",
"4/27/2019 19:00", "4/27/2019 19:30", "4/27/2019 20:00",
"4/27/2019 20:30", "4/27/2019 21:00", "4/27/2019 21:30",
"4/27/2019 22:00"), time = c("17:40", "18:10", "18:40", "19:10",
"19:40", "20:10", "20:40", "21:10", "21:40", "22:10"), DOY = c(117.735,
117.756, 117.777, 117.798, 117.819, 117.84, 117.86, 117.881,
117.902, 117.923), daytime = c(1, 1, 1, 0, 0, 0, 0, 0, 0,
0)), row.names = c(NA, 10L), class = "data.frame")
library(dplyr)
dd %>%
mutate(DAY_grp = as.Date(TIMESTAMP)) %>%
mutate(across(where(~is.numeric(.)), ~zoo::na.aggregate(.x)),
.by = DAY_grp) %>%
select(-DAY_grp)
#> # A tibble: 10 × 6
#> TIMESTAMP dates ch4_flux distance Tau filename
#> <chr> <chr> <dbl> <dbl> <dbl> <chr>
#> 1 2019-04-27 17:30:00 2019-04-27 64.9 1000 0.0322 2019-04-27T173000_AI…
#> 2 2019-04-27 18:00:00 2019-04-27 66.4 1000 0.0495 2019-04-27T180000_AI…
#> 3 2019-04-27 18:30:00 2019-04-27 65.4 781. 0.174 2019-04-27T183000_AI…
#> 4 2019-04-27 19:00:00 2019-04-27 64.4 125. 0.177 2019-04-27T190000_AI…
#> 5 2019-04-27 19:30:00 2019-04-27 63.5 1000 0.108 2019-04-27T193000_AI…
#> 6 2019-04-28 10:00:00 2019-04-28 62.8 668. 0.125 <NA>
#> 7 2019-04-28 10:30:00 2019-04-28 62.2 1000 0.144 2019-04-28T100000_AI…
#> 8 2019-04-28 11:00:00 2019-04-28 62.0 5.5 0.110 2019-04-28T103000_AI…
#> 9 2019-04-28 11:30:00 2019-04-28 61.5 668. 0.126 2019-04-28T110000_AI…
#> 10 2019-04-28 12:00:00 2019-04-28 61.5 1000 0.126 <NA>
Created on 2025-03-18 with reprex v2.1.1