I have a dataset that I used to make a random forest (it is split into testing and training data). I have already made the random forest and generated predictions (code below), but I don't know how to take those predictions and use them to generate a full data table that includes the gap-filled values.
#data table head
TIMESTAMP <- c("2019-05-31 17:00:00", "2019-05-31 17:30:00", "2019-05-31 18:00:00", "2019-05-31 18:30:00", "2019-05-31 19:00:00", "2019-05-31 19:30:00", "2019-05-31 20:00:00", "2019-05-31 20:30:00", "2019-05-31 21:00:00", "2019-05-31 21:30:00")
RH<-c(38, 40, 41, 42, 44, 49, 65, 72, 74, 77)
Fch4 <- c(0.045, -0.002, 0.001, 0.004, 0, -0.013, 0.004,-0.003, -0.001,-0.002)
distance <- c(1000,1000,180,125.35,1000,180,1000,5.50,180,1000)
Ta <-c(29.52, 29.01, 29.04, 28.39, 27.87, 26.68, 23.28, 21.16, 19.95, 19.01)
Fe<- c(95.16, 68.95, 68.62, 39.24, 35.04, 27.26, -2.60, 5.09,7.28, 2.08)
dd<- data.frame(TIMESTAMP, RH, Fch4, distance, Ta, Fe)
#Making RF and Prediction
set.seed(1)
inTraining <- createDataPartition(dd$Fch4, p = 0.65, list=FALSE)
training <- dd[inTraining,]
testing <- dd[-inTraining,]
set.seed(1)
pfpfit <- randomForest(Fch4 ~ ., training, ntree=500, type="regression")
predicted <- predict(pfpfit, newdata = testing)
So, with the above code, I have the prediction model, but I don't know how to apply it to a dataset I already have (example below) that has gaps. I also don't know if it is a problem to have gaps in variables that aren't the variable I want to gap fill (I want to gap fill Fch4, but I also have gaps in Fe and Ta). An example of the dataset I want to gap fill is below:
#data table head
TIMESTAMP <- c("2019-05-31 17:00:00", "2019-05-31 17:30:00", "2019-05-31 18:00:00", "2019-05-31 18:30:00", "2019-05-31 19:00:00", "2019-05-31 19:30:00", "2019-05-31 20:00:00", "2019-05-31 20:30:00", "2019-05-31 21:00:00", "2019-05-31 21:30:00")
RH<-c(38, 40, 41, 42, 44, 49, 65, 72, 74, 77)
Fch4 <- c(NA, -0.002, 0.001, 0.004, NA, -0.013, 0.004,NA, -0.001,-0.002)
distance <- c(1000,1000,180,125.35,1000,180,1000,5.50,180,1000)
Ta <-c(29.52, 29.01, NA, 28.39, 27.87, 26.68, 23.28, NA, 19.95, 19.01)
Fe<- c(NA, NA, 68.62, 39.24, 35.04, 27.26, -2.60, NA,7.28, 2.08)
dd<- data.frame(TIMESTAMP, RH, Fch4, distance, Ta, Fe)
I would want the gap filled dataset to look like this:
#data table head
TIMESTAMP <- c("2019-05-31 17:00:00", "2019-05-31 17:30:00", "2019-05-31 18:00:00", "2019-05-31 18:30:00", "2019-05-31 19:00:00", "2019-05-31 19:30:00", "2019-05-31 20:00:00", "2019-05-31 20:30:00", "2019-05-31 21:00:00", "2019-05-31 21:30:00")
RH<-c(38, 40, 41, 42, 44, 49, 65, 72, 74, 77)
Fch4 <- c(0.045, -0.002, 0.001, 0.004, 0, -0.013, 0.004,-0.003, -0.001,-0.002)
distance <- c(1000,1000,180,125.35,1000,180,1000,5.50,180,1000)
Ta <-c(29.52, 29.01, 29.04, 28.39, 27.87, 26.68, 23.28, 21.16, 19.95, 19.01)
Fe<- c(95.16, 68.95, 68.62, 39.24, 35.04, 27.26, -2.60, 5.09,7.28, 2.08)
dd<- data.frame(TIMESTAMP, RH, Fch4, distance, Ta, Fe)
(I realize the three datasets are mostly the same and that you shouldn't test on the same data as your training data. This is just for the purposes of having something that runs. I can refine it on my own.)
If the TIMESTAMP
column is required by your model, I have to convert it to a factor
.
dd <- data.frame(TIMESTAMP, RH, Fch4, distance, Ta, Fe, stringsAsFactors = TRUE)
Assuming you have already built your random forest model pfpfit
on a training set, this is how you can fill the gaps in Fch4
column:
dd_imputed <- na.roughfix(dd[,-3]) # Impute missing values by median/mode
predicted_Fch4 <- predict(pfpfit, newdata = dd_imputed)
dd$Fch4 <- round(ifelse(is.na(dd$Fch4), predicted_Fch4, dd$Fch4),3)
Output:
> dd
TIMESTAMP RH Fch4 distance Ta Fe
1 2019-05-31 17:00:00 38 0.021 1000.00 29.52 NA
2 2019-05-31 17:30:00 40 -0.002 1000.00 29.01 NA
3 2019-05-31 18:00:00 41 0.001 180.00 NA 68.62
4 2019-05-31 18:30:00 42 0.004 125.35 28.39 39.24
5 2019-05-31 19:00:00 44 0.001 1000.00 27.87 35.04
6 2019-05-31 19:30:00 49 -0.013 180.00 26.68 27.26
7 2019-05-31 20:00:00 65 0.004 1000.00 23.28 -2.60
8 2019-05-31 20:30:00 72 -0.002 5.50 NA NA
9 2019-05-31 21:00:00 74 -0.001 180.00 19.95 7.28
10 2019-05-31 21:30:00 77 -0.002 1000.00 19.01 2.08
Furthermore, if you want to fill in the columns with missing values using their imputed values, use this:
dd$Ta <- ifelse(is.na(dd$Ta), dd_imputed$Ta, dd$Ta)
dd$Fe <- ifelse(is.na(dd$Fe), dd_imputed$Fe, dd$Fe)
> dd
TIMESTAMP RH Fch4 distance Ta Fe
1 2019-05-31 17:00:00 38 0.021 1000.00 29.520 27.26
2 2019-05-31 17:30:00 40 -0.002 1000.00 29.010 27.26
3 2019-05-31 18:00:00 41 0.001 180.00 27.275 68.62
4 2019-05-31 18:30:00 42 0.004 125.35 28.390 39.24
5 2019-05-31 19:00:00 44 0.001 1000.00 27.870 35.04
6 2019-05-31 19:30:00 49 -0.013 180.00 26.680 27.26
7 2019-05-31 20:00:00 65 0.004 1000.00 23.280 -2.60
8 2019-05-31 20:30:00 72 -0.002 5.50 27.275 27.26
9 2019-05-31 21:00:00 74 -0.001 180.00 19.950 7.28
10 2019-05-31 21:30:00 77 -0.002 1000.00 19.010 2.08
randomForest
package actually does handle missing values with na.action = na.roughfix
parameter, you can read more here: https://stackoverflow.com/a/56936983/12382064.