rmachine-learningrandom-forest

How do I use a random forest to predict gaps in a dataset?


I have a dataset that I used to make a random forest (it is split into testing and training data). I have already made the random forest and generated predictions (code below), but I don't know how to take those predictions and use them to generate a full data table that includes the gap-filled values.

#data table head
TIMESTAMP <- c("2019-05-31 17:00:00", "2019-05-31 17:30:00", "2019-05-31 18:00:00", "2019-05-31 18:30:00", "2019-05-31 19:00:00", "2019-05-31 19:30:00", "2019-05-31 20:00:00", "2019-05-31 20:30:00", "2019-05-31 21:00:00", "2019-05-31 21:30:00")
RH<-c(38, 40, 41, 42, 44, 49, 65, 72, 74, 77)
Fch4 <- c(0.045, -0.002, 0.001, 0.004, 0, -0.013, 0.004,-0.003, -0.001,-0.002)
distance <- c(1000,1000,180,125.35,1000,180,1000,5.50,180,1000)
Ta <-c(29.52, 29.01, 29.04, 28.39, 27.87, 26.68, 23.28, 21.16, 19.95, 19.01)
Fe<- c(95.16, 68.95, 68.62, 39.24, 35.04, 27.26, -2.60, 5.09,7.28, 2.08)

dd<- data.frame(TIMESTAMP, RH, Fch4, distance, Ta, Fe)

#Making RF and Prediction
set.seed(1)
inTraining <- createDataPartition(dd$Fch4, p = 0.65, list=FALSE)
training <- dd[inTraining,]
testing <- dd[-inTraining,]

set.seed(1)
pfpfit <- randomForest(Fch4 ~ ., training, ntree=500, type="regression")
predicted <- predict(pfpfit, newdata = testing)

So, with the above code, I have the prediction model, but I don't know how to apply it to a dataset I already have (example below) that has gaps. I also don't know if it is a problem to have gaps in variables that aren't the variable I want to gap fill (I want to gap fill Fch4, but I also have gaps in Fe and Ta). An example of the dataset I want to gap fill is below:

#data table head
    TIMESTAMP <- c("2019-05-31 17:00:00", "2019-05-31 17:30:00", "2019-05-31 18:00:00", "2019-05-31 18:30:00", "2019-05-31 19:00:00", "2019-05-31 19:30:00", "2019-05-31 20:00:00", "2019-05-31 20:30:00", "2019-05-31 21:00:00", "2019-05-31 21:30:00")
    RH<-c(38, 40, 41, 42, 44, 49, 65, 72, 74, 77)
    Fch4 <- c(NA, -0.002, 0.001, 0.004, NA, -0.013, 0.004,NA, -0.001,-0.002)
    distance <- c(1000,1000,180,125.35,1000,180,1000,5.50,180,1000)
    Ta <-c(29.52, 29.01, NA, 28.39, 27.87, 26.68, 23.28, NA, 19.95, 19.01)
    Fe<- c(NA, NA, 68.62, 39.24, 35.04, 27.26, -2.60, NA,7.28, 2.08)
dd<- data.frame(TIMESTAMP, RH, Fch4, distance, Ta, Fe)

I would want the gap filled dataset to look like this:

#data table head
    TIMESTAMP <- c("2019-05-31 17:00:00", "2019-05-31 17:30:00", "2019-05-31 18:00:00", "2019-05-31 18:30:00", "2019-05-31 19:00:00", "2019-05-31 19:30:00", "2019-05-31 20:00:00", "2019-05-31 20:30:00", "2019-05-31 21:00:00", "2019-05-31 21:30:00")
    RH<-c(38, 40, 41, 42, 44, 49, 65, 72, 74, 77)
    Fch4 <- c(0.045, -0.002, 0.001, 0.004, 0, -0.013, 0.004,-0.003, -0.001,-0.002)
    distance <- c(1000,1000,180,125.35,1000,180,1000,5.50,180,1000)
    Ta <-c(29.52, 29.01, 29.04, 28.39, 27.87, 26.68, 23.28, 21.16, 19.95, 19.01)
    Fe<- c(95.16, 68.95, 68.62, 39.24, 35.04, 27.26, -2.60, 5.09,7.28, 2.08)
dd<- data.frame(TIMESTAMP, RH, Fch4, distance, Ta, Fe)

(I realize the three datasets are mostly the same and that you shouldn't test on the same data as your training data. This is just for the purposes of having something that runs. I can refine it on my own.)


Solution

  • If the TIMESTAMP column is required by your model, I have to convert it to a factor .

    dd <- data.frame(TIMESTAMP, RH, Fch4, distance, Ta, Fe, stringsAsFactors = TRUE)
    

    Assuming you have already built your random forest model pfpfit on a training set, this is how you can fill the gaps in Fch4 column:

    dd_imputed <- na.roughfix(dd[,-3]) # Impute missing values by median/mode
    
    predicted_Fch4 <- predict(pfpfit, newdata = dd_imputed)
    
    dd$Fch4 <- round(ifelse(is.na(dd$Fch4), predicted_Fch4, dd$Fch4),3)
    

    Output:

    > dd
                 TIMESTAMP RH   Fch4 distance    Ta    Fe
    1  2019-05-31 17:00:00 38  0.021  1000.00 29.52    NA
    2  2019-05-31 17:30:00 40 -0.002  1000.00 29.01    NA
    3  2019-05-31 18:00:00 41  0.001   180.00    NA 68.62
    4  2019-05-31 18:30:00 42  0.004   125.35 28.39 39.24
    5  2019-05-31 19:00:00 44  0.001  1000.00 27.87 35.04
    6  2019-05-31 19:30:00 49 -0.013   180.00 26.68 27.26
    7  2019-05-31 20:00:00 65  0.004  1000.00 23.28 -2.60
    8  2019-05-31 20:30:00 72 -0.002     5.50    NA    NA
    9  2019-05-31 21:00:00 74 -0.001   180.00 19.95  7.28
    10 2019-05-31 21:30:00 77 -0.002  1000.00 19.01  2.08
    

    Furthermore, if you want to fill in the columns with missing values using their imputed values, use this:

    dd$Ta <- ifelse(is.na(dd$Ta), dd_imputed$Ta, dd$Ta)
    dd$Fe <- ifelse(is.na(dd$Fe), dd_imputed$Fe, dd$Fe)
    
    > dd
                 TIMESTAMP RH   Fch4 distance     Ta    Fe
    1  2019-05-31 17:00:00 38  0.021  1000.00 29.520 27.26
    2  2019-05-31 17:30:00 40 -0.002  1000.00 29.010 27.26
    3  2019-05-31 18:00:00 41  0.001   180.00 27.275 68.62
    4  2019-05-31 18:30:00 42  0.004   125.35 28.390 39.24
    5  2019-05-31 19:00:00 44  0.001  1000.00 27.870 35.04
    6  2019-05-31 19:30:00 49 -0.013   180.00 26.680 27.26
    7  2019-05-31 20:00:00 65  0.004  1000.00 23.280 -2.60
    8  2019-05-31 20:30:00 72 -0.002     5.50 27.275 27.26
    9  2019-05-31 21:00:00 74 -0.001   180.00 19.950  7.28
    10 2019-05-31 21:30:00 77 -0.002  1000.00 19.010  2.08
    

    randomForest package actually does handle missing values with na.action = na.roughfix parameter, you can read more here: https://stackoverflow.com/a/56936983/12382064.