rrandom-forestr-caretpredict

Predict on new data with R random forest when there are missing data


I want to predict on new data that contain NA rows. I need to keep these rows to have the same number of rows in input data and prediction outputs. How can I do this with a random forest model trained with R Caret ? I tried different values for the argument na.action of predict function, for example :

predictions = predict(RF_model, newdata = newdata, type = "prob", na.action = "na.exclude")

With na.exclude and na.omit the rows are deleted. With na.pass I've got an error output "missing values".

EDIT : the model has already been trained, we are talking about predictions on completely new data, and some of them are not good. I know we can't predict on these bad data, but I need to keep a track of the rows.


Solution

  • I think I understand what you want. You want to take a trained model and make predictions on new data which may have missing values. Rather than impute the missing values, you want the predicted value to be NA for those rows with missing values.

    Here is one way to do that. I can even maintain the original row order. The assumptions are that your new data is in a data.frame called new_data and your trained random forest model is called my_forest. Replace these with the names of your objects. I'm also assuming a regression model. If this is a classification problem, let me know and I can alter the code.

    Here is a step-by-step method explaining what we are doing.

    library(tidyr)
    library(dplyr)
    new_data <- new_data %>% rowid_to_column() # add column with rownumber
    new_data_na <- new_data %>%
      filter(!complete.cases(.))  # save those rows with NA in separate data.frame
    new_data_complete <- new_data %>%
      filter(complete.cases(.))   # keep only those rows with no NA
    new_data_complete$predicted <- predict(my_forest, newdata = new_data_complete) # make predictions
    new_data_na$predicted <- NA_real # ensure that that NA is the same data type
    new_data_predicted <- rbind(new_data_na, new_data_complete)  # bind rows
    arrange(new_data_predicted, rowid) # return data to original order
    

    Here is a mode code-efficient pipe method from using the tools of dplyr. Note how simple this looks. The case_when structure checks each row for NA values with complete.cases(.). The . in the argument tells complete.cases to use all columns. If there are no NA values, complete.cases(.) returns TRUE, and the prediction runs on that row. Again, newdata = . is used to tell predict() to use all columns. If there is one or more NA values, complete.cases(.) will return FALSE. The second line of the case_when structure is a catchall for when the first line is not TRUE. If the first line is not TRUE, we want the predicted value to return NA. Note that this method does not involve taking the data apart, and so no effort needs to be made to put it back together.

    library(dplyr)
    new_data %>%
      mutate(predicted = case_when(complete.cases(.) ~ predict(my_forest, newdata = .),
                                   TRUE ~ NA_real_)