rdataframemissing-data

Creating Artificial Gaps in R Dataset


I am processing data using Random Forest, and I am trying to create random artificial gaps in my dataset so that I can test how accurate the random forest predictions are.

TIMESTAMP <- c(2001:2020)
ch4_flux <- c(67.36, 66.39, 65.39, 64.41, 63.52, 62.76, 62.16,61.76, 61.54,61.53,61.7,62.05,62.52, 63.09, 63.71, 64.33, 64.92, 65.46, 65.93, 66.32)
ch4_flux_gaps <- c(67.36, 66.39, 65.39, 64.41, 63.52, 62.76, 62.16,61.76, 61.54,61.53,61.7,62.05,62.52, 63.09, 63.71, 64.33, 64.92, 65.46, 65.93, 66.32)
distance <- c(1000,1000,1000,125.35,1000,1000,1000,5.50,1000,1000,1000,1000, 1000,1000,179.65,1000,1000,1000,1000,1000)
CowNum <- c(0, 0, 0, 30, 0, 0, 0, 81, 0, 0, 0, 0, 0, 0, 127, 0, 0, 0, 0, 0)
dd <- data.frame(TIMESTAMP, ch4_flux, ch4_flux_gaps, distance,CowNum)

In the above example data, ch4_flux and ch4_flux_gaps are identical columns because I will be making gaps in only the ch4_flux_gaps column and then comparing them. I'm hoping to add gaps to 5-10% of the rows. I have seen information about how to add an entire row that is a gap, but not how to target only one column and have the gaps be random.

I am hoping that the ch4_flux_gaps column will look something like this afterwards:

ch4_flux_gaps <- c(67.36, 66.39, 65.39, NA, 63.52, NA, 62.16,61.76, 61.54,61.53,61.7,62.05,NA, 63.09, 63.71, 64.33, 64.92, 65.46, NA, NA)

Solution

  • The package {messy}, by Nicola Rennie, features a make_missing() function that allows you to randomly add missing values to a column, specifying a percentage of the rows to modify:

    dd2 <- dd |> 
      messy::make_missing(cols = "ch4_flux_gaps", messiness = 0.3)
    
    # > dd2$ch4_flux_gaps
    # [1] 67.36 66.39 65.39 NA 63.52 62.76 NA 61.76 61.54 NA NA NA NA 63.09 63.71 NA 64.92 NA 65.93 66.32