rdataframemultiple-columnsnaoutliers

Replace certain values in data.frame columns


I have a data as follows:

data<-data.frame(id=c(1,2,3,4,5,6,7,8,9,10),
                 Wt=c(91,92,85,205,285,43,95,75,76,NA),
                 Ht=c(185,182,173,171,600,650,NA,890,NA,NA))

Wt represents the weight in kilograms and Ht represents the height in centimeters. In this example, I want to treat the values of Wt bigger than 200 as outliers and change to some specific numbers. Also, I want to treat the values of Ht bigger than 250 as outliers and change to NA. In my actual data, there are few outliers in Wt and many outliers in Ht. So, I could find the outliers for Wt by using the code below:

a1<-data$Wt 

a1<-data.frame(a1)
a1<-na.omit(a1)
b1<-a1[a1$a1>200, ]
b1  #205,285

I want to change 205 to 80 and change 285 to 90. (Because, in my actual data, there are few outliers for Wt, so that I can change them individually.) Also, I want to make the values of Ht bigger than 250 as NA. So my expected output is as follows:

data<-data.frame(id=c(1,2,3,4,5,6,7,8,9,10),
                 Wt=c(91,92,85,80,90,43,95,75,76,NA),
                 Ht=c(185,182,173,171,NA,NA,NA,NA,NA,NA))

Solution

  • Do it by reference using data.table:

    library(data.table)
    setDT(data)
    
    data[Ht > 250, Ht := NA]
    data[Wt == 205, Wt := 80]
    data[Wt == 285, Wt := 90]
    data
        id Wt  Ht
     1:  1 91 185
     2:  2 92 182
     3:  3 85 173
     4:  4 80 171
     5:  5 90  NA
     6:  6 43  NA
     7:  7 95  NA
     8:  8 75  NA
     9:  9 76  NA
    10: 10 NA  NA
    

    For more info, see: Introduction to data.table.


    To implement the ifelse method suggested by Shawn in data.table, you could do something like the following. This is still updating by reference, so there is still no need to write something like data <- data %>% ....

    library(data.table)
    setDT(data)
    
    data[, `:=`(Ht = fifelse(Ht > 250, NA_real_, Ht),
                Wt = fcase(Wt == 205, 80, 
                           Wt == 285, 90,
                           rep(TRUE, .N), Wt))]
    

    Note that fifelse is data.table's fast ifelse, and I use fcase to handle the two Wt conditions together (although the rep(TRUE, .N) trick is a bit hacky, so maybe it would be nicer to just use two fifelse calls for Wt instead of fcase).


    Update 2023-07-17: If you need a vectorised option, I suggest using plyr::mapvalues. Even though plyr is retired, you can just use body(plyr::mapvalues) and do the same thing for it's few dependencies, and hence put the code into your own project.