rlapplydesctools

Trim data using lapply to remove outliers


I am trying to use lapply to trim some of my data. What I am trying to do is trim columns 2:4 (deleting the outliers or extreme values) but also remove the rows across the columns.

Some data with outliers in each column. So I want to remove values 100 and -100 in V1 but also remove the whole row in the data. Also removing values 80 and -80 in column V2 - subsequently removing that row also.

    trimdata <- NULL
    trimdata$ID <-  seq.int(102)
    trimdata$V1 <- c(rnorm(100), 100, -100)
    trimdata$V2 <- c(rnorm(100), 80, -80)
    trimdata$V3 <- c(rnorm(100), 120, -120)
    trimdata <- as.data.frame(trimdata)

    library(DescTools)
    trimdata <- lapply(trimdata, function(x) Trim(x, trim = 0.01))
    trimdata <- as.data.frame(trimdata)

The above code applies the function across all the columns (removing the extreme values in the ID column)

This code:

trimdata[2:4] <- lapply(trimdata[2:4], function(x) Trim(x, trim = 0.01))

Returns the following error

Error in `[<-.data.frame`(`*tmp*`, 2:4, value = list(V1 = c(0.424725933773568,  : 
  replacement element 1 has 98 rows, need 100

So I am trying to trim based on columns 2:4 but also apply it to column 1.


Solution

  • You can't replace values in the trimdata because function Trim removes elements and you lose the length equality necessary to the substitution.

    Here an example:

    x <- rnorm(10)
    length(x)
    [1] 10
    length(Trim(x, trim=0.1))
    [1] 8
    

    Before Trim function you have 10 elements, after only 8.

    In your example Trim removes 2 elements, so you have this description in the error:

    replacement element 1 has 98 rows, need 100

    From Trim documentation:

    A symmetrically trimmed vector x with a fraction of trim observations (resp. the given number) deleted from each end will be returned.

    In your example two rows by each column are trimmed out. Rows are differents for each column as you can see:

    trim_out<-lapply(trimdata[2:4], function(x) Trim(x, trim = 0.01))
    lapply(trim_out, attributes)
    $V1
    $V1$trim
    [1] 56 57
    
    
    $V2
    $V2$trim
    [1] 63 47
    
    
    $V3
    $V3$trim
    [1] 90 74
    

    If you want a cleaned data.frame in output you can remove all this rows from your dataframe trimdata, like this:

    trimdata[-unique(unlist(lapply(trim_out, attributes))),]