rreplaceoutliersstdev

Find outliers by Standard Deviation from mean, replace with NA in large dataset (6000+ columns)


I've found a few answers similar to what I'm looking for, but the code has not worked.

I need to replace all outliers (as defined for our purposes as more than 2 SD away from the mean), of each column with NA or NULL. I'm trying to have this iterate across all columns, calculating the outliers for each column and replacing those values.

Disclosure: I also know statistically there are strong views on the removal of outliers, and the use of SD as a measure to exclude them given that SD is calculated including the outliers in this case. These were my instructions from our statistician, so I am working off that for now.

This is a glimpse of my dataset:

data

Group sp.Q13813.SPTN1_HUMAN sp.O14773.TPP1_HUMAN sp.P11137.MTAP2_HUMAN
1 Premutation           10713983468            367492324            2134747097
2 Premutation           10789498495            343303410            2677825476
3 Premutation           11134883489            383589325            2132552280
4 Premutation            9723552595            269965000            2262740921
5 Premutation           11175156282            359864993            1419225650
6 Premutation           10959077349            258095035            3343267633
7 Premutation           10770809133            331554977            2763604046
8 Premutation           11098182537            344384433            2198718886

I've tried code including the scale() function, but when I performed the below functions I found it replaced values that were within 2 SD of the mean of the column, and it left my code in the scaled format. I'm not as familiar with this function and was not sure how to revert it back to the original data points, even if it was replacing the correct outlier-deemed values.

Method attempt #1

# take note of order for column names
data.names_P <- colnames(data)


# scale all numeric columns
data.numeric.Pre <- select_if(data, is.numeric) %>% # subset of numeric columns
  mutate_all(scale)                             # perform scale separately for each column
data.numeric.Pre[data.numeric.Pre > 2] <- 99999 # set values larger than 2 to NA (none in this example)

# combine results with subset data frame of non-numeric columns
data.Pre <- data.frame(select_if(data, function(x) !is.numeric(x)),
                   data.numeric)

# restore columns to original order
data.Pre <- data.Pre[, data.names_P]

Method attempt #2


FindOutliers <- function(data) {
  upper = (2*sd(data) + mean(data)
  lower = (mean - 2*sd(data))
  result <- which(data < upper | data > lower)
}

I know attempt number two does not replace the data with NA.

Any help would be immensely appreciated.


Solution

  • You can use a ifelse function, here an example using dplyr and applying the ifelse function over all columns containing the term HUMAN:

    library(dplyr)
    data %>% mutate_at(.vars = vars(contains("HUMAN")), 
                       .funs= ~ifelse(abs(.)>mean(.)+2*sd(.), NA, .))