rsum

Exclude column in sapply(dat_clean, FUN = function(x){x / sum(x)}) in R


I'm working on a piece of R code to calculate the diversity in a dataset of fungal taxonomies. The tutorial I'm following has a piece of code meant to sum each row, which isn't working because the first column (displaying the OTU numbers) isn't numeric but character. Here is the piece of code:

# make relative abundances
dat_relab = as.data.frame(sapply(dat_clean, FUN = function(x){x / sum(x)}))
rownames(dat_relab) = rownames(dat_clean)

# Remove low abundant ASVs, i.e. for which total relative abundance is lower than 0.01%
dat_relab$relab = rowSums(dat_relab) # adds column with total number of reads per ASV
keep01 = which(dat_relab$relab > 0.0001) # selects which ASVs fit the cut-off
dat_relab01 = dat_relab[keep01, -26] # outputs filtered table, and removes added column with number reads

# Select the ASVs with relative abundance higher than 0.01% from the ASV table with read number for the subsequent analysis
dat_clean2 = dat_clean[rownames(dat_relab01), colnames(dat_relab01)]

The code stops at the 2nd line and provides the following error: Error in sum(x) : invalid 'type' (character) of argument

I understand logically what the problem here is; the first column of the data isn't numeric, and won't be summed. My issue is in understand how to fix it, and why it isn't going wrong in the tutorial (which has data formatted in the same way).


Solution

  • It sums each column!

    I understand logically what the problem here is; the first column of the data isn't numeric, and won't be summed. My issue is in understand how to fix it, [...]

    For some toy data

    dat_clean = data.frame(V1=letters[1:5], V2=1:5, V3=6:10)
    

    your attempt

    as.data.frame(sapply(dat_clean, FUN = function(x){x / sum(x)}))
    

    issues

    Error in sum(x) : invalid 'type' (character) of argument

    which is quite informative. To avoid, we can select numeric columns

    i = sapply(dat_clean, is.numeric) 
    dat_clean[i] = sapply(dat_clean[i], \(x) x / sum(x))
    
    > # i is a named vector of Boolean (logical)
    > i 
        V1    V2    V3 
     FALSE  TRUE  TRUE 
    > # how the result looks 
    > dat_clean 
      V1         V2    V3
    1  a 0.06666667 0.150
    2  b 0.13333333 0.175
    3  c 0.20000000 0.200
    4  d 0.26666667 0.225
    5  e 0.33333333 0.250
    

    Matchig the tutorial.

    i = sapply(dat_clean, is.numeric)
    dat_relab = as.data.frame(sapply(dat_clean[i], FUN = function(x){x / sum(x)}))
    
    

    --or either of--

    # (1)
    X = as.data.frame(dat_clean[i] / lapply(dat_clean[i], sum))
    # (2)
    Y = dat_clean[i] / lapply(dat_clean[i], sum)
    # (3)
    Z = Filter(is.numeric, dat_clean)
    Z = Z / lapply(Z, sum)
    
    > Vectorize(identical, 'x')(list(X, Y, Z), dat_relab)
    [1] TRUE TRUE TRUE
    

    [...] and why it isn't going wrong in the tutorial (which has data formatted in the same way).

    This is hard to tell without seeing the "tutorial".