rdata.tableinclusion

R Count Sequentially Columns


    set.seed(0)
data = data.frame(ID = 1:1000, X1=runif(1000), X2=runif(1000), DROP1=sample(0:1,r=T),DROP2=sample(0:1,r=T),DROP3=sample(0:1,r=T))

Say this is my data. I wish to do this: count the number of values of DROP1 that equals to 1; then count the number of values of DROP2 among the cases when DROP1 equals to 1; then count the number of values of DROP3 equals to 1 among the cases when DROP2 equals to 1 and DROP1 equals to 1. I can do this manually but our real data file is large and has 80+ DROP variables. The desirable output is just a print out that looks:

DROP1, #
DROP2 (AFTER DROP1), #
DROP3 (AFTER DROP1 & DROP2), #

Solution

  • Here is an option with base R where we get the 'DROP' column names ('nm1') using grep. Then loop over the sequence of those, get the seq of those, subset the data columns, use Reduce to get a logical vector with & (only TRUE if we have all the columns with 1 for a row, i.e 1=> TRUE, 0 => FALSE), and get the sum of those elements to return the count

    nm1 <- grep('^DROP', names(data), value = TRUE)
    sapply(seq_along(nm1), function(i)  {i1 <- seq(i)
            sum(Reduce(`&`, data[nm1[i1]])) })
    #[1] 503 249 137
    

    Or with data.table

    library(data.table)
    setDT(data)
    lapply(seq_along(nm1), function(i) {
             i1 <- seq(i)
             data[, sum(Reduce(`&`, .SD)), .SDcols = nm1[i1]]
    
        })
    

    data

    set.seed(0)
    data <- data.frame(ID = 1:1000, X1=runif(1000), X2=runif(1000), 
               DROP1=sample(0:1,1000, replace = TRUE),
               DROP2=sample(0:1,1000, replace = TRUE),
               DROP3=sample(0:1,1000,replace = TRUE))