rfiltersummultiple-columnscol

How to Delete Empty Columns/Columns with a Sum of Zero


I am new to R, so please bear with me. I am working with a CSV file that started off as 95 rows and 10,000+ columns. The first thing I needed to do was delete any rows with a sum of less than 1000. I was able to do that and reduced it to 80 rows, but when I tried to adapt the code to my columns, it did not work.

I need to delete any column with a sum of 0. The columns have headers. I have tried multiple different methods without luck.

The file is set up like so

            A1     A2     A3     A4     A5     A6
Sample 1   394     28     1839   0      2389   24
Sample 2   4825    0      230    0      239    8349
Sample 3   439     329    932    0      948    348
Sample 4   239     832    3094   0      389    438
Sample 5   2389    13     390    0      5893   3849
Sample 6   98      320    23     0      9340   432

These are the libraries that I have been using. I have been using vroom instead of readr because of the size of my file.

library(vroom)
library(dplyr)
library(tidyverse)

I am trying to remove all of the columns with a sum of 0, like A4. Initially, I tried adapting the code I used on the rows:

df <- vroom("file.csv")

keep <- colSums(df[sapply(df, is.numeric)][,]) > 0
df2 <- df[keep,]

However, it looks like keep might only work on x values/rows? Was receiving an error saying keep had to be either 1-80, which would correspond to the number of rows in the file. Would appreciate if anyone could confirm/reject this theory.

Since that wasn't working, I tried using which to select the columns that were zero sum so that I could get rid of those manually.

unique.cols <- which(colSums(df[sapply(df, is.numeric)]) == 0)

It selected all the columns, which is not what I wanted and I'm still not sure why. I tried experimenting with some other methods, such as the janitor package and remove_empty function, but those didn't work either. I tried using a macro in excel as well, but it crashed my computer and then wouldn't reopen...

I found this code while going through other questions on Stack that seemed similar to mine, but it completely decimated my data set and took out many more columns than it should have. Out of 10,000+, only 300 some were left. Only a few columns are empty.

df <- Filter(function(x)!all(is.na(x) || is.null(x) || x == "" || x == 0), df)

Here is an example of what I get when I use the str(df) function

str(df)
tibble [80 × 10,932] (S3: tbl_df/tbl/data.frame)
 $ ...1    : chr [1:80] "sample 1" "sample 2" "sample 3" "sample 4" ...
 $ A1    : num [1:80] 394 4825 439 239 2389 ...
 $ A2    : num [1:80] 28 0 329 832 13 320 ...
 $ A3    : num [1:80] 1839 230 932 3094 390 ...

Any advice would be appreciated. Thank you!


Solution

  • Using Filter() we pick non-numeric columns and numeric columns that have non-zero sum.

    Filter(\(x) !is.numeric(x) || sum(x) != 0, df)
    
    #            A1  A2   A3   A5   A6
    # Sample 1  394  28 1839 2389   24
    # Sample 2 4825   0  230  239 8349
    # Sample 3  439 329  932  948  348
    # Sample 4  239 832 3094  389  438
    # Sample 5 2389  13  390 5893 3849
    # Sample 6   98 320   23 9340  432
    

    Where

    df <- data.frame(
      A1 = c(394L, 4825L, 439L, 239L, 2389L, 98L),
      A2 = c(28L, 0L, 329L, 832L, 13L, 320L),
      A3 = c(1839L, 230L, 932L, 3094L, 390L, 23L),
      A4 = integer(6),
      A5 = c(2389L, 239L, 948L, 389L, 5893L, 9340L),
      A6 = c(24L, 8349L, 348L, 438L, 3849L, 432L),
      row.names = c("Sample 1", "Sample 2", "Sample 3", "Sample 4", "Sample 5", "Sample 6")
    )