I am new to R, so please bear with me. I am working with a CSV file that started off as 95 rows and 10,000+ columns. The first thing I needed to do was delete any rows with a sum of less than 1000. I was able to do that and reduced it to 80 rows, but when I tried to adapt the code to my columns, it did not work.
I need to delete any column with a sum of 0. The columns have headers. I have tried multiple different methods without luck.
The file is set up like so
A1 A2 A3 A4 A5 A6
Sample 1 394 28 1839 0 2389 24
Sample 2 4825 0 230 0 239 8349
Sample 3 439 329 932 0 948 348
Sample 4 239 832 3094 0 389 438
Sample 5 2389 13 390 0 5893 3849
Sample 6 98 320 23 0 9340 432
These are the libraries that I have been using. I have been using vroom
instead of readr
because of the size of my file.
library(vroom)
library(dplyr)
library(tidyverse)
I am trying to remove all of the columns with a sum of 0, like A4. Initially, I tried adapting the code I used on the rows:
df <- vroom("file.csv")
keep <- colSums(df[sapply(df, is.numeric)][,]) > 0
df2 <- df[keep,]
However, it looks like keep might only work on x values/rows? Was receiving an error saying keep had to be either 1-80, which would correspond to the number of rows in the file. Would appreciate if anyone could confirm/reject this theory.
Since that wasn't working, I tried using which
to select the columns that were zero sum so that I could get rid of those manually.
unique.cols <- which(colSums(df[sapply(df, is.numeric)]) == 0)
It selected all the columns, which is not what I wanted and I'm still not sure why. I tried experimenting with some other methods, such as the janitor package and remove_empty function, but those didn't work either. I tried using a macro in excel as well, but it crashed my computer and then wouldn't reopen...
I found this code while going through other questions on Stack that seemed similar to mine, but it completely decimated my data set and took out many more columns than it should have. Out of 10,000+, only 300 some were left. Only a few columns are empty.
df <- Filter(function(x)!all(is.na(x) || is.null(x) || x == "" || x == 0), df)
Here is an example of what I get when I use the str(df)
function
str(df)
tibble [80 × 10,932] (S3: tbl_df/tbl/data.frame)
$ ...1 : chr [1:80] "sample 1" "sample 2" "sample 3" "sample 4" ...
$ A1 : num [1:80] 394 4825 439 239 2389 ...
$ A2 : num [1:80] 28 0 329 832 13 320 ...
$ A3 : num [1:80] 1839 230 932 3094 390 ...
Any advice would be appreciated. Thank you!
Using Filter() we pick non-numeric columns and numeric columns that have non-zero sum.
Filter(\(x) !is.numeric(x) || sum(x) != 0, df)
# A1 A2 A3 A5 A6
# Sample 1 394 28 1839 2389 24
# Sample 2 4825 0 230 239 8349
# Sample 3 439 329 932 948 348
# Sample 4 239 832 3094 389 438
# Sample 5 2389 13 390 5893 3849
# Sample 6 98 320 23 9340 432
Where
df <- data.frame(
A1 = c(394L, 4825L, 439L, 239L, 2389L, 98L),
A2 = c(28L, 0L, 329L, 832L, 13L, 320L),
A3 = c(1839L, 230L, 932L, 3094L, 390L, 23L),
A4 = integer(6),
A5 = c(2389L, 239L, 948L, 389L, 5893L, 9340L),
A6 = c(24L, 8349L, 348L, 438L, 3849L, 432L),
row.names = c("Sample 1", "Sample 2", "Sample 3", "Sample 4", "Sample 5", "Sample 6")
)