I have a CSV file that is too large to be opened in Excel. In this file, I am trying to see how much of each column's sum is from rows 1, 47:56, and 156:158. This does NOT include the first two columns of the file. There are also headers for both the rows and columns. The file is made up of about 22,000 columns and 210 rows.
I am trying to see which columns get more than >.1% of their total sums from the mentioned rows, and then delete those columns.
I have been using vroom rather than readr when loading the file due to how large it is.
Example file setup:
H e a d e r A1 A2 A3 A4 A5 A6
H 1 sample a 1 0 0 13 0 9
e 2 sample b 4 0 0 8 312 24
a 3 sample c 0 20 0 49 0 17
d 4 sample d 2 0 213 18 56 3
e 5 sample e 5 4 0 10 94 62
r 6 sample f 9 87 0 2 33 90
Code:
library(dplyr)
library(vroom)
myData <- vroom("File.csv")
myData$newRow <- 100*(colSums(myData[-1, -2])/rowSums(myData[1, 47:56, 156:158]))
I was trying to create a new row with the percentage from (each column's sum EXCEPT 1 and 2)/(sum called rows). And here is the latest error message I have received, and the one that I am having trouble understanding:
> myData$newRow <- 100*(colSums(myData[-1, -2])/rowSums(myData[1, 47:56, 156:158]))
Error:
! Assigned data `100 * ...` must be compatible with existing data.
✖ Existing data has 200 rows.
✖ Assigned data has 21941 rows.
ℹ Only vectors of size 1 are recycled.
Run `rlang::last_trace()` to see where the error occurred.
Warning message:
In drop && length(xo) == 1L :
'length(x) = 3 > 1' in coercion to 'logical(1)'
Any advice would be appreciated. Thank you.
You're not far off the answer, but there are a few issues with your current approach:
myData$newRow <-
will create a new column, not a new row. I would suggest something like myData[nrow(myData)+1, ] <-
to add new rows to the data.[ ]
the different ranges should be wrapped in c()
, otherwise R thinks they are different dimensions.colSums()
in both calculations, rowSums
will get a total for the rows, which is not what is needed here.c(-1, -2)
should be in both calculations (and on the left hand side of <-
).So:
myData[nrow(myData) + 1, c(-1, -2)] <- 100 * colSums(myData[c(1, 47:56, 156:158), c(-1, -2)]) / colSums(myData[, c(-1, -2)])