rrowpercentagecolvroom

How to check what percentage of a column is made up of certain rows


I have a CSV file that is too large to be opened in Excel. In this file, I am trying to see how much of each column's sum is from rows 1, 47:56, and 156:158. This does NOT include the first two columns of the file. There are also headers for both the rows and columns. The file is made up of about 22,000 columns and 210 rows.

I am trying to see which columns get more than >.1% of their total sums from the mentioned rows, and then delete those columns.

I have been using vroom rather than readr when loading the file due to how large it is.

Example file setup:

  H e a d e r           A1    A2    A3    A4    A5    A6
H    1    sample a     1     0     0     13    0     9
e    2    sample b     4     0     0     8     312   24
a    3    sample c     0     20    0     49    0     17
d    4    sample d     2     0     213   18    56    3
e    5    sample e     5     4     0     10    94    62
r    6    sample f     9     87    0     2     33    90

Code:

library(dplyr)

library(vroom)


myData <- vroom("File.csv")

myData$newRow <- 100*(colSums(myData[-1, -2])/rowSums(myData[1, 47:56, 156:158]))

I was trying to create a new row with the percentage from (each column's sum EXCEPT 1 and 2)/(sum called rows). And here is the latest error message I have received, and the one that I am having trouble understanding:

> myData$newRow <- 100*(colSums(myData[-1, -2])/rowSums(myData[1, 47:56, 156:158]))
Error:
! Assigned data `100 * ...` must be compatible with existing data.
✖ Existing data has 200 rows.
✖ Assigned data has 21941 rows.
ℹ Only vectors of size 1 are recycled.
Run `rlang::last_trace()` to see where the error occurred.
Warning message:
In drop && length(xo) == 1L :
  'length(x) = 3 > 1' in coercion to 'logical(1)'

Any advice would be appreciated. Thank you.


Solution

  • You're not far off the answer, but there are a few issues with your current approach:

    1. myData$newRow <- will create a new column, not a new row. I would suggest something like myData[nrow(myData)+1, ] <- to add new rows to the data.
    2. when selecting the rows or columns from the data frame using [ ] the different ranges should be wrapped in c(), otherwise R thinks they are different dimensions.
    3. you should use colSums() in both calculations, rowSums will get a total for the rows, which is not what is needed here.
    4. The column subset c(-1, -2) should be in both calculations (and on the left hand side of <-).
    5. The division calculation is in the wrong order.

    So:

    myData[nrow(myData) + 1, c(-1, -2)] <- 100 * colSums(myData[c(1, 47:56, 156:158), c(-1, -2)]) / colSums(myData[, c(-1, -2)])