rdplyrconditional-statements

Full dataframe operations based on specific subsets of itself (+dplyr)


I want to be able to perform simple operations such as divisions or multiplications on all columns or a specific one, by using a subset of the same data frame of one or more columns based on a condition such as a matching ID on another row.

I have data on Optical Density values (OD) of yeast culture across different treatments, where each column is the optical density value at a different timepoint and could be summarized as this:

DF <- data.frame(Time=c(1,2,3,4,5,6,7),
                 A1=c(40,50,60,70,80,90,100),
                 A2=c(40,50,60,70,80,90,100),
                 A3=c(20,40,50,60,70,80,90),
                 B1=c(20,40,50,60,70,80,90),
                 B2=c(10,20,30,40,50,60,70),
                 B3=c(14,22,37,45,52,65,73),
                 C1=c(25,30,41,49,64,71,83),
                 C2=c(24,30,41,50,63,72,80),
                 C3=c(28,33,47,51,67,72,88))

I also have another data frame with the relevant information for each column:

INFO<- data.frame(Well=c("A1","A2","A3","B1","B2","B3","C1","C2","C3"),
                  Strain=rep(c("L","M","N"),times=3),
                  Treatment=rep(c("0X","X","2X"),each=3))

One of the things that I have to do is normalize the values of the OD by the control treatment. In this example it would be "X". This means that I have to divide each value (including the control treatment), by the corresponding value of the control treatment for the correct time point and Strain.

The way I got around this was by creating a second dataframe with cbind and repeatedly adding the treatment column like this:

normingOD<-DF[,5:7]
normingOD<-cbind(normingOD,normingOD,normingOD)
normOD<-DF[-1]/normingOD
normOD$Time<-DF$Time

This should extract the columns corresponding to the treatment "X",bind them 3 times so that they match the size of the dataset minus the first column and then divide the "DF" by "normingOD". This should work, since the positions should match, for timepoints and strains already.

I think this is quite ineffective and I'd like to see if there is a better way to do this, since my real data set has 96 columns and thousands of observations and the control treatments are 12 of those columns in reality. How can I make this cleaner???

Later down the road I made a long format to work with dplyr like this:

library(reshape2)
library(dplyr)
Ndata <- reshape2::melt(normOD ,  id.vars = 'Time', variable.name = 'Well')
DATA<-merge(Ndata,INFO,by="Well")

At this point I'd also like to add a new column with the difference in growth as a percentage, but compared to the values of the treatment "0X". This implies to somehow tell R that each value has to be multiplied by 100 and then divided by the value that corresponds to the same Time but for Treatment = "X0" and for the same strain. As and example, The value of for the Well==B1, Time==1, Treatment==X (Value==2) and Strain==L I would divide it by the value in Well==A1, Time ==1, Treatment==0X and Strain==L (Value==1). And if I did things correctly it should give perc_growth==50 as it is 50% of the value in the X0 treatment.

I again bruteforced it by making a vector with the correspondent values to divide by like this:

vector<-DATA[rep(1:7,9),3]
DATA<-cbind(DATA,vector)
DATA<-DATA%>%
  mutate(perc_growth=(value*100)/vector)

This extracts a vector with the values for the 'Strain' 'L' and 'Treatment' '0X', and repeats it 9 times to match the size of the dataframe, so that I can again use it as a column to divide by.

Is there an easier and cleaner way of doing this in dplyr or similar??? I'll most likely have to do this in more occasions, so I really want to find the proper way of writing these things in a better way!!!

I thank you in anticipation and I apologize if for some reason I did not find a post already answering this.


Solution

  • I think your attempt to go from wide to long is a good idea - I think you should do that first, and then join INFO to the long data:

    library(dplyr)
    library(tidyr)
    DF_long = pivot_longer(DF, -1, names_to = "Well") |>
      left_join(INFO, by = "Well")
    DF_long
    # # A tibble: 63 × 5
    #     Time Well  value Strain Treatment
    #    <dbl> <chr> <dbl> <chr>  <chr>    
    #  1     1 A1       40 L      0X       
    #  2     1 A2       40 L      X        
    #  3     1 A3       20 L      2X       
    #  4     1 B1       20 M      0X       
    #  5     1 B2       10 M      X        
    #  6     1 B3       14 M      2X       
    #  7     1 C1       25 N      0X       
    #  8     1 C2       24 N      X        
    #  9     1 C3       28 N      2X       
    # 10     2 A1       50 L      0X       
    # # ℹ 53 more rows
    # # ℹ Use `print(n = ...)` to see more rows
    

    I believe this is what you want for the growth calculation:

    DF_long |>
      ## within each group defined by the same Strain and same Time, 
      ## deivide each value by the value when Treatment == "0X"
      mutate(
        growth = value / value[Treatment == "0X"],
        as_a_percent = scales::percent_format()(growth),
        .by = c(Strain, Time)
      ) |>
      arrange(Strain, Time, Well) |>
      print(n = 20)
    # # A tibble: 63 × 7
    #     Time Well  value Strain Treatment growth as_a_percent
    #    <dbl> <chr> <dbl> <chr>  <chr>      <dbl> <chr>       
    #  1     1 A1       40 L      0X         1     100%        
    #  2     1 A2       40 L      X          1     100%        
    #  3     1 A3       20 L      2X         0.5   50%         
    #  4     2 A1       50 L      0X         1     100%        
    #  5     2 A2       50 L      X          1     100%        
    #  6     2 A3       40 L      2X         0.8   80%         
    #  7     3 A1       60 L      0X         1     100%        
    #  8     3 A2       60 L      X          1     100%        
    #  9     3 A3       50 L      2X         0.833 83%         
    # 10     4 A1       70 L      0X         1     100%        
    # 11     4 A2       70 L      X          1     100%        
    # 12     4 A3       60 L      2X         0.857 86%         
    # 13     5 A1       80 L      0X         1     100%        
    # 14     5 A2       80 L      X          1     100%        
    # 15     5 A3       70 L      2X         0.875 88%         
    # 16     6 A1       90 L      0X         1     100%        
    # 17     6 A2       90 L      X          1     100%        
    # 18     6 A3       80 L      2X         0.889 89%         
    # 19     7 A1      100 L      0X         1     100%        
    # 20     7 A2      100 L      X          1     100%        
    # # ℹ 43 more rows
    # # ℹ Use `print(n = ...)` to see more rows
    

    At this point, I don't understand the norming calculation, but if you explain it a little more I can edit to show how to do it.