I would like to make a dose response curve from the library(drc)
, and am stuck on how to prepare my dataset properly in order to make the plot. In particular, I’m struggling how to get my y-axis ready.
I made up a dataframe (df)
to help clarify what I would like to do.
df <- read.table("https://pastebin.com/raw/TZdjp2JX", header=T)
Open necessary libraries for today's exercise
library(drc)
library(ggplot2)
Let’s pretend I like humming birds, and do an experiment with different concentrations of sugar with the goal of seeing which concentration is ideal for humming birds. I therefore run an experiment in a closed setting (here column “room”), with 4 different sugar concentrations (column concentration), and 10 individual birds per concentration. I also run each experiment with 4 replicates in parallel, which is why there are 4 “rooms”. After 36 hours (column “time”), I go into the room, and check how many birds survived, creating a “yes/no” variable, or 1 & 0 (here, this is my column “status), where 1==survive, 0==died.
With this dataset, I specifically made it that most survived at concentration 0, 50% survived concentration 1, 25% survived concentration 2, and only 10% survive concentration 3.
My first issue I’m running into is : how can I turn my y-axis , generated from my “status” column into a percentage? I have done this when I do kaplan-meier survival curves, but that does not work here unfortunately. Obviously, this should column should go from 0% - 100% (we could call the column "mortality"). After I succeed at doing this, I would like to make a dose response curve that looks like the following (I found this example online, and will directly copy it here to use an example. It is from the ryegrass dataset included in R)
ryegrass.LL.4 <- drm(rootl ~ conc, data = ryegrass, fct = LL.3())
I must admit, the next steps of code are a little confusing for me.
# new dose levels as support for the line
newdata <- expand.grid(conc=exp(seq(log(0.5), log(100), length=100)))
# predictions and confidence intervals
pm <- predict(ryegrass.LL.4, newdata=newdata, interval="confidence")
# new data with predictions
newdata$p <- pm[,1]
newdata$pmin <- pm[,2]
newdata$pmax <- pm[,3]
# plot curve
# need to shift conc == 0 a bit up, otherwise there are problems with coord_trans
ryegrass$conc0 <- ryegrass$conc
ryegrass$conc0[ryegrass$conc0 == 0] <- 0.5
# plotting the curve
ggplot(ryegrass, aes(x = conc0, y = rootl)) +
geom_point() +
geom_ribbon(data=newdata, aes(x=conc, y=p, ymin=pmin, ymax=pmax), alpha=0.2) +
geom_line(data=newdata, aes(x=conc, y=p)) +
coord_trans(x="log") +
xlab("Ferulic acid (mM)") + ylab("Root length (cm)")
In the end, I would like to generate a similar curve, but with mortality on the y-axis, from 0-100 (starting low, going high) and also display the confidence intervals in a shaded grey area around the regression line. Meaning, my first step of code should like something like the following:
model <- drc(mortality ~ Concentration, data=df, fct = LL.3())
But I'm lost on the "mortality" creation part, and a little bit on the next step with ggplot
Could anyone help me achieve this? From the example from ryegrass
, I'm perplexed how to translate this to be helpful for my pretend dataset. I hope someone here is able to help me solve this issue! Many thanks, and I appreciate any feedback if there are other ways I should have my dataset structured, etc.
-Andy
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(ggplot2)
library(drc)
#> Loading required package: MASS
#>
#> Attaching package: 'MASS'
#> The following object is masked from 'package:dplyr':
#>
#> select
#>
#> 'drc' has been loaded.
#> Please cite R and 'drc' if used for a publication,
#> for references type 'citation()' and 'citation('drc')'.
#>
#> Attaching package: 'drc'
#> The following objects are masked from 'package:stats':
#>
#> gaussian, getInitial
df <- read.table("https://pastebin.com/raw/sH5hCr2J", header=T)
Making the mortality
or as I do here survival
, can be done easily with the dplyr
package. This will help perform many calculations. It seems that you are interested in calcuating the percent survival for each Concentration across your four rooms (or replicates). So the first step is to group the data by these columns and then calculate the statistic we want.
df_calc <- df %>%
group_by(Concentration, room) %>%
summarise(surv = sum(Status)/n())
#> `summarise()` has grouped output by 'Concentration'. You can override using the `.groups` argument.
I don’t know if Concentration represents arbitrary Concentration levels so I’m moving forward with the following assumption:
df_calc <- mutate(df_calc, conc = exp(-Concentration))
Just to be clear, the conc
variable is just my attempt at having something close to the true known concentrations of the experiment. If your data has the true concentrations, then don't mind this calculation.
df_calc
#> # A tibble: 12 x 4
#> # Groups: Concentration [3]
#> Concentration room surv conc
#> <int> <int> <dbl> <dbl>
#> 1 1 1 0.5 0.368
#> 2 1 2 0.4 0.368
#> 3 1 3 0.5 0.368
#> 4 1 4 0.6 0.368
#> 5 2 1 0 0.135
#> 6 2 2 0.4 0.135
#> 7 2 3 0.2 0.135
#> 8 2 4 0.4 0.135
#> 9 3 1 0.2 0.0498
#> 10 3 2 0 0.0498
#> 11 3 3 0 0.0498
#> 12 3 4 0.2 0.0498
mod <- drm(surv ~ conc, data = df_calc, fct = LL.3())
Make new conc
data points
newdata <- data.frame(conc = exp(seq(log(0.01), log(10), length = 100)))
To respond to your comment I'll explain the above code chunk. Again the conc
variable is expected to be the unit concentration. In this hypothetical case, we have three concentration levels c(0.049, 0.135, 0.368)
. For brevity, lets assume the units are mg of sugar/ml of water
. Our model was fit on these three dose levels with 4 data points per dose level. If we wanted, we could have just plotted the curve between these levels of c(0.049, 0.368)
, but in this example, I chose c(0.01, 10) mg/ml
as the domain to plot on. This was just so that we could visualize where the curve would end up based on the model fit. In short, you choose the range that you are interested in most. As I show later - even though we can choose data points outside the range of the experimental data, the confidence intervals are extremely large indicating the model will be unhelpful for those points.
The reason behind casting these values with the log()
function is to ensure that we are sampling points that look evenly distributed on a log10 scale (most does response curves are plotted with this transformation). Once we get the sequence of 100 points, we use exp()
to return back to the linear space (which our model was fit on). These values are then used in the predict
function as the new dose
levels in conjunction with the fitted model.
All this is saved into newdata
variable which allows for the plotting of the line and the confidence intervals.
Use the model and the generated data points to
predict a new surv
value as well as the upper and lower bound
newdata <- cbind(newdata,
suppressWarnings(predict(mod, newdata = newdata, interval="confidence")))
plot with ggplot2
ggplot(df_calc, aes(conc)) +
geom_point(aes(y = surv)) +
geom_ribbon(aes(ymin = Lower, ymax = Upper), data = newdata, alpha = 0.2) +
geom_line(aes(y = Prediction), data = newdata) +
scale_x_log10() +
coord_cartesian(ylim = c(0, 1))
As you may notice, the confidence intervals increase greately when we try and predict ranges that has no data.
Created on 2021-10-27 by the reprex package (v1.0.0)