I have searched and searched in the stacks for an answer to my question; this one approaches my question but I have been unsuccessful in modifying the code to fix my graph.
I have data, reshaped in long format, that looks like this:
ID Var1 GenePosition ContinuousOutcomeVar
1 control X20068492 0.092813611
2 control X20068492 0.001746708
3 case X20068492 0.069251157
4 case X20068492 0.003639304
Each ID
has one value for ContinuousOutcomeVar
per position, and there are 86 positions and 10 IDs. I want to plot a line graph with position on the x axis and the continuous outcome variable on the y axis. I want two groups: a case group and control group, so there should be two dots for every position: one is the mean value for cases, and one is the mean value for controls. Then I want a line that connects the cases, and a line that connects the controls. I know this is easy, but I'm new to R - I've been working at it for 8 hours and I can't quite get it right. Below is what I have; I'd really appreciate some insight. If this exists somewhere in the stacks, I really apologize...I honestly looked all over and tried modifying a lot of code but still haven't gotten it right.
My code: This code plots all the values for all IDs at each position, and connects them for the two groups. It gives me a black dot at the mean of all 10 values per position (I think):
lineplot <- ggplot(data=seq.long, aes(x=Position, y=PMethyl,
group=CACO, colour=CACO)) +
stat_summary (fun.y=mean, geom="point", aes(group=1), color="black") +
geom_line() + geom_point()
I can't get R to not plot all 10 points; just two means (one per case/control group) per position, with cases' & controls' values each connected by a line across the x axis.
First, adjusted your original sample data to contain more than one unique GenePosition
.
dput(seq.long)
structure(list(ID = 1:8, Var1 = structure(c(2L, 2L, 1L, 1L, 2L,
2L, 1L, 1L), .Label = c("case", "control"), class = "factor"),
GenePosition = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L
), .Label = c("X20068492", "X20068493"), class = "factor"),
ContinuousOutcomeVar = c(0.092813611, 0.001746708, 0.069251157,
0.003639304, 0.112813611, 0.002746708, 0.089251157, 0.004639304
)), .Names = c("ID", "Var1", "GenePosition", "ContinuousOutcomeVar"
), class = "data.frame", row.names = c(NA, -8L))
If you just want to represent one value for each GenePosition
and Var1
combination then it would be easier to calculate mean values before plotting. That can be achieved with function ddply()
from library plyr
.
library(plyr)
seq.long.sum<-ddply(seq.long,.(Var1,GenePosition),
summarize, value = mean(ContinuousOutcomeVar))
seq.long.sum
Var1 GenePosition value
1 case X20068492 0.03644523
2 case X20068493 0.04694523
3 control X20068492 0.04728016
4 control X20068493 0.05778016
Now with this new data frame you just have to give x
and y
values. Var1
should be used in colour=
and group=
to ensure that each group has different color and that lines are connected.
ggplot(seq.long.sum,aes(x=GenePosition,y=value,colour=Var1,group=Var1))+
geom_point()+geom_line()