I am making a simple line plot with ggplot2
with "signal" peaks over time in the x-axis.
For a little bit of background, this specific plot is what is called a "chromatogram", and shows signal intensity, in relative fluorescence units, plotted over time. DNA bases (one of A, C, G, T) are "called" (assigned) at each peak intensity.
I am using the sangerseqR
ab1 file example for my MWE below. I just load the data, create a data frame with the trace data to plot (the intensity values), and define the DNA sequence as each of the bases called at the peaks, at specific time points in the x-axis.
I just plot a small portion of the DNA sequence (trimmed from front and back) for simplicity.
This is all fine, and works as expected:
#read in data from sangerseqR example
seq_obj <- sangerseqR::readsangerseq(system.file("extdata", "heterozygous.ab1", package = "sangerseqR"))
#create data frame with trace data to plot
trace_df <- as.data.frame(seq_obj@traceMatrix) #columns for A, C, G, T
names(trace_df) <- c('A','C','G','T')
trace_df$time <- seq_len(nrow(trace_df))
trace_df <- as.data.frame(tidyr::pivot_longer(trace_df, -time, names_to = "base", values_to = "signal"))
#create data frame with base (letter) calls at the specific times (corresponding with trace peaks)
basecall <- unlist(strsplit(toString(seq_obj@primarySeq), ""))
basepos <- seq_obj@peakPosMatrix[,1] #first column for primary seq
base_df <- data.frame(call=basecall, time=basepos, callnum5=seq_along(basepos), callnum3=rev(seq_along(basepos)))
#join both data frames
trace_df <- dplyr::left_join(trace_df, base_df, by="time")
trace_df$call <- ifelse(trace_df$call==trace_df$base, trace_df$call, NA)
#trim data from 5' (top) and 3' (bottom) to plot only a small section of the sequence
trim5 <- 50
trim3 <- 500
startpos <- min(which(trace_df$callnum5==trim5))
endpos <- max(which(trace_df$callnum3==trim3))
trace_sub <- trace_df[startpos:endpos,]
#define colors, x-axis breaks and labels
basecolors <- c("green","blue","black","red")
xbreaks <- trace_sub$time[which(trace_sub$call==trace_sub$base)]
xlabels <- trace_sub$call[which(trace_sub$call==trace_sub$base)]
#make plot in ggplot2
P <- ggplot2::ggplot(trace_sub, ggplot2::aes(x=time, y=signal, group=base, colour=base)) +
ggplot2::geom_line(linewidth=0.5) +
ggplot2::scale_color_manual(values=basecolors) +
ggplot2::scale_x_continuous(breaks=xbreaks, labels=xlabels) +
ggplot2::theme_light()
grDevices::pdf(file="test.pdf", height=2, width=20)
print(P)
grDevices::dev.off()
This produces the following plot, which looks perfect (identical to the sangerseqR::chromatogram()
function output), but is not exactly as I want it.
Note the following. In a perfect chromatogram, we should see evenly-spaced peaks (and bases called at the peaks), but this is rarely the case, and certainly not the case in this example.
For my purposes (I want to compare a few almost-identical sequences by aligning/stacking their chromatograms), I need the x-axis breaks at the peaks (corresponding with the bases called) to be evenly spaced.
I want to ignore the time variable and have the bases appear at an equal distance from one another. In technical terms, the "basecall windows" need to have the same width.
For this, the peaks in the plot need to be widened or narrowed accordingly. This is something that different commercial software do, but I cannot think of what formula to apply to the data, so I can visualize the plot in such way.
Any help would be greatly appreciated! Many thanks.
Create a data frame containing the times and new x-axis breaks using seq
.
Xbreaks <- data.frame(time=xbreaks,
xbreaks=seq(from=min(xbreaks),
to=max(xbreaks),
length.out=length(xbreaks)))
Then join this to the original data frame on "time" and replace the NAs by linear interpolation.
trace_sub |>
dplyr::full_join(Xbreaks, by="time") |>
dplyr::mutate(time=zoo::na.approx(xbreaks), .by=base) |>
ggplot(aes(x=time, y=signal, group=base, colour=base)) +
geom_line(linewidth=0.5) +
scale_color_manual(values=basecolors) +
scale_x_continuous(breaks=Xbreaks$xbreaks, labels=xlabels) +
theme_light()
Edit:
If the whole dataset (trace_df
) is used instead, which includes NA at the start and end, then approxExtrap
from the Hmisc package can be used to extrapolate these NA after the linear interpolation.
xbreaks <- trace_df$time[which(trace_df$call==trace_df$base)]
xlabels <- trace_df$call[which(trace_df$call==trace_df$base)]
Xbreaks <- data.frame(time=xbreaks,
xbreaks=seq(from=min(xbreaks),
to=max(xbreaks),
length.out=length(xbreaks)))
Linear interpolation
trace_df2 <- trace_df |>
dplyr::full_join(Xbreaks, by="time") |>
dplyr::mutate(time=zoo::na.approx(xbreaks, na.rm=FALSE),
id=row_number(), .by=base)
Linear extrapolation
trace_df2$time[is.na(trace_df2$time)] <- Hmisc::approxExtrap(
x=trace_df2$id[!is.na(trace_df2$time)],
y=trace_df2$time[!is.na(trace_df2$time)],
xout=trace_df2$id[is.na(trace_df2$time)])$y
This gives
time base signal call callnum5 callnum3 xbreaks id
1 -0.04721754 A 0 <NA> NA NA NA 1
2 -0.04721754 C 25 <NA> NA NA NA 1
3 -0.04721754 G 281 <NA> NA NA NA 1
4 -0.04721754 T 0 <NA> NA NA NA 1
5 1.30185497 A 0 <NA> NA NA NA 2
6 1.30185497 C 29 <NA> NA NA NA 2
7 1.30185497 G 284 <NA> NA NA NA 2
8 1.30185497 T 0 <NA> NA NA NA 2
9 2.65092749 A 0 <NA> NA NA NA 3
10 2.65092749 C 35 <NA> NA NA NA 3
11 2.65092749 G 290 <NA> NA NA NA 3
12 2.65092749 T 0 <NA> NA NA NA 3
13 4.00000000 A 1 <NA> 1 605 4 4
14 4.00000000 C 47 <NA> 1 605 4 4
15 4.00000000 G 301 G 1 605 4 4
16 4.00000000 T 0 <NA> 1 605 4 4
17 5.34907251 A 2 <NA> NA NA NA 5
18 5.34907251 C 67 <NA> NA NA NA 5
19 5.34907251 G 321 <NA> NA NA NA 5
20 5.34907251 T 0 <NA> NA NA NA 5
Which can then be plotted as before:
trace_df2 |>
ggplot2::ggplot(ggplot2::aes(x=time, y=signal, group=base, colour=base)) +
ggplot2::geom_line(linewidth=0.5) +
ggplot2::scale_color_manual(values=basecolors) +
ggplot2::scale_x_continuous(breaks=Xbreaks$xbreaks, labels=xlabels) +
ggplot2::theme_light()