I have a series of topographic profile scans that I would like to combine to create a single continuous profile. The only issue is that each scan may or may not have been taken from a different height, so while the different files have a fair amount of overlap in terms of the area covered, the different data may not have a common reference point in terms of absolute height.
Below are 4 different scans. Each scan contains approximately 30 measurements, with the last few measurements representing new data, and the rest being overlap with the previous scan. The first scan contains the only known absolute values, so the first scan is the "gold standard". The second scan happens to be taken from the same height, so the overlap matches up (almost) perfectly and only adds 4 new points to the previous scan. The third and fourth scans are taken from different heights, so while the overlap covers the same area (relatively), I can not simply stitch it onto the previous two scans.
Scan1<-c(5,6,7,8,15,16,18,20,25,23,20,17,15,10,10,9,8,9,11,10,13,16,17,19,20,25,28,30,29,30)
Scan2<-c(15,16,18,20,25,23,20,16,15,10,10,9,8,9,11,10,13,16,17,19,20,25,28,30,29,30,32,35,38,37)
Scan3<-c(28,25,23,18,18,17,16,17,19,18,21,23,25,27,26,33,36,37,37,38,40,43,46,45,43,42,40,38,32,30)
Scan4<-c(27,30,29,36,39,39,40,41,43,46,49,48,46,45,43,41,35,33,30,29,28,30,31,32,35)
Using R, is there a way to stitch these 4 scans together to make a continuous topographic profile? The absolute height would need to be based off of the first scan, with each successive scan being stitched onto the previous scan. IE- Scan2 is stitched onto Scan 1 adding 4 data points, then the new data from Scan 3 is added onto to the combination of Scan1 and Scan2, then the new data from Scan4 is added onto the combination of Scans 1,2, and 3, and so on....
I'm assuming there is a way to normalize all the data by matching up the large overlap between scans, using some sort of pattern recognition to figure out that Scan3 is about 8 units different from Scan1, and that Scan4 is about 11 units off. But please note there is some "noise" in my data and the pattern of overlap will not be a perfect fit.
The end result should contain a complete topographic profile encompassing all 4 scans, with some sort of adjustments for when the actual numbers differ. Something along the lines of:
5,6,7,8,15,16,18,20,25,23,20,16.5,15,10,10,9,8,9,11,10,13,15.5,17,19,19,25,28,29.5,29,30,32,35,38,37,35,34,32,30,24,22,19,18,17,19,20,21,24
You might want to look into sequence alignment - DNA alignment is basically this problem, but with bases rather than numbers.
As a quick once over, here's a quickly writtem function to find the "best" shift, based on finding the lowest standard deviation of differences between the values when sliding the scans. The function takes the given two sequences, and compares them with the shifts given (-15 to 15 by default):
aligner <- function(bestsequence, sequence2, shift = (-15):15){
minsd <- sd(bestsequence[1:min(length(sequence2), length(bestsequence))] - sequence2[1:min(length(sequence2), length(bestsequence))])
bestshift <- 0
avgdiff <- mean(bestsequence[1:min(length(sequence2), length(bestsequence))] - sequence2[1:min(length(sequence2), length(bestsequence))])
for(i in shift){
if(i < 0){
worksequence2 <- sequence2[abs(i):length(sequence2)]
if(sd(bestsequence[1:min(length(worksequence2), length(bestsequence))]
- worksequence2[1:min(length(worksequence2), length(bestsequence))]) < minsd){
minsd <- sd(bestsequence[1:min(length(worksequence2), length(bestsequence))]-
worksequence2[1:min(length(worksequence2), length(bestsequence))])
bestshift <- i
avgdiff <- mean(bestsequence[1:min(length(worksequence2), length(bestsequence))]-
worksequence2[1:min(length(worksequence2), length(bestsequence))])
}
}
if(i > 0){
workbest <- bestsequence[i:length(bestsequence)]
if(sd(workbest[1:min(length(sequence2), length(workbest))]
-sequence2[1:min(length(sequence2), length(workbest))]) < minsd){
minsd <- sd(workbest[1:min(length(sequence2), length(workbest))]-
sequence2[1:min(length(sequence2), length(workbest))])
bestshift <- i
avgdiff <- mean(workbest[1:min(length(sequence2), length(workbest))]-
sequence2[1:min(length(sequence2), length(workbest))])
}
}
}
return(list(bestshift = bestshift, avgdiff = avgdiff, minsd = minsd))
}
So, for your data:
aligner(Scan1, Scan2)
$bestshift
[1] 5
$avgdiff
[1] 0.03846154
$minsd
[1] 0.1961161
So, your Scan2s 5th element is equal to the 1st element of Scan1. From here it should be easy to subset, correct by avgdiff and append the new data points, then rerun.
EDIT: Here's how to get your final sequence. First, we want a wrapper that will output our desired sequence. It basically runs the previous command, then checks if the shift is positive or negative and then outputs the correct sequence:
wrappedaligner <- function(bestseq, seq2){
z <- aligner(bestseq, seq2)
if(z$bestshift==0){
if(length(bestseq) >= length(seq2)){
return(bestseq)
} else {return(c(bestseq, (seq2[(length(bestseq)+1):length(seq2)])-z$avgdiff))}
}
else if(z$bestshift > 0){
if(length(bestseq)-z$bestshift >= length(seq2)){
return(bestseq)
} else {return(c(bestseq, seq2[(length(bestseq) - z$bestshift + 2):length(seq2)] - z$avgdiff))}
}
else if(z$bestshift <0){
if((length(bestseq) - abs(z$bestshift))>= length(seq2)){
return(bestseq)
} else {return(c(seq2[1:abs(z$bestshift) - 1] - z$avgdiff, bestseq))}
}
}
Now we need to run it recursively on your data - luckily we can use Reduce
:
Reduce(wrappedaligner, list(Scan1, Scan2, Scan3, Scan4))
[1] 5.00000 6.00000 7.00000 8.00000 15.00000 16.00000 18.00000 20.00000
[9] 25.00000 23.00000 20.00000 17.00000 15.00000 10.00000 10.00000 9.00000
[17] 8.00000 9.00000 11.00000 10.00000 13.00000 16.00000 17.00000 19.00000
[25] 20.00000 25.00000 28.00000 30.00000 29.00000 30.00000 31.96154 34.96154
[33] 37.96154 36.96154 50.83974 49.83974 47.83974 45.83974 39.83974 37.83974