Problem can be found here: http://rosalind.info/problems/subs/
The question I have has to do with the performance of the two solutions provided below.
1.
def indexOfAppearances(strand: String, subStrand: String): List[Int] = {
def help0(x: String, y: String, accu: List[Int]): List[Int] =
(x contains y) match {
case true => {
val index = (x indexOfSlice y) /* index where substring appears */
val adjust = strand.length - x.length
/* adjustment since the x has all the previous
* nucleotides removed.
*/
val newX = x.drop(index + 1).mkString
/* here's the drop of index + 1 elements */
help0(newX, y, (index + adjust) :: accu) /* tail recursive call */
}
case false => accu
}
help0(strand, subStrand, List()).reverse.toList.map(x => x + 1)
//indexes are not from 0 but from 1
}
2.
val s = "ACGTACGTACGTACGT"
val t = "GTA"
val combs = s.sliding(t.length).zipWithIndex
val locations = combs.collect { case (sub, i) if (sub == t) => i + 1 }
println(locations.mkString(" "))
The second solution is pretty, functional and short.
First solution is a bit large but it's still functional. I could have left out the val's and just work with the values to make it shorter but that wasn't my goal.
After seeing the 2nd solution I was quite disappointed by mine due to the length of code. Checked the scala library to see why the 2nd solution works and then re-implemented it myself. Thought about checking the performance of both solutions and made a huge 30 million DNA strand.
Surprised!
Performance:
First number is the DNA length, and the next two numbers present the execution time of 1st and 2nd solution(in ms).
11,226,096 - 4921 - 14503
33,678,288 - 6448 - 35150
Why is the performance so much different?
I've tried checking the scala library but couldn't find a thing which explains this behaviour.
I assumed that the first solution is creating many objects thus consuming more memory and taking a lot of time to do that but it seems that for some reason it works much faster. I doubt it's the tail recursion and I doubt that zipWithIndex takes a lot of time. Iterators are just iterators?
Thanks!
sliding
isn't efficient with strings. It breaks the string down into characters, boxes them, then reassembles them into a string.
The fastest way to do this which is still not incredibly hard is with the regionMatches
method on String
. (Faster yet with DNA is to convert everything into bytes, faster yet is to convert it into 2-bit nibbles and pack into int arrays.)