[SOLVED] Finding a Motif in DNA

Finding a Motif in DNA

Problem can be found here: http://rosalind.info/problems/subs/

The question I have has to do with the performance of the two solutions provided below.

  def indexOfAppearances(strand: String, subStrand: String): List[Int] = {
    def help0(x: String, y: String, accu: List[Int]): List[Int] =
      (x contains y) match {
        case true => {
          val index = (x indexOfSlice y) /* index where substring appears */
          val adjust = strand.length - x.length
          /* adjustment since the x has all the previous
               * nucleotides removed.
               */

          val newX = x.drop(index + 1).mkString
           /* here's the drop of index + 1 elements */
          help0(newX, y, (index + adjust) :: accu) /* tail recursive call */
        }
        case false => accu
      }

     help0(strand, subStrand, List()).reverse.toList.map(x => x + 1)
         //indexes are not from 0 but from 1
  }

  val s = "ACGTACGTACGTACGT"
  val t = "GTA"

  val combs = s.sliding(t.length).zipWithIndex
  val locations = combs.collect { case (sub, i) if (sub == t) => i + 1 }
  println(locations.mkString(" "))

The second solution is pretty, functional and short.

First solution is a bit large but it's still functional. I could have left out the val's and just work with the values to make it shorter but that wasn't my goal.

After seeing the 2nd solution I was quite disappointed by mine due to the length of code. Checked the scala library to see why the 2nd solution works and then re-implemented it myself. Thought about checking the performance of both solutions and made a huge 30 million DNA strand.

Surprised!

Performance:

First number is the DNA length, and the next two numbers present the execution time of 1st and 2nd solution(in ms).

11,226,096 - 4921 - 14503

33,678,288 - 6448 - 35150

Why is the performance so much different?

I've tried checking the scala library but couldn't find a thing which explains this behaviour.

I assumed that the first solution is creating many objects thus consuming more memory and taking a lot of time to do that but it seems that for some reason it works much faster. I doubt it's the tail recursion and I doubt that zipWithIndex takes a lot of time. Iterators are just iterators?

Thanks!

Solution

sliding isn't efficient with strings. It breaks the string down into characters, boxes them, then reassembles them into a string.

The fastest way to do this which is still not incredibly hard is with the regionMatches method on String. (Faster yet with DNA is to convert everything into bytes, faster yet is to convert it into 2-bit nibbles and pack into int arrays.)