ruby algorithm plagiarism-detection rabin-karp

Rabin Karp Implementation too slow in Ruby

I have been working on a small Plagiarism detection engine which uses Idea from MOSS. I need a Rolling Hash function, I am inspired from Rabin-Karp Algorithm.

Code I wrote -->

#!/usr/bin/env ruby
#Designing a rolling hash function.
#Inspired from the Rabin-Karp Algorithm

module Myth
  module Hasher

    #Defining a Hash Structure
    #A hash is a integer value + line number where the word for this hash existed in the source file
    Struct.new('Hash',:value,:line_number)

    #For hashing a piece of text we ned two sets of parameters
    #k-->For buildinf units of k grams hashes  
    #q-->Prime which lets calculations stay within range
    def calc_hash(text_to_process,k,q)

      text_length=text_to_process.length
      radix=26

      highorder=(radix**(text_length-1))%q

      #Individual hashes for k-grams
      text_hash=0

      #The entire k-grams hashes list for the document
      text_hash_string=""

      #Preprocessing
      for c in 0...k do
        text_hash=(radix*text_hash+text_to_process[c].ord)%q
      end

      text_hash_string << text_hash.to_s << " "

      loop=text_length-k

      for c in 0..loop do        
        puts text_hash
        text_hash=(radix*(text_hash-text_to_process[c].ord*highorder)+(text_hash[c+k].ord))%q
        text_hash_string << text_hash_string << " "
      end
    end

  end
end

I am running it with values --> calc_hash(text,5,101) where text is a String input.

The code is very slow. Where am I going wrong?

Solution

Look at Ruby-Prof, a profiler for Ruby. Use gem install ruby-prof to install it.

Once you have some ideas where the code is lagging, you can use Ruby's Benchmark to try different algorithms to find the fastest.

Nose around on StackOverflow and you'll see lots of places where we'll use Benchmark to test various methods to see which is the fastest. You'll also get an idea of different ways to set up the tests.

For instance, looking at your code, I wasn't sure whether an append, <<, was better than concatenating using + or using string interpolation. Here's the code to test that and the results:

require 'benchmark'
include Benchmark

n = 1_000_000
bm(13) do |x|
  x.report("interpolate") { n.times { foo = "foo"; bar = "bar"; "#{foo}#{bar}" } }
  x.report("concatenate") { n.times { foo = "foo"; bar = "bar"; foo + bar      } }
  x.report("append")      { n.times { foo = "foo"; bar = "bar"; foo << bar     } }
end

ruby test.rb; ruby test.rb
                   user     system      total        real
interpolate    1.090000   0.000000   1.090000 (  1.093071)
concatenate    0.860000   0.010000   0.870000 (  0.865982)
append         0.750000   0.000000   0.750000 (  0.753016)
                   user     system      total        real
interpolate    1.080000   0.000000   1.080000 (  1.085537)
concatenate    0.870000   0.000000   0.870000 (  0.864697)
append         0.750000   0.000000   0.750000 (  0.750866)

I was wondering about the effects of using fixed versus variables when appending strings based on @Myth17's comment below:

require 'benchmark'
include Benchmark

n = 1_000_000
bm(13) do |x|
  x.report("interpolate") { n.times { foo = "foo"; bar = "bar"; "#{foo}#{bar}" } }
  x.report("concatenate") { n.times { foo = "foo"; bar = "bar"; foo + bar      } }
  x.report("append")      { n.times { foo = "foo"; bar = "bar"; foo << bar     } }
  x.report("append2")     { n.times { foo = "foo"; bar = "bar"; "foo" << bar   } }
  x.report("append3")     { n.times { foo = "foo"; bar = "bar"; "foo" << "bar" } }
end

Resulting in:

ruby test.rb;ruby test.rb

                   user     system      total        real
interpolate    1.330000   0.000000   1.330000 (  1.326833)
concatenate    1.080000   0.000000   1.080000 (  1.084989)
append         0.940000   0.010000   0.950000 (  0.937635)
append2        1.160000   0.000000   1.160000 (  1.165974)
append3        1.400000   0.000000   1.400000 (  1.397616)

                   user     system      total        real
interpolate    1.320000   0.000000   1.320000 (  1.325286)
concatenate    1.100000   0.000000   1.100000 (  1.090585)
append         0.930000   0.000000   0.930000 (  0.936956)
append2        1.160000   0.000000   1.160000 (  1.157424)
append3        1.390000   0.000000   1.390000 (  1.392742)

The values are different than my previous test because the code is being run on my laptop.

Appending two variables is faster than when a fixed string is involved because there is overhead; Ruby has to create an intermediate variable and then append to it.

The big lesson here is we can make a more informed decision when we're writing code because we know what runs faster. At the same time, the differences are not very big, since most code isn't running 1,000,000 loops. Your mileage might vary.