I have been working on a small Plagiarism detection engine which uses Idea from MOSS. I need a Rolling Hash function, I am inspired from Rabin-Karp Algorithm.
Code I wrote -->
#!/usr/bin/env ruby
#Designing a rolling hash function.
#Inspired from the Rabin-Karp Algorithm
module Myth
module Hasher
#Defining a Hash Structure
#A hash is a integer value + line number where the word for this hash existed in the source file
Struct.new('Hash',:value,:line_number)
#For hashing a piece of text we ned two sets of parameters
#k-->For buildinf units of k grams hashes
#q-->Prime which lets calculations stay within range
def calc_hash(text_to_process,k,q)
text_length=text_to_process.length
radix=26
highorder=(radix**(text_length-1))%q
#Individual hashes for k-grams
text_hash=0
#The entire k-grams hashes list for the document
text_hash_string=""
#Preprocessing
for c in 0...k do
text_hash=(radix*text_hash+text_to_process[c].ord)%q
end
text_hash_string << text_hash.to_s << " "
loop=text_length-k
for c in 0..loop do
puts text_hash
text_hash=(radix*(text_hash-text_to_process[c].ord*highorder)+(text_hash[c+k].ord))%q
text_hash_string << text_hash_string << " "
end
end
end
end
I am running it with values --> calc_hash(text,5,101) where text is a String input.
The code is very slow. Where am I going wrong?
Look at Ruby-Prof, a profiler for Ruby. Use gem install ruby-prof
to install it.
Once you have some ideas where the code is lagging, you can use Ruby's Benchmark to try different algorithms to find the fastest.
Nose around on StackOverflow and you'll see lots of places where we'll use Benchmark to test various methods to see which is the fastest. You'll also get an idea of different ways to set up the tests.
For instance, looking at your code, I wasn't sure whether an append, <<
, was better than concatenating using +
or using string interpolation. Here's the code to test that and the results:
require 'benchmark'
include Benchmark
n = 1_000_000
bm(13) do |x|
x.report("interpolate") { n.times { foo = "foo"; bar = "bar"; "#{foo}#{bar}" } }
x.report("concatenate") { n.times { foo = "foo"; bar = "bar"; foo + bar } }
x.report("append") { n.times { foo = "foo"; bar = "bar"; foo << bar } }
end
ruby test.rb; ruby test.rb
user system total real
interpolate 1.090000 0.000000 1.090000 ( 1.093071)
concatenate 0.860000 0.010000 0.870000 ( 0.865982)
append 0.750000 0.000000 0.750000 ( 0.753016)
user system total real
interpolate 1.080000 0.000000 1.080000 ( 1.085537)
concatenate 0.870000 0.000000 0.870000 ( 0.864697)
append 0.750000 0.000000 0.750000 ( 0.750866)
I was wondering about the effects of using fixed versus variables when appending strings based on @Myth17's comment below:
require 'benchmark'
include Benchmark
n = 1_000_000
bm(13) do |x|
x.report("interpolate") { n.times { foo = "foo"; bar = "bar"; "#{foo}#{bar}" } }
x.report("concatenate") { n.times { foo = "foo"; bar = "bar"; foo + bar } }
x.report("append") { n.times { foo = "foo"; bar = "bar"; foo << bar } }
x.report("append2") { n.times { foo = "foo"; bar = "bar"; "foo" << bar } }
x.report("append3") { n.times { foo = "foo"; bar = "bar"; "foo" << "bar" } }
end
Resulting in:
ruby test.rb;ruby test.rb
user system total real
interpolate 1.330000 0.000000 1.330000 ( 1.326833)
concatenate 1.080000 0.000000 1.080000 ( 1.084989)
append 0.940000 0.010000 0.950000 ( 0.937635)
append2 1.160000 0.000000 1.160000 ( 1.165974)
append3 1.400000 0.000000 1.400000 ( 1.397616)
user system total real
interpolate 1.320000 0.000000 1.320000 ( 1.325286)
concatenate 1.100000 0.000000 1.100000 ( 1.090585)
append 0.930000 0.000000 0.930000 ( 0.936956)
append2 1.160000 0.000000 1.160000 ( 1.157424)
append3 1.390000 0.000000 1.390000 ( 1.392742)
The values are different than my previous test because the code is being run on my laptop.
Appending two variables is faster than when a fixed string is involved because there is overhead; Ruby has to create an intermediate variable and then append to it.
The big lesson here is we can make a more informed decision when we're writing code because we know what runs faster. At the same time, the differences are not very big, since most code isn't running 1,000,000 loops. Your mileage might vary.