rubyregexstring

Why is String#split("\n") and Array#join(' ') quicker than String#gsub(/\n/, ' ')?


I have to remove all newlines from a large number of strings. In benchmarking string.join("\n").split(' ') vs string.gsub(/\n/, ' '), I found that the split and join methods are much quicker, but have a hard time understanding why. I do not understand how splitting the string up into array elements each time it encounters a \n, then joining the array into a new string could possibly be quicker than just scan and replacing each \n with a ' '.

sentence = %q[
  Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium,
  totam rem aperiam, eaque ipsa quae ab illo inventore veritatis et quasi architecto beatae vitae
  dicta sunt explicabo. Nemo enim ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit,
  sed quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Neque porro quisquam
  est, qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit, sed quia non numquam eius
  modi tempora incidunt ut labore et dolore magnam aliquam quaerat voluptatem. Ut enim ad minima
  veniam, quis nostrum exercitationem ullam corporis suscipit laboriosam, nisi ut aliquid ex ea]

Check to verify the output is indeed the same for both methods:

puts sentence.split("\n").join(' ') == sentence.gsub(/\n/, ' ')
#=> true

Script used to benchmark:

def split_join_method(string)
  start = Time.now;
  1000000.times {  string.split("\n").join(' ') }
  puts "split_join: #{Time.now - start} s"
end

def gsub_method(string)
  start = Time.now;
  1000000.times {  string.gsub(/\n/, ' ') }
  puts "gsub: #{Time.now - start} s"
end

5.times do
  split_join_method(sentence)
  gsub_method(sentence)
end

Results:

#=> split_join: 6.753057 s
#=> gsub: 14.938358 s
#=> split_join: 6.16101 s
#=> gsub: 14.166971 s
#=> split_join: 5.946168 s
#=> gsub: 13.490355 s
#=> split_join: 5.781062 s
#=> gsub: 13.436135 s
#=> split_join: 5.903052 s
#=> gsub: 15.670774 s

Solution

  • I think gsub takes more time for two reasons:

    The first is that using a regex engine has an initial cost, at least to parse the pattern.

    The second and probably the most important here is that the regex engine works with a dumb walk character by character and tests the pattern for each positions in the string when the split (with a literal string here) uses a fast string search algorithm (probably the Boyer-Moore algorithm).

    Note that even if the split/join way is faster, it uses probably more memory since this way needs to generate new strings.

    Note2: some regex engines are able to use this fast string search algorithm before the walk to find positions, but I have no informations about this for the ruby regex engine.

    Note3: It may be interesting to have a better idea of what happens to include tests with few repeatitions but with larger strings. [edit] After several tests with @spickermann code, it seems that it doesn't change anything (or nothing very significative) even with very few repetitions. So the initial cost may be not so important.