Today I was trying to find a method, to do some processing on strings in python. Some more senior programmer than I'm said not to use +=
but use ''.join()
I could also read this in eg: http://wiki.python.org/moin/PythonSpeed/#Use_the_best_algorithms_and_fastest_tools .
But I tested this myself and found a bit strange results ( It's not that I'm trying to second guess them but I want to under stand).
The idea was if there was a string "This is \"an example text\"
containing spaces" the string should be converted to Thisis"an example text"containingspaces
The spaces are removed, but only outside the quotes.
I measured the performance of two different versions of my algorithm one using the ''.join(list)
and one using +=
import time
#uses '+=' operator
def strip_spaces ( s ):
ret_val = ""
quote_found = False
for i in s:
if i == '"':
quote_found = not quote_found
if i == ' ' and quote_found == True:
ret_val += i
if i != ' ':
ret_val += i
return ret_val
#uses "".join ()
def strip_spaces_join ( s ):
#ret_val = ""
ret_val = []
quote_found = False
for i in s:
if i == '"':
quote_found = not quote_found
if i == ' ' and quote_found == True:
#ret_val = ''.join( (ret_val, i) )
ret_val.append(i)
if i != ' ':
#ret_val = ''.join( (ret_val,i) )
ret_val.append(i)
return ''.join(ret_val)
def time_function ( function, data):
time1 = time.time();
function(data)
time2 = time.time()
print "it took about {0} seconds".format(time2-time1)
On my machine this yielded this output with a minor advantage for the algorithm using +=
print '#using += yields ', timeit.timeit('f(string)', 'from __main__ import string, strip_spaces as f', number=1000)
print '#using \'\'.join() yields ', timeit.timeit('f(string)', 'from __main__ import string, strip_spaces_join as f', number=1000)
when timed with timeit :
#using += yields 0.0130770206451
#using ''.join() yields 0.0108470916748
The difference is really minor. But why is ''.join()
not clearly out performing the function that uses +=
, but there seems to be a small advantage for the ''.join() version.
I tested this on Ubuntu 12.04 with python-2.7.3
Do use the correct methodology when comparing algorithms; use the timeit
module to eliminate fluctuations in CPU utilization and swapping.
Using timeit
shows there is little difference between the two approaches, but ''.join()
is slightly faster:
>>> s = 1000 * string
>>> timeit.timeit('f(s)', 'from __main__ import s, strip_spaces as f', number=100)
1.3209099769592285
>>> timeit.timeit('f(s)', 'from __main__ import s, strip_spaces_join as f', number=100)
1.2893600463867188
>>> s = 10000 * string
>>> timeit.timeit('f(s)', 'from __main__ import s, strip_spaces as f', number=100)
14.545105934143066
>>> timeit.timeit('f(s)', 'from __main__ import s, strip_spaces_join as f', number=100)
14.43651008605957
Most of the work in your functions is the looping over each and every character and testing for quotes and spaces, not string concatenation itself. Moreover, the ''.join()
variant does more work; you are appending the elements to a list first (this replaces the +=
string concatenation operations), then you are concatenating these values at the end using ''.join()
. And that method is still ever so slightly faster.
You may want to strip back the work being done to compare just the concatenation part:
def inplace_add_concatenation(s):
res = ''
for c in s:
res += c
def str_join_concatenation(s):
''.join(s)
which shows:
>>> s = list(1000 * string)
>>> timeit.timeit('f(s)', 'from __main__ import s, inplace_add_concatenation as f', number=1000)
6.113742113113403
>>> timeit.timeit('f(s)', 'from __main__ import s, str_join_concatenation as f', number=1000)
0.6616439819335938
This shows ''.join()
concatenation is still a heck of a lot faster than +=
. The speed difference lies in the loop; s
is a list in both cases, but ''.join()
loops over the values in C, while the other version has to do all it's looping in Python. And that makes all the difference here.