This is my first optimization of the code and I am excited about it. Read some articles but I still have some questions.
1) First of all, what does take so much time in my code below? I think it is array over here: array.append(len(set(line.split()))). I read online that lists in python work faster, but I don't see using lists here. Does anybody know how this can be improved?
2) Are there any other improvements I am missing?
3) Also, online it says that for loop slows down code a lot. Can it be improved here? (I guess write code in C would be best, but :D )
4) Why people suggest to always look at "ncalls" and "tottime"? For me "percall" makes more sense. It tells you how fast is your function or call.
5) Over here in the correct answer Class B he applied lists. Does he? For me I still see an array and a FOR loop that is suppose to slow things down. Fastest way to grow a numpy numeric array
Thank you.
NEW cProfile result:
618384 function calls in 9.966 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
19686 3.927 0.000 4.897 0.000 <ipython-input-120-d8351bb3dd17>:14(f)
78744 3.797 0.000 3.797 0.000 {numpy.core.multiarray.array}
19686 0.948 0.000 0.948 0.000 {range}
19686 0.252 0.000 0.252 0.000 {method 'partition' of 'numpy.ndarray' objects}
19686 0.134 0.000 0.930 0.000 function_base.py:2896(_median)
1 0.126 0.126 9.965 9.965 <ipython-input-120-d8351bb3dd17>:22(<module>)
19686 0.125 0.000 0.351 0.000 _methods.py:53(_mean)
19686 0.120 0.000 0.120 0.000 {method 'reduce' of 'numpy.ufunc' objects}
19686 0.094 0.000 4.793 0.000 function_base.py:2747(_ureduce)
19686 0.071 0.000 0.071 0.000 {method 'flatten' of 'numpy.ndarray' objects}
19686 0.065 0.000 0.065 0.000 {method 'format' of 'str' objects}
78744 0.055 0.000 3.852 0.000 numeric.py:464(asanyarray)
NEW CODE:
import numpy
import cProfile
pr = cProfile.Profile()
pr.enable()
#paths to files
read_path = '../tweet_input/tweets.txt'
write_path = "../tweet_output/ft2.txt"
def f(a):
for i in range(0, len(array)):
if a <= array[i]:
array.insert(i, a)
break
if 0 == len(array):
array.append(a)
try:
with open(read_path) as inf, open(write_path, 'a') as outf:
array = []
#for every line (tweet) in the file
for line in inf: ###Loop is bad. Builtin function is good
#append amount of unique words to the array
wordCount = len(set(line.split()))
#print wordCount, array
f(wordCount)
#write current median of the array to the file
result = "{:.2f}\n".format(numpy.median(array))
outf.write(result)
except IOError as e:
print 'Operation failed: %s' % e.strerror
###Service
pr.disable()
pr.print_stats(sort = 'time')
OLD cProfile result:
551211 function calls in 13.195 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
78744 10.193 0.000 10.193 0.000 {numpy.core.multiarray.array}
OLD CODE:
with open(read_path) as inf, open(write_path, 'a') as outf:
array = []
#for every line in the file
for line in inf:
#append amount of unique words to the array
array.append(len(set(line.split())))
#write current median of the array to the file
result = "{:.2f}\n".format(numpy.median(array))
outf.write(result)
Numpy uses a meadian finding algorithm that is O(n log n). You call numpy.meadian
once per line, so your algorithm ends up being O(n^2 log n).
There are several ways to improve on this. One is to keep the array sorted (i.e. insert each element in the position that maintains the sorted order). Each insert takes O(n) (inserting into an array is a linear time operation), and getting the median of a sorted array is O(1), so this ends up being O(n^2).
For profiling, the main thing you want to look at is tottime
, since this tells you how much time your program spent in the function in total. As in your example, percall
sometimes is not very useful, because sometimes, if you have a slow function (high percall
) but it is only called a few times (low numcalls
), the tottime
ends up being insignificant compared to the other functions.