python optimization profiling cprofile pstats

Making use of pstats and cProfile in Python. How to make array work faster?

This is my first optimization of the code and I am excited about it. Read some articles but I still have some questions.

1) First of all, what does take so much time in my code below? I think it is array over here: array.append(len(set(line.split()))). I read online that lists in python work faster, but I don't see using lists here. Does anybody know how this can be improved?

2) Are there any other improvements I am missing?

3) Also, online it says that for loop slows down code a lot. Can it be improved here? (I guess write code in C would be best, but :D )

4) Why people suggest to always look at "ncalls" and "tottime"? For me "percall" makes more sense. It tells you how fast is your function or call.

5) Over here in the correct answer Class B he applied lists. Does he? For me I still see an array and a FOR loop that is suppose to slow things down. Fastest way to grow a numpy numeric array

Thank you.

NEW cProfile result:

 618384 function calls in 9.966 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    19686    3.927    0.000    4.897    0.000 <ipython-input-120-d8351bb3dd17>:14(f)
    78744    3.797    0.000    3.797    0.000 {numpy.core.multiarray.array}
    19686    0.948    0.000    0.948    0.000 {range}
    19686    0.252    0.000    0.252    0.000 {method 'partition' of 'numpy.ndarray' objects}
    19686    0.134    0.000    0.930    0.000 function_base.py:2896(_median)
        1    0.126    0.126    9.965    9.965 <ipython-input-120-d8351bb3dd17>:22(<module>)
    19686    0.125    0.000    0.351    0.000 _methods.py:53(_mean)
    19686    0.120    0.000    0.120    0.000 {method 'reduce' of 'numpy.ufunc' objects}
    19686    0.094    0.000    4.793    0.000 function_base.py:2747(_ureduce)
    19686    0.071    0.000    0.071    0.000 {method 'flatten' of 'numpy.ndarray' objects}
    19686    0.065    0.000    0.065    0.000 {method 'format' of 'str' objects}
    78744    0.055    0.000    3.852    0.000 numeric.py:464(asanyarray)

NEW CODE:

import numpy
import cProfile

pr = cProfile.Profile()
pr.enable()

#paths to files
read_path = '../tweet_input/tweets.txt'
write_path = "../tweet_output/ft2.txt"


def f(a):  
    for i in range(0, len(array)):
        if a <= array[i]:
            array.insert(i, a)
            break
    if 0 == len(array):
        array.append(a)

try:
    with open(read_path) as inf, open(write_path, 'a') as outf:
        array = []
        #for every line (tweet) in the file
        for line in inf:                                            ###Loop is bad. Builtin function is good
            #append amount of unique words to the array
            wordCount = len(set(line.split()))
            #print wordCount, array
            f(wordCount)
            #write current median of the array to the file
            result = "{:.2f}\n".format(numpy.median(array))
            outf.write(result)
except IOError as e:
    print 'Operation failed: %s' % e.strerror


###Service
pr.disable()
pr.print_stats(sort = 'time')

OLD cProfile result: 551211 function calls in 13.195 seconds Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function) 78744 10.193 0.000 10.193 0.000 {numpy.core.multiarray.array}

OLD CODE:

    with open(read_path) as inf, open(write_path, 'a') as outf:
        array = []
        #for every line in the file
        for line in inf:                            
            #append amount of unique words to the array
            array.append(len(set(line.split())))
            #write current median of the array to the file
            result = "{:.2f}\n".format(numpy.median(array))
            outf.write(result)

Solution

Numpy uses a meadian finding algorithm that is O(n log n). You call numpy.meadian once per line, so your algorithm ends up being O(n^2 log n).

There are several ways to improve on this. One is to keep the array sorted (i.e. insert each element in the position that maintains the sorted order). Each insert takes O(n) (inserting into an array is a linear time operation), and getting the median of a sorted array is O(1), so this ends up being O(n^2).

For profiling, the main thing you want to look at is tottime, since this tells you how much time your program spent in the function in total. As in your example, percall sometimes is not very useful, because sometimes, if you have a slow function (high percall) but it is only called a few times (low numcalls), the tottime ends up being insignificant compared to the other functions.