pythonpandasvectorizationprogress

How to assess the progress of a vectorized task (in python with pandas)


Vectorization of tasks speeds up the execution, but I cannot find how to measure the progress of the vectorized task (in case of tasks taking a long time to complete). I've seen that tqdm might do the job, but I wonder if it is possible to do it in a simpler way.

Example with pandas dataframe (assume the index is [0...n] and a printout message is outputted each 1000 rows):

for idx in df.index:
    df.loc[idx, 'B'] = a_function(df.loc[idx, 'A'])
    if (idx % 1000) == 0:
        print(idx)

This will show the progress, but can be horribly slow if df has several million rows and a_function() is not trivial.

The alternative is to vectorize the operation:

df['B'] = df['A'].apply(lambda x: a_funcion(x))

which will probably run much quicker, but it does not provide any hint about the progress. Any idea on how to get this information on the status of the vectorized task?


Solution

  • tqdm now supports main generic pandas.core structures with progress_apply method:

    from tqdm import tqdm
    
    tqdm.pandas()
    df = pd.DataFrame(np.random.randint(0, 100, 3000_000), columns=['A'])
    df['B'] = df['A'].progress_apply(lambda x: x**2)
    

    It shows progress without requiring print statement (though it may not be convenient for all cases).

    enter image description here