speed up my function about build bill of materials with pandas

for me a Python fresh, I m working on a code for building bill of materials(BOM), which can get the customer's desired item id and purchase quantity from the order sheet in the excel file and use another BOM sheet(also in the same excel) to calculate the quantity of all raw materials needed.After subtracting the stock of raw materials from the inventory, the remaining demand is input to the dictionary output like{id_material: quantity}. The format of BOM sheet as follow

item_id	process_id	process_No.	IN/OUT	material_id	quantity_in	quantity_out
A	z420	1	IN	12125	100	Nan
A	z420	1	OUT	A-z512-2	Nan	100
A	z512	2	IN	A-z512-2	100	Nan
A	z512	2	OUT	A-z600-3	Nan	120
A	z600	3	IN	A-z600-3	120	Nan
A	z600	3	OUT	14551	Nan	-20
A	z600	3	OUT	A	Nan	100

attr: processs_id : Process id used

attr: processs_No.: Process order in the process path. Not always consecutive or regular like natural numbers, such as (51, 60, 70, 100)

attr: IN/OUT : Indicate whether the material is a raw material or an output

the pandas.dataframe I use is also like this but two attr columns added: 'count_demand' for indicate the quantity needed and 'flag' for my function to identify the material that needs to be executed.Lets call it 'df_demand'.

I was able to complete the functions that would serve my purpose, but the speed was not satisfactory. I tested with moudles such as timeit and found some operations that took a lot of time, but I couldn't think of a way to optimize them, so I came here for help.

The first thing I found unsatisfactory and the most time-consuming part of the whole process was a function that was used by my code to find the raw material of the current demand material and to calculate the demand quantity of the raw material in proportion to the BOM according to the quantity of the demand material.The code as follow

def calculate_demand_raw(row, df_demand):
    try:
       if np.isnan(row['quantity_out']):
           raise ValueError('To avoid including those recycled materials with negative outputs')

       list_index = list(df_demand['item_id'].isin([row['item_id']]) &
                         df_demand['process_No.'].isin([row['process_No.']]) &
                         df_demand['IN/OUT'].isin(['IN']))
       index = [i for i, x in enumerate(list_index) if x==True]  
       # Search to find the index of the required generation process

       df_demand.loc[index, 'count_demand'] = row['count_demand']/row['quantity_out']*
                                               df_demand.loc[index, 'quantity_in']
       # calculate quantity of raw materials.
       df_demand.loc[index, 'flag'] = 1

    except ValueError:
        pass  # Prevent the query material is the base material, no process generation
    df_demand.loc[row.name, 'flag'] = 0

df_demand[df_demand['flag'].isin([1])].apply(lambda row: calculate_demand_raw(row, df_demand), axis=1)

timeit told me that, in the function, it takes three times as long to find the index of the eligible rows as it does to calculate the quantity of the raw material, and calculate_demand_raw is also the most time-consuming function in a loop.So could anyone have a deal of search index time reduction?

The second time-consuming function in a loop is a function that fill the aggregated raw material requirements into the df_demand's attr'count_demand' where they are generated for the process.

def fill_demand(row, qty_sum_demand, df_demand):
    df_demand[row.name, 'count_demand'] += qty_sum_demand.loc[
                                qty_sum_demand['IN/OUT'].isin([row['IN/OUT']).tolist(),
                                'count_demand'].tolist()
    df_demand.loc[index, 'flag'] = 1

df_demand.loc[index_generated_process].apply(lambda row: 
                                fill_demand(row, qty_sum_demand, df_demand), axis=1)

Is it the conditional search in the function that makes it take so long, just like calculate_demand_raw? Is it possible to turn this operation into a faster numpy vectorization operation?

Solution

A modified version of your example to show some functionality:

df = pd.read_csv(io.StringIO(
"""
item_id,process_id,process_No.,IN/OUT,material_id,quantity_in,quantity_out,flag
A,z420,1,IN,12125,100.0,,0
A,z420,1,OUT,A-z512-2,,100.0,0
A,z512,2,IN,A-z512-2,100.0,,0
A,z512,2,OUT,A-z600-3,,120.0,0
A,z600,3,IN,A-z600-2,,400,1
A,z600,3,IN,A-z600-3,120.0,200,1
A,z600,3,IN,14551,,-20.0,0
A,z600,3,OUT,A,,100.0,0
""".strip()
))

  item_id process_id  process_No. IN/OUT material_id  quantity_in  quantity_out  flag
0       A       z420            1     IN       12125        100.0           NaN     0
1       A       z420            1    OUT    A-z512-2          NaN         100.0     0
2       A       z512            2     IN    A-z512-2        100.0           NaN     0
3       A       z512            2    OUT    A-z600-3          NaN         120.0     0
4       A       z600            3     IN    A-z600-2          NaN         400.0     1
5       A       z600            3     IN    A-z600-3        120.0         200.0     1 # <- only valid flag (non-na quantity_in)
6       A       z600            3     IN       14551          NaN         -20.0     0
7       A       z600            3    OUT           A          NaN         100.0     0

Here's one way you may implement calculate_demand_raw without using .apply + .loc lookups.

Generally the you want to .merge in cases like this so you have all the data "side-by-side" which allows you to then work in a "vectorized manner".

flags = df[df['flag'] == 1].dropna(subset='quantity_in')

df_m = df.merge(flags, on=['item_id', 'process_No.'], how='left', suffixes=('', '_y'))

df_m.loc[ 
   (df_m['flag'] == 1) | (df_m['IN/OUT'] == 'OUT'), 
   df.columns.difference(['item_id', 'process_No.']) + '_y' 
] = float('nan')

rows = df_m['process_id_y'].notna()
df_m.loc[rows, 'quantity_out'] *= df_m.loc[rows, 'quantity_in_y']

df_m.loc[df_m['flag'] == 1, 'flag'] = 0
df_m.loc[rows, 'flag'] = 1

A breakdown of the steps:

Find all flag rows.

flags = df[df['flag'] == 1].dropna(subset='quantity_in')

The .dropna() is to emulate the if np.isnan(row['quantity_out']) line in your code.

  item_id process_id  process_No. IN/OUT material_id  quantity_in  quantity_out  flag
5       A       z600            3     IN    A-z600-3        120.0         200.0     1

left-merge with flags:

df_m = df.merge(flags, on=['item_id', 'process_No.'], how='left', suffixes=('', '_y'))

  item_id process_id  process_No. IN/OUT material_id  quantity_in  quantity_out  flag process_id_y IN/OUT_y material_id_y  quantity_in_y  quantity_out_y  flag_y
0       A       z420            1     IN       12125        100.0           NaN     0          NaN      NaN           NaN            NaN             NaN     NaN
1       A       z420            1    OUT    A-z512-2          NaN         100.0     0          NaN      NaN           NaN            NaN             NaN     NaN
2       A       z512            2     IN    A-z512-2        100.0           NaN     0          NaN      NaN           NaN            NaN             NaN     NaN
3       A       z512            2    OUT    A-z600-3          NaN         120.0     0          NaN      NaN           NaN            NaN             NaN     NaN
4       A       z600            3     IN    A-z600-2          NaN         400.0     1         z600       IN      A-z600-3          120.0           200.0     1.0
5       A       z600            3     IN    A-z600-3        120.0         200.0     1         z600       IN      A-z600-3          120.0           200.0     1.0
6       A       z600            3     IN       14551          NaN         -20.0     0         z600       IN      A-z600-3          120.0           200.0     1.0
7       A       z600            3    OUT           A          NaN         100.0     0         z600       IN      A-z600-3          120.0           200.0     1.0

You want to discard OUT rows, it's unclear if you want to compare the flag rows with themselves so I've discarded them here.

You can reset the _y columns on the rows you want to discard back to NaN

df_m.loc[ 
   (df_m['flag'] == 1) | (df_m['IN/OUT'] == 'OUT'), 
   df.columns.difference(['item_id', 'process_No.']) + '_y' 
] = float('nan')

  item_id process_id  process_No. IN/OUT material_id  quantity_in  quantity_out  flag process_id_y IN/OUT_y material_id_y  quantity_in_y  quantity_out_y  flag_y
0       A       z420            1     IN       12125        100.0           NaN     0          NaN      NaN           NaN            NaN             NaN     NaN
1       A       z420            1    OUT    A-z512-2          NaN         100.0     0          NaN      NaN           NaN            NaN             NaN     NaN
2       A       z512            2     IN    A-z512-2        100.0           NaN     0          NaN      NaN           NaN            NaN             NaN     NaN
3       A       z512            2    OUT    A-z600-3          NaN         120.0     0          NaN      NaN           NaN            NaN             NaN     NaN
4       A       z600            3     IN    A-z600-2          NaN         400.0     1          NaN      NaN           NaN            NaN             NaN     NaN
5       A       z600            3     IN    A-z600-3        120.0         200.0     1          NaN      NaN           NaN            NaN             NaN     NaN
6       A       z600            3     IN       14551          NaN         -20.0     0         z600       IN      A-z600-3          120.0           200.0     1.0
7       A       z600            3    OUT           A          NaN         100.0     0          NaN      NaN           NaN            NaN             NaN     NaN

You can then perform your calculations on the rows with nonna _y values:

rows = df_m['process_id_y'].notna()
df_m.loc[rows, 'quantity_out'] *= df_m.loc[rows, 'quantity_in_y']

  item_id process_id  process_No. IN/OUT material_id  quantity_in  quantity_out  flag process_id_y IN/OUT_y material_id_y  quantity_in_y  quantity_out_y  flag_y
0       A       z420            1     IN       12125        100.0           NaN     0          NaN      NaN           NaN            NaN             NaN     NaN
1       A       z420            1    OUT    A-z512-2          NaN         100.0     0          NaN      NaN           NaN            NaN             NaN     NaN
2       A       z512            2     IN    A-z512-2        100.0           NaN     0          NaN      NaN           NaN            NaN             NaN     NaN
3       A       z512            2    OUT    A-z600-3          NaN         120.0     0          NaN      NaN           NaN            NaN             NaN     NaN
4       A       z600            3     IN    A-z600-2          NaN         400.0     1          NaN      NaN           NaN            NaN             NaN     NaN
5       A       z600            3     IN    A-z600-3        120.0         200.0     1          NaN      NaN           NaN            NaN             NaN     NaN
6       A       z600            3     IN       14551          NaN       -2400.0     0         z600       IN      A-z600-3          120.0           200.0     1.0
7       A       z600            3    OUT           A          NaN         100.0     0          NaN      NaN           NaN            NaN             NaN     NaN

Toggle the flag values:

df_m.loc[df_m['flag'] == 1, 'flag'] = 0
df_m.loc[rows, 'flag'] = 1

  item_id process_id  process_No. IN/OUT material_id  quantity_in  quantity_out  flag process_id_y IN/OUT_y material_id_y  quantity_in_y  quantity_out_y  flag_y
0       A       z420            1     IN       12125        100.0           NaN     0          NaN      NaN           NaN            NaN             NaN     NaN
1       A       z420            1    OUT    A-z512-2          NaN         100.0     0          NaN      NaN           NaN            NaN             NaN     NaN
2       A       z512            2     IN    A-z512-2        100.0           NaN     0          NaN      NaN           NaN            NaN             NaN     NaN
3       A       z512            2    OUT    A-z600-3          NaN         120.0     0          NaN      NaN           NaN            NaN             NaN     NaN
4       A       z600            3     IN    A-z600-2          NaN         400.0     0          NaN      NaN           NaN            NaN             NaN     NaN
5       A       z600            3     IN    A-z600-3        120.0         200.0     0          NaN      NaN           NaN            NaN             NaN     NaN
6       A       z600            3     IN       14551          NaN       -2400.0     1         z600       IN      A-z600-3          120.0           200.0     1.0
7       A       z600            3    OUT           A          NaN         100.0     0          NaN      NaN           NaN            NaN             NaN     NaN