numpymachine-learningdata-processing

np.where: "ValueError: operands could not be broadcast together with shapes (38658637,) (9456,)"


I have two dataframes with two different shapes:

  1. df_rts_1 #Shape: (38658637, 7)
  2. df_crsh_rts #Shape: (9456, 6)

I am trying to use np.where to update a column value (df_rts_1['crash']) to equal 1 based on certain condition as follows:

  1. df_rts_1['tmc_code']= df_crsh_rts['tmc']
  2. df_rts_1['measurement_tstamp'] is between df_crsh_rts['Start_time'] and df_crsh_rts['Closed_time']

My code:

df_rts_1['crash'] = np.where((df_rts_1['tmc_code'].values == df_crsh_rts['tmc'].values) & ((df_rts_1['measurement_tstamp'].values > df_crsh_rts['Start_time'].values) & (df_rts_1['measurement_tstamp'].values > df_crsh_rts['Closed_time'].values)), 1, df_rts_1['crash'])

I am getting the error in the title. I am very new to Python/data science.


Solution

  • Assume that your both DataFrames contain:

    1. df_rts_1:

          tmc_code  measurement_tstamp  crash
       0         1 2020-01-03 10:05:00      0
       1         1 2020-01-03 11:00:00      0
       2         1 2020-01-03 12:10:00      0
       3         2 2020-01-03 10:10:00      0
       4         3 2020-01-03 10:05:00      0
      
    2. df_crsh_rts:

          tmc          Start_time         Closed_time
       0    1 2020-01-03 10:00:00 2020-01-03 11:00:00
       1    2 2020-01-03 14:00:00 2020-01-03 15:00:00
       2    4 2020-01-03 16:00:00 2020-01-03 18:00:00
      

    To facilitate evaluation of the "between" condition, let's create the following IntervalIndex:

    interv = pd.IntervalIndex.from_arrays(df_crsh_rts.Start_time,
        df_crsh_rts.Closed_time, closed='both')
    

    And now, assuming that we have a current row from df_rts_1, let's construct your condition:

    To check how they work, save the first row from df_rts_1 as row variable:

    row = df_rts_1.iloc[0]
    

    and execute both contitions.

    The first condition generates a Numpy array of bool type:

    array([ True, False, False])
    

    and the second - a Series (also of bool type):

    0     True
    1    False
    2    False
    Name: tmc, dtype: bool
    

    So to construct the final (single) bool value - whether this row should have updated its crash column, the condition is:

    (interv.contains(row.measurement_tstamp) & df_crsh_rts.tmc.eq(row.tmc_code)).any()
    

    i.e. logical AND of both above conditions and any() - whether any element of this conjunction is True.

    And the last change, compared to your code:

    The code to do it is:

    df_rts_1.loc[[ (interv.contains(row.measurement_tstamp) &
        df_crsh_rts.tmc.eq(row.tmc_code)).any()
        for row in df_rts_1.itertuples()], 'crash'] = 1
    

    For my sample data, the result is:

       tmc_code  measurement_tstamp  crash
    0         1 2020-01-03 10:05:00      1
    1         1 2020-01-03 11:00:00      1
    2         1 2020-01-03 12:10:00      0
    3         2 2020-01-03 10:10:00      0
    4         3 2020-01-03 10:05:00      0
    

    Edit following questions in a comment

    Q1. Are we using the indices from both conditions to access and update the df_rts_1 dataframe?

    Actually not. Note that:

    [ (interv.contains(row.measurement_tstamp) &
        df_crsh_rts.tmc.eq(row.tmc_code)).any() for row in df_rts_1.itertuples() ]
    

    yields a list of bools, which even does not contain original indices. It is then used in .loc[...], so this is a case of boolean indexing. Consecutive True / False elements of this list relate to consecutive rows from df_rts_1 and state whether particular row is to be selected.

    Q2 and Q3. What does the any() do here? What any() helps us achieve.

    Look at the example for the initial row:

    We require that both these conditions must be met for the same row from df_crsh_rts, hence the & joinig them.

    Note however that:

    And just this transformation (from a bool Series to a single bool) is performed by any().

    Edit 2

    As your data volume is huge, I came up with another, probably faster solution. The idea is to:

    To do it, define the following function, to be applied to each group:

    def newCrash(grp):
        # Intervals for the current tmc_code
        wrk = intrv[intrv.index == grp.iloc[0,0]]
        if wrk.empty:
            return grp.crash  # Nothing found - no change
        wrkInd = pd.IntervalIndex(wrk)
        return grp.crash.mask([ wrkInd.contains(ts).any()
            for ts in grp.measurement_tstamp ], 1)
    

    Then create the auxiliary Series:

    intrv = pd.Series(pd.IntervalIndex.from_arrays(df_crsh_rts.Start_time,
        df_crsh_rts.Closed_time, closed='both'), index=df_crsh_rts.tmc)
    

    And finally run the update of crash column:

    df_rts_1.crash = df_rts_1.groupby('tmc_code', sort=False).\
        apply(newCrash).reset_index(level=0, drop=True)
    

    For your (very small) sample data this solution works slower.

    But after I inreased the size of df_rts_1 to 40 rows, this solution works a bit faster.

    If you further increase the size of df_rts_1, the difference in speed should be bigger (in favour of the second solution).

    Check both solutions (original and this) on a sample of e.g. 100,000 rows from df_rts_1 and write how long took the execution of both solutions.