pythondataframefunctionenumerate

Function not capturing change in values


I have the following dataframe:

state_territory_code    week_nbr
 CA                     WEEK 49
 CA                     WEEK 49
 FL                     WEEK 49
 CA                     WEEK 50
 TN                     WEEK 50
 HI                     WEEK 50
 GA                     WEEK 51
 FL                     WEEK 51

I am trying to essentially perform the following:

The output should be:

state_territory_code    week_nbr          new_cal_date
     CA                     WEEK 49         01/01/2024
     CA                     WEEK 49         01/01/2024
     FL                     WEEK 49         01/01/2024
     CA                     WEEK 50         01/08/2024
     TN                     WEEK 50         01/08/2024
     HI                     WEEK 50         01/08/2024
     GA                     WEEK 51         01/15/2024
     FL                     WEEK 51         01/15/2024

Here is my code:

def cal_week_start(some_df):
    start_dt = datetime.datetime(2024, 1, 1)
    
    for i, wk in enumerate(some_df):
        if i < (len(some_df)-1): ## Keep running as long as we haven't reached end of DF
            next_value = some_df[i+1] ## index pos of next value
            if wk != next_value: ## if value changes from one to the next do something
                new_dt = start_dt + timedelta(days = 7) ## Add 7 days.
                start_dt = new_dt ## reinitiate variable to add 7 days, 14, 21, etc.
                return start_dt 
        else:
            return start_dt

df_new['new_cal_date'] = df_new['week_nbr'].apply(cal_week_start)

It's a little wonky I know/not fool-proof, but I just need it to execute on a large dataset. It works when I test it outside of the function for some reason, not sure why it just repeats 01/08/2024 all the way through. Any help would be greatly appreciated. I am still learning.


Solution

  • I would slightly adjust your assumptions:

    >>> import datetime
    >>> start_dt = datetime.datetime(2024, 1, 1)
    >>> min_week = min(df.week_nbr.str.rsplit(" ").str[-1].astype(int))
    >>> df["new_cal_date"] = datetime.timedelta(days=7)*df.week_nbr.str.rsplit(" ").str[-1].astype(int).sub(min_week) + start_dt
    >>> df
      state_territory_code week_nbr new_cal_date
    0                   CA  WEEK 49   2024-01-01
    1                   CA  WEEK 49   2024-01-01
    2                   FL  WEEK 49   2024-01-01
    3                   CA  WEEK 50   2024-01-08
    4                   TN  WEEK 50   2024-01-08
    5                   HI  WEEK 50   2024-01-08
    6                   GA  WEEK 51   2024-01-15
    7                   FL  WEEK 51   2024-01-15
    

    Explanation: you probably mean to map week, per its number to date - not rely on order (even if it's given). You also will be better off using vectorized computing, apply is usually slower. The only caveat here is, if year marker changes - but you can easily mitigate it with grouping/conditional mapping.