[SOLVED] Creating new columns with Pandas df.apply

Creating new columns with Pandas df.apply

I have a huge NetFlow database, (it contains a Timestamp, Source IP, Dest IP, Protocol, Source and Dest Port Num., Packets Exchanged, Bytes and more). I want to create custom attributes based on the current and previous rows.

I want to calculate new columns based on the source ip and timestamp of the current row. This what i want to do logically:

Get the source ip for the current row.
Get the Timestamp for the current row.
Based on the source IP, and Timestamp, I want to get all the Previous rows of the entire dataframe, that matches the source IP, and the communicaton happened in the last half an hour. This is very important.
For the rows(Flows, in my example), that matches the criteria (source ip and happened in the last half hour), I want to count the sum and mean of all the packets and all the bytes.

One row from the dataset

Snippets of relevant code:

df = pd.read_csv(path, header = None, names=['ts','td','sa','da','sp','dp','pr','flg','fwd','stos','pkt','byt','lbl'])

df['ts'] = pd.to_datetime(df['ts'])

def prev_30_ip_sum(ts,sa,size):
global joined
for (x,y) in zip(df['sa'], df['ts']):
    ...
return sum

df['prev30ipsumpkt'] = df.apply(lambda x: prev_30_ip_sum(x['ts'],x['sa'],x['pkt']), axis = 1)

I know that there's probably a better, more efficient way to do this, but I'm sadly not the best programmer.

Thanks.

Solution

df = pd.read_csv(path, header = None, names=['ts','td','sa','da','sp','dp','pr','flg','fwd','stos','pkt','byt','lbl'])
        
df['ts'] = pd.to_datetime(df['ts'])
   
def prev_30_ip_sum(df, i):
  #current time from current row
  current = df.loc[i, 'ts']
  # timestamp of last 30 minutes 
  last = current - timedelta(minutes=30)

  # Current source address
  sa = df.loc[i, 'sa']

  # new dataframe for timestamp less than 30 min and same ip as current one
  new_df = df[(last <= df['ts']) & (current > df['ts']) & (df['sa'] == sa)]

  # Return sum and mean
  return new_df['pkt'].sum(), new_df['pkt'].mean()


# Take sa and timestamp of each row and create new dataframe
result = [prev_30_ip_sum(df, i) for i in df.index]

# Create new columns in current database.
df['sum'] = [i[0] for i in result]
df['mean'] = [i[1] for i in result]

refer this to understand timedelta