I have two pandas dataframes
flows:
------
sourceIPAddress destinationIPAddress flowStartMicroseconds flowEndMicroseconds
163.193.204.92 40.8.121.226 2021-05-01 07:00:00.113 2021-05-01 07:00:00.113962
104.247.103.181 163.193.124.92 2021-05-01 07:00:00.074 2021-05-01 07:00:00.101026
17.254.170.53 163.193.124.133 2021-05-01 07:00:00.077 2021-05-01 07:00:00.083874
18.179.96.152 203.179.250.96 2021-05-01 07:00:00.112 2021-05-01 07:00:00.098296
133.103.144.34 13.154.212.11 2021-05-01 07:00:00.101 2021-05-01 07:00:00.112013
attacks:
--------
datetime srcIP dstIP
2021-05-01 07:00:00.055210 188.67.130.72 133.92.239.153
2021-05-01 07:00:00.055500 45.100.34.74 203.179.180.153
2021-05-01 07:00:00.055351 103.113.29.26 163.193.242.75
2021-05-01 07:00:00.056209 128.215.229.101 163.193.94.194
2021-05-01 07:00:00.055258 45.111.22.11 163.193.138.139
I want to check for each row of flows if it matches any row of attacks where
attacks[srcIP] == flows[srcIP] || attacks[srcIP] == flows[destIP]
&&
attacks[destIP] == flows[srcIP] || attacks[destIP] == flows[destIP]
&&
attacks[datetime] between flows[flowStartMicroseconds] and flows[flowEndMicroseconds]
Is there any more efficient way to do this than just iterating over it?
EDIT: The dataframes are quite large. I included the head() of each.
flows = {'sourceIPAddress': {510: '163.193.204.92',
564: '104.247.103.181',
590: '17.254.170.53',
599: '18.179.96.152',
1149: '133.103.144.34'},
'destinationIPAddress': {510: '40.8.121.226',
564: '163.193.124.92',
590: '163.193.124.133',
599: '203.179.250.96',
1149: '13.154.212.11'},
'flowStartMicroseconds': {510: Timestamp('2021-05-01 07:00:00.113000'),
564: Timestamp('2021-05-01 07:00:00.074000'),
590: Timestamp('2021-05-01 07:00:00.077000'),
599: Timestamp('2021-05-01 07:00:00.112000'),
1149: Timestamp('2021-05-01 07:00:00.101000')},
'flowEndMicroseconds': {510: Timestamp('2021-05-01 07:00:00.113962'),
564: Timestamp('2021-05-01 07:00:00.083874'),
590: Timestamp('2021-05-01 07:00:00.098296'),
599: Timestamp('2021-05-01 07:00:00.112013'),
1149: Timestamp('2021-05-01 07:00:00.101026')}}
attacks = {'datetime': {0: Timestamp('2021-05-01 07:00:00.055210'),
1: Timestamp('2021-05-01 07:00:00.055500'),
2: Timestamp('2021-05-01 07:00:00.055351'),
3: Timestamp('2021-05-01 07:00:00.056209'),
4: Timestamp('2021-05-01 07:00:00.055258')},
'srcIP': {0: '188.67.130.72',
1: '45.100.34.74',
2: '103.113.29.26',
3: '128.215.229.101',
4: '45.111.22.11'},
'dstIP': {0: '133.92.239.153',
1: '203.179.180.153',
2: '163.193.242.75',
3: '163.193.94.194',
4: '163.193.138.139'}}
Solution: Database
My solution was to import the two dataframes into PostgreSQL and create two new tables for forward and backward IP matches, then UNION ALL them together.
The two single joins are significantly faster than if you would do one giant join.
create table attacks_forward as
SELECT
flows.*, attacks."label", attacks."sublabel"
FROM
flows
JOIN attacks
ON flows."sourceIPAddress" = attacks."srcIP"
and flows."destinationIPAddress" = attacks."dstIP"
and attacks."datetime" between flows."flowStartMicroseconds" and flows."flowEndMicroseconds";
create table attacks_backward as
SELECT
flows.*, attacks."label", attacks."sublabel"
FROM
flows
JOIN attacks
ON flows."sourceIPAddress" = attacks."dstIP"
and flows."destinationIPAddress" = attacks."srcIP"
and attacks."datetime" between flows."flowStartMicroseconds" and flows."flowEndMicroseconds";
create table attacks_flows as
SELECT * FROM attacks_forward
UNION ALL
SELECT * FROM attacks_backward;