I have a data frame df:
first_seen last_seen uri
0 2015-05-11 23:08:46 2015-05-11 23:08:50 http://11i-ssaintandder.com/
1 2015-05-11 23:08:46 2015-05-11 23:08:46 http://11i-ssaintandder.com/
2 2015-05-02 18:27:10 2015-06-06 03:52:03 http://example.com/NMqjd1
3 2015-05-02 18:27:10 2015-06-08 08:44:53 http://example.com/NMqjd1
I would like to remove the rows that has the same "first_seen","uri" and keep only the row that has the latest last_seen.
Here is the an example of expected
dataset:
first_seen last_seen uri
0 2015-05-11 23:08:46 2015-05-11 23:08:50 http://11i-ssaintandder.com/
3 2015-05-02 18:27:10 2015-06-08 08:44:53 http://example.com/NMqjd1
Does anybody know who to do it without writing a for loop?
Call drop_duplicates
and pass the columns you want to consider for duplicate matching as the args for subset
and set param take_last=True
:
In [295]:
df.drop_duplicates(subset=['first_seen','uri'], take_last=True)
Out[295]:
index first_seen last_seen uri
1 1 2015-05-11 23:08:46 2015-05-11 23:08:46 http://11i-ssaintandder.com/
3 3 2015-05-02 18:27:10 2015-06-08 08:44:53 http://example.com/NMqjd1
EDIT
In order to take the latest date you need to sort the df first on 'first_seen' and 'last_seen':
n [317]:
df = df.sort(columns=['first_seen','last_seen'], ascending=[0,1])
df.drop_duplicates(subset=['first_seen','uri'], take_last=True)
Out[317]:
index first_seen last_seen uri
0 0 2015-05-11 23:08:46 2015-05-11 23:08:50 http://11i-ssaintandder.com/
3 3 2015-05-02 18:27:10 2015-06-08 08:44:53 http://example.com/NMqjd1