[SOLVED] Python pandas: Compare rows of dataframe based on some columns and drop row with lowest value

Python pandas: Compare rows of dataframe based on some columns and drop row with lowest value

I have a data frame df:

       first_seen              last_seen             uri
0   2015-05-11 23:08:46     2015-05-11 23:08:50 http://11i-ssaintandder.com/
1   2015-05-11 23:08:46     2015-05-11 23:08:46 http://11i-ssaintandder.com/
2   2015-05-02 18:27:10     2015-06-06 03:52:03 http://example.com/NMqjd1
3   2015-05-02 18:27:10     2015-06-08 08:44:53 http://example.com/NMqjd1

I would like to remove the rows that has the same "first_seen","uri" and keep only the row that has the latest last_seen.

Here is the an example of expected dataset:

       first_seen              last_seen             uri
0   2015-05-11 23:08:46     2015-05-11 23:08:50 http://11i-ssaintandder.com/
3   2015-05-02 18:27:10     2015-06-08 08:44:53 http://example.com/NMqjd1

Does anybody know who to do it without writing a for loop?

Solution

Call drop_duplicates and pass the columns you want to consider for duplicate matching as the args for subset and set param take_last=True:

In [295]:

df.drop_duplicates(subset=['first_seen','uri'], take_last=True)
Out[295]:
  index          first_seen            last_seen                           uri
1     1 2015-05-11 23:08:46  2015-05-11 23:08:46  http://11i-ssaintandder.com/
3     3 2015-05-02 18:27:10  2015-06-08 08:44:53          http://example.com/NMqjd1

EDIT

In order to take the latest date you need to sort the df first on 'first_seen' and 'last_seen':

n [317]:
df = df.sort(columns=['first_seen','last_seen'], ascending=[0,1])
df.drop_duplicates(subset=['first_seen','uri'], take_last=True)

Out[317]:
  index          first_seen            last_seen                           uri
0     0 2015-05-11 23:08:46  2015-05-11 23:08:50  http://11i-ssaintandder.com/
3     3 2015-05-02 18:27:10  2015-06-08 08:44:53          http://example.com/NMqjd1