pythonpandasdataframecomparerow

Python pandas: Compare rows of dataframe based on some columns and drop row with lowest value


I have a data frame df:

       first_seen              last_seen             uri
0   2015-05-11 23:08:46     2015-05-11 23:08:50 http://11i-ssaintandder.com/
1   2015-05-11 23:08:46     2015-05-11 23:08:46 http://11i-ssaintandder.com/
2   2015-05-02 18:27:10     2015-06-06 03:52:03 http://example.com/NMqjd1
3   2015-05-02 18:27:10     2015-06-08 08:44:53 http://example.com/NMqjd1

I would like to remove the rows that has the same "first_seen","uri" and keep only the row that has the latest last_seen.

Here is the an example of expected dataset:

       first_seen              last_seen             uri
0   2015-05-11 23:08:46     2015-05-11 23:08:50 http://11i-ssaintandder.com/
3   2015-05-02 18:27:10     2015-06-08 08:44:53 http://example.com/NMqjd1

Does anybody know who to do it without writing a for loop?


Solution

  • Call drop_duplicates and pass the columns you want to consider for duplicate matching as the args for subset and set param take_last=True:

    In [295]:
    
    df.drop_duplicates(subset=['first_seen','uri'], take_last=True)
    Out[295]:
      index          first_seen            last_seen                           uri
    1     1 2015-05-11 23:08:46  2015-05-11 23:08:46  http://11i-ssaintandder.com/
    3     3 2015-05-02 18:27:10  2015-06-08 08:44:53          http://example.com/NMqjd1
    

    EDIT

    In order to take the latest date you need to sort the df first on 'first_seen' and 'last_seen':

    n [317]:
    df = df.sort(columns=['first_seen','last_seen'], ascending=[0,1])
    df.drop_duplicates(subset=['first_seen','uri'], take_last=True)
    
    Out[317]:
      index          first_seen            last_seen                           uri
    0     0 2015-05-11 23:08:46  2015-05-11 23:08:50  http://11i-ssaintandder.com/
    3     3 2015-05-02 18:27:10  2015-06-08 08:44:53          http://example.com/NMqjd1