pythonpandasdatedatetime

How to identify duplicate datetime entries from a .csv file where pandas does not consider time down to the second?


I am working with a pandas DataFrame where one of the columns contains datetime values, and I need to identify duplicate entries in the "Data" column. The datetime values include both the date and the exact time (hours, minutes, and seconds). However, I noticed an issue when I read the data from a .csv file — pandas does not seem to consider the time down to the second when identifying duplicates.

Interestingly, when I create synthetic data directly in pandas (like in the example below), the expected output works correctly, and it identifies the duplicates as I would expect. But when I read the same data from a .csv file, it marks even datetime values that are different by the hour as duplicates, which is not what I want.

Here is an example of my synthetic DataFrame:

import pandas as pd

# Creating synthetic data with random IDs and names
data = {
    'ID': ['ID-1001', 'ID-1002', 'ID-1003', 'ID-1004', 'ID-1005', 'ID-1006', 'ID-1007', 'ID-1008', 'ID-1009', 'ID-1010'],
    'Name': ['Sensor-A', 'Sensor-B', 'Sensor-C', 'Sensor-D', 'Sensor-E', 'Sensor-F', 'Sensor-G', 'Sensor-H', 'Sensor-I', 'Sensor-J'],
    'Code': [330735, 330736, 330737, 330738, 330739, 330740, 330741, 330742, 330743, 330744],
    'Date': [
        '2022-01-01 12:00:00', '2022-01-01 12:00:00', '2022-01-01 13:00:00', '2022-01-01 14:00:00', 
        '2022-01-02 12:00:00', '2022-01-02 13:00:00', '2022-01-02 14:00:00', '2022-01-02 15:00:00', 
        '2022-01-03 12:00:00', '2022-01-03 13:00:00'
    ]
}

# Convert to DataFrame
dd_csv = pd.DataFrame(data)

# Ensure 'Date' is in datetime format
dd_csv['Date'] = pd.to_datetime(dd_csv['Date'])

In this dataset, the following rows have exact duplicate datetime values (same date and time):

2022-01-01 12:00:00 for Sensor-A and Sensor-B (these are duplicates). Now, I want to check for duplicates in the "Data" column based on the exact datetime value, including both date and time. It works ok for the synthetic data above.

duplicates_all = dd_csv['Date'].duplicated(keep=False)
print(dd_csv[duplicates_all])
      ID      Name    Code                Date
0  ID-1001  Sensor-A  330735  2022-01-01 12:00:00
1  ID-1002  Sensor-B  330736  2022-01-01 12:00:00

However, when the data is read from a .csv file (real data), the time is not correctly recognized down to the second. This results in pandas marking entries with the same date but different times (down to the hour) as duplicates, even if I set the format before:

import pandas as pd

# URL of the CSV file in the GitHub repository
url = 'https://raw.githubusercontent.com/jc-barreto/Data/main/test_data.csv'

# Read the CSV file directly from the URL
real_data = pd.read_csv(url)

# Convert the 'Date' column to datetime format
real_data['Date'] = pd.to_datetime(real_data['Date'], format="%Y-%m-%d %H:%M:%S", errors='coerce')

# Identify rows with duplicate dates
duplicates_all = real_data['Date'].duplicated(keep=False)

# Print the rows with duplicate dates
print(real_data[duplicates_all])

and the output is:


        Unnamed: 0 ID                Date         T
11774        11774  A 2017-05-25 12:00:00  20.55000
11775        11775  A 2017-05-25 13:00:00  20.56000
11776        11776  A 2017-05-25 14:00:00  20.56000
11777        11777  A 2017-05-25 15:00:00  20.57000
11778        11778  A 2017-05-25 16:00:00  20.57000

where clear the dates are not repeated since it have different times.

I have tried the suggestion from the answer below, but didn't work neither:

real_data['date_only'] = [x.date() for x in real_data['Date']]
real_data['time_only'] = [x.time() for x in real_data['Date']]

duplicates_all2 = real_data[['date_only', 'time_only']].duplicated(keep=False)
print(real_data[duplicates_all2])

How do I fix that? I need to fix because I'm going to use the ID + Data as a key for a database update, to make sure I only update data that is not in the database.


Solution

  • Your data has the duplicated date/times shown, but the aren't consecutive. Sort the duplicated data if you want to see the duplicated dates together.

    Example:

    import pandas as pd
    
    # Synthetic data with non-consecutive duplicated dates.
    data = {
        'ID': ['ID-1001', 'ID-1002', 'ID-1003', 'ID-1004', 'ID-1005', 'ID-1006', 'ID-1007', 'ID-1008', 'ID-1009', 'ID-1010'],
        'Name': ['Sensor-A', 'Sensor-B', 'Sensor-C', 'Sensor-D', 'Sensor-E', 'Sensor-F', 'Sensor-G', 'Sensor-H', 'Sensor-I', 'Sensor-J'],
        'Code': [330735, 330736, 330737, 330738, 330739, 330740, 330741, 330742, 330743, 330744],
        'Date': [
            '2022-01-01 12:00:00', '2022-01-01 11:00:00', '2022-01-01 13:00:00', '2022-01-01 14:00:00', 
            '2022-01-02 12:00:00', '2022-01-02 13:00:00', '2022-01-02 14:00:00', '2022-01-02 15:00:00', 
            '2022-01-01 12:00:00', '2022-01-02 15:00:00'
        ]
    }
    
    # Convert to DataFrame
    dd_csv = pd.DataFrame(data)
    
    # Ensure 'Date' is in datetime format
    dd_csv['Date'] = pd.to_datetime(dd_csv['Date'])
    
    duplicates_all = dd_csv['Date'].duplicated(keep=False)
    print(dd_csv[duplicates_all])
    print()
    print(dd_csv[duplicates_all].sort_values(by=['Date']))  # sort the Dates
    

    Output below. Note that in the first instance, duplicates are listed by not together.

            ID      Name    Code                Date
    0  ID-1001  Sensor-A  330735 2022-01-01 12:00:00
    7  ID-1008  Sensor-H  330742 2022-01-02 15:00:00
    8  ID-1009  Sensor-I  330743 2022-01-01 12:00:00
    9  ID-1010  Sensor-J  330744 2022-01-02 15:00:00
    
            ID      Name    Code                Date
    0  ID-1001  Sensor-A  330735 2022-01-01 12:00:00
    8  ID-1009  Sensor-I  330743 2022-01-01 12:00:00
    7  ID-1008  Sensor-H  330742 2022-01-02 15:00:00
    9  ID-1010  Sensor-J  330744 2022-01-02 15:00:00