I am working with a pandas DataFrame where one of the columns contains datetime values, and I need to identify duplicate entries in the "Data" column. The datetime values include both the date and the exact time (hours, minutes, and seconds). However, I noticed an issue when I read the data from a .csv file — pandas does not seem to consider the time down to the second when identifying duplicates.
Interestingly, when I create synthetic data directly in pandas (like in the example below), the expected output works correctly, and it identifies the duplicates as I would expect. But when I read the same data from a .csv file, it marks even datetime values that are different by the hour as duplicates, which is not what I want.
Here is an example of my synthetic DataFrame:
import pandas as pd
# Creating synthetic data with random IDs and names
data = {
'ID': ['ID-1001', 'ID-1002', 'ID-1003', 'ID-1004', 'ID-1005', 'ID-1006', 'ID-1007', 'ID-1008', 'ID-1009', 'ID-1010'],
'Name': ['Sensor-A', 'Sensor-B', 'Sensor-C', 'Sensor-D', 'Sensor-E', 'Sensor-F', 'Sensor-G', 'Sensor-H', 'Sensor-I', 'Sensor-J'],
'Code': [330735, 330736, 330737, 330738, 330739, 330740, 330741, 330742, 330743, 330744],
'Date': [
'2022-01-01 12:00:00', '2022-01-01 12:00:00', '2022-01-01 13:00:00', '2022-01-01 14:00:00',
'2022-01-02 12:00:00', '2022-01-02 13:00:00', '2022-01-02 14:00:00', '2022-01-02 15:00:00',
'2022-01-03 12:00:00', '2022-01-03 13:00:00'
]
}
# Convert to DataFrame
dd_csv = pd.DataFrame(data)
# Ensure 'Date' is in datetime format
dd_csv['Date'] = pd.to_datetime(dd_csv['Date'])
In this dataset, the following rows have exact duplicate datetime values (same date and time):
2022-01-01 12:00:00 for Sensor-A and Sensor-B (these are duplicates). Now, I want to check for duplicates in the "Data" column based on the exact datetime value, including both date and time. It works ok for the synthetic data above.
duplicates_all = dd_csv['Date'].duplicated(keep=False)
print(dd_csv[duplicates_all])
ID Name Code Date
0 ID-1001 Sensor-A 330735 2022-01-01 12:00:00
1 ID-1002 Sensor-B 330736 2022-01-01 12:00:00
However, when the data is read from a .csv file (real data), the time is not correctly recognized down to the second. This results in pandas marking entries with the same date but different times (down to the hour) as duplicates, even if I set the format before:
import pandas as pd
# URL of the CSV file in the GitHub repository
url = 'https://raw.githubusercontent.com/jc-barreto/Data/main/test_data.csv'
# Read the CSV file directly from the URL
real_data = pd.read_csv(url)
# Convert the 'Date' column to datetime format
real_data['Date'] = pd.to_datetime(real_data['Date'], format="%Y-%m-%d %H:%M:%S", errors='coerce')
# Identify rows with duplicate dates
duplicates_all = real_data['Date'].duplicated(keep=False)
# Print the rows with duplicate dates
print(real_data[duplicates_all])
and the output is:
Unnamed: 0 ID Date T
11774 11774 A 2017-05-25 12:00:00 20.55000
11775 11775 A 2017-05-25 13:00:00 20.56000
11776 11776 A 2017-05-25 14:00:00 20.56000
11777 11777 A 2017-05-25 15:00:00 20.57000
11778 11778 A 2017-05-25 16:00:00 20.57000
where clear the dates are not repeated since it have different times.
I have tried the suggestion from the answer below, but didn't work neither:
real_data['date_only'] = [x.date() for x in real_data['Date']]
real_data['time_only'] = [x.time() for x in real_data['Date']]
duplicates_all2 = real_data[['date_only', 'time_only']].duplicated(keep=False)
print(real_data[duplicates_all2])
How do I fix that? I need to fix because I'm going to use the ID + Data as a key for a database update, to make sure I only update data that is not in the database.
Your data has the duplicated date/times shown, but the aren't consecutive. Sort the duplicated data if you want to see the duplicated dates together.
Example:
import pandas as pd
# Synthetic data with non-consecutive duplicated dates.
data = {
'ID': ['ID-1001', 'ID-1002', 'ID-1003', 'ID-1004', 'ID-1005', 'ID-1006', 'ID-1007', 'ID-1008', 'ID-1009', 'ID-1010'],
'Name': ['Sensor-A', 'Sensor-B', 'Sensor-C', 'Sensor-D', 'Sensor-E', 'Sensor-F', 'Sensor-G', 'Sensor-H', 'Sensor-I', 'Sensor-J'],
'Code': [330735, 330736, 330737, 330738, 330739, 330740, 330741, 330742, 330743, 330744],
'Date': [
'2022-01-01 12:00:00', '2022-01-01 11:00:00', '2022-01-01 13:00:00', '2022-01-01 14:00:00',
'2022-01-02 12:00:00', '2022-01-02 13:00:00', '2022-01-02 14:00:00', '2022-01-02 15:00:00',
'2022-01-01 12:00:00', '2022-01-02 15:00:00'
]
}
# Convert to DataFrame
dd_csv = pd.DataFrame(data)
# Ensure 'Date' is in datetime format
dd_csv['Date'] = pd.to_datetime(dd_csv['Date'])
duplicates_all = dd_csv['Date'].duplicated(keep=False)
print(dd_csv[duplicates_all])
print()
print(dd_csv[duplicates_all].sort_values(by=['Date'])) # sort the Dates
Output below. Note that in the first instance, duplicates are listed by not together.
ID Name Code Date
0 ID-1001 Sensor-A 330735 2022-01-01 12:00:00
7 ID-1008 Sensor-H 330742 2022-01-02 15:00:00
8 ID-1009 Sensor-I 330743 2022-01-01 12:00:00
9 ID-1010 Sensor-J 330744 2022-01-02 15:00:00
ID Name Code Date
0 ID-1001 Sensor-A 330735 2022-01-01 12:00:00
8 ID-1009 Sensor-I 330743 2022-01-01 12:00:00
7 ID-1008 Sensor-H 330742 2022-01-02 15:00:00
9 ID-1010 Sensor-J 330744 2022-01-02 15:00:00