My dataset have two columns name location-id and location-name. Each location name is given a unique location id.
location-id location-name
234 SL
456 IN
234 SL
123 EN
As each location has a unique id, unique values in location-id column and location-name column need to be equal. But there seems to be an error in the df and my location-id has 1863 unique values and location-name has 1800 unique values.
Is there a way to spot in which entries the error is made?
I thought of a way. Iterate through these two columns and create a dictionary with key-value pairs.
dict_a = {234:"SL", 456:"IN", 123:"EN"}
For each sample, get the location-id, check if it is already a key in the dict. If it is already in, check the value related to it. If that value == location-id for the current sample, then go to the next sample. If the value and location-name is different, add that new name as another value to the same key. After iterating through the complete dataset, get key-value pairs with more than one value to spot the errors in the dataset.
Is there a more efficient way to do this?
Use group by and then filter on those countries something like below:
t=df.groupby(['location-name']).count().reset_index()
df_filtr=df[df['location-name'].isin(t[t['location-id']>1]['location-name'])]
You can sort by countries to find the correct entry