rdplyrdata-cleaningimputation

R aggregate() and distinct() functions only cleaning some of my data


I am currently struggling with trying to impute or remove rows in R that are mostly duplicates...except for three columns. I am working with a mortality dataset where I have countries/age groups/genders/years/other categorical variables that are all the same...but multiple rows still show differing death counts/percentages/per-capita-deaths for what is otherwise a duplicate row. (The data I am trying to use in question is here: https://www.kaggle.com/ds/4597596)

I have tried using dplyr function to try to select specific columns only. I have also tried to use the aggregate function to impute a mean death toll and eliminate the duplicate rows. I still do not seem to be doing it correctly, so I am not sure what I am doing wrong here.

(For example, after doing all of this, I am getting two rows for something like United States of America, Male, Age Group 25-34, year=2020, but one row has a death count of 3465 and the other row has a count of 3417. All of the other information is the same except for that one column. It seems to work well for data closer to the top of the data frame, so I am unsure why this is happening.)

#dplyr    
Death %>% distinct(Death$RegionCode, Death$RegionName, Death$CountryCode, Death$CountryName, Death$Year, Death$Sex, Death$AgeGroup,.keep_all = TRUE)

#aggregate function
unalived1 <- aggregate(Death$SuicideCount,by=list(RegionName=Death$RegionName, CountryName=Death$CountryName, Year=Death$Year, Sex=Death$Sex, AgeGroup=Death$AgeGroup, CauseSpecificDeathPercentage=Death$CauseSpecificDeathPercentage, DeathRatePer100K=Death$DeathRatePer100K, Population=Death$Population, GDP=Death$GDP, GDPPerCapita=Death$GDPPerCapita, GrossNationalIncome=Death$GrossNationalIncome, GNIPerCapita=Death$GNIPerCapita, InflationRate=Death$InflationRate, EmploymentPopulationRatio=Death$EmploymentPopulationRatio),FUN=mean)

unalived2 <- aggregate(unalived1$CauseSpecificDeathPercentage,by=list(RegionName=unalived1$RegionName, CountryName=unalived1$CountryName, Year=unalived1$Year, Sex=unalived1$Sex, AgeGroup=unalived1$AgeGroup, SuicideCount=unalived1$x, DeathRatePer100K=unalived1$DeathRatePer100K, Population=unalived1$Population, GDP=unalived1$GDP, GDPPerCapita=unalived1$GDPPerCapita, GrossNationalIncome=unalived1$GrossNationalIncome, GNIPerCapita=unalived1$GNIPerCapita, InflationRate=unalived1$InflationRate, EmploymentPopulationRatio=unalived1$EmploymentPopulationRatio),FUN=mean)

unalived3 <- aggregate(unalived2$DeathRatePer100K,by=list(RegionName=unalived2$RegionName, CountryName=unalived2$CountryName, Year=unalived2$Year, Sex=unalived2$Sex, AgeGroup=unalived2$AgeGroup, SuicideCount=unalived2$SuicideCount, CauseSpecificDeathPercentage=unalived2$x, Population=unalived2$Population, GDP=unalived2$GDP, GDPPerCapita=unalived2$GDPPerCapita, GrossNationalIncome=unalived2$GrossNationalIncome, GNIPerCapita=unalived2$GNIPerCapita, InflationRate=unalived2$InflationRate, EmploymentPopulationRatio=unalived2$EmploymentPopulationRatio),FUN=mean)

unalived4 <- na.omit(unalived3)

unalived <- unalived4


US_Deaths <- unalived[unalived$CountryName %in% c("United States of America"),]
US_Deaths_Male <- US_Deaths[US_Deaths$Sex %in% c("Male"),]
US_Deaths_Male_2534 <- US_Deaths_Male[US_Deaths_Male$AgeGroup %in% c("25-34 years"),]

After doing all of the above, I still get the duplicate rows where only a couple of columns have different values.

I was wondering if I could get any insight as to how I would go about this correctly. I would rather try to impute the data, but eliminating rows is also fine.


Solution

  • First remove outright duplicates:

    library(dplyr)
    
    Death <- Death %>% distinct()
    

    Then summarise, grouping by similar variables to the ones that you have chosen and taking the mean of the remaining columns. There are a few columns that were not being aggregated in your original approach.

    Death <- Death %>%
      group_by(
        RegionName,
        CountryName,
        Year,
        Sex,
        AgeGroup,
        Population,
        GDP,
        GDPPerCapita,
        GrossNationalIncome,
        GNIPerCapita,
        InflationRate,
        EmploymentPopulationRatio
      ) %>%
      summarise(
        SuicideCount = mean(SuicideCount),
        CauseSpecificDeathPercentage = mean(CauseSpecificDeathPercentage),
        DeathRatePer100K = mean(DeathRatePer100K),
        CauseSpecificDeathPercentage = mean(CauseSpecificDeathPercentage),
        .groups = "drop"
      )
    

    Check on the example you mention:

    Death %>%
      filter(
        CountryName == "United States of America",
        AgeGroup == "25-34 years",
        Year == 2020,
        Sex == "Male"
      ) %>%
      nrow()
    

    The result of this is 1, indicating that there are no duplicates for this specific case.