rcluster-analysiscategorical-dataexact-match

Finding the exact match in the values in the categorical variables


I wanted to find an exact match in the values between all three columns (rg1,rg2,rg3).Below is my dataframe.

enter image description here

For instance - first row has a combination of (70,71,72) , if this same combination appears in the remaining rows for the rest of the user ids , then, keep only those users and delete rest.

To describe it further - first row has (70,71,72) and say , if row 10 had the same values in B,C,D column, then I just want to display row 1 and row 10.(using R)

I tried doing clustering on this - kmodes. But I'm not getting the expected results.The current code is grouping all the rgs but it's kind of validating only a single Rg that has appeared most frequently in the data frame(above is my dataframe) and ranking them accordingly.

Can someone please guide me on this?Is there any better way to do this?

kmodes <- klaR::kmodes(mapped_df, modes= 5, iter.max = 10, weighted = FALSE)
 #Add these clusters to the main dataframe
final <- mapped_df %>%
  mutate(cluster = kmodes$cluster)

Solution

  • You can sort across the columns, then look for duplicates.

    set.seed(1234)
    
    df <- tibble(Userids = 1:20,
                 rg_1 = sample(1:20, 20, TRUE),
                 rg_2 = sample(1:20, 20, TRUE),
                 rg_3 = sample(1:20, 20, TRUE)) 
    
    df[4, -1] <- rev(df[15, -1])
    
    # sort across the columns
    df_sorted <- t(apply(df[-1], 1, sort))
    
    # return the duplicated rows
    df[duplicated(df_sorted) | duplicated(df_sorted, fromLast = TRUE), ]
    

    This will give you a data frame with all the duplicated values. Once you have the sorted data frame, it should be easy enough to find what you need.

      Userids  rg_1  rg_2  rg_3
        <int> <int> <int> <int>
    1       4    16    17     6
    2      15     6    17    16