rgroup-bydatatabledata.tableset-difference

How to find difference/setdiff() between two multi-category columns of data table in R


I have my data in two data tables as below (with many more columns than just shown here) -

DataTable 1 = data_sale

Site Id Country Product ID
1000375476 Canada UG10000-WISD
1000375476 Canada UGD12895
1000706152 Switzerland UG10000-WISD
1000706152 Switzerland UG80000-NTCD-G
1000797366 Italy UG10000-WISD
1000797366 Italy UG12210

DataTable 2 = data_licenses

Site Id Country Product ID
1000375476 Canada UG10000-WISD
1000375476 Canada UGD12895
1000797366 Italy UG12785
1000797366 Italy UG12210

I want to calculate the set difference for unique Product ID for all the Site Id in data_sale, keeping all rows.

Here is what I've done so far -

  1. For both of the data tables, I've created a new column with all unique products in it.
data_sale <-
  data_sale[, `unique_products` := paste0(unique(`Product ID`), collapse = ","), 
              keyby = c("Site Id")]
data_licenses <-
  data_licenses[, .(`unique_products` = paste0(unique(`Product ID`), collapse = ",")), 
              keyby = c("Site Id")]
  1. Left Merged data_sale with data_licenses
merge(data_sale, data_licenses, by = 'Site Id', all.x = TRUE)

Now the merged datatable look like this -

Site Id Country Product ID unique_products.data_sale unique_products.data_licenses
1000375476 Canada UG10000-WISD UG10000-WISD,UGD12895 UG10000-WISD,UGD12895
1000375476 Canada UGD12895 UG10000-WISD,UGD12895 UG10000-WISD,UGD12895
1000706152 Switzerland UG10000-WISD UG10000-WISD,UG80000-NTCD-G NA
1000706152 Switzerland UG80000-NTCD-G UG10000-WISD,UG80000-NTCD-G NA
1000797366 Italy UG10000-WISD UG10000-WISD,UG12210 UG12785,UG12210
1000797366 Italy UG12210 UG10000-WISD,UG12210 UG12785,UG12210

The problem is with my final step where I want a new column showing difference between the products of data_sale and data_licenses, it should look like this -

Site Id Country Product ID unique_products.data_sale unique_products.data_licenses difference
1000375476 Canada UG10000-WISD UG10000-WISD,UGD12895 UG10000-WISD,UGD12895 NA
1000375476 Canada UGD12895 UG10000-WISD,UGD12895 UG10000-WISD,UGD12895 NA
1000706152 Switzerland UG10000-WISD UG10000-WISD,UG80000-NTCD-G NA UG10000-WISD,UG80000-NTCD-G
1000706152 Switzerland UG80000-NTCD-G UG10000-WISD,UG80000-NTCD-G NA UG10000-WISD,UG80000-NTCD-G
1000797366 Italy UG10000-WISD UG10000-WISD,UG12210 UG12785,UG12210 UG10000-WISD
1000797366 Italy UG12210 UG10000-WISD,UG12210 UG12785,UG12210 UG10000-WISD

Any leads on how it can be achieved will be of great help. Thanks!

Below is the data using dput() for the merged datatable

structure(list(`Site Id` = c("1000375476", "1000375476", "1000706152", 
"1000706152", "1000797366", "1000797366"), Country = c("Canada", 
"Canada", "Switzerland", "Switzerland", "Italy", "Italy"), `Product ID` = c("UG10000-WISD", 
"UGD12895", "UG10000-WISD", "UG80000-NTCD-G", "UG10000-WISD", 
"UG12210"), unique_products.x = c("UG10000-WISD,UGD12895", "UG10000-WISD,UGD12895", 
"UG10000-WISD,UG80000-NTCD-G", "UG10000-WISD,UG80000-NTCD-G", 
"UG10000-WISD,UG12210", "UG10000-WISD,UG12210"), unique_products.y = c("UG10000-WISD,UGD12895", 
"UG10000-WISD,UGD12895", NA, NA, "UG12785,UG12210", "UG12785,UG12210"
)), sorted = "Site Id", class = c("data.table", "data.frame"), row.names = c(NA, 
-6L), .internal.selfref = <pointer: 0x556bb5c10a40>)

Solution

  • This will get the products in data_sale that are not in data_license by Site Id. Instead of concatenating the unique product IDs, it's easier to work with the unique columns as character vectors.

    library(data.table)
    
    data_licenses <- data.table(`Site Id` = c("1000375476", "1000375476", "1000797366", "1000797366"),
                                Country = c("Canada", "Canada", "Italy", "Italy"),
                                `Product ID` = c("UG10000-WISD", "UGD12895", "UG12785", "UG12210"))
    data_sale <- data.table(`Site Id` = c("1000375476", "1000375476", "1000706152", "1000706152", "1000797366", "1000797366"),
                            Country = c("Canada", "Canada", "Switzerland", "Switzerland", "Italy", "Italy"),
                            `Product ID` = c("UG10000-WISD", "UGD12895", "UG10000-WISD", "UG80000-NTCD-G", "UG10000-WISD", "UG12210"))
    
    data_unique <- data_sale[
      , .(unique_products.data_sale = .(unique(`Product ID`))), c("Site Id", "Country")
    ][
      data_licenses[, .(unique_products = .(unique(`Product ID`))), "Site Id"],
      unique_products.data_licenses := i.unique_products,
      on = "Site Id"
    ][
      , difference := lapply(.I, function(i) setdiff(unique_products.data_sale[[i]], unique_products.data_licenses[[i]]))
    ]
    print(data_unique)
    #>       Site Id     Country   unique_products.data_sale unique_products.data_licenses                  difference
    #> 1: 1000375476      Canada       UG10000-WISD,UGD12895         UG10000-WISD,UGD12895                            
    #> 2: 1000706152 Switzerland UG10000-WISD,UG80000-NTCD-G                               UG10000-WISD,UG80000-NTCD-G
    #> 3: 1000797366       Italy        UG10000-WISD,UG12210               UG12785,UG12210                UG10000-WISD