roptimizationmulticoredomcmclapply

R Checking for duplicates is painfully slow, even with mclapply


I've got some data involving repeated sales for a bunch of of cars with unique Ids. A car can be sold more than once.

Some of the Ids are erroneous however, so I'm checking, for each Id, if the size is recorded as the same over multiple sales. If it isn't, then I know that the Id is erroneous.

I'm trying to do this with the following code:

library("doMC")

Data <- data.frame(ID=c(15432,67325,34623,15432,67325,34623),Size=c("Big","Med","Small","Big","Med","Big"))
compare <- function(v) all(sapply( as.list(v[-1]), FUN=function(z) {isTRUE(all.equal(z, v[1]))}))

IsGoodId = function(Id){
  Sub = Data[Data$ID==Id,]
  if (length(Sub[,1]) > 1){
    return(compare(Sub[,"Size"]))
  }else{
    return(TRUE)
  }
}

WhichAreGood = mclapply(unique(Data$ID),IsGoodId)

But it's painfully, awfully, terribly slow on my quad-core i5.

Can anyone see where the bottleneck is? I'm a newbie to R optimisation.

Thanks, -N


Solution

  • Looks like your algorithm makes N^2 comparisons. Maybe something like the following will scale better. We find the duplicate sales, thinking that this is a small subset of the total.

    dups = unique(Data$ID[duplicated(Data$ID)])
    DupData = Data[Data$ID %in% dups,,drop=FALSE]
    

    The %in% operator scales very well. Then split the size column based on id, checking for id's with more than one size

    tapply(DupData$Size, DupData$ID, function(x) length(unique(x)) != 1)
    

    This gives a named logical vector, with TRUE indicating that there is more than one size per id. This scales approximately linearly with the number of duplicate sales; there are clever ways to make this go fast, so if your duplicated data is itself big...

    Hmm, thinking about this a bit more, I guess

    u = unique(Data)
    u$ID[duplicated(u$ID)]
    

    does the trick.