rperformancemergedata.tableprocessing-efficiency

How to most efficiently repeat the same merge on different columns in R (preferably data.table)


I have program that needs to repeat a merge which uses the same column from Table A over and over, but changes the column from table B. Doing it with a loop that uses the data.table::merge command repeatedly is quite slow, so I'm wondering if there's a faster way to do it.

For an example:

Imagine a table "A" of fruits, with two columns, "fruit_name" and "price."

And another table "B" of baskets, with 3 columns, "fruit_1", "fruit_2", and "fruit_3"

I would like to get the sum price of the fruits in table B for each row. I could do 3 merges, all using "fruit_name" from the fruit table, and the first using "fruit_1", second "fruit_2", and 3rd "fruit_3" from the baskets table.

Merges take a long time, though. Is there anyway to do this more efficiently computationally? Here is the code laying out the example and getting the desired result, but doing it too slowly.

I generally use data.table and it's generally efficient, so I prefer a data.table solution but open to others if they are faster than 3 merges.

Also, I imagine I could get the data in long format and do one merge, ideally I'd avoid that,because my data makes a lot of sense in wide format, comes in wide, and needs to be exported wide. But if it's definitely the best practice then I guess that's good to know.

Thanks all for your time!

Edit: I chose the answer that I did because for my use case, at least, its limitations are not a problem. And, it is very, very fast.

The more upvoted answer takes a bit longer to do the initial melt step than the chosen one takes to do the whole task. So, for speed alone, I think the chosen answer is best, if it works for you.

If it doesn't, then the most upvoted answer is probably good!

library(data.table)


fruits <- data.table(fruit_name = c('orange', 'apple', 'pear', 'kiwi', 'blueberry')
                     , price = c(1, 1.531, 2.1, 2.25, 3.03)
                     )

baskets <- data.table(fruit_1 = c('orange', 'apple', 'apple', 'pear')
                      ,fruit_2 = c('apple', 'pear', 'kiwi', 'kiwi')
                      ,fruit_3 = c('pear', 'kiwi', 'blueberry', 'blueberry'))

result <- copy (baskets)


result <- merge(result, fruits, by.x = 'fruit_1', by.y = 'fruit_name')
setnames(result, 'price', 'price_1')

result <- merge(result, fruits, by.x = 'fruit_2', by.y = 'fruit_name')
setnames(result, 'price', 'price_2')

result <- merge(result, fruits, by.x = 'fruit_3', by.y = 'fruit_name')
setnames(result, 'price', 'price_3')

result[,price_total := price_1 + price_2 + price_3]

Solution

  • Using the function chmatch from data.table, you can solve your problem as follow. If it is still slow, replace chmatch with the function collapse::fmatch and check if there is some performance improvement.

    baskets[, price_total := rowSums(sapply(.SD, \(x) fruits$price[chmatch(x, fruits$fruit_name)])), .SDcols=patterns("fruit")]
    
    #    fruit_1 fruit_2   fruit_3 price_total
    #     <char>  <char>    <char>       <num>
    # 1:  orange   apple      pear       4.631
    # 2:   apple    pear      kiwi       5.881
    # 3:   apple    kiwi blueberry       6.811
    # 4:    pear    kiwi blueberry       7.380