rustarray-merge

Merging two Vec<(u32, Vec<u8>)> without duplicate keys


I have 2 vectors which are of type Vec<(u32, Vec<u8>)>. I want to merge these two vectors, and I want the result to have unique keys. In case of same keys, 2nd vector should overwrite first one.

Here is my attempt to solve this problem:

pub fn merge(old_v: Vec<(u32, Vec<u8>)>, new_v: Vec<(u32, Vec<u8>)>) -> Vec<(u32, Vec<u8>)> {
    let mut new_map = HashMap::<u32, Vec<u8>>::from_iter(old_v);

    new_map.extend(new_v);
    new_map.into_iter().collect()
}

This works, but the problem is, these vectors carry quite big data, potentially 500 KB to 1 MB of data, with 1000s of entries (especially old_v). And this method creates quite a lot of memory considering that I call this method quite frequently in my application.

Are there any ways to improve the efficiency of this method? I am OK with doing in-place mutations.


Solution

  • If the inputs are presorted, you could merge the two inputs as iterators, coalesce on the key, and select only the last of the inputs with matched keys. This might do a little better, by replacing a per-element average-case O(1) operation, with moderate constant overhead and poor memory locality, with one that is strictly O(1) with low constant overhead and good memory locality.

    Rough example using the itertools crate to avoid reinventing wheels:

    use itertools::Itertools; // 0.13.0
    
    pub fn merge(old_v: Vec<(u32, Vec<u8>)>, new_v: Vec<(u32, Vec<u8>)>) -> Vec<(u32, Vec<u8>)> {
        old_v.into_iter()
            .merge_by(new_v, |(k1, _), (k2, _)| k1 <= k2)
            .coalesce(|a, b|
                if a.0 == b.0 {
                    Ok(b)
                } else {
                    Err((a, b))
                }
            ).collect()
    }
    

    Rust Playground link

    Like I said, this assumes the inputs are sorted on the provided key, and will misbehave if they aren't. You could receive the Vecs as mutable, and .sort_by_key(|(k, _)| *k) for both of them (Rust Playground demo) before doing the rest of the work (and .sort_by_key explicitly states that if the inputs are already sorted, the work is linear, not standard O(n log n)). But if the inputs aren't sorted, you'll likely do more work than your HashMap solution by performing full O(n log n) sorting.

    Assuming sorted input can't be assumed to be the case, what you've got looks optimal. You take ownership of the inputs, using IntoIterator (implicitly when consuming the inputs, explicitly when converting back to the result), so you're performing pure moves. Your incremental memory overhead is just for additional memory based on what's stored inline in the tuples (a u32 and the small handful of pointers Vec's are implemented in terms of that get emptied as you construct the HashMap from the Vecs before the Vecs are completely emptied and the underlying storage released), you're not copying any of the inner Vecs.

    The best I can see for improving it for non-sorted inputs would be to avoid storing the inner Vecs at all, so you only store the keys for uniqueness checking, and perform fewer (admittedly cheap) moves by keeping the first item seen immediately, discarding the duplicates immediately on identification without moving. itertools helps here too, and the code is very simple, but it does hide a HashSet under-the-hood to perform the uniquification, so it does require some auxiliary memory:

    use itertools::Itertools; // 0.13.0
    use itertools::chain;
    
    pub fn merge(old_v: Vec<(u32, Vec<u8>)>, new_v: Vec<(u32, Vec<u8>)>) -> Vec<(u32, Vec<u8>)> {
        // new_v comes first because first value seen is kept, and the rest are discarded
        chain!(new_v, old_v).unique_by(|(k, _)| *k).collect()
    }
    

    Rust Playground link

    With either solution, you might benefit from returning the iterator itself (using impl Iterator<(u32, Vec<u8>)> to avoid verbosity), rather than a new Vec. If your caller needs a Vec, they can collect it themselves. If they just want to iterate the results, you can avoid the memory cost of the new Vec, and the delay to begin processing, by processing and discarding as you go.

    Similarly, receiving your arguments as generics templated on the IntoIterator trait (so your function could be fed streaming data without collecting it into Vecs first) could reduce memory usage dramatically in many cases, and provide similar benefits in decreasing the time for the first results to appear (thinking in iterators instead of containers is often good for performance in memory-sensitive/demanding tasks).