I have 2 vectors which are of type Vec<(u32, Vec<u8>)>
. I want to merge these two vectors, and I want the result to have unique keys. In case of same keys, 2nd vector should overwrite first one.
Here is my attempt to solve this problem:
pub fn merge(old_v: Vec<(u32, Vec<u8>)>, new_v: Vec<(u32, Vec<u8>)>) -> Vec<(u32, Vec<u8>)> {
let mut new_map = HashMap::<u32, Vec<u8>>::from_iter(old_v);
new_map.extend(new_v);
new_map.into_iter().collect()
}
This works, but the problem is, these vectors carry quite big data, potentially 500 KB to 1 MB of data, with 1000s of entries (especially old_v
). And this method creates quite a lot of memory considering that I call this method quite frequently in my application.
Are there any ways to improve the efficiency of this method? I am OK with doing in-place mutations.
If the inputs are presorted, you could merge the two inputs as iterators, coalesce on the key, and select only the last of the inputs with matched keys. This might do a little better, by replacing a per-element average-case O(1)
operation, with moderate constant overhead and poor memory locality, with one that is strictly O(1)
with low constant overhead and good memory locality.
Rough example using the itertools
crate to avoid reinventing wheels:
use itertools::Itertools; // 0.13.0
pub fn merge(old_v: Vec<(u32, Vec<u8>)>, new_v: Vec<(u32, Vec<u8>)>) -> Vec<(u32, Vec<u8>)> {
old_v.into_iter()
.merge_by(new_v, |(k1, _), (k2, _)| k1 <= k2)
.coalesce(|a, b|
if a.0 == b.0 {
Ok(b)
} else {
Err((a, b))
}
).collect()
}
Like I said, this assumes the inputs are sorted on the provided key, and will misbehave if they aren't. You could receive the Vec
s as mutable, and .sort_by_key(|(k, _)| *k)
for both of them (Rust Playground demo) before doing the rest of the work (and .sort_by_key
explicitly states that if the inputs are already sorted, the work is linear, not standard O(n log n)
). But if the inputs aren't sorted, you'll likely do more work than your HashMap
solution by performing full O(n log n)
sorting.
Assuming sorted input can't be assumed to be the case, what you've got looks optimal. You take ownership of the inputs, using IntoIterator
(implicitly when consuming the inputs, explicitly when converting back to the result), so you're performing pure moves. Your incremental memory overhead is just for additional memory based on what's stored inline in the tuples (a u32
and the small handful of pointers Vec
's are implemented in terms of that get emptied as you construct the HashMap
from the Vec
s before the Vec
s are completely emptied and the underlying storage released), you're not copying any of the inner Vec
s.
The best I can see for improving it for non-sorted inputs would be to avoid storing the inner Vec
s at all, so you only store the keys for uniqueness checking, and perform fewer (admittedly cheap) moves by keeping the first item seen immediately, discarding the duplicates immediately on identification without moving. itertools
helps here too, and the code is very simple, but it does hide a HashSet
under-the-hood to perform the uniquification, so it does require some auxiliary memory:
use itertools::Itertools; // 0.13.0
use itertools::chain;
pub fn merge(old_v: Vec<(u32, Vec<u8>)>, new_v: Vec<(u32, Vec<u8>)>) -> Vec<(u32, Vec<u8>)> {
// new_v comes first because first value seen is kept, and the rest are discarded
chain!(new_v, old_v).unique_by(|(k, _)| *k).collect()
}
With either solution, you might benefit from returning the iterator itself (using impl Iterator<(u32, Vec<u8>)>
to avoid verbosity), rather than a new Vec
. If your caller needs a Vec
, they can collect
it themselves. If they just want to iterate the results, you can avoid the memory cost of the new Vec
, and the delay to begin processing, by processing and discarding as you go.
Similarly, receiving your arguments as generics templated on the IntoIterator
trait (so your function could be fed streaming data without collecting it into Vec
s first) could reduce memory usage dramatically in many cases, and provide similar benefits in decreasing the time for the first results to appear (thinking in iterators instead of containers is often good for performance in memory-sensitive/demanding tasks).