I need to build an index, where first I check if a string already exists in the store
vector, or if not, add it to the store
and get its position in it.
Once done, I will need to consume the store, and discard the index. I would like to avoid creating each string twice (once for the index and once for the store).
struct StringIndex {
store: Vec<String>,
index: HashMap<&'this str, usize>,
}
impl StringIndex {
pub fn get_or_insert(&mut self, key: &str) -> usize {
match self.index.entry(key) {
Occupied(v) => v.key(),
Vacant(v) => {
let idx = self.store.len();
self.store.push(key.to_string()); // Create a new clone
v.insert(idx);
idx
}
}
}
}
I looked at self_cell and ouroboros, but neither seem to be targetting this use case... or not?
I ended up creating a dup-indexer crate / github / docs. Works with any values (both allocated and copy-able), and mega-performant:
use dup_indexer::DupIndexer;
fn main() {
let mut di = DupIndexer::new();
assert_eq!(0, di.insert("hello".to_string()));
assert_eq!(1, di.insert("world".to_string()));
assert_eq!(0, di.insert("hello".to_string()));
assert_eq!(di.into_vec(), vec!["hello".to_string(), "world".to_string()]);
}
DupIndexer creates a shallow non-droppable copy of the value, and stores it in the hashmap, whereas the original value goes into the vector:
pub struct DupIndexer<T> {
values: Vec<T>,
lookup: HashMap<ManuallyDrop<T>, usize>,
}
// impl
pub fn insert(&mut self, value: T) -> usize {
let dup_value = ManuallyDrop::new(unsafe { ptr::read(&value) });
match self.lookup.entry(dup_value) {
Occupied(entry) => *entry.get(),
Vacant(entry) => {
let index = self.values.len();
entry.insert(index);
self.values.push(value);
index
}
}
}
I believe the above code is safe because the hashmap only keeps the ptr:read-created copy of the original value while we own it, and the value is never modified.
Note that Miri does complain about this code, but only in one case: if I use Box<T>
as the value. I do not know if this is a valid concern or not. See the safety section with the error output