rustself-reference

Rust self-referencing struct with a hashmap and vec indexes


I need to build an index, where first I check if a string already exists in the store vector, or if not, add it to the store and get its position in it.

Once done, I will need to consume the store, and discard the index. I would like to avoid creating each string twice (once for the index and once for the store).

struct StringIndex {
    store: Vec<String>,
    index: HashMap<&'this str, usize>,
}

impl StringIndex {
  pub fn get_or_insert(&mut self, key: &str) -> usize {
    match self.index.entry(key) {
      Occupied(v) => v.key(),
      Vacant(v) => {
        let idx = self.store.len();
        self.store.push(key.to_string());  // Create a new clone
        v.insert(idx);
        idx
      }
    }
  }
}

I looked at self_cell and ouroboros, but neither seem to be targetting this use case... or not?


Solution

  • I ended up creating a dup-indexer crate / github / docs. Works with any values (both allocated and copy-able), and mega-performant:

    use dup_indexer::DupIndexer;
    
    fn main() {
        let mut di = DupIndexer::new();
        assert_eq!(0, di.insert("hello".to_string()));
        assert_eq!(1, di.insert("world".to_string()));
        assert_eq!(0, di.insert("hello".to_string()));
        assert_eq!(di.into_vec(), vec!["hello".to_string(), "world".to_string()]);
    }
    

    DupIndexer creates a shallow non-droppable copy of the value, and stores it in the hashmap, whereas the original value goes into the vector:

    pub struct DupIndexer<T> {
        values: Vec<T>,
        lookup: HashMap<ManuallyDrop<T>, usize>,
    }
    
    // impl
    pub fn insert(&mut self, value: T) -> usize {
        let dup_value = ManuallyDrop::new(unsafe { ptr::read(&value) });
        match self.lookup.entry(dup_value) {
            Occupied(entry) => *entry.get(),
            Vacant(entry) => {
                let index = self.values.len();
                entry.insert(index);
                self.values.push(value);
                index
            }
        }
    }
    

    I believe the above code is safe because the hashmap only keeps the ptr:read-created copy of the original value while we own it, and the value is never modified.

    Note that Miri does complain about this code, but only in one case: if I use Box<T> as the value. I do not know if this is a valid concern or not. See the safety section with the error output