rustmemmap

How can I replace a struct field with other immutable references in Rust?


I'm trying to write some server code (non production) that reads a very large raw file (upwards of 20GB) from the filesystem. I decided to use the mmap crate which gives me an interface like a slice of u8. I store the mmap in a struct field and call immutable methods on that struct from multiple threads. This all works as intended.

However, some of the very large files I access are located on a fast SSD, while others are on a HDD. The SSD files can be accessed quickly, but the HDD files are slow, due to random accesses. I would like to temporarily copy files from the HDD to the SSD, then read them from there until the file is no longer needed (copying and deleting are easy and not part of the question).

However, I don't want to pause while I copy the file before returning the requested data. I would therefore like to start a thread when the file is first requested and copy the data to the new location using that, while other threads still access the HDD mmap in response to requests. That is also easily implemented.

My problem is when I am finished copying. I now want to replace the HDD mmap in the struct with the new SSD mmap of the same data, and I can't think how to do this. I can obviously use a RwLock, but since the data is frequently accessed, having to lock for reading every time seems very inefficient just so I can replace the value once during the lifetime of the program.

Is there any neat way of doing this using safe Rust? Otherwise, could I wrap the mmap in the struct in an Arc, then replace this Arc once the SSD mmap is ready? I believe this would require unsafe Rust, but still should be safe since the Arc will only drop the mmap once the last Arc is dropped, and any new access will use the new mmap. Since the data in both is identical, this shouldn't be an issue in terms of the data itself suddenly changing (even if it points to a different location). Am I correct or is there some unsafety here I'm not thinking of? If so, what can I do instead?


Solution

  • I can obviously use a RwLock, but since the data is frequently accessed, having to lock for reading every time seems very inefficient just so I can replace the value once during the lifetime of the program.

    This is very much a pre-emptive optimisation. An RwLock is not particularly heavy-weight when there is no contention. If you look at the implementation in the std library you'll see the read first performs a simple check on an atomic 32-bit variable and then a CMPXCHG. Only when there's contention (i.e. writers) it falls back to the slower read_contended path.

    What this means is that an RwLock is probably excellent for a use case where you'll have only one write during the entire lifetime of the application, since it'll be using the fast path all the time while still providing you with all the thread safety you need. Moreover, it's hard to imagine a synchronisation scheme for your use case that does not involve some atomic flag and a CMPXCHG, so it seems unlikely you'd be able to write something significantly faster.

    I would recommend implementing your solution using RwLock and then, if performance is an issue, profiling your application to see if this is the bottleneck. Since we're talking about memory or even disk I/O reads, my intuition is the atomic operation is not going to be the issue.

    Something like this:

    struct Cache {
       hdd: Arc<Mmap>,
       ssd: RwLock<Option<Arc<Mmap>>>,
    }
    
    impl Cache {
      fn get(&self) -> Arc<Mmap> {
        match self.ssd.try_read().as_deref() {
          Ok(Some(ssd)) => ssd.clone(),
          Ok(None) | Err(_) => self.hdd.clone(),
        }
      }
    
      fn cache_ssd(&self) -> bool {
        let mmap = Arc::new(create_mmap());
        let mut lock = self.ssd.write().unwrap();
        *lock = Some(mmap);
      }
    }
    
    fn create_mmap() -> Mmap { todo!() }
    

    Note that the hot path here is either try_read -> Ok(None) or try_read -> Ok(Some) , and there's no lock contention on either. If there is a writer currently creating an ssd cache then the try_read will immediately fail and fall back to using hdd.

    Aside: the memmap crate is unmaintained, I recommend using memmap2 which has largely the same interface.