rustbenchmarkingmicrobenchmark

Combine known-size slices into an array in rust


Trying to find the optimal way to combine slices into a database key.

I need a slice that contains three things (concatenated):

  1. A prefix byte (u8)
  2. A slice that is passed in of a compile-time known size (20 bytes)
  3. A slice that is passed in of a compile-time known size (32 bytes)

I've tried benchmarking the following 3 functions. I would have expected "slot_key_array" to be the fastest, since its not allocating a vector, but "slot_key_sized" is consistently the fastest. Why?


#[inline]
pub fn slot_key_vec(address: &[u8], slot: &[u8]) -> Vec<u8> {
  let mut key: Vec<u8> = vec![DB_PREFIX_SLOT];
  key.extend_from_slice(address);
  key.extend_from_slice(slot);
  key
}

#[inline]
pub fn slot_key_sized(address: &[u8], slot: &[u8]) -> Vec<u8> {
  let mut key: Vec<u8> = Vec::with_capacity(1 + 20 + 32);
  key.push(DB_PREFIX_SLOT);
  key.extend_from_slice(address);
  key.extend_from_slice(slot);
  key
}

#[inline]
pub fn slot_key_array(address: &[u8], slot: &[u8]) -> [u8; 53] {
  let key: [u8; 53] = [
    DB_PREFIX_SLOT,
    address[0], address[1], address[2], address[3], address[4], address[5], address[6], address[7],
    address[8], address[9], address[10], address[11], address[12], address[13], address[14], address[15],
    address[16], address[17], address[18], address[19],
    slot[0], slot[1], slot[2], slot[3], slot[4], slot[5], slot[6], slot[7],
    slot[8], slot[9], slot[10], slot[11], slot[12], slot[13], slot[14], slot[15],
    slot[16], slot[17], slot[18], slot[19], slot[20], slot[21], slot[22], slot[23],
    slot[24], slot[25], slot[26], slot[27], slot[28], slot[29], slot[30], slot[31],
  ];
  key
}

Solution

  • You are probably on Linux, where the default allocator is good (I benchmarked on Windows, by replacing the allocator to mimalloc I got the same results).

    This is because the array versions need to do bounds checking for each element, which prevents vectorizing and adds overhead. But they don't have to: if we make the arguments arrays instead of slices, the bounds checking will be gone.

    pub fn slot_key_array_arrays(address: &[u8; 20], slot: &[u8; 32]) -> [u8; 53] {
        // ...
    }
    

    If you only have slices, you can convert them to arrays with try_into():

    #[inline]
    pub fn slot_key_copy_array_try_into(address: &[u8], slot: &[u8]) -> [u8; 53] {
        slot_key_array_arrays(address.try_into().unwrap(), slot.try_into().unwrap())
    }
    

    Another alternative is to use copy_from_slice():

    #[inline]
    pub fn slot_key_copy_from_slice(address: &[u8], slot: &[u8]) -> [u8; 53] {
        let mut key: [u8; 53] = [0; 53];
        key[0] = DB_PREFIX_SLOT;
        key[1..][..20].copy_from_slice(address);
        key[21..].copy_from_slice(slot);
        key
    }
    

    Benchmark:

    copy_from_slice         time:   [6.8321 ns 6.8841 ns 6.9611 ns]
    Found 13 outliers among 100 measurements (13.00%)
      4 (4.00%) high mild
      9 (9.00%) high severe
    
    array_try_into          time:   [4.6906 ns 4.7001 ns 4.7119 ns]
    Found 11 outliers among 100 measurements (11.00%)
      3 (3.00%) high mild
      8 (8.00%) high severe
    
    array_arrays            time:   [4.8212 ns 4.8325 ns 4.8473 ns]
    Found 15 outliers among 100 measurements (15.00%)
      4 (4.00%) high mild
      11 (11.00%) high severe
    
    array                   time:   [12.963 ns 12.992 ns 13.029 ns]
    Found 17 outliers among 100 measurements (17.00%)
      5 (5.00%) high mild
      12 (12.00%) high severe
    
    sized                   time:   [10.116 ns 10.218 ns 10.327 ns]
    Found 4 outliers among 100 measurements (4.00%)
      2 (2.00%) high mild
      2 (2.00%) high severe
    
    vec                     time:   [25.602 ns 27.162 ns 29.211 ns]
    Found 7 outliers among 100 measurements (7.00%)
      6 (6.00%) high mild
      1 (1.00%) high severe
    

    So, your version with arrays instead of slice is the fastest together with try_into() (same performance), then copy_from_slice(), and then the others.

    Yet another alternative is to help LLVM eliminate the bound checks. It has problems here because you index in increasing order: we check for index 0, but that means nothing about index 1, so we check it too, then 2, then 3... but if we check the last index first, LLVM knows that all previous indices also exist:

    #[inline]
    pub fn slot_key_array(address: &[u8], slot: &[u8]) -> [u8; 53] {
        address[19];
        slot[31];
    
        // ...
    }
    

    This brings this code on par with the arrays version:

    array                   time:   [4.6107 ns 4.6164 ns 4.6238 ns]
                            change: [-65.325% -64.755% -64.411%] (p = 0.00 < 0.05)
                            Performance has improved.