With the goal in mind to explore the domain of data persistence, I am trying to write a (very) simple networked service:
In my example, the data is very important and should not be lost. Upon receiving the "ok" signal from the server, the client deletes its local copy of the data, trusting that the server holds a copy. This setting is described in an excellent article that led me to the discovery of fsync
. If I understand correctly, fsync
instructs the operating system to persist to disk all outstanding data on a given file descriptor. Once fsync
returns, data is safely stored.
My favorite language is Rust, so that is what I turned to to implement this example. To the best of my understanding, fsync
's equivalent in Rust is sync_all
, so that is what I set out to use. The basic logic of my server would be:
File
.sync_all
on the File
, wait for it to return.Before even setting to write my network code, I started with a trivial benchmark:
use rand::RngCore;
use std::fs::OpenOptions;
use std::io::{Seek, SeekFrom, Write};
use std::thread::sleep;
use std::time::{Duration, Instant};
fn main() {
let mut file = OpenOptions::new()
.write(true)
.create(true)
.truncate(true)
.open("hello.dat")
.expect("Failed to open file");
sleep(Duration::from_secs(1)); // Wait for a moment after opening the file
for i in 0u32..100 {
let mut buffer = vec![0u8; 1024];
rand::thread_rng().fill_bytes(&mut buffer);
let start_time = Instant::now();
file.seek(SeekFrom::Start(0))
.expect("Failed to seek to the beginning of the file");
file.write_all(&buffer).expect("Failed to write to file");
file.sync_all().expect("Failed to sync to file");
let elapsed_time = start_time.elapsed();
println!("Time taken for write {}: {:?}", i, elapsed_time,);
}
}
The benchmark simply opens a File
, writes 1 KB and invokes sync_all
, measuring how long the process takes and repeating the measurement 100 times.
I ran the code on my 2024 Macbook Pro M3 Max, which uses an Apple AP2048Z SSD. To my dismay, the benchmark measured approximately 4ms per write. Such a long time defeats my intuition: how can a write to a local, modern SSD take several times longer than, e.g., pinging google.com? Surprised, I checked online to understand what latency one could expect of an SSD. A quick search led me to this article reporting that the average write latency for even a QLC SSD should hover around 70 microseconds. That is nearly 60 times less than what I am observing.
I am wondering what is happening? My intuition suggests that writing some small amount of data to disk (I tried with a single byte all the way to a few megabytes, the time to return from sync_all
is always the same, around 4ms) should take an amount of time comparable to the SSD's write latency. In fact, in my mind, that is what write latency aims to measure!
To get a grasp of what might be going on, I installed and ran fio
on my system:
% fio --name=latency_benchmark --fdatasync=1 --ioengine=posixaio --rw=randwrite --bs=4k --numjobs=1 --size=1G --runtime=1m --time_based
latency_benchmark: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=1
fio-3.37
Starting 1 process
latency_benchmark: Laying out IO file (1 file / 1024MiB)
Jobs: 1 (f=1): [w(1)][100.0%][w=134MiB/s][w=34.2k IOPS][eta 00m:00s]
latency_benchmark: (groupid=0, jobs=1): err= 0: pid=96200: Sat Jun 15 14:02:44 2024
write: IOPS=27.5k, BW=108MiB/s (113MB/s)(6455MiB/60001msec); 0 zone resets
slat (nsec): min=0, max=705000, avg=689.65, stdev=791.16
clat (nsec): min=0, max=5391.0k, avg=13040.05, stdev=52330.03
lat (usec): min=4, max=5392, avg=13.73, stdev=52.34
clat percentiles (usec):
| 1.00th=[ 6], 5.00th=[ 6], 10.00th=[ 6], 20.00th=[ 6],
| 30.00th=[ 6], 40.00th=[ 6], 50.00th=[ 8], 60.00th=[ 9],
| 70.00th=[ 9], 80.00th=[ 9], 90.00th=[ 10], 95.00th=[ 12],
| 99.00th=[ 143], 99.50th=[ 155], 99.90th=[ 229], 99.95th=[ 1500],
| 99.99th=[ 2040]
bw ( KiB/s): min=23675, max=152578, per=100.00%, avg=110181.21, stdev=40310.33, samples=119
iops : min= 5918, max=38144, avg=27544.97, stdev=10077.64, samples=119
lat (nsec) : 2=0.01%
lat (usec) : 2=0.02%, 4=0.01%, 10=90.64%, 20=5.33%, 50=0.03%
lat (usec) : 100=0.31%, 250=3.56%, 500=0.02%, 750=0.01%, 1000=0.01%
lat (msec) : 2=0.06%, 4=0.01%, 10=0.01%
fsync/fdatasync/sync_file_range:
sync (nsec): min=0, max=7069.0k, avg=21486.12, stdev=16450.76
sync percentiles (nsec):
| 1.00th=[17024], 5.00th=[17024], 10.00th=[17024], 20.00th=[18048],
| 30.00th=[18048], 40.00th=[19072], 50.00th=[20096], 60.00th=[21888],
| 70.00th=[22912], 80.00th=[23936], 90.00th=[27008], 95.00th=[30080],
| 99.00th=[34048], 99.50th=[36096], 99.90th=[42752], 99.95th=[50944],
| 99.99th=[82432]
cpu : usr=6.37%, sys=12.88%, ctx=3305893, majf=0, minf=9
IO depths : 1=200.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,1652467,0,0 short=1652467,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
WRITE: bw=108MiB/s (113MB/s), 108MiB/s-108MiB/s (113MB/s-113MB/s), io=6455MiB (6769MB), run=60001-60001msec
I have to admit I am not too competent at orienting myself in fio
's output (for example, I am not entirely sure what the difference is between slat
, clat
and lat
). But, it seems to me that the average write latency when writing 4 KB should be around 14 us, and the time to sync around 21 us.
I might have run my test incorrectly here? If so, please do point that out. If I didn't, then I am not sure where the very large discrepancy I observe comes from.
I starting to wonder if there might be something at play here, maybe related to Rust's standard library targeting Apple Silicon. I ran the same fio
on a cheap Linux server I am renting, getting strictly worse latency (avg. 68 us) and similar fsync
timing (avg. 22 us).
However, running the exact same benchmark, compiled on the same version of the rust compiler, with the same optimization flags, the time fluctuates between 25 and 50 us. This is between 80 and 160 times less! I will try to run the same test on the same machine using C, as suggested by @user2407038.
It's not Rust's fault. I tried to translate the above Rust code to C. I hope I got it right:
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/time.h>
#include <time.h>
int main() {
int fd = open("hello.dat", O_WRONLY | O_CREAT | O_TRUNC, S_IRUSR | S_IWUSR);
if (fd == -1) {
perror("Failed to open file");
return 1;
}
sleep(1); // Wait for a moment after opening the file
unsigned char buffer[1024];
struct timeval start_time, end_time;
for (int i = 0; i < 100; i++) {
for (int j = 0; j < 1024; j++) {
buffer[j] = rand() % 256;
}
gettimeofday(&start_time, NULL);
if (lseek(fd, 0, SEEK_SET) == -1) {
perror("Failed to seek to the beginning of the file");
return 1;
}
if (write(fd, buffer, sizeof(buffer)) != sizeof(buffer)) {
perror("Failed to write to file");
return 1;
}
if (fcntl(fd, F_FULLFSYNC) == -1) {
perror("Failed to sync to file");
return 1;
}
gettimeofday(&end_time, NULL);
long elapsed_time = ((end_time.tv_sec - start_time.tv_sec) * 1000000L) + end_time.tv_usec - start_time.tv_usec;
printf("Time taken for write %d: %ld microseconds\n", i, elapsed_time);
}
close(fd);
return 0;
}
Note in particular how I'm using fcntl(fd, F_FULLFSYNC)
instead of fsync(fd)
, as fsync
on OSX apparently does not guarantee that the data is actually flushed to disk. I get a very similar behavior: around 4ms of measured latency. This is baffling, but also interesting!
fsync
In the above snippet, if I switch fcntl(fd, F_FULLSYNC)
for fsync(fd)
the measured time collapses to 30 us, more than 100 times less than using F_FULLSYNC
.
F_FULLFSYNC
slower than fsync
?This question is actually fairly simple to answer. As you found in your research, F_FULLFSYNC
simply does more stuff than fsync
. Here is a quote from the man pages on fsync
.
https://www.unix.com/man-page/osx/2/fsync/
For applications that require tighter guarantees about the integrity of their data, Mac OS X provides the F_FULLFSYNC fcntl. The F_FULLFSYNC fcntl asks the drive to flush all buffered data to permanent storage. Applications, such as databases, that require a strict ordering of writes should use F_FULLFSYNC to ensure that their data is written in the order they expect. Please see fcntl(2) for more detail.
If we dig into the Rust standard library, we see it uses F_FULLFSYNC
, so that explains the results you saw in your C comparison.
pub fn fsync(&self) -> io::Result<()> {
cvt_r(|| unsafe { os_fsync(self.as_raw_fd()) })?;
return Ok(());
#[cfg(any(
target_os = "macos",
target_os = "ios",
target_os = "tvos",
target_os = "watchos",
))]
unsafe fn os_fsync(fd: c_int) -> c_int {
libc::fcntl(fd, libc::F_FULLFSYNC)
}
#[cfg(not(any(
target_os = "macos",
target_os = "ios",
target_os = "tvos",
target_os = "watchos",
)))]
unsafe fn os_fsync(fd: c_int) -> c_int {
libc::fsync(fd)
}
}
F_FULLFSYNC
so slow?Many modern hard drives (particularly SSDs) will perform journaling of writes before committing them to the disk. This is done in hardware and helps to reduce power consumption and improve overall performance by grouping together writes to the same blocks of memory. My guess is that the overhead you are seeing may be related to the commit interval due to the small write size.
(See edit below for more information) Now, this is the point where I wanted to show a nice graph displaying real world benchmark data that backs up my claim by letting us see the commit interval in relation to write sizes. I had planned to spin up a mac ec2 instance to run it, but it turns out those are way more expensive than had I initially thought (>$1/hr, minimum 24h). Instead, here is what it looks like on some other machines. They arn't directly related to your use case, but I think the results are still interesting.
That being said, without the benchmark data or an authoritative source, I can not be 100% certain this is the primary cause of the performance difference.
x64 Windows 11 Pro 23H2 with 1T NTFS SSD
x64 Ubuntu 24.04 LTS with 20GB ext4 NVMe SSD (Digital Ocean Droplet)
1 Technically Windows uses FlushFileBuffers
, not fsync
or F_FULLFSYNC
.
2 This graph is a bit choppier because the benchmark ran for only ~1/8 the duration of the Windows machine.
After additional searching, I stumbled across Mac OS X Internals: A Systems Approach. This book goes into depth on the implementation of F_FULLFSYNC
in HFS+ (the predecessor of APFS that is in use today). If you are interested, consider reading these sections of the book. Based on what I have read so far, I suspect the delay occurs in ioctl(disk, DKIOCSYNCHRONIZECACHE)
(and by extension maybe the doSynchronizeCache()
function of the disk driver). However, I have not yet been able to confirm this to be the case (I need to figure out how to work around Apple's sandboxing to run this operation directly).
Mac OS X Internals: A Systems Approach By Amit Singh
Chapter 12, Page 1540 FIGURE 12-23 Processing of the
F_FULLSYNC
file control operation
Additionally, I decided to run the benchmark on a MAC from work and got the following result. This machine should be similar to the one used for the original question.
MacOS Sonoma 14.5 with 1T APFS SSD
What is really surprising to me is just how consistent the timing is. It isn't about 4ms, it is 4ms. Or to be more precise, about 3955±15μs3 (Raw Benchmark Data). In the graph you can see that up until around 40k, the line corresponding to the seek time is almost perfectly horizontal. This has been extremely consistent in all of the runs. With such reliable timing, I am suspicious the disk driver might be sleeping for a fixed time period (or some other clock-dependent approach to waiting for flush to complete).
3 I later realized that my previous assertion about the consistency of the F_FULLSYNC
durations was flawed. The way I wrote the benchmark, the result for a given run is the median of multiple samples. This helps filter out outliers, but reduces overall variation. The correct interpretation should have been that "F_FULLSYNC
following writes smaller than ~40KB had a median duration of 3955±15μs as measured across 125 samples/write size".
I went back and put together a new graph showing the distribution of individual samples. This graph consists of 10000 samples per write size sorted in ascending order (Raw Data). I cherry-picked a couple write sizes to reduce the overall clutter, but there are more in the full data. Of the samples taken, the shortest F_FULLFSYNC
took 3256.792μs, however this was an extreme outlier with only 1% of samples being under 3609μs.
I also wanted to say that this is probably as far as I can go looking into this problem. While I am still curious as to the exact cause, I have no experience profiling kernel drivers and I do not want to mess with the security settings of work laptops to try and figure it out.