linux-kernel kernel disk scsi block-device

When should I use REQ_OP_FLUSH in a kernel blockdev driver? (Do REQ_OP_FLUSH bio's flush dirty RAID controller caches?)

When should I use REQ_OP_FLUSH in my kernel blockdev driver, and what is the expected behavior of the hardware that receives the REQ_OP_FLUSH (or equivalent SCSI cmd)?

In the Linux kernel, when a struct bio is flagged as REQ_OP_FLUSH is passed to a RAID controller volume in writeback mode, is the RAID controller supposed to flush its dirty caches?

It seems to me that this is the purpose of REQ_OP_FLUSH but that is at odds with wanting to be fast with writeback: If the cache is battery-backed, shouldn't the controller ignore the flush?

In ext4's super.c ext4_sync_fs() function, the write skips a call to blkdev_issue_flush() when barriers are disabled via the barrier=0 mount option. This seems to imply that RAID controllers will flush their caches when they are told to...but does RAID firmware ever break the rules?

Is the flush behavior dependent on the firmware implementation and manufacturer?
Where is the SAS/SCSI specification on the subject?
Other considerations?

Solution

Christoph Hellwig on the linux-block mailing list said:

Devices with power fail protection will advertise that (using VWC flag in NVMe for example) and [the Linux kernel] will never send flushes.

Keith Busch at kernel.org:

You can check the queue attribute, /sys/block/<disk>/queue/write_cache. If the value is "write through", then the device is reporting it doesn't have a volatile cache. If it is "write back", then it has a volatile cache.

If this sounds backwards, then consider this using a RAID controller cache as an example:

A RAID controller with a non-volatile "writeback" cache (from the controller's perspective, ie, with battery) is a "write through"
device as far as the kernel is concerned because the controller will return the write as complete as soon as it is in the persistent cache.
A RAID controller with a volatile "writeback" cache (from the controller's perspective, ie without battery) is a "write back"
device as far as the kernel is concerned because the controller will return the write as complete as soon as it is in the cache, but the cache is not persistent! So in that case flush/FUA is necessary.

[ Reference: https://lore.kernel.org/all/273d3e7e-4145-cdaf-2f80-dc61823dd6ea@ewheeler.net/ ]

From personal experience, not all raid controllers will properly set queue/write_cache as indicated by Keith above. If you know your array has a non-volatile cache running in write-back mode then check make sure it is in "write through" so flushes will be dropped:

]# cat /sys/block/<disk>/queue/write_cache
<cache status>

and fix it if it isn't in the proper mode. These settings below might seem backdwards, but if they do, then re-read #1 and #2 above because these are correct:

If you have a non-volatile cache (ie, with BBU):

]# echo "write through" > /sys/block/<disk>/queue/write_cache

If you have a volatile cache (ie, without BBU):

]# echo "write back" > /sys/block/<disk>/queue/write_cache

So the answer to the question about when to flag REQ_OP_FLUSH in your kernel code is this: whenever you think your code should commit to disk. Since the block layer can re-order any bio request,

Send a WRITE IO, wait for its completion
Send a flush, wait for flush completion

and then you are guaranteed to have the IO from #1 on disk.

However, if the device being written has cache_mode in "write through" mode, then the flush will complete immediately and its up to your controller do do its job and keep the non-volatile cache active, even after a power loss (BBU, supercap, flashcache, etc).