I am creating a zfs system where each instance of a certain entity in my system has its own dataset in zfs. This is needed because each entity consists of a lot of small files that are really slow to copy or delete. So I decided to try out relying on zfs datasets to either destroy or snapshot/copy an entity in its entirety regardless of its contents.
But now during my benchmarks, which is around 5000+ datasets and counting, creating a new dataset using 'zfs create' sometimes takes up to 9 minutes. Although 9 minutes is really slow but still acceptable, I am afraid that it will only become worse if I increase the number of datasets. And 5000 isn't that many yet in my opinion.
System information:
Does anyone have experience with working with large amounts of datasets with zfs and can tell me more about the performance in such a situation? Or whether I am using zfs in a way it isn't intended?
The way ZFS works internally is using a concept called a txg
(transaction group). This concept helps ZFS know what order operations happened in, so there is just a single integer txg
that is available at any given time (no parallelism by design). In normal circumstances, a new txg
is created every few seconds, to create a reasonably recent recovery point if the system crashes. When this happens it requires some work to be done, mostly flushing any outstanding writes to disk. However, a new txg
must be created any time you mutate the ZFS metadata by creating a new dataset, taking a snapshot, etc. which means that those operations are a bit heavier than you might expect.
In your case, my guess of what's happening is that your application is doing a ton of these filesystem operations (creations, deletions, snapshots, etc.), and the queue to process your requests is just getting longer and longer because the system isn't able to keep up.
There are three possible solutions:
txg
using a ZFS channel program, which is basically a Lua script that can be invoked in the middle of a txg
creation that can run arbitrary ZFS filesystem operationsI have one last thought to leave you with: deleting a filesystem in ZFS looks immediate, but internally, the filesystem is hidden immediately, but its data is freed asynchronously by a background thread which can take a while. You can see the amount of space waiting to be freed by running the command zpool get -o freeing <pool>
. So this whole design of using ZFS to delete stuff faster might not actually be doing that much for you. If you want to get this behavior without the txg
overhead, you could just create a queue inside your application with a background thread that will delete directories that are no longer in use.