With reference to: Is rename() atomic?
I'm asking something similar, but not quite the same, because what I want to know is is it safe to rely on the atomicty of rename()
when using NFS?
Here's a scenario I'm dealing with - I have an 'index' file that must always be present.
So:
Separate client:
This is making the assumption that rename()
being atomic means - there will always be an 'index' file (although, it might be an out of date version, because caching and timing)
However the problem I'm hitting is this - that this is happening on NFS - and working - but several of my NFS clients are occasionally reporting "ENOENT" - no such file or directory. (e.g. in hundreds operations happening at 5m intervals, we get this error every couple of days).
So what I'm hoping is whether someone can enlighten me - should it actually be impossible to get 'ENOENT' in this scenario?
The reason I'm asking is this entry in RFC 3530:
The RENAME operation must be atomic to the client.
I'm wondering if that means just the client issuing the rename, and not the client viewing the directory? (I'm ok with a cached/out of date directory structure, but the point of this operation is that this file will always be 'present' in some form)
Sequence of operations (from the client performing the write operation) is:
21401 14:58:11 open("fleeg.ext", O_RDWR|O_CREAT|O_EXCL, 0666) = -1 EEXIST (File exists) <0.000443>
21401 14:58:11 open("fleeg.ext", O_RDWR) = 3 <0.000547>
21401 14:58:11 fstat(3, {st_mode=S_IFREG|0600, st_size=572, ...}) = 0 <0.000012>
21401 14:58:11 fadvise64(3, 0, 572, POSIX_FADV_RANDOM) = 0 <0.000008>
21401 14:58:11 fcntl(3, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=1, len=1}) = 0 <0.001994>
21401 14:58:11 open("fleeg.ext.i", O_RDWR|O_CREAT, 0666) = 4 <0.000538>
21401 14:58:11 fstat(4, {st_mode=S_IFREG|0600, st_size=42, ...}) = 0 <0.000008>
21401 14:58:11 fadvise64(4, 0, 42, POSIX_FADV_RANDOM) = 0 <0.000006>
21401 14:58:11 close(4) = 0 <0.000011>
21401 14:58:11 fstat(3, {st_mode=S_IFREG|0600, st_size=572, ...}) = 0 <0.000007>
21401 14:58:11 open("fleeg.ext.i", O_RDONLY) = 4 <0.000577>
21401 14:58:11 fstat(4, {st_mode=S_IFREG|0600, st_size=42, ...}) = 0 <0.000007>
21401 14:58:11 fadvise64(4, 0, 42, POSIX_FADV_RANDOM) = 0 <0.000006>
21401 14:58:11 fstat(4, {st_mode=S_IFREG|0600, st_size=42, ...}) = 0 <0.000007>
21401 14:58:11 fstat(4, {st_mode=S_IFREG|0600, st_size=42, ...}) = 0 <0.000007>
21401 14:58:11 read(4, "\3PAX\1\0\0O}\270\370\206\20\225\24\22\t\2\0\203RD\0\0\0\0\17\r\0\2\0\n"..., 42) = 42 <0.000552>
21401 14:58:11 close(4) = 0 <0.000013>
21401 14:58:11 fcntl(3, F_SETLKW, {type=F_RDLCK, whence=SEEK_SET, start=466, len=68}) = 0 <0.001418>
21401 14:58:11 pread(3, "\21@\203\244I\240\333\272\252d\316\261\3770\361#\222\200\313\224&J\253\5\354\217-\256LA\345\253"..., 38, 534) = 38 <0.000010>
21401 14:58:11 pread(3, "\21@\203\244I\240\333\272\252d\316\261\3770\361#\222\200\313\224&J\253\5\354\217-\256LA\345\253"..., 38, 534) = 38 <0.000010>
21401 14:58:11 pread(3, "\21\"\30\361\241\223\271\256\317\302\363\262F\276]\260\241-x\227b\377\205\356\252\236\211\37\17.\216\364"..., 68, 466) = 68 <0.000010>
21401 14:58:11 pread(3, "\21\302d\344\327O\207C]M\10xxM\377\2340\0319\206k\201N\372\332\265R\242\313S\24H"..., 62, 300) = 62 <0.000011>
21401 14:58:11 pread(3, "\21\362cv'\37\204]\377q\362N\302/\212\255\255\370\200\236\350\2237>7i`\346\271Cy\370"..., 104, 362) = 104 <0.000010>
21401 14:58:11 pwrite(3, "\21\302\3174\252\273.\17\v\247\313\324\267C\222P\303\n~\341F\24oh/\300a\315\n\321\31\256"..., 127, 572) = 127 <0.000012>
21401 14:58:11 pwrite(3, "\21\212Q\325\371\223\235\256\245\247\\WT$\4\227\375[\\\3263\222\0305\0\34\2049A;2U"..., 68, 699) = 68 <0.000009>
21401 14:58:11 pwrite(3, "\21\262\20Kc(!.\350\367i\253hkl~\254\335H\250.d\0036\r\342\v\242\7\255\214\31"..., 38, 767) = 38 <0.000009>
21401 14:58:11 fsync(3) = 0 <0.001007>
21401 14:58:11 fstat(3, {st_mode=S_IFREG|0600, st_size=805, ...}) = 0 <0.000009>
21401 14:58:11 open("fleeg.ext.i.tmp", O_RDWR|O_CREAT|O_TRUNC, 0666) = 4 <0.001813>
21401 14:58:11 fstat(4, {st_mode=S_IFREG|0600, st_size=0, ...}) = 0 <0.000007>
21401 14:58:11 fadvise64(4, 0, 0, POSIX_FADV_RANDOM) = 0 <0.000007>
21401 14:58:11 write(4, "\3PAX\1\0\0qT2\225\226\20\225\24\22\t\2\0\205;D\0\0\0\0\17\r\0\2\0\n"..., 42) = 42 <0.000012>
21401 14:58:11 stat("fleeg.ext.i", {st_mode=S_IFREG|0600, st_size=42, ...}) = 0 <0.000011>
21401 14:58:11 fchmod(4, 0100600) = 0 <0.002517>
21401 14:58:11 fstat(4, {st_mode=S_IFREG|0600, st_size=42, ...}) = 0 <0.000008>
21401 14:58:11 close(4) = 0 <0.000011>
21401 14:58:11 rename("fleeg.ext.i.tmp", "fleeg.pax.i") = 0 <0.001201>
21401 14:58:11 close(3) = 0 <0.000795>
21401 14:58:11 munmap(0x7f1475cce000, 4198400) = 0 <0.000177>
21401 14:58:11 munmap(0x7f14760cf000, 4198400) = 0 <0.000173>
21401 14:58:11 futex(0x7f147cbcb908, FUTEX_WAKE_PRIVATE, 2147483647) = 0 <0.000010>
21401 14:58:11 exit_group(0) = ?
21401 14:58:11 +++ exited with 0 +++
NB - Paths and files renamed in the above for consistency. fleeg.ext
is the data file, and fleeg.ext.i
is the index. During this process - the fleeg.ext.i
file is being overwritten (by the .tmp
file), which is why the belief is that there should always be a file at that path (either the old one, or the new that's just overwritten it).
On the reading client the PCAP looks like LOOKUP
NFS call is what's failing:
124 1.375777 10.10.41.35 -> 10.10.41.9 NFS 226 LOOKUP fleeg.ext.i V3 LOOKUP Call, DH: 0x6fbbff3a/fleeg.ext.i
125 1.375951 10.10.41.9 -> 10.10.41.35 NFS 186 5347 LOOKUP 0775 Directory V3 LOOKUP Reply (Call In 124) Error: NFS3ERR_NOENT
126 1.375975 10.10.41.35 -> 10.10.41.9 NFS 226 LOOKUP fleeg.ext.i V3 LOOKUP Call, DH: 0x6fbbff3a/fleeg.ext.i
127 1.376142 10.10.41.9 -> 10.10.41.35 NFS 186 5347 LOOKUP 0775 Directory V3 LOOKUP Reply (Call In 126) Error: NFS3ERR_NOENT
I think I now have the answer as to what is going on. I'm adding it here, because whilst the others have been very helpful in getting there, the actual root of the issue is this:
Reading host:
79542 10.643148 10.0.0.52 -> 10.0.0.24 NFS 222 ACCESS allowed testfile V3 ACCESS Call, FH: 0x76a9a83d, [Check: RD MD XT XE]
79543 10.643286 10.0.0.24 -> 10.0.0.52 NFS 194 0 ACCESS allowed 0600 Regular File testfile NFS3_OK V3 ACCESS Reply (Call In 79542), [Allowed: RD MD XT XE]
79544 10.643335 10.0.0.52 -> 10.0.0.24 NFS 222 ACCESS allowed V3 ACCESS Call, FH: 0xe0e7db45, [Check: RD LU MD XT DL]
79545 10.643456 10.0.0.24 -> 10.0.0.52 NFS 194 0 ACCESS allowed 0755 Directory NFS3_OK V3 ACCESS Reply (Call In 79544), [Allowed: RD LU MD XT DL]
79546 10.643487 10.0.0.52 -> 10.0.0.24 NFS 230 LOOKUP testfile V3 LOOKUP Call, DH: 0xe0e7db45/testfile
79547 10.643632 10.0.0.24 -> 10.0.0.52 NFS 190 0 LOOKUP 0755 Directory NFS3ERR_NOENT V3 LOOKUP Reply (Call In 79546) Error: NFS3ERR_NOENT
79548 10.643662 10.0.0.52 -> 10.0.0.24 NFS 230 LOOKUP testfile V3 LOOKUP Call, DH: 0xe0e7db45/testfile
79549 10.643814 10.0.0.24 -> 10.0.0.52 NFS 190 0 LOOKUP 0755 Directory NFS3ERR_NOENT V3 LOOKUP Reply (Call In 79548) Error: NFS3ERR_NOENT
Writing host:
203306 13.805489 10.0.0.6 -> 10.0.0.24 NFS 246 LOOKUP .nfs00000000d59701e500001030 V3 LOOKUP Call, DH: 0xe0e7db45/.nfs00000000d59701e500001030
203307 13.805687 10.0.0.24 -> 10.0.0.6 NFS 186 0 LOOKUP 0755 Directory NFS3ERR_NOENT V3 LOOKUP Reply (Call In 203306) Error: NFS3ERR_NOENT
203308 13.805711 10.0.0.6 -> 10.0.0.24 NFS 306 RENAME testfile,.nfs00000000d59701e500001030 V3 RENAME Call, From DH: 0xe0e7db45/testfile To DH: 0xe0e7db45/.nfs00000000d59701e500001030
203309 13.805982 10.0.0.24 -> 10.0.0.6 NFS 330 0,0 RENAME 0755,0755 Directory,Directory NFS3_OK V3 RENAME Reply (Call In 203308)
203310 13.806008 10.0.0.6 -> 10.0.0.24 NFS 294 RENAME testfile_temp,testfile V3 RENAME Call, From DH: 0xe0e7db45/testfile_temp To DH: 0xe0e7db45/testfile
203311 13.806254 10.0.0.24 -> 10.0.0.6 NFS 330 0,0 RENAME 0755,0755 Directory,Directory NFS3_OK V3 RENAME Reply (Call In 203310)
203312 13.806297 10.0.0.6 -> 10.0.0.24 NFS 246 CREATE testfile_temp V3 CREATE Call, DH: 0xe0e7db45/testfile_temp Mode: EXCLUSIVE
203313 13.806538 10.0.0.24 -> 10.0.0.6 NFS 354 0,0 CREATE 0755,0755 Regular File,Directory testfile_temp NFS3_OK V3 CREATE Reply (Call In 203312)
203314 13.806560 10.0.0.6 -> 10.0.0.24 NFS 246 SETATTR 0600 testfile_temp V3 SETATTR Call, FH: 0x4b69a46a
203315 13.806767 10.0.0.24 -> 10.0.0.6 NFS 214 0 SETATTR 0600 Regular File testfile_temp NFS3_OK V3 SETATTR Reply (Call In 203314)
This is only reproducible if you open the same file for reading - so in addition to a trivial C write-rename loop:
#!/usr/bin/env perl
use strict;
use warnings;
while ( 1 ) {
open ( my $input, '<', 'testfile' ) or warn $!;
print ".";
sleep 1;
}
This causes my test case to fail quickly (minutes) rather than not at all, seemingly. It's down to the '.nfsXXX' file that is created when a file handle is open and then deleted (or overwritten by a RENAME
).
Because NFS is stateless, it has to have some persistent for the client, so it can still read/write that file in the same way as it would if it had done an open/unlink on a local filesystem. And to do that - we get a double RENAME
and a very brief (sub millisecond) interval whereby the file we're targeting isn't present for a LOOKUP
NFS RPC to find.