rename() atomicity and NFS?
Asked Answered
M

4

17

With reference to: Is rename() atomic?

I'm asking something similar, but not quite the same, because what I want to know is is it safe to rely on the atomicty of rename() when using NFS?

Here's a scenario I'm dealing with - I have an 'index' file that must always be present.

So:

  • Client creates a new file
  • Client renames new file over 'old' index file.

Separate client:

  • Reads index file
  • refers to on disk structure based on the index.

This is making the assumption that rename() being atomic means - there will always be an 'index' file (although, it might be an out of date version, because caching and timing)

However the problem I'm hitting is this - that this is happening on NFS - and working - but several of my NFS clients are occasionally reporting "ENOENT" - no such file or directory. (e.g. in hundreds operations happening at 5m intervals, we get this error every couple of days).

So what I'm hoping is whether someone can enlighten me - should it actually be impossible to get 'ENOENT' in this scenario?

The reason I'm asking is this entry in RFC 3530:

The RENAME operation must be atomic to the client.

I'm wondering if that means just the client issuing the rename, and not the client viewing the directory? (I'm ok with a cached/out of date directory structure, but the point of this operation is that this file will always be 'present' in some form)

Sequence of operations (from the client performing the write operation) is:

21401 14:58:11 open("fleeg.ext", O_RDWR|O_CREAT|O_EXCL, 0666) = -1 EEXIST (File exists) <0.000443>
21401 14:58:11 open("fleeg.ext", O_RDWR) = 3 <0.000547>
21401 14:58:11 fstat(3, {st_mode=S_IFREG|0600, st_size=572, ...}) = 0 <0.000012>
21401 14:58:11 fadvise64(3, 0, 572, POSIX_FADV_RANDOM) = 0 <0.000008>
21401 14:58:11 fcntl(3, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=1, len=1}) = 0 <0.001994>
21401 14:58:11 open("fleeg.ext.i", O_RDWR|O_CREAT, 0666) = 4 <0.000538>
21401 14:58:11 fstat(4, {st_mode=S_IFREG|0600, st_size=42, ...}) = 0 <0.000008>
21401 14:58:11 fadvise64(4, 0, 42, POSIX_FADV_RANDOM) = 0 <0.000006>
21401 14:58:11 close(4)                 = 0 <0.000011>
21401 14:58:11 fstat(3, {st_mode=S_IFREG|0600, st_size=572, ...}) = 0 <0.000007>
21401 14:58:11 open("fleeg.ext.i", O_RDONLY) = 4 <0.000577>
21401 14:58:11 fstat(4, {st_mode=S_IFREG|0600, st_size=42, ...}) = 0 <0.000007>
21401 14:58:11 fadvise64(4, 0, 42, POSIX_FADV_RANDOM) = 0 <0.000006>
21401 14:58:11 fstat(4, {st_mode=S_IFREG|0600, st_size=42, ...}) = 0 <0.000007>
21401 14:58:11 fstat(4, {st_mode=S_IFREG|0600, st_size=42, ...}) = 0 <0.000007>
21401 14:58:11 read(4, "\3PAX\1\0\0O}\270\370\206\20\225\24\22\t\2\0\203RD\0\0\0\0\17\r\0\2\0\n"..., 42) = 42 <0.000552>
21401 14:58:11 close(4)                 = 0 <0.000013>
21401 14:58:11 fcntl(3, F_SETLKW, {type=F_RDLCK, whence=SEEK_SET, start=466, len=68}) = 0 <0.001418>
21401 14:58:11 pread(3, "\21@\203\244I\240\333\272\252d\316\261\3770\361#\222\200\313\224&J\253\5\354\217-\256LA\345\253"..., 38, 534) = 38 <0.000010>
21401 14:58:11 pread(3, "\21@\203\244I\240\333\272\252d\316\261\3770\361#\222\200\313\224&J\253\5\354\217-\256LA\345\253"..., 38, 534) = 38 <0.000010>
21401 14:58:11 pread(3, "\21\"\30\361\241\223\271\256\317\302\363\262F\276]\260\241-x\227b\377\205\356\252\236\211\37\17.\216\364"..., 68, 466) = 68 <0.000010>
21401 14:58:11 pread(3, "\21\302d\344\327O\207C]M\10xxM\377\2340\0319\206k\201N\372\332\265R\242\313S\24H"..., 62, 300) = 62 <0.000011>
21401 14:58:11 pread(3, "\21\362cv'\37\204]\377q\362N\302/\212\255\255\370\200\236\350\2237>7i`\346\271Cy\370"..., 104, 362) = 104 <0.000010>
21401 14:58:11 pwrite(3, "\21\302\3174\252\273.\17\v\247\313\324\267C\222P\303\n~\341F\24oh/\300a\315\n\321\31\256"..., 127, 572) = 127 <0.000012>
21401 14:58:11 pwrite(3, "\21\212Q\325\371\223\235\256\245\247\\WT$\4\227\375[\\\3263\222\0305\0\34\2049A;2U"..., 68, 699) = 68 <0.000009>
21401 14:58:11 pwrite(3, "\21\262\20Kc(!.\350\367i\253hkl~\254\335H\250.d\0036\r\342\v\242\7\255\214\31"..., 38, 767) = 38 <0.000009>
21401 14:58:11 fsync(3)                 = 0 <0.001007>
21401 14:58:11 fstat(3, {st_mode=S_IFREG|0600, st_size=805, ...}) = 0 <0.000009>
21401 14:58:11 open("fleeg.ext.i.tmp", O_RDWR|O_CREAT|O_TRUNC, 0666) = 4 <0.001813>
21401 14:58:11 fstat(4, {st_mode=S_IFREG|0600, st_size=0, ...}) = 0 <0.000007>
21401 14:58:11 fadvise64(4, 0, 0, POSIX_FADV_RANDOM) = 0 <0.000007>
21401 14:58:11 write(4, "\3PAX\1\0\0qT2\225\226\20\225\24\22\t\2\0\205;D\0\0\0\0\17\r\0\2\0\n"..., 42) = 42 <0.000012>
21401 14:58:11 stat("fleeg.ext.i", {st_mode=S_IFREG|0600, st_size=42, ...}) = 0 <0.000011>
21401 14:58:11 fchmod(4, 0100600)       = 0 <0.002517>
21401 14:58:11 fstat(4, {st_mode=S_IFREG|0600, st_size=42, ...}) = 0 <0.000008>
21401 14:58:11 close(4)                 = 0 <0.000011>
21401 14:58:11 rename("fleeg.ext.i.tmp", "fleeg.pax.i") = 0 <0.001201>
21401 14:58:11 close(3)                 = 0 <0.000795>
21401 14:58:11 munmap(0x7f1475cce000, 4198400) = 0 <0.000177>
21401 14:58:11 munmap(0x7f14760cf000, 4198400) = 0 <0.000173>
21401 14:58:11 futex(0x7f147cbcb908, FUTEX_WAKE_PRIVATE, 2147483647) = 0 <0.000010>
21401 14:58:11 exit_group(0)            = ?
21401 14:58:11 +++ exited with 0 +++

NB - Paths and files renamed in the above for consistency. fleeg.ext is the data file, and fleeg.ext.i is the index. During this process - the fleeg.ext.i file is being overwritten (by the .tmp file), which is why the belief is that there should always be a file at that path (either the old one, or the new that's just overwritten it).

On the reading client the PCAP looks like LOOKUP NFS call is what's failing:

124   1.375777  10.10.41.35 -> 10.10.41.9   NFS 226   LOOKUP    fleeg.ext.i V3 LOOKUP Call, DH: 0x6fbbff3a/fleeg.ext.i
125   1.375951   10.10.41.9 -> 10.10.41.35  NFS 186 5347  LOOKUP  0775 Directory  V3 LOOKUP Reply (Call In 124) Error: NFS3ERR_NOENT
126   1.375975  10.10.41.35 -> 10.10.41.9   NFS 226   LOOKUP    fleeg.ext.i V3 LOOKUP Call, DH: 0x6fbbff3a/fleeg.ext.i
127   1.376142   10.10.41.9 -> 10.10.41.35  NFS 186 5347  LOOKUP  0775 Directory  V3 LOOKUP Reply (Call In 126) Error: NFS3ERR_NOENT
Monodic answered 28/12, 2016 at 12:18 Comment(5)
Do you close the file before renaming? It is important that it is closed first.Queenqueena
Hmm, looking at an strace it appears probably not - the 'new' file is created by the open call, and then renameed whilst open. I'll run another trace to confirm.Monodic
@MaximEgorushkin - have rerun at trace and added it to the question - the file is closed prior to rename.Monodic
What syscall returns ENOENT, is it open or read?Queenqueena
The error message I have indicates it's open - although I'm less certain, as I don't have an strace of a failure (it doesn't happen frequently enough for me to have caught it), just application layer logging.Monodic
M
1

I think I now have the answer as to what is going on. I'm adding it here, because whilst the others have been very helpful in getting there, the actual root of the issue is this:

Reading host:

79542  10.643148 10.0.0.52 -> 10.0.0.24 NFS 222  ACCESS allowed   testfile  V3 ACCESS Call, FH: 0x76a9a83d, [Check: RD MD XT XE]
79543  10.643286 10.0.0.24 -> 10.0.0.52 NFS 194 0 ACCESS allowed 0600 Regular File testfile NFS3_OK V3 ACCESS Reply (Call In 79542), [Allowed: RD MD XT XE]
79544  10.643335 10.0.0.52 -> 10.0.0.24 NFS 222  ACCESS allowed     V3 ACCESS Call, FH: 0xe0e7db45, [Check: RD LU MD XT DL]
79545  10.643456 10.0.0.24 -> 10.0.0.52 NFS 194 0 ACCESS allowed 0755 Directory  NFS3_OK V3 ACCESS Reply (Call In 79544), [Allowed: RD LU MD XT DL]
79546  10.643487 10.0.0.52 -> 10.0.0.24 NFS 230  LOOKUP    testfile  V3 LOOKUP Call, DH: 0xe0e7db45/testfile
79547  10.643632 10.0.0.24 -> 10.0.0.52 NFS 190 0 LOOKUP  0755 Directory  NFS3ERR_NOENT V3 LOOKUP Reply (Call In 79546) Error: NFS3ERR_NOENT
79548  10.643662 10.0.0.52 -> 10.0.0.24 NFS 230  LOOKUP    testfile  V3 LOOKUP Call, DH: 0xe0e7db45/testfile
79549  10.643814 10.0.0.24 -> 10.0.0.52 NFS 190 0 LOOKUP  0755 Directory  NFS3ERR_NOENT V3 LOOKUP Reply (Call In 79548) Error: NFS3ERR_NOENT

Writing host:

203306  13.805489  10.0.0.6 -> 10.0.0.24 NFS 246  LOOKUP    .nfs00000000d59701e500001030  V3 LOOKUP Call, DH: 0xe0e7db45/.nfs00000000d59701e500001030
203307  13.805687 10.0.0.24 -> 10.0.0.6  NFS 186 0 LOOKUP  0755 Directory  NFS3ERR_NOENT V3 LOOKUP Reply (Call In 203306) Error: NFS3ERR_NOENT
203308  13.805711  10.0.0.6 -> 10.0.0.24 NFS 306  RENAME    testfile,.nfs00000000d59701e500001030  V3 RENAME Call, From DH: 0xe0e7db45/testfile To DH: 0xe0e7db45/.nfs00000000d59701e500001030
203309  13.805982 10.0.0.24 -> 10.0.0.6  NFS 330 0,0 RENAME  0755,0755 Directory,Directory  NFS3_OK V3 RENAME Reply (Call In 203308)
203310  13.806008  10.0.0.6 -> 10.0.0.24 NFS 294  RENAME    testfile_temp,testfile  V3 RENAME Call, From DH: 0xe0e7db45/testfile_temp To DH: 0xe0e7db45/testfile
203311  13.806254 10.0.0.24 -> 10.0.0.6  NFS 330 0,0 RENAME  0755,0755 Directory,Directory  NFS3_OK V3 RENAME Reply (Call In 203310)
203312  13.806297  10.0.0.6 -> 10.0.0.24 NFS 246  CREATE    testfile_temp  V3 CREATE Call, DH: 0xe0e7db45/testfile_temp Mode: EXCLUSIVE
203313  13.806538 10.0.0.24 -> 10.0.0.6  NFS 354 0,0 CREATE  0755,0755 Regular File,Directory testfile_temp NFS3_OK V3 CREATE Reply (Call In 203312)
203314  13.806560  10.0.0.6 -> 10.0.0.24 NFS 246  SETATTR  0600  testfile_temp  V3 SETATTR Call, FH: 0x4b69a46a
203315  13.806767 10.0.0.24 -> 10.0.0.6  NFS 214 0 SETATTR  0600 Regular File testfile_temp NFS3_OK V3 SETATTR Reply (Call In 203314)

This is only reproducible if you open the same file for reading - so in addition to a trivial C write-rename loop:

#!/usr/bin/env perl

use strict;
use warnings;

while ( 1 ) {
  open ( my $input, '<', 'testfile' ) or warn $!;
  print ".";
  sleep 1;
}

This causes my test case to fail quickly (minutes) rather than not at all, seemingly. It's down to the '.nfsXXX' file that is created when a file handle is open and then deleted (or overwritten by a RENAME).

Because NFS is stateless, it has to have some persistent for the client, so it can still read/write that file in the same way as it would if it had done an open/unlink on a local filesystem. And to do that - we get a double RENAME and a very brief (sub millisecond) interval whereby the file we're targeting isn't present for a LOOKUP NFS RPC to find.

Monodic answered 3/1, 2017 at 12:13 Comment(9)
There is something weird, or I am missing something. An NFS Filehandle must "persistently" refer to the same object (more or less, an "inode"). The first rename in your trace is "0xe0e7db45/testfile To DH: 0xe0e7db45/.nfs000", where I see that the Filehandle remains the same. After this first rename, the handle still refers to the same object, so I don't understand why TWO renames would be required instead of a single, simple, atomic, rename.Theisen
This only occurs if there's a read filehandle open on the 'writer' host. Rename is atomic to that client, but because two renames occur, the remote client sees a directory that is - very briefly - empty between the two events.Monodic
I still think that the double rename is wrong. Even with an open file ongoing, a single rename, which indeed preserves the Filehandle, should work. I am still missing something.Theisen
Because the NFS server is stateless, it needs to have 'something' for an open filehandle to point to. It can't be the 'new' filename, because you've just renamed something over it, so it has to rename the previous one, out of the way, first.Monodic
Last comment, I promise. The 'something' you say should be the NFS filehandle, which refers (commonly) to an inode. You get a filehandle referring to "myfile.txt", you rename the file, and the filehandle still points to the same object, even when renamed. So, this double renaming seems wrong to me.Theisen
I'm not renaming 'myfile.txt'. I'm renaming 'myfile.tmp' over 'myfile.txt' to replace it (and thus delete it). Normally, that's not a problem on Unix - file handles stay there until the reference count drops to zero, even if the file is deleted. But NFS has to deal with server or client reboots. So in order to keep 'myfile.txt' open - whilst overwriting it - it needs to rename the 'open' copy (and preserve the FH) first.Monodic
The log "Writing host:" refers not to the server, but to the client. This perhaps is even stranger.Theisen
@Monodic what NFS debugging tool did you use to generate the "Reading host:" and "Writing host:" traces? They look more readable than what I have seen so far.Local
@GuillaumePapin That was merely 'tshark' which comes bundled with wireshark but is text only.Monodic
T
8

I think the problem is not in the RENAME not being atomic, but in the fact that OPENing a file via NFS is not atomic.

NFS uses Filehandles; in order to do something to a file, a client first obtains a Filehandle through a LOOKUP, then the obtained Filehandle is used to perform the other requests. A minimum of two datagram is required, and the time between them can, in particular circumstances, be quite "large".

What is happening to you, I suppose, is that a client (client1) performs a LOOKUP; just after that, the LOOKUPed file gets erased as a result of RENAME (by client2); the Filehandle client1 has is no more valid, because it refers to an inode, not to a named path.

The reason for all this is that NFS aims to be stateless. More info in this PDF: http://pages.cs.wisc.edu/~remzi/OSTEP/dist-nfs.pdf

In pages 6 and 8 this behaviour is well explained.

Theisen answered 30/12, 2016 at 15:10 Comment(0)
Q
5

Should it actually be impossible to get ENOENT in this scenario?

It is quite possible. The RFC 3530 says:

The operation is required to be atomic to the client.

That most likely means it must be atomic to the client invoking this operation, not all clients.

And further on it says:

If the target directory already contains an entry with the name... the existing target is removed before the rename occurs.

This is the reason other clients get ENOENT sometimes.

In other words, rename is not atomic on NFS.

Queenqueena answered 30/12, 2016 at 13:10 Comment(0)
M
3

As developer, I was interested in how to properly update the NFS-residing config file of my application. This file is read frequently, however, on application update, it is re-written due to scheme updates. Importantly, on update, the existing content should be preserved, while a "default" config file should be created, if not exist. While with true atomic rename this is simple, on NFS there is a small time slot, where the file does not exist. So a reader must not simply create the "default" config file, just because it is not found. However, it appears, on NFS this problem can be solved using below script. The basic procedure is:

  • Updaters atomically create a lock_dir, do the rename, sync, and remove the lock
  • Readers are prepared for non-existing files and stale reads, whereupon they become updaters themselves. Once they got the lock, they try to read the config file again, to differentiate file-in-update from file-not-exist.

My C++ implementation of this concept can be found here, for a standalone python script see below.

Usage:

# start writer with
$ echo abc > foo; rm tmp*; rmdir foo_LOCK/; ./renametest.py foo 1
# On another machine, start reader with
$ ./renametest.py foo 0

Soon, you'll see messages like

iter 481 stale file handle
iter 16811 file not found
iter 16811 failed to obtain lock. Giving up.

which indicate that some processes starved too long trying to get the lock. However, the config-file was either successfully read/updated or not. No corruption. Nice.

The script:

#!/usr/bin/env python3

import os
import sys
import tempfile
import errno
import time


def eprint(*args, **kwargs):
    print('iter', g_iter, *args, file=sys.stderr, **kwargs)


def lock_file_name(filename):
    return filename + '_LOCK'

def try_lock(filename):
    try:
        os.mkdir(lock_file_name(filename))
        return True
    except FileExistsError:
        return False


def abc_or_die(filename):
    with open(filename, 'r') as f:
        content = f.read()
    if content != "abc\n":
        eprint("ERROR - bad content:", content)
        exit(1)

def update_it(filename):
    cwd = os.getcwd()
    for i in range(10):
        if not try_lock(filename):
            time.sleep(1)
            continue

        # 'Updating' a cfg file usually means to read it first,
        # which should now be safe:
        abc_or_die(filename)

        tmp_file = tempfile.NamedTemporaryFile(delete=False, dir=cwd).name
        with open(tmp_file, 'w') as f:
            f.write("abc\n")

        # almost-atomic-replace on NFS
        os.rename(tmp_file, filename)
        # sync, before releasing the lock. Otherwise, there is still a small slot,
        # where the lockdir is removed, while the config-file rename is still in progress
        os.sync()
        os.rmdir(lock_file_name(filename))
        return True

    eprint('failed to obtain lock. Giving up.')


def handle_read_fail(filename):
    for i in range(10):
        if not try_lock(filename):
            time.sleep(1)
            continue
        # got the lock
        if not os.path.exists(filename):
            # TODO: in the real world, we would create the config file now.
            # Here we require it to exist
            eprint('ERROR: got lock but file does not exist')
            exit(1)
        abc_or_die(filename)
        os.rmdir(lock_file_name(filename))
        return True

    eprint('failed to obtain lock. Giving up.')




def read_it(filename):
    try:
        with open(filename, 'r') as f:
            content = f.read()
            if len(content) == 0:
                eprint('file is empty')
                handle_read_fail(filename)
                return

            if content != "abc\n":
                eprint("ERROR - bad content:", content)
                exit(1)
            # eprint('red success on first try!')
            return True
    except OSError as e:
        if e.errno == errno.ENOENT:
            eprint('file not found')
        elif e.errno == errno.ESTALE:
            eprint('stale file handle')
        else:
            eprint("unhandled error", e)
            exit(1)
        handle_read_fail(filename)


def main():
    global g_iter
    filename=sys.argv[1]
    do_update=int(sys.argv[2])

    g_iter = 0
    if do_update == 1:
        while True:
            update_it(filename)
            g_iter += 1
    else:
        while True:
            read_it(filename)
            g_iter += 1

if __name__ == '__main__':
    try:
        main()
    except (BrokenPipeError, KeyboardInterrupt):
        pass
    # avoid additional broken pipe error. s. https://stackoverflow.com/a/26738736
    sys.stderr.close()


As a side-note, first, I used advisory locking with the library call flock - a shared lock for reading, an exclusive lock for writing. That way, I did not use rename at all and everything was working "ok" (and the code was simple). However, locking via NFS can be slow when a lot of other traffic is going on, so I looked for a "safe" rename implementation without locks.

Montelongo answered 26/9, 2023 at 12:18 Comment(1)
Thanks for the comment, much appreciated. This is still a thing we have to be aware of, but a lot of the issue is gone by virtue of 'just' stopping code from making expectations that aren't valid.Monodic
M
1

I think I now have the answer as to what is going on. I'm adding it here, because whilst the others have been very helpful in getting there, the actual root of the issue is this:

Reading host:

79542  10.643148 10.0.0.52 -> 10.0.0.24 NFS 222  ACCESS allowed   testfile  V3 ACCESS Call, FH: 0x76a9a83d, [Check: RD MD XT XE]
79543  10.643286 10.0.0.24 -> 10.0.0.52 NFS 194 0 ACCESS allowed 0600 Regular File testfile NFS3_OK V3 ACCESS Reply (Call In 79542), [Allowed: RD MD XT XE]
79544  10.643335 10.0.0.52 -> 10.0.0.24 NFS 222  ACCESS allowed     V3 ACCESS Call, FH: 0xe0e7db45, [Check: RD LU MD XT DL]
79545  10.643456 10.0.0.24 -> 10.0.0.52 NFS 194 0 ACCESS allowed 0755 Directory  NFS3_OK V3 ACCESS Reply (Call In 79544), [Allowed: RD LU MD XT DL]
79546  10.643487 10.0.0.52 -> 10.0.0.24 NFS 230  LOOKUP    testfile  V3 LOOKUP Call, DH: 0xe0e7db45/testfile
79547  10.643632 10.0.0.24 -> 10.0.0.52 NFS 190 0 LOOKUP  0755 Directory  NFS3ERR_NOENT V3 LOOKUP Reply (Call In 79546) Error: NFS3ERR_NOENT
79548  10.643662 10.0.0.52 -> 10.0.0.24 NFS 230  LOOKUP    testfile  V3 LOOKUP Call, DH: 0xe0e7db45/testfile
79549  10.643814 10.0.0.24 -> 10.0.0.52 NFS 190 0 LOOKUP  0755 Directory  NFS3ERR_NOENT V3 LOOKUP Reply (Call In 79548) Error: NFS3ERR_NOENT

Writing host:

203306  13.805489  10.0.0.6 -> 10.0.0.24 NFS 246  LOOKUP    .nfs00000000d59701e500001030  V3 LOOKUP Call, DH: 0xe0e7db45/.nfs00000000d59701e500001030
203307  13.805687 10.0.0.24 -> 10.0.0.6  NFS 186 0 LOOKUP  0755 Directory  NFS3ERR_NOENT V3 LOOKUP Reply (Call In 203306) Error: NFS3ERR_NOENT
203308  13.805711  10.0.0.6 -> 10.0.0.24 NFS 306  RENAME    testfile,.nfs00000000d59701e500001030  V3 RENAME Call, From DH: 0xe0e7db45/testfile To DH: 0xe0e7db45/.nfs00000000d59701e500001030
203309  13.805982 10.0.0.24 -> 10.0.0.6  NFS 330 0,0 RENAME  0755,0755 Directory,Directory  NFS3_OK V3 RENAME Reply (Call In 203308)
203310  13.806008  10.0.0.6 -> 10.0.0.24 NFS 294  RENAME    testfile_temp,testfile  V3 RENAME Call, From DH: 0xe0e7db45/testfile_temp To DH: 0xe0e7db45/testfile
203311  13.806254 10.0.0.24 -> 10.0.0.6  NFS 330 0,0 RENAME  0755,0755 Directory,Directory  NFS3_OK V3 RENAME Reply (Call In 203310)
203312  13.806297  10.0.0.6 -> 10.0.0.24 NFS 246  CREATE    testfile_temp  V3 CREATE Call, DH: 0xe0e7db45/testfile_temp Mode: EXCLUSIVE
203313  13.806538 10.0.0.24 -> 10.0.0.6  NFS 354 0,0 CREATE  0755,0755 Regular File,Directory testfile_temp NFS3_OK V3 CREATE Reply (Call In 203312)
203314  13.806560  10.0.0.6 -> 10.0.0.24 NFS 246  SETATTR  0600  testfile_temp  V3 SETATTR Call, FH: 0x4b69a46a
203315  13.806767 10.0.0.24 -> 10.0.0.6  NFS 214 0 SETATTR  0600 Regular File testfile_temp NFS3_OK V3 SETATTR Reply (Call In 203314)

This is only reproducible if you open the same file for reading - so in addition to a trivial C write-rename loop:

#!/usr/bin/env perl

use strict;
use warnings;

while ( 1 ) {
  open ( my $input, '<', 'testfile' ) or warn $!;
  print ".";
  sleep 1;
}

This causes my test case to fail quickly (minutes) rather than not at all, seemingly. It's down to the '.nfsXXX' file that is created when a file handle is open and then deleted (or overwritten by a RENAME).

Because NFS is stateless, it has to have some persistent for the client, so it can still read/write that file in the same way as it would if it had done an open/unlink on a local filesystem. And to do that - we get a double RENAME and a very brief (sub millisecond) interval whereby the file we're targeting isn't present for a LOOKUP NFS RPC to find.

Monodic answered 3/1, 2017 at 12:13 Comment(9)
There is something weird, or I am missing something. An NFS Filehandle must "persistently" refer to the same object (more or less, an "inode"). The first rename in your trace is "0xe0e7db45/testfile To DH: 0xe0e7db45/.nfs000", where I see that the Filehandle remains the same. After this first rename, the handle still refers to the same object, so I don't understand why TWO renames would be required instead of a single, simple, atomic, rename.Theisen
This only occurs if there's a read filehandle open on the 'writer' host. Rename is atomic to that client, but because two renames occur, the remote client sees a directory that is - very briefly - empty between the two events.Monodic
I still think that the double rename is wrong. Even with an open file ongoing, a single rename, which indeed preserves the Filehandle, should work. I am still missing something.Theisen
Because the NFS server is stateless, it needs to have 'something' for an open filehandle to point to. It can't be the 'new' filename, because you've just renamed something over it, so it has to rename the previous one, out of the way, first.Monodic
Last comment, I promise. The 'something' you say should be the NFS filehandle, which refers (commonly) to an inode. You get a filehandle referring to "myfile.txt", you rename the file, and the filehandle still points to the same object, even when renamed. So, this double renaming seems wrong to me.Theisen
I'm not renaming 'myfile.txt'. I'm renaming 'myfile.tmp' over 'myfile.txt' to replace it (and thus delete it). Normally, that's not a problem on Unix - file handles stay there until the reference count drops to zero, even if the file is deleted. But NFS has to deal with server or client reboots. So in order to keep 'myfile.txt' open - whilst overwriting it - it needs to rename the 'open' copy (and preserve the FH) first.Monodic
The log "Writing host:" refers not to the server, but to the client. This perhaps is even stranger.Theisen
@Monodic what NFS debugging tool did you use to generate the "Reading host:" and "Writing host:" traces? They look more readable than what I have seen so far.Local
@GuillaumePapin That was merely 'tshark' which comes bundled with wireshark but is text only.Monodic

© 2022 - 2024 — McMap. All rights reserved.