Is there a good way to detect a stale NFS mount

Asked 29/10, 2009 at 12:24 Answered 12/10, 2023 at 20:43

I have a procedure I want to initiate only if several tests complete successfully.

One test I need is that all of my NFS mounts are alive and well.

Can I do better than the brute force approach:

mount | sed -n "s/^.* on \(.*\) type nfs .*$/\1/p" | 
while read mount_point ; do 
  timeout 10 ls $mount_point >& /dev/null || echo "stale $mount_point" ; 
done

Here timeout is a utility that will run the command in the background, and will kill it after a given time, if no SIGCHLD was caught prior to the time limit, returning success/fail in the obvious way.

In English: Parse the output of mount, check (bounded by a timeout) every NFS mount point. Optionally (not in the code above) breaking on the first stale mount.

Lubricity answered 29/10, 2009 at 12:24 Comment(0)

You could write a C program and check for ESTALE.

#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
#include <iso646.h>
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>

int main(){
    struct stat st;
    int ret;
    ret = stat("/mnt/some_stale", &st);
    if(ret == -1 and errno == ESTALE){
        printf("/mnt/some_stale is stale\n");
        return EXIT_SUCCESS;
    } else {
        return EXIT_FAILURE;
    }
}

Upturned answered 29/10, 2009 at 12:53 Comment(9)

From man 3 errno: ESTALE Reserved. Does this mean I should look for another solution? – Lubricity 1/11, 2009 at 12:55

I guess it depends on your kernel. – Upturned 2/11, 2009 at 1:11

Yes, you are right: In a later version of my distro man 3 errno say: "ESTALE Stale file handle (POSIX.1)) This error can occur for NFS and for other file systems". And although I went with the brute force approach, described in my question, I will accept this answer. – Lubricity 2/11, 2009 at 7:56

@Teddy, I know I am picking dead questions ... can you put a short C program example that returns 1 if /mnt/some_stale is in stale status? – Xylograph 29/8, 2012 at 13:0

@Oz123 Done, but I'm curious why you would need it - it's completely straightforward. – Upturned 14/10, 2012 at 0:51

@Teddy, if you are a Linux C programmer it is straight forward... For me it is always fun to see more C code. Thanks for bothering ! – Xylograph 14/10, 2012 at 7:58

@Teddy, actually, did you try your code? In my Linux box, if there is a stale NFS, the command stat will hang, exactly like your function stat("/mnt/some_stale", &st), so the code never returns... – Xylograph 15/10, 2012 at 8:37

Confirming @Oz123's observation - a hung NFS mount will hang this program indefinitely. – Ic 10/3, 2016 at 3:44

I made a really hacked together code for this answer in python (and managed to test it when I had a stale filesystem): github.com/guysoft/stale_mount_checker – Urnfield 30/3, 2017 at 12:6

A colleague of mine ran into your script. This doesn't avoid a "brute force" approach, but if I may in Bash:

while read _ _ mount _; do 
  read -t1 < <(stat -t "$mount") || echo "$mount timeout"; 
done < <(mount -t nfs)

mount can list NFS mounts directly. read -t (a shell builtin) can time out a command. stat -t (terse output) still hangs like an ls*. ls yields unnecessary output, risks false positives on huge/slow directory listings, and requires permissions to access - which would also trigger a false positive if it doesn't have them.

while read _ _ mount _; do 
  read -t1 < <(stat -t "$mount") || lsof -b 2>/dev/null|grep "$mount"; 
done < <(mount -t nfs)

We're using it with lsof -b (non-blocking, so it won't hang too) in order to determine the source of the hangs.

Thanks for the pointer!

test -d (a shell builtin) would work instead of stat (a standard external) as well, but read -t returns success only if it doesn't time out and reads a line of input. Since test -d doesn't use stdout, a (( $? > 128 )) errorlevel check on it would be necessary - not worth the legibility hit, IMO.

Sovereign answered 24/8, 2011 at 17:22 Comment(5)

while the latter example allows the command (without hanging at the stat) the lsof -b 2 just appears to skip all the stat tests and return nothing. – Heterotypic 18/2, 2012 at 9:2

As you know, <(...) is executed in a sub-shell, and if stat(1) hang due to stale NFS, the sub-shell won't terminated gracefully. See check-nfs.sh for the improvement for this. – Merideth 2/7, 2014 at 2:19

That works really well, except for the when the mount's source has whitespace in the path. Here's a variation that works with whitespace: while read mount; do timeout -k 2 2 stat -t $mount > /dev/null || echo "$mount timeout" done < <(grep nfs /proc/mounts| cut -d' ' -f2) – Dygert 22/3, 2021 at 17:35

I am in awe of the beauty of this piece of shell ingeniouity. But I almost fell for a trap hidden in the last line (for my use case): For my case, I had to check mount -t nfs4,nfs, as I have both nfs and nfs4 type mounts. My overall solution:

bash -c 'RETURNCODE=0; while read _ _ mount _; do read -t0.5 < <(stat -t "$mount") || echo "$mount mount is gone. Please reboot." && RETURNCODE=1; done < <(mount -t nfs,nfs4); exit $RETURNCODE'

– Jen 12/7, 2023 at 12:18

I also replaced read -t by the coreutil timeout (adressing issues raised in comment by @cinsk) My new overall solution:

bash -c 'RETURNCODE=0; while read _ _ mount _; do timeout 0.5 < <(stat -t "$mount") || echo "$mount mount is gone. Please reboot." && RETURNCODE=1; done < <(mount -t nfs,nfs4); exit $RETURNCODE'

– Jen 12/7, 2023 at 12:31

Took me some time, but here is what I found which works in Python:

import signal, os, subprocess
class Alarm(Exception):
    pass
    
def alarm_handler(signum, frame):
    raise Alarm

pathToNFSMount = '/mnt/server1/' # or you can implement some function 
                                 # to find all the mounts...

signal.signal(signal.SIGALRM, alarm_handler)
signal.alarm(3)  # 3 seconds
try:
    proc = subprocess.call('stat '+pathToNFSMount, shell=True, stderr=subprocess.PIPE, stdout=subprocess.PIPE) 
    stdoutdata, stderrdata = proc.communicate()
    signal.alarm(0)  # reset the alarm
except Alarm:
    print "Oops, taking too long!"

Remarks:

credit to the answer here.
You could also use alternative scheme:

os.fork() and os.stat()

check if the fork finished, if it has timed out you can kill it. You will need to work with time.time() and so on.

Xylograph answered 4/9, 2012 at 10:8 Comment(0)

In addition to previous answers, which hangs under some circumstances, this snippet checks all suitable mounts, kills with signal KILL, and is tested with CIFS too:

grep -v tracefs /proc/mounts | cut -d' ' -f2 | \
  while read m; do \
    timeout --signal=KILL 1 ls -d $m > /dev/null || echo "$m"; \
  done

Recitative answered 14/2, 2018 at 12:44 Comment(0)

You could write a C program and check for ESTALE.

#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
#include <iso646.h>
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>

int main(){
    struct stat st;
    int ret;
    ret = stat("/mnt/some_stale", &st);
    if(ret == -1 and errno == ESTALE){
        printf("/mnt/some_stale is stale\n");
        return EXIT_SUCCESS;
    } else {
        return EXIT_FAILURE;
    }
}

Upturned answered 29/10, 2009 at 12:53 Comment(9)

From man 3 errno: ESTALE Reserved. Does this mean I should look for another solution? – Lubricity 1/11, 2009 at 12:55

I guess it depends on your kernel. – Upturned 2/11, 2009 at 1:11

@Teddy, I know I am picking dead questions ... can you put a short C program example that returns 1 if /mnt/some_stale is in stale status? – Xylograph 29/8, 2012 at 13:0

@Oz123 Done, but I'm curious why you would need it - it's completely straightforward. – Upturned 14/10, 2012 at 0:51

@Teddy, if you are a Linux C programmer it is straight forward... For me it is always fun to see more C code. Thanks for bothering ! – Xylograph 14/10, 2012 at 7:58

Confirming @Oz123's observation - a hung NFS mount will hang this program indefinitely. – Ic 10/3, 2016 at 3:44

I made a really hacked together code for this answer in python (and managed to test it when I had a stale filesystem): github.com/guysoft/stale_mount_checker – Urnfield 30/3, 2017 at 12:6

Writing a C program that checks for ESTALE is a good option if you don't mind waiting for the command to finish because of the stale file system. If you want to implement a "timeout" option the best way I've found to implement it (in a C program) is to fork a child process that tries to open the file. You then check if the child process has finished reading a file successfully in the filesystem within an allocated amount of time.

Here is a small proof of concept C program to do this:

#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include <unistd.h>
#include <errno.h>
#include <fcntl.h>
#include <sys/wait.h>


void readFile();
void waitForChild(int pid);


int main(int argc, char *argv[])
{
  int pid;

  pid = fork();

  if(pid == 0) {
    // Child process.
    readFile();
  }
  else if(pid > 0) {
    // Parent process.
    waitForChild(pid);
  }
  else {
    // Error
    perror("Fork");
    exit(1);
  }

  return 0;
}

void waitForChild(int child_pid)
{
  int timeout = 2; // 2 seconds timeout.
  int status;
  int pid;

  while(timeout != 0) {
    pid = waitpid(child_pid, &status, WNOHANG);
    if(pid == 0) {
      // Still waiting for a child.
      sleep(1);
      timeout--;
    }
    else if(pid == -1) {
      // Error
      perror("waitpid()");
      exit(1);
    }
    else {
      // The child exited.
      if(WIFEXITED(status)) {
        // Child was able to call exit().
        if(WEXITSTATUS(status) == 0) {
          printf("File read successfully!\n");
          return;
        }
      }
      printf("File NOT read successfully.\n");
      return;
    }
  }

  // The child did not finish and the timeout was hit.
  kill(child_pid, 9);
  printf("Timeout reading the file!\n");
}

void readFile()
{
  int fd;

  fd = open("/path/to/a/file", O_RDWR);
  if(fd == -1) {
    // Error
    perror("open()");
    exit(1);
  }
  else {
    close(fd);
    exit(0);
  }
}

Benton answered 3/9, 2013 at 3:20 Comment(0)

I wrote https://github.com/acdha/mountstatus which uses an approach similar to what UndeadKernel mentioned, which I've found to be the most robust approach: it's a daemon which periodically scans all mounted filesystems by forking a child process which attempts to list the top-level directory and SIGKILL it if it fails to respond in a certain timeout, with both successes and failures recorded to syslog. That avoids issues with certain client implementations (e.g older Linux) which never trigger timeouts for certain classes of error, NFS servers which are partially responsive but e.g. won't respond to actual calls like listdir, etc.

I don't publish them but the included Makefile uses fpm to build rpm and deb packages with an Upstart script.

Ultraviolet answered 5/4, 2016 at 18:3 Comment(1)

Along with porting the project to Rust and adding a few other features there are now deb & rpm packages: github.com/acdha/mountstatus/releases – Ultraviolet 7/9, 2017 at 21:40

Another way, using shell script. Works good for me:

#!/bin/bash
# Purpose:
# Detect Stale File handle and remove it
# Script created: July 29, 2015 by Birgit Ducarroz
# Last modification: --
#

# Detect Stale file handle and write output into a variable and then into a file
mounts=`df 2>&1 | grep 'Stale file handle' |awk '{print ""$2"" }' > NFS_stales.txt`
# Remove : ‘ and ’ characters from the output
sed -r -i 's/://' NFS_stales.txt && sed -r -i 's/‘//' NFS_stales.txt && sed -r -i 's/’//' NFS_stales.txt

# Not used: replace space by a new line
# stales=`cat NFS_stales.txt && sed -r -i ':a;N;$!ba;s/ /\n /g' NFS_stales.txt`

# read NFS_stales.txt output file line by line then unmount stale by stale.
#    IFS='' (or IFS=) prevents leading/trailing whitespace from being trimmed.
#    -r prevents backslash escapes from being interpreted.
#    || [[ -n $line ]] prevents the last line from being ignored if it doesn't end with a \n (since read returns a non-zero exit code when it encounters EOF).

while IFS='' read -r line || [[ -n "$line" ]]; do
    echo "Unmounting due to NFS Stale file handle: $line"
    umount -fl $line
done < "NFS_stales.txt"
#EOF

Variole answered 29/7, 2015 at 8:4 Comment(1)

thanks, I couldn't get the redirect to work without this for some reason. – Sat 12/5, 2022 at 14:18

I'll just paste a snippet from our Icinga2 NFS stale mount monitoring Bash script here:

MOUNTS="$(mount -t nfs;mount -t nfs3;mount -t nfs4)"
MOUNT_POINTS=$(echo -e "$MOUNTS \n"|grep -v ^$|awk '{print $3}')

if [ -z "$MOUNT_POINTS" ]; then
        OUTPUT="[OK] No nfs mounts"
        set_result 0
else
        for i in $MOUNT_POINTS;do
                timeout 1 stat -t "$i" > /dev/null
                TMP_RESULT=$?
                set_result $TMP_RESULT
                set_output $TMP_RESULT "$i"
        done
fi

Wickman answered 7/9, 2022 at 6:3 Comment(0)

I know this post is old, but I have been struggling with this. I made it dead simple, on my mounts (smbc/nfs/gpfs):

echo mountup > /mnt/s/test/.chkstring

To test:

read -t1 mount < /mnt/s/test/.chkstring

if [[ $mount == "mountup" ]] ; then
echo "do your stuff"
fi

unset mount

Of course you need to be sure that the file with the test string does not exist when there is no mount, but that speaks for itself. Leaves no processes or whatever lingering.

Bathyscaphe answered 12/10, 2023 at 20:43 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags