Determine a file's path(s) relative to a directory, including symlinks
Asked Answered
D

3

8

I have a directory with thousands of descendants (at least 1,000, probably no more than 20,000). Given a file path (which is guaranteed to exist), I want to know where that file can be found inside that directory -- including via symlinks.

For example, given:

  • The directory path is /base.
  • The real file path is /elsewhere/myfile.
  • /base is a symlink to /realbase
  • /realbase/foo is a symlink to /elsewhere.
  • /realbase/bar/baz is a symlink to /elsewhere/myfile.

I want to find the paths /base/foo/myfile and /base/bar/baz.

I could do this by recursively checking every symlink in /base, but this would be very slow. I'm hoping that there's a more graceful solution.


Motivation

This is for a Sublime Text plugin. When the user saves a file, we want to detect whether it is in the Sublime configuration directory. In particular, we want to do so even if the file is symlinked from inside the config directory and the user is editing the file at its physical path (e.g. inside their Dropbox directory). There may be other applications as well.

Sublime runs on Linux, Windows, and the Mac OS, and so ideally should the solution.

Dot answered 22/2, 2019 at 21:57 Comment(8)
This question is similar: #4532741. Basic finding is you have to search everywhere. My gut feel is you should maintain a data structure which indexes all of a files links, updating when new symlinks are created.Monia
Because of the differences in how the requested operating systems and their file systems work, you may need to have OS-specific modules that are selected at runtime.Freewheel
Do you really need to worry about symlinks on Windows? (it's a somewhat obscure feature that average users don't know about)Freewheel
"This is for a Sublime Text plugin." - I'd suggest that this is not your job... and could even potentially be considered malicious... no, no, no. IDEs are already bloated and slow, this "feature" just adds to that.Freewheel
If I didn't care about performance, then I wouldn't have asked this question. If it can't be done within reasonable performance bounds, then I won't do it at all. But what's reasonable depends on the application -- the performance budget for a handler that runs on every save event is much tighter than the budget for a user-invoked command that runs an automated test suite.Dot
You completely missed the first part of my comment... an IDE that scans / watches / monitors / whatever the entire VFS to try and figure out if a configuration file has changed, is a bad idea - made even worse by the edge case requirement of "maybe someone had some of their config symlinked into their Dropbox". I'd argue pretty strongly that as a plugin developer, this is not your job. Perhaps switch it around, to say "reload config if files in these other directories change", and allow the user to add their Dropbox directory to that list.Freewheel
Re: performance... what works well for an interactive application on your system will likely be very different on another user's system. Even triggering "on every save event" is a bad idea (performance / what if some other editor, or even Dropbox changed the file, not Sublime) - use the tools you have been given to monitor these things, like inotify, and keep the scope small.Freewheel
Just to clarify, the only files that need to be checked are the editor config files, not every file in the filesystem. I agree that it wouldn't make sense to monitor the entire filesystem.Dot
F
2

This, like many things, is more complex than it might appear on the surface.

Each entity in the file system points at an inode, which describes the content of the file. Entities are the things you see - files, directories, sockets, block devices, character devices, etc...

The content of a single "file" can be accessed via one or more paths - each of these paths is called a "hard link". Hard links can only point at files on the same filesystem, they cannot cross the boundary of a filesystem.

It is also possible for a path to address a "symbolic link", which can point at another path - that path doesn't have to exist, it can be another symbolic link, it can be on another filesystem, or it can point back at the original path producing an infinite loop.

It is impossible to locate all links (symbolic or hard) that point at a particular entity without scanning the entire tree.


Before we get into this... some comments:

  1. See the end for some benchmarks. I'm not convinced that this is a significant issue, though admittedly this filesystem is on a 6-disk ZFS array, on an i7, so using a lower spec system will take longer...
  2. Given that this is impossible without calling stat() on every file at some point, you're going to struggle coming up with a better solution that isn't significantly more complex (such as maintaining an index database, with all the issues that introduces)

As mentioned, we must scan (index) the whole tree. I know it's not what you want to do, but it's impossible without doing this...

To do this, you need to collect inodes, not filenames, and review them after the fact... there may be some optimisation here, but I've tried to keep it simple to prioritise understanding.

The following function will produce this structure for us:

def get_map(scan_root):
    # this dict will have device IDs at the first level (major / minor) ...
    # ... and inodes IDs at the second level
    # each inode will have the following keys:
    #   - 'type'     the entity's type - i.e: dir, file, socket, etc...
    #   - 'links'    a list of all found hard links to the inode
    #   - 'symlinks' a list of all found symlinks to the inode
    # e.g: entities[2049][4756]['links'][0]     path to a hard link for inode 4756
    #      entities[2049][4756]['symlinks'][0]  path to a symlink that points at an entity with inode 4756
    entity_map = {}

    for root, dirs, files in os.walk(scan_root):
        root = '.' + root[len(scan_root):]
        for path in [ os.path.join(root, _) for _ in files ]:
            try:
                p_stat = os.stat(path)
            except OSError as e:
                if e.errno == 2:
                    print('Broken symlink [%s]... skipping' % ( path ))
                    continue
                if e.errno == 40:
                    print('Too many levels of symbolic links [%s]... skipping' % ( path ))
                    continue
                raise

            p_dev = p_stat.st_dev
            p_ino = p_stat.st_ino

            if p_dev not in entity_map:
                entity_map[p_dev] = {}
            e_dev = entity_map[p_dev]

            if p_ino not in e_dev:
                e_dev[p_ino] = {
                    'type': get_type(p_stat.st_mode),
                    'links': [],
                    'symlinks': [],
                }
            e_ino = e_dev[p_ino]

            if os.lstat(path).st_ino == p_ino:
                e_ino['links'].append(path)
            else:
                e_ino['symlinks'].append(path)

    return entity_map

I've produced an example tree that looks like this:

$ tree --inodes
.
├── [  67687]  4 -> 5
├── [  67676]  5 -> 4
├── [  67675]  6 -> dead
├── [  67676]  a
│   └── [  67679]  1
├── [  67677]  b
│   └── [  67679]  2 -> ../a/1
├── [  67678]  c
│   └── [  67679]  3
└── [  67687]  d
    └── [  67688]  4

4 directories, 7 files

The output of this function is:

$ places
Broken symlink [./6]... skipping
Too many levels of symbolic links [./5]... skipping
Too many levels of symbolic links [./4]... skipping
{201: {67679: {'links': ['./a/1', './c/3'],
               'symlinks': ['./b/2'],
               'type': 'file'},
       67688: {'links': ['./d/4'], 'symlinks': [], 'type': 'file'}}}

If we are interested in ./c/3, then you can see that just looking at symlinks (and ignoring hard links) would cause us to miss ./a/1...

By subsequently searching for the path we are interested in, we can find all other references within this tree:

def filter_map(entity_map, filename):
    for dev, inodes in entity_map.items():
        for inode, info in inodes.items():
            if filename in info['links'] or filename in info['symlinks']:
                return info
$ places ./a/1
Broken symlink [./6]... skipping
Too many levels of symbolic links [./5]... skipping
Too many levels of symbolic links [./4]... skipping
{'links': ['./a/1', './c/3'], 'symlinks': ['./b/2'], 'type': 'file'}

The full source for this demo is below. Note that I've used relative paths to keep things simple, but it would be sensible to update this to use absolute paths. Additionally, any symlink that points outside the tree will not currently have a corresponding link... that's an exercise for the reader.

It might also be an idea to collect the data while you're filling the tree (if that's something that would work with your process)... you can use inotify to deal with this nicely - there's even a python module.

#!/usr/bin/env python3

import os, sys, stat
from pprint import pprint

def get_type(mode):
    if stat.S_ISDIR(mode):
        return 'directory'
    if stat.S_ISCHR(mode):
        return 'character'
    if stat.S_ISBLK(mode):
        return 'block'
    if stat.S_ISREG(mode):
        return 'file'
    if stat.S_ISFIFO(mode):
        return 'fifo'
    if stat.S_ISLNK(mode):
        return 'symlink'
    if stat.S_ISSOCK(mode):
        return 'socket'
    return 'unknown'

def get_map(scan_root):
    # this dict will have device IDs at the first level (major / minor) ...
    # ... and inodes IDs at the second level
    # each inode will have the following keys:
    #   - 'type'     the entity's type - i.e: dir, file, socket, etc...
    #   - 'links'    a list of all found hard links to the inode
    #   - 'symlinks' a list of all found symlinks to the inode
    # e.g: entities[2049][4756]['links'][0]     path to a hard link for inode 4756
    #      entities[2049][4756]['symlinks'][0]  path to a symlink that points at an entity with inode 4756
    entity_map = {}

    for root, dirs, files in os.walk(scan_root):
        root = '.' + root[len(scan_root):]
        for path in [ os.path.join(root, _) for _ in files ]:
            try:
                p_stat = os.stat(path)
            except OSError as e:
                if e.errno == 2:
                    print('Broken symlink [%s]... skipping' % ( path ))
                    continue
                if e.errno == 40:
                    print('Too many levels of symbolic links [%s]... skipping' % ( path ))
                    continue
                raise

            p_dev = p_stat.st_dev
            p_ino = p_stat.st_ino

            if p_dev not in entity_map:
                entity_map[p_dev] = {}
            e_dev = entity_map[p_dev]

            if p_ino not in e_dev:
                e_dev[p_ino] = {
                    'type': get_type(p_stat.st_mode),
                    'links': [],
                    'symlinks': [],
                }
            e_ino = e_dev[p_ino]

            if os.lstat(path).st_ino == p_ino:
                e_ino['links'].append(path)
            else:
                e_ino['symlinks'].append(path)

    return entity_map

def filter_map(entity_map, filename):
    for dev, inodes in entity_map.items():
        for inode, info in inodes.items():
            if filename in info['links'] or filename in info['symlinks']:
                return info

entity_map = get_map(os.getcwd())

if len(sys.argv) == 2:
    entity_info = filter_map(entity_map, sys.argv[1])
    pprint(entity_info)
else:
    pprint(entity_map)

I've run this on my system out of curiosity. It's a 6x disk ZFS RAID-Z2 pool on an i7-7700K with plenty of data to play with. Admittedly this will run somewhat slower on lower-spec systems...

Some benchmarks to consider:

  • A dataset of ~3.1k files and links in ~850 directories. This runs in less than 3.5 seconds, ~80ms on subsequent runs
  • A dataset of ~30k files and links in ~2.2k directories. This runs in less than 30 seconds, ~300ms on subsequent runs
  • A dataset of ~73.5k files and links in ~8k directories. This runs in approx 60 seconds, ~800ms on subsequent runs

Using simple maths, that's about 1140 stat() calls per second with an empty cache, or ~90k stat() calls per second once the cache has been filled - I don't think that stat() is as slow as you think it is!

Freewheel answered 25/2, 2019 at 15:51 Comment(4)
Depending on the directory structure it might be more efficient to use os.scandir instead of calling os.stat for each file. It is used by the os.walk implementation to divide files from subdirectories by using a single systemcall instead of doing one os.stat call per element in a directory. Ideally there would be a version of os.walk that simply yields the DirEntrys so you'd avoid having to repeat it...Mixed
@Mixed - indeed... I left optimisation out, but that would be a very sensible first step...Freewheel
os.scandir is not available on Python 3.3.Dot
This could be sped up substantially by relaxing the requirements a little — for example, only checking files with the same name as the modified file. This could fail if the user symlinks a file with a different name, but this may be acceptable in practice.Dot
M
0

Symlinks do not admit of shortcuts. You have to know about all relevant FS entries that might point at the file of interest. That corresponds either to creating an empty directory and then listening for all file creation events under it, or to scanning all files currently under it. Run the following.

#! /usr/bin/env python

from pathlib import Path
import collections
import os
import pprint
import stat


class LinkFinder:

    def __init__(self):
        self.target_to_orig = collections.defaultdict(set)

    def scan(self, folder='/tmp'):
        for fspec, target in self._get_links(folder):
            self.target_to_orig[target].add(fspec)

    def _get_links(self, folder):
        for root, dirs, files in os.walk(Path(folder).resolve()):
            for file in files:
                fspec = os.path.join(root, file)
                if stat.S_ISLNK(os.lstat(fspec).st_mode):
                    target = os.path.abspath(os.readlink(fspec))
                    yield fspec, target


if __name__ == '__main__':
    lf = LinkFinder()
    for folder in '/base /realbase'.split():
        lf.scan(folder)
    pprint.pprint(lf.target_to_orig)

You wind up with a mapping from all symlinked filespecs to a set of aliases by which that filespec may be accessed.

A symlink target may be a file or directory, so to properly use the mapping on a given filespec you must repeatedly truncate it, asking if parent directory or an ancestor directory appears in the mapping.

Dangling symlinks are not handled specially, they are simply allowed to dangle.

You might choose to serialize the mapping, probably in sorted order. If you repeatedly re-scan a large directory, there is an opportunity to remember directory mod times across runs, and avoid re-scanning files in that directory. Unfortunately, you would still have to recurse down into its descendant directories in case any of them had recent changes. Your subtrees may exhibit enough structure to let you avoid recursing more than K levels deep, or avoid descending into a directory whose name matches some regex.

If most FS changes are produced by a handful of programs, such as package managers or a build system, then getting those programs to log their actions could yield a performance win. That is, if you do a full scan each midnight, and then you run make in only two out of a thousand directories, you could choose to re-scan just that pair of subtrees.

Mufi answered 25/2, 2019 at 6:44 Comment(0)
C
0

My first instinct is to have the OS or some service inform you when the file-system tree has changed instead of you looking for the changes. Essentially don't reinvent the wheel.

Maybe:

Windows specific: 5 tools to monitor folder changes

Concise answered 25/2, 2019 at 16:21 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.