How to use sys.path_hooks for customized loading of modules?
Asked Answered
L

3

10

I hope the following question is not too long. But otherwise I cannot explain by problem and what I want:

Learned from What is the difference between `sys.meta_path` and `sys.path_hooks` importer objects? (my question of yesterday) I have written a specific loader for a new file type (.xxx). (In fact the xxx is an encrypted version of a pyc to protect code from being stolen).

I would like just to add an import hook for the new file type "xxx" without affecting the other types (.py, .pyc, .pyd) in any way.

Now, the loader is ModuleLoader, inheriting from importlib.machinery.SourcelessFileLoader.

Using sys.path_hooks the loader shall be added as a hook:

myFinder = importlib.machinery.FileFinder
loader_details = (ModuleLoader, ['.xxx'])
sys.path_hooks.append(myFinder.path_hook(loader_details))

Note: This is activated once by calling modloader.activateLoader()

Upon loading a module named test (which is a test.xxx) I get:

>>> import modloader
>>> modloader.activateLoader()
>>> import test
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: No module named 'test'
>>>

However, when I delete content of sys.path_hooks before adding the hook:

sys.path_hooks = []
sys.path.insert(0, '.') # current directory
sys.path_hooks.append(myFinder.path_hook(loader_details))

it works:

>>> modloader.activateLoader()
>>> import test
using xxx class

in xxxLoader exec_module
in xxxLoader get_code: .\test.xxx
ANALYZING ...

GENERATE CODE OBJECT ...

  2           0 LOAD_CONST               0
              3 LOAD_CONST               1 ('foo2')
              6 MAKE_FUNCTION            0
              9 STORE_NAME               0 (foo2)
             12 LOAD_CONST               2 (None)
             15 RETURN_VALUE
>>>>>> test
<module 'test' from '.\\test.xxx'>

The module is imported correctly after conversion of the files content to a code object.

However I cannot load the same module from a package: import pack.test

Note: __init__.py is of course as an empty file in pack directory.

>>> import pack.test
Traceback (most recent call last):
  File "<frozen importlib._bootstrap>", line 2218, in _find_and_load_unlocked
AttributeError: 'module' object has no attribute '__path__'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: No module named 'pack.test'; 'pack' is not a package
>>>

Not enough, I cannot load plain *.py modules from that package anymore: I get the same error as above:

>>> import pack.testpy
Traceback (most recent call last):
  File "<frozen importlib._bootstrap>", line 2218, in _find_and_load_unlocked
AttributeError: 'module' object has no attribute '__path__'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: No module named 'pack.testpy'; 'pack' is not a package
>>>

For my understanding sys.path_hooks is traversed until the last entry is tried. So why is the first variant (without deleting sys.path_hooks) not recognizing the new extension "xxx" and the second variant (deleting sys.path_hooks) do? It looks like the machinery is throwing an exception rather than traversing further to the next entry, when an entry of sys.path_hooks is not able to recognize "xxx".

And why is the second version working for py, pyc and xxx modules in the current directory, but not working in the package pack? I would expect that py and pyc is not even working in the current dir, because sys.path_hooks contains only a hook for "xxx"...

Leister answered 1/2, 2017 at 21:33 Comment(0)
B
12

The short answer is that the default PathFinder in sys.meta_path isn't meant to have new file extensions and importers added in the same paths it already supports. But there's still hope!

Quick Breakdown

sys.path_hooks is consumed by the importlib._bootstrap_external.PathFinder class.

When an import happens, each entry in sys.meta_path is asked to find a matching spec for the requested module. The PathFinder in particular will then take the contents of sys.path and pass it to the factory functions in sys.path_hooks. Each factory function has a chance to either raise an ImportError (basically the factory saying "nope, I don't support this path entry") or return a finder instance for that path. The first successfully returned finder is then cached in sys.path_importer_cache. From then on PathFinder will only ask those cached finder instances if they can provide the requested module.

If you look at the contents of sys.path_importer_cache, you'll see all of the directory entries from sys.path have been mapped to FileFinder instances. Non-directory entries (zip files, etc) will be mapped to other finders.

Thus, if you append a new factory created via FileFinder.path_hook to sys.path_hooks, your factory will only be invoked if the previous FileFinder hook didn't accept the path. This is unlikely, since FileFinder will work on any existing directory.

Alternatively, if you insert your new factory to sys.path_hooks ahead of the existing factories, the default hook will only be used if your new factory doesn't accept the path. And again, since FileFinder is so liberal with what it will accept, this would lead to only your loader being used, as you've already observed.

Making it Work

So you can either try to adjust that existing factory to also support your file extension and importer (which is difficult as the importers and extension string tuples are held in a closure), or do what I ended up doing, which is add a new meta path finder.

So eg. from my own project,


import sys

from importlib.abc import FileLoader
from importlib.machinery import FileFinder, PathFinder
from os import getcwd
from os.path import basename

from sibilant.module import prep_module, exec_module


SOURCE_SUFFIXES = [".lspy", ".sibilant"]


_path_importer_cache = {}
_path_hooks = []


class SibilantPathFinder(PathFinder):
    """
    An overridden PathFinder which will hunt for sibilant files in
    sys.path. Uses storage in this module to avoid conflicts with the
    original PathFinder
    """


    @classmethod
    def invalidate_caches(cls):
        for finder in _path_importer_cache.values():
            if hasattr(finder, 'invalidate_caches'):
                finder.invalidate_caches()


    @classmethod
    def _path_hooks(cls, path):
        for hook in _path_hooks:
            try:
                return hook(path)
            except ImportError:
                continue
        else:
            return None


    @classmethod
    def _path_importer_cache(cls, path):
        if path == '':
            try:
                path = getcwd()
            except FileNotFoundError:
                # Don't cache the failure as the cwd can easily change to
                # a valid directory later on.
                return None
        try:
            finder = _path_importer_cache[path]
        except KeyError:
            finder = cls._path_hooks(path)
            _path_importer_cache[path] = finder
        return finder


class SibilantSourceFileLoader(FileLoader):


    def create_module(self, spec):
        return None


    def get_source(self, fullname):
        return self.get_data(self.get_filename(fullname)).decode("utf8")


    def exec_module(self, module):
        name = module.__name__
        source = self.get_source(name)
        filename = basename(self.get_filename(name))

        prep_module(module)
        exec_module(module, source, filename=filename)


def _get_lspy_file_loader():
    return (SibilantSourceFileLoader, SOURCE_SUFFIXES)


def _get_lspy_path_hook():
    return FileFinder.path_hook(_get_lspy_file_loader())


def _install():
    done = False

    def install():
        nonlocal done
        if not done:
            _path_hooks.append(_get_lspy_path_hook())
            sys.meta_path.append(SibilantPathFinder)
            done = True

    return install


_install = _install()
_install()

The SibilantPathFinder overrides PathFinder and replaces only those methods which reference sys.path_hook and sys.path_importer_cache with similar implementations which instead look in a _path_hook and _path_importer_cache which are local to this module.

During import, the existing PathFinder will try to find a matching module. If it cannot, then my injected SibilantPathFinder will re-traverse the sys.path and try to find a match with one of my own file extensions.

Figuring More Out

I ended up delving into the source for the _bootstrap_external module https://github.com/python/cpython/blob/master/Lib/importlib/_bootstrap_external.py

The _install function and the PathFinder.find_spec method are the best starting points to seeing why things work the way they do.

Bedford answered 18/7, 2017 at 13:39 Comment(0)
T
9

@obriencj's analysis of the situation is correct. But I came up with a different solution to this problem that doesn't require putting anything in sys.meta_path. Instead, it installs a special hook in sys.path_hooks that acts almost as a sort of middle-ware between the PathFinder in sys.meta_path, and the hooks in sys.path_hooks where, rather than just using the first hook that says "I can handle this path!" it tries all matching hooks in order, until it finds one that actually returns a useful ModuleSpec from its find_spec method:

@PathEntryFinder.register
class MetaFileFinder:
    """
    A 'middleware', if you will, between the PathFinder sys.meta_path hook,
    and sys.path_hooks hooks--particularly FileFinder.

    The hook returned by FileFinder.path_hook is rather 'promiscuous' in that
    it will handle *any* directory.  So if one wants to insert another
    FileFinder.path_hook into sys.path_hooks, that will totally take over
    importing for any directory, and previous path hooks will be ignored.

    This class provides its own sys.path_hooks hook as follows: If inserted
    on sys.path_hooks (it should be inserted early so that it can supersede
    anything else).  Its find_spec method then calls each hook on
    sys.path_hooks after itself and, for each hook that can handle the given
    sys.path entry, it calls the hook to create a finder, and calls that
    finder's find_spec.  So each sys.path_hooks entry is tried until a spec is
    found or all finders are exhausted.
    """

    class hook:
        """
        Use this little internal class rather than a function with a closure
        or a classmethod or anything like that so that it's easier to
        identify our hook and skip over it while processing sys.path_hooks.
        """

        def __init__(self, basepath=None):
            self.basepath = os.path.abspath(basepath)

        def __call__(self, path):
            if not os.path.isdir(path):
                raise ImportError('only directories are supported', path=path)
            elif not self.handles(path):
                raise ImportError(
                    'only directories under {} are supported'.format(
                        self.basepath), path=path)

            return MetaFileFinder(path)

        def handles(self, path):
            """
            Return whether this hook will handle the given path, depending on
            what its basepath is.
            """

            path = os.path.abspath(path)

            return (self.basepath is None or
                    os.path.commonpath([self.basepath, path]) == self.basepath)

    def __init__(self, path):
        self.path = path
        self._finder_cache = {}

    def __repr__(self):
        return '{}({!r})'.format(self.__class__.__name__, self.path)

    def find_spec(self, fullname, target=None):
        if not sys.path_hooks:
            return None

        last = len(sys.path_hooks) - 1

        for idx, hook in enumerate(sys.path_hooks):
            if isinstance(hook, self.__class__.hook):
                continue

            finder = None
            try:
                if hook in self._finder_cache:
                    finder = self._finder_cache[hook]
                    if finder is None:
                        # We've tried this finder before and got an ImportError
                        continue
            except TypeError:
                # The hook is unhashable
                pass

            if finder is None:
                try:
                    finder = hook(self.path)
                except ImportError:
                    pass

            try:
                self._finder_cache[hook] = finder
            except TypeError:
                # The hook is unhashable for some reason so we don't bother
                # caching it
                pass

            if finder is not None:
                spec = finder.find_spec(fullname, target)
                if (spec is not None and
                        (spec.loader is not None or idx == last)):
                    # If no __init__.<suffix> was found by any Finder,
                    # we may be importing a namespace package (which
                    # FileFinder.find_spec returns in this case).  But we
                    # only want to return the namespace ModuleSpec if we've
                    # exhausted every other finder first.
                    return spec

        # Module spec not found through any of the finders
        return None

    def invalidate_caches(self):
        for finder in self._finder_cache.values():
            finder.invalidate_caches()

    @classmethod
    def install(cls, basepath=None):
        """
        Install the MetaFileFinder in the front sys.path_hooks, so that
        it can support any existing sys.path_hooks and any that might
        be appended later.

        If given, only support paths under and including basepath.  In this
        case it's not necessary to invalidate the entire
        sys.path_importer_cache, but only any existing entries under basepath.
        """

        if basepath is not None:
            basepath = os.path.abspath(basepath)

        hook = cls.hook(basepath)
        sys.path_hooks.insert(0, hook)
        if basepath is None:
            sys.path_importer_cache.clear()
        else:
            for path in list(sys.path_importer_cache):
                if hook.handles(path):
                    del sys.path_importer_cache[path]

This is still, depressing, far more complication than should be necessary. I feel like on Python 2, before the import system rewrite, it was much simpler to do this since less of the support for the built-in module types (.py, etc.) was built on top of the import hooks themselves, so it was harder to break importing normal modules by adding hooks to import new modules types. I'm going to start a discussion on python-ideas to see if there's any way we can't improve this situation.

Tribade answered 7/2, 2018 at 19:40 Comment(4)
Nice! What is the advantage of your solution, @Tribade ? It seems that the heart of the problem is that python caches PathEntryFinders from sys.path_hooks without checking that they can actually load all needed modules. @Bedford solution seems to be avoiding the path-based subsystem, reimplementing it better via meta-path. You instead kind of patch the normal PathFinder, making it circumvent the premature caching once again? Also, can you shed light on the install function? I don't understand who calls that and what is it overriding (something from ABCMeta? or PathEntryFinder?).Tony
@MichelePiccolini I don't really remember the details of what's going on here anymore. I think, if I recall correctly, I wanted to use this to handle importing file types that were not standard Python extensions (e.g. import .pyx files). The install method would be called by any code that needs to use this extension to the import system. So in your own code you would run MetaFileFinder.install() before trying to import any "non-standard" modules.Tribade
It's too bad this is still so complicated to do on the Python end. Maybe I should make an issue about it (I seem to recall making a python-ideas post about this around the same time I wrote this, but nobody seemed interested).Tribade
Thanks @Iguananaut! I played a bit with your code and I was able to fully understand what it does. I agree that it does seem cumbersome to have to do something like this just to avoid a behavior of python that seems to have a "too much aggressive caching".Tony
R
1

I came up with yet an alternative tweak. I won't say it is beautiful as it does a closure on an already existing one, but at least short :)

It adds loaders to the default FileLoader objects through a new hook. The original path_hook_for_FileFinder is wrapped in a closure and the loaders are injected into the FileFinder objects returned by the original hook.

After the new hook added the path_importer_cache is cleared as that is already filled with the original FileFinder objects. Those could also be updated dynamically, but I did not bother for now.

Disclaimer: not extensively tested yet. It does what I need in the easiest possible way I know, but the import system is complicated enough to produce funny side-effects for a tweak like this.

import sys
import importlib.machinery

def extend_path_hook_for_FileFinder(*loader_details):

    orig_hook, orig_pos = None, None

    for i, hook in enumerate(sys.path_hooks):
        if hook.__name__ == 'path_hook_for_FileFinder':
            orig_hook, orig_pos = hook, i
            break

    sys.path_hooks.remove(orig_hook)

    def extended_path_hook_for_FileFinder(path):
        orig_finder = orig_hook(path)

        loaders = []
        for loader, suffixes in loader_details:
            loaders.extend((suffix, loader) for suffix in suffixes)
        
        orig_finder._loaders.extend(loaders)

        return orig_finder

    sys.path_hooks.insert(orig_pos, extended_path_hook_for_FileFinder)


MY_SUFFIXES = ['.pymy']

class MySourceFileLoader(importlib.machinery.SourceFileLoader):
    pass

loader_detail = (MySourceFileLoader, MY_SUFFIXES)

extend_path_hook_for_FileFinder(loader_detail)

# empty cache as it is already filled with simple FileFinder
# objects for the most common path elements
sys.path_importer_cache.clear()
sys.path_importer_cache.invalidate_caches()

Ragg answered 26/4, 2022 at 5:41 Comment(3)
Extremely clever. Although terser than Iguananaut's already clever solution, this approach requires violating privacy encapsulation at least twice: once via the assumption that importlib._bootstrap_external.FileFinder.path_hook() returns a closure named path_hook_for_FileFinder (which is by no means guaranteed in future Python releases) and again by accessing the private FileFinder._loaders list. Anyone who does this should have extensive unit tests validating this does what you think it does.Risner
Relatedly: you definitely don't want to do sys.path_importer_cache = {}. No one expects anyone to do something drastic like that, so you shouldn't. If any other code held a reference to the prior cache object, you've now pulled the rug out from underneath them. Instead, you want to first make a call to sys.path_importer_cache.clear() followed by a call to importlib.invalidate_caches(). Doing so preserves the existing cache object while still achieving the intended effect.Risner
Thanks @CecilCurry, good points, I will try to come up with something more future proof, currently I am just using this to achieve something that should be more supported by the import system in the first place. Clearing the cache properly is definitelly better than my hard delete, I will incorporate that!Dicarlo

© 2022 - 2024 — McMap. All rights reserved.