Import vendored dependencies in Python package without modifying sys.path or 3rd party packages
Asked Answered
G

4

28

Summary

I am working on a series of add-ons for Anki, an open-source flashcard program. Anki add-ons are shipped as Python packages, with the basic folder structure looking as follows:

anki_addons/
    addon_name_1/
        __init__.py
    addon_name_2/
        __init__.py

anki_addons is appended to sys.path by the base app, which then imports each add_on with import <addon_name>.

The problem I have been trying to solve is to find a reliable way to ship packages and their dependencies with my add-ons while not polluting global state or falling back to manual edits of the vendored packages.

Specifics

Specifically, given an add-on structure like this...

addon_name_1/
    __init__.py
    _vendor/
        __init__.py
        library1
        library2
        dependency_of_library2
        ...

...I would like to be able to import any arbitrary package that is included in the _vendor directory, e.g.:

from ._vendor import library1

The main difficulty with relative imports like this is that they do not work for packages that also depend on other packages imported through absolute references (e.g. import dependency_of_library2 in the source code of library2)

Solution attempts

So far I have explored the following options:

  1. Manually updating the third-party packages, so that their import statements point to the fully qualified module path within my python package (e.g. import addon_name_1._vendor.dependency_of_library2). But this is tedious work that is not scalable to larger dependency trees and not portable to other packages.
  2. Adding _vendor to sys.path via sys.path.insert(1, <path_to_vendor_dir>) in my package init file. This works, but it introduces a global change to the module look-up path which will affect other add-ons and even the base app itself. It just seems like a hack that could result in a pandora's box of issues later down the line (e.g. conflicts between different versions of the same package, etc.).
  3. Temporarily modifying sys.path for my imports; but this fails to work for third-party modules with method-level imports.
  4. Writing a PEP302-style custom importer based off an example I found in setuptools, but I just couldn't make head nor tail of that.

I've been stuck on this for quite a few hours now and I'm beginning to think that I'm either completely missing an easy way to do this, or that there is something fundamentally wrong with my entire approach.

Is there no way I can ship a dependency tree of third-party packages with my code, without having to resort to sys.path hacks or modifying the packages in question?


Edit:

Just to clarify: I don't have any control over how add-ons are imported from the anki_addons folder. anki_addons is just the directory provided by the base app where all add-ons are installed into. It is added to the sys path, so the add-on packages therein pretty much just behave like any other python package located in Python's module look-up paths.

Gyn answered 27/9, 2018 at 13:31 Comment(0)
A
20

First of all, I'd advice against vendoring; a few major packages did use vendoring before but have switched away to avoid the pain of having to handle vendoring. One such example is the requests library. If you are relying on people using pip install to install your package, then just use dependencies and tell people about virtual environments. Don't assume you need to shoulder the burden of keeping dependencies untangled or need to stop people from installing dependencies in the global Python site-packages location.

At the same time, I appreciate that a plug-in environment of a third-party tool is something different, and if adding dependencies to the Python installation used by that tool is cumbersome or impossible vendorizing may be a viable option. I see that Anki distributes extensions as .zip files without setuptools support, so that's certainly such an environment.

So if you choose to vendor dependencies, then use a script to manage your dependencies and update their imports. This is your option #1, but automated.

This is the path that the pip project has chosen, see their tasks subdirectory for their automation, which builds on the invoke library. See the pip project vendoring README for their policy and rationale (chief among those is that pip needs to bootstrap itself, e.g. have their dependencies available to be able to install anything).

You should not use any of the other options; you already enumerated the issues with #2 and #3.

The issue with option #4, using a custom importer, is that you still need to rewrite imports. Put differently, the custom importer hook used by setuptools doesn't solve the vendorized namespace problem at all, it instead makes it possible to dynamically import top-level packages if the vendorized packages are missing (a problem that pip solves with a manual debundling process). setuptools actually uses option #1, where they rewrite the source code for vendorized packages. See for example these lines in the packaging project in the setuptools vendored subpackage; the setuptools.extern namespace is handled by the custom import hook, which then redirects either to setuptools._vendor or the top-level name if importing from the vendorized package fails.

The pip automation to update vendored packages takes the following steps:

  • Delete everything in the _vendor/ subdirectory except the documentation, the __init__.py file and the requirements text file.
  • Use pip to install all vendored dependencies into that directory, using a dedicated requirements file named vendor.txt, avoiding compilation of .pyc bytecache files and ignoring transient dependencies (these are assumed to be listed in vendor.txt already); the command used is pip install -t pip/_vendor -r pip/_vendor/vendor.txt --no-compile --no-deps.
  • Delete everything that was installed by pip but not needed in a vendored environment, i.e. *.dist-info, *.egg-info, the bin directory, and a few things from installed dependencies that pip would never use.
  • Collect all installed directories and added files sans .py extension (so anything not in the whitelist); this is the vendored_libs list.
  • Rewrite imports; this is simply a series of regexes, where every name in vendored_lists is used to replace import <name> occurrences with import pip._vendor.<name> and every from <name>(.*) import occurrence with from pip._vendor.<name>(.*) import.
  • Apply a few patches to mop up the remaining changes needed; from a vendoring perspective, only the pip patch for requests is interesting here in that it updates the requests library backwards compatibility layer for the vendored packages that the requests library had removed; this patch is quite meta!

So in essence, the most important part of the pip approach, the rewriting of vendored package imports is quite simple; paraphrased to simplify the logic and removing the pip specific parts, it is simply the following process:

import shutil
import subprocess
import re

from functools import partial
from itertools import chain
from pathlib import Path

WHITELIST = {'README.txt', '__init__.py', 'vendor.txt'}

def delete_all(*paths, whitelist=frozenset()):
    for item in paths:
        if item.is_dir():
            shutil.rmtree(item, ignore_errors=True)
        elif item.is_file() and item.name not in whitelist:
            item.unlink()

def iter_subtree(path):
    """Recursively yield all files in a subtree, depth-first"""
    if not path.is_dir():
        if path.is_file():
            yield path
        return
    for item in path.iterdir():
        if item.is_dir():
            yield from iter_subtree(item)
        elif item.is_file():
            yield item

def patch_vendor_imports(file, replacements):
    text = file.read_text('utf8')
    for replacement in replacements:
        text = replacement(text)
    file.write_text(text, 'utf8')

def find_vendored_libs(vendor_dir, whitelist):
    vendored_libs = []
    paths = []
    for item in vendor_dir.iterdir():
        if item.is_dir():
            vendored_libs.append(item.name)
        elif item.is_file() and item.name not in whitelist:
            vendored_libs.append(item.stem)  # without extension
        else:  # not a dir or a file not in the whilelist
            continue
        paths.append(item)
    return vendored_libs, paths

def vendor(vendor_dir):
    # target package is <parent>.<vendor_dir>; foo/_vendor -> foo._vendor
    pkgname = f'{vendor_dir.parent.name}.{vendor_dir.name}'

    # remove everything
    delete_all(*vendor_dir.iterdir(), whitelist=WHITELIST)

    # install with pip
    subprocess.run([
        'pip', 'install', '-t', str(vendor_dir),
        '-r', str(vendor_dir / 'vendor.txt'),
        '--no-compile', '--no-deps'
    ])

    # delete stuff that's not needed
    delete_all(
        *vendor_dir.glob('*.dist-info'),
        *vendor_dir.glob('*.egg-info'),
        vendor_dir / 'bin')

    vendored_libs, paths = find_vendored_libs(vendor_dir, WHITELIST)

    replacements = []
    for lib in vendored_libs:
        replacements += (
            partial(  # import bar -> import foo._vendor.bar
                re.compile(r'(^\s*)import {}\n'.format(lib), flags=re.M).sub,
                r'\1from {} import {}\n'.format(pkgname, lib)
            ),
            partial(  # from bar -> from foo._vendor.bar
                re.compile(r'(^\s*)from {}(\.|\s+)'.format(lib), flags=re.M).sub,
                r'\1from {}.{}\2'.format(pkgname, lib)
            ),
        )

    for file in chain.from_iterable(map(iter_subtree, paths)):
        patch_vendor_imports(file, replacements)

if __name__ == '__main__':
    # this assumes this is a script in foo next to foo/_vendor
    here = Path('__file__').resolve().parent
    vendor_dir = here / 'foo' / '_vendor'
    assert (vendor_dir / 'vendor.txt').exists(), '_vendor/vendor.txt file not found'
    assert (vendor_dir / '__init__.py').exists(), '_vendor/__init__.py file not found'
    vendor(vendor_dir)
Ambros answered 6/10, 2018 at 17:42 Comment(16)
Is iter_subtree different than os.walk(topdown=False)?Culicid
@Ben: os.walk() doesn't give you pathlib paths, only strings; iter_subtree() works exclusively with pathlib.Path instances.Ambros
I see. So you're writing it to avoid creating a new Path from each string returned? Thanks for the explanationCulicid
Thank you, Martijn! Your answer was very insightful and helped clear up a lot of the confusion I had on these different approaches. If anything, it saved me from going down even more of a rabbit-hole trying to find the perfect solution I envisioned which obviously does not exist. I also very much appreciate that you even took the time to modify the pip script to get me started with automating my own vendoring. I haven't had a chance to give it a try, yet, but it looks very promising! I just awarded you the bounty as your answer absolutely deserved it. Thanks again!Gyn
@MartijnPieters (1) thanks again for posting the script! I had a chance to try it out recently and it's been working really well. The only issue I keep stumbling on is that it doesn't rewrite absolute imports of the form import foo.bar. I've tried expanding the RegEx replacements to account for that, but have not had much success so far.Gyn
(2) I guess the core issue here is that I can't think of any exact relative import equivalent of these types of import statements. From what I've gathered, import foo.bar will both evaluate foo and foo.bar, but only foo ends up in the namespace. The nearest relative equivalent would be from .foo import bar, but that puts bar into the namespace and not foo. Do you have any idea how I could address this? It definitely feels like I'm missing something obvious.Gyn
@Glutanimate: I fear you’ll have to use multiple imports: from . import foo and from .foo import bar as _. The first adds foo to the namespace, the latter ensures foo.bar exists as an attribute but assigns the nested module to the name _, universally recognised as this name is ignored.Ambros
@Gyn the second from .top import nested as _ may not be needed if the top-level name itself imports the nested name. os.path always exists because os uses it, and the same goes for many other such nested packages. So experiment and decide how much risk it is to assume future releases won’t change that relationship vs the risk of maintaining more complicated replacement rules.Ambros
It is true that requests moved away from vendoring, but can you also shed some light on how that move help solve the problem which was presumably the reason that the vendoring was desirable? In other words, an application can avoid vendoring and ask user to simply use virtual env, but how can a LIBRARY Foo which depends on Bar 1.x allows its downstream app to use Bar==2.0? Any comments on this?Planospore
@RayLuo: pip gained the ability to tell you your dependencies are broken, is what changed. There is no good way of allowing the downstream app to use Bar==2.0, no, that's way way too complex a problem to solve in comments on SO at any rate.Ambros
This script will work if you install the addon manually, but in practice there's a big issue due to how Anki addons are typically distributed: when you upload an addon to Ankiweb it gets assigned an effectively random ID which becomes the name of that addon's folder, and therefore the module name, when people download it through the app. It's possible to update an addon after the fact, meaning you could hypothetically hardcode that ID somewhere, except it's entirely numeric, making it incompatible with import statements -- you would need to also replace them with the functional form.Kitti
For the reasons above, if it's necessary to rewrite all the absolute import statements, it's a much better idea in this context to turn them into relative ones. Which of course means keeping track of subdirectory depth and prepending the right amount of . I couldn't find any existing solutions to help with this, so I'm sure that anyone who wants to have a go would be appreciated.Kitti
@Kitti it shouldn’t be that hard to retool the replacements based on how many elements there are in each filename that’s being edited. Note that the OP never mentioned those additional constraints. I’m sorry that you found that my answer doesn’t work for you but that’s the fault of the question not including an accurate description of the Ankiweb upload system.Ambros
Yeah @Gyn is well known for his published addons so I'm also surprised he didn't bring it up - maybe he didn't actually use this solution in the end. I guess another way around those constraints could be to add a stable alias for your addon in sys.modules to use, like: sys.modules["myuniqueaddonname"] = sys.modules[__name__]. There's still technically a chance of collision but at least you can minimize that. Still I think relative imports are the way to go, just as they are in addon code you write yourself.Kitti
@Inkling: If Ankit extensions are installed under a numeric ID, you can still use importlib.import_module() to import it, or access it via sys.modules[identifier]. Instead of myuniqueaddonname, I'd generate a globally unique module name (with f"anki_{uuid.uuid3(uuid.NAMESPACE_DNS, 'ankiweb.net')}" perhaps, which would make collisions next-to-impossible), once, per plugin project.Ambros
@Inkling: Then use that name as the vendorising package parent namespace, use sys.modules[unique_name] = sys.modules[__name__] so the vendorised imports work, and reference .vendor.[some_package] in your own code.Ambros
H
1

How about making your anki_addons folder a package and importing the the required libraries to __init__.py in the main package folder.

So it'd be something like

anki/
__init__.py

In anki.__init__.py :

from anki_addons import library1

In anki.anki_addons.__init__.py :

from addon_name_1 import *

I'm new at this, so please bear with me here.

Hanging answered 3/10, 2018 at 16:10 Comment(5)
Thanks for the response! Really appreciate it. Just to clarify: I don't have any control over how add-ons are imported from the anki_addons folder. anki_addons is just the directory provided by the base app where all add-ons are installed into. It is added to the sys path, so the add-on packages therein pretty much just behave like any other python package located in Python's module look-up paths.Gyn
But to follow along with your train of thought: The failure point always occurs when Python tries to import a package through an unresolvable absolute import. E.g.: Importing library1 would fail because its dependency, library2, would not be found in the module look-up path. And as Python 3 no longer supports implicit relative imports, even moving library2 into library1's package folder would not work without adjusting library1's source code for explicit relative imports.Gyn
Even worse, a lot of packages will use their own package name for in-package imports (e.g. import library1.module1 within library1/__init__.py). So even without any dependencies, packaging third-party packages can fail.Gyn
Thanks for clarifying. Really appreciate it. If this could be of some help.Hanging
In my experience, we've used conda and sometimes even pipenv, but those were smaller projects. I'm not sure if they could be of help in your case, but just putting it out there.Hanging
H
1

To extend the excellent reply from Martijn Pieters, pip has been using a dedicated CLI tool for vendoring dependencies since pip 20.0. The tool is called vendoring and seems to be mainly focus on pip's need, but I am hopeful that it can become a great framework for any projects with similar needs.

As I'm writing this comment, they don't have user facing documentation yet: https://github.com/pradyunsg/vendoring/issues/3

It is configurable via the pyproject.toml file:

[tool.vendoring]
destination = "src/pip/_vendor/"
requirements = "src/pip/_vendor/vendor.txt"
namespace = "pip._vendor"

protected-files = ["__init__.py", "README.rst", "vendor.txt"]
patches-dir = "tools/vendoring/patches"

It can be installed in a virtual environment as follows:

$ pip install vendoring

And it seems to work as follows:

$ vendoring sync /path/to/location    # Install dependencies in destination folder
$ vendoring update /path/to/location  # Update vendoring dependencies

EDIT:

I have been using this tool on a python plugin for a compositor software. More info about it here: https://nomenclator-nuke.readthedocs.io/en/stable/installing.html#managing-external-dependencies

Hentrich answered 11/8, 2021 at 22:16 Comment(0)
R
-2

The best way to bundle dependencies is to use a virtualenv. The Anki project should at least be able to install inside one.

I think what you are after is namespace packages.

https://packaging.python.org/guides/packaging-namespace-packages/

I would imagine that the main Anki project has a setup.py and every add-on has its own setup.py and can be installed from its own source distribution. Then the add-ons can list their dependencies in their own setup.py and pip will install them in site-packages.

Namespace packages only solves part of the problem and as you said you don't have any control over how add-ons are imported from the anki_addons folder. I think designing how add-ons are imported and packaging them goes hand-in-hand.

The pkgutil module provides a way for the main project to discovered the installed add-ons. https://packaging.python.org/guides/creating-and-discovering-plugins/

A project that uses this extensively is Zope. http://www.zope.org

Have a look here: https://github.com/zopefoundation/zope.interface/blob/master/setup.py

Reptant answered 6/10, 2018 at 12:11 Comment(3)
No, this is not what the OP Is asking for. They are not looking for a toplevel.packagename setup, they are asking how to best bundle a series of dependent packages and not install them in site-packages at the top level.Ambros
@MartijnPieters The best way to bundle dependencies is to use a virtualenv.Reptant
@EddyPronk: yes, that's certainly something I'm advocating in my answer. But what has that got to do with namespace packages?Ambros

© 2022 - 2024 — McMap. All rights reserved.