python pip priority order with index-url and extra-index-url
Asked Answered
O

3

33

I searched a bit but could not find a clear answer.
The goal is, to have two pip indexes, one is a private index, that will be a first priority. And one is the standard PyPI. The priority is there to prevent the security risk of code injection.

Say I have library named lib, and I configure index_url = http://my_private_pypi_repo and extra_index_url = https://pypi.org/simple

If I pip install lib, and lib exists in both indexes. What index will get the priority? From where it is going to be installed from?

Also, if I pip install lib=0.0.2 but lib exists in my private index at version 0.0.1. Is it going to look at PyPI as well?

And what is a good way to be in control, that certain libraries will only be fetched from the private index if they exists there, and will not be looked for at PyPI?

Outlying answered 25/4, 2021 at 11:59 Comment(0)
G
31

The short answer is: there is no prioritization and you probably should avoid using --extra-index-url entirely.


This is asked and answered here: https://github.com/pypa/pip/issues/5045#issuecomment-369521345

Question:

I have this in my pip.conf:

[global]
index-url = https://myregistry-xyz.com
extra-index-url = https://pypi.python.org/pypi

Let's assume packageX exists in both registries and I run pip install packageX.

I expect pip to install packageX from https://myregistry-xyz.com, but pip will use https://pypi.python.org/pypi instead.

If I switch the values for index-url and extra-index-url I get the same result. pypi is always prioritized.

Answer:

Packages are expected to be unique up to name and version, so two wheels with the same package name and version are treated as indistinguishable by pip. This is a deliberate feature of the package metadata, and not likely to change.


I would also recommend reading this discussion: https://discuss.python.org/t/dependency-notation-including-the-index-url/5659

There are quite a lot of things that are addressed in this discussion, some that is clearly out of scope for this question, but everything is very informative anyway.

In there, there should be the key takeaway for you:

Pip does not really prioritize one index over the other in theory. In practice, because of a coincidence in the way things are implemented in code, it might be that one is always checked first, but it is not a behavior you should rely on.

And what is a good way to be in control, that certain libraries will only be fetched from the private index if they exists there, and will not be looked for at PyPI?

You should setup and curate your own package index (devpi, pydist, jfrog artifactory, sonatype nexus, etc.) and use it exclusively, meaning: never use --extra-index-url. This is the only way you can have exact control over what gets downloaded. This custom repository might function mostly a proxy for the public PyPI, except for a couple of dependencies.


For a potential solution to some of the reasons that lead to ask about index priority order, keep an eye on "PEP 708 – Extending the Repository API to Mitigate Dependency Confusion Attacks"


Related:

Grovergroves answered 2/5, 2021 at 15:51 Comment(0)
W
0

Assuming you already have a private pip repository and you want to serve some patched version of a package from your private pip repo and the rest from the public pip repo.

There is no priority among the --extra-index-url and --index-url, but we can handle this at private pip repo level.

If we look at the pip repo structure it is something like this:

├── bar
│   └── bar-0.1.tar.gz
│   └── bar-0.2.tar.gz
│   └── index.html ## 1
└── foo 
│   ├── foo-1.0.tar.gz
│   └── foo-2.0.tar.gz
│   └── index.html
└── index.html ## 2

Each index.html file is responsible for listing the all the file in the dir and simply contains the links of each files.

<!DOCTYPE html>
<html>
<head>
 <meta http-equiv="content-type" content="text/html; charset=windows-1252">
 <title>Bar Python Packages</title>
</head>
<body>
<pre>
 45.63 KiB  2024-06-20T11:22:29Z  <a href="./bar-2.0.2-py3-none-any.whl">bar-2.0.2-py3-none-any.whl</a>
 45.63 KiB  2024-06-20T11:22:29Z  <a href="./bar-2.0.2-py3-none-any.whl">bar-3.0.0-py3-none-any.whl</a>
 45.63 KiB  2024-06-20T11:22:29Z  <a href="./bar-2.0.2-py3-none-any.whl">bar-4.0.0-py3-none-any.whl</a>
 ..
</pre>
</body>
</html>

But the link need not to be local, we can specify the link to point to pip, like this:

<!DOCTYPE html>
<html>
<head>
 <meta http-equiv="content-type" content="text/html; charset=windows-1252">
 <title>Bar Python Packages</title>
</head>
<body>
<pre>
 45.63 KiB  2024-06-20T11:22:29Z  <a href="./bar-2.0.2-py3-none-any.whl">bar-2.0.2-py3-none-any.whl</a>
 45.63 KiB  2024-06-20T11:22:29Z  <a href="./bar-2.0.2-py3-none-any.whl">bar-3.0.0-py3-none-any.whl</a>
 45.63 KiB  2024-06-20T11:22:29Z  <a href="https://files.pythonhosted.org/packages/02/2b/982217eab772d5e969c04614f6b77c158aee9699201a616cc00c1645326e//bar-2.0.2-py3-none-any.whl">bar-4.0.0-py3-none-any.whl</a>
 ..
</pre>
</body>
</html>

Now we can replace the last index.html file and place it in /bar/index.html in the private pip index.

If we now run pip install "bar==2.0.2" --index-url="<path-to-private-repo>" it will be using .whl files from private pip and if we run pip install "bar==4.0.0" --index-url="<path-to-private-repo>" it will be point to public pip, and will use the wheels from there.

To get the URL for .whl files from public pip we can use - https://pypi.org/pypi/<bar>json and get it from the field releases.<version>.url.

One downside of this approach is that we have to keep up to date with new releases of the package and keep updating the index.html in a private repo.

Waikiki answered 12/7 at 6:54 Comment(0)
K
-3

The title of this question feels a bit like an instance of XY problem. If you would elaborate more on what you want to achieve and what your constraints are we may be able to give you a better answer.

That said, sinoroc's suggestion to curate your own package index and use only that is a good one. A few other ideas also come to mind:

  • Update: Turns out pip may run distributions other than those in the constraints file so this method should probably be considered insecure. Additionally hashes are kind of broken on recent releases of pip.

    Using a constraints file with hashes. This file can be generated using pip-tools like pip-compile --generate-hashes assuming you have documented your dependencies in a file named requirements.in. You can then install packages like pip install -c requirements.txt some_package.

    • Pro: What may be installed is documented alongside your code in your VCS.
    • Con: Controlling what is downloaded the first time is either tricky or laborious.
    • Con: Hash checking can be slow.
    • Con: You run into issues more frequently than when not using hashes. Some can be worked around others cannot; it is for instance not possible to combine constraints like -e file://` with hashes.
  • Use an alternative packaging tool like pipenv. It works similarly to the previous suggestion.

    • Pro: Easy to use
    • Con: Harder to integrate into your workflow if it does not fit naturally.
  • Curate packages locally. Packages and dependencies can be downloaded like pip download --dest some_dir some_package and installed like pip install --no-index --find-links some_dir.

    • Pro: What may be installed can be documented alongside your code, if you track the artifacts in VCS e.g. git lfs.
    • Con: Either all packages are downloaded or none are.
  • Use a hermetic build system. I know bazel advertise this as a feature, not sure about others like pants and buck.

    • Pro: May be the ultimate solution if you want control over your builds.
    • Con: Does not integrate well with open source python ecosystem afaik.
    • Con: A lot of overhead.

1: https://en.wikipedia.org/wiki/XY_proble

Kutchins answered 6/5, 2021 at 19:57 Comment(4)
I think it is fairly clear. I have my private PyPI index at gitlab. I don't want that accidently a pip install command will download a package from the public PyPI, if someone happens to create such a package at PyPI. This is obviously a security issue.Outlying
In that case I definitely think sinorocs suggestion is the way to go. Out of curiosity, 1. will pip install be run only as part of automated workloads or manually as well? 2. Are you concerned about malicious releases of projects? 3. Are you concerned about name squatting?Kutchins
1. Both manually and automated. 2. Yes, that's the main issue here 3. what is it 'name squatting'?Outlying
Thanks for entertaining my curiosity. Typo-squatting may be a better term; what I refer to is the practice of a malicious to actors to upload an altered project to pypi under a similar name hoping that someone would download their malicious distribution by accident. They may for instance try forking numpy, embed some malware and publish it as nunpy. It appears to be a mild threat in practice. In fact, this is the only incident I remember hearing of.Kutchins

© 2022 - 2024 — McMap. All rights reserved.