Installing pandas in docker Alpine
Asked Answered
P

5

30

I am having a really hard time trying to install a stable data science package configuration in docker. This should be easier with such mainstream, relevant tools.

The following is the Dockerfile that used to work, with a bit of a hack, removing pandas from the package core and installing it separately, specifying pandas<0.21.0, because, allegedly, higher versions conflict with numpy.

    FROM alpine:3.6

    ENV PACKAGES="\
    dumb-init \
    musl \
    libc6-compat \
    linux-headers \
    build-base \
    bash \
    git \
    ca-certificates \
    freetype \
    libgfortran \
    libgcc \
    libstdc++ \
    openblas \
    tcl \
    tk \
    libssl1.0 \
    "

ENV PYTHON_PACKAGES="\
    numpy \
    matplotlib \
    scipy \
    scikit-learn \
    nltk \
    " 

RUN apk add --no-cache --virtual build-dependencies python3 \
    && apk add --virtual build-runtime \
    build-base python3-dev openblas-dev freetype-dev pkgconfig gfortran \
    && ln -s /usr/include/locale.h /usr/include/xlocale.h \
    && python3 -m ensurepip \
    && rm -r /usr/lib/python*/ensurepip \
    && pip3 install --upgrade pip setuptools \
    && ln -sf /usr/bin/python3 /usr/bin/python \
    && ln -sf pip3 /usr/bin/pip \
    && rm -r /root/.cache \
    && pip install --no-cache-dir $PYTHON_PACKAGES \
    && pip3 install 'pandas<0.21.0' \    #<---------- PANDAS
    && apk del build-runtime \
    && apk add --no-cache --virtual build-dependencies $PACKAGES \
    && rm -rf /var/cache/apk/*

# set working directory
WORKDIR /usr/src/app

# add and install requirements
COPY ./requirements.txt /usr/src/app/requirements.txt # other than data science packages go here
RUN pip install -r requirements.txt

# add entrypoint.sh
COPY ./entrypoint.sh /usr/src/app/entrypoint.sh

RUN chmod +x /usr/src/app/entrypoint.sh

# add app
COPY . /usr/src/app

# run server
CMD ["/usr/src/app/entrypoint.sh"]

The configuration above used to work. What happens now is that build does go through, but pandas fails at import with the following error:

ImportError: Missing required dependencies ['numpy']

Since numpy 1.16.1 was installed, I don't know which numpy pandas is trying to find anymore...

Does anyone know how to obtain a stable solution for this?

NOTE: A solution consisting of a pull from a turnkey docker image for data science with at least the packages mentioned above, into Dockerfile above, would be also very welcomed.


EDIT 1:

If I move install of data packages into requirements.txt, as suggested in the comments, like so:

requirements.txt

(...)
numpy==1.16.1 # or numpy==1.16.0
scikit-learn==0.20.2
scipy==1.2.1
nltk==3.4   
pandas==0.24.1 # or pandas== 0.23.4
matplotlib==3.0.2 
(...)

and Dockerfile:

# add and install requirements
COPY ./requirements.txt /usr/src/app/requirements.txt
RUN pip install -r requirements.txt

It breaks again at pandas, complaining about numpy.

Collecting numpy==1.16.1 (from -r requirements.txt (line 61))
  Downloading https://files.pythonhosted.org/packages/2b/26/07472b0de91851b6656cbc86e2f0d5d3a3128e7580f23295ef58b6862d6c/numpy-1.16.1.zip (5.1MB)
Collecting scikit-learn==0.20.2 (from -r requirements.txt (line 62))
  Downloading https://files.pythonhosted.org/packages/49/0e/8312ac2d7f38537361b943c8cde4b16dadcc9389760bb855323b67bac091/scikit-learn-0.20.2.tar.gz (10.3MB)
Collecting scipy==1.2.1 (from -r requirements.txt (line 63))
  Downloading https://files.pythonhosted.org/packages/a9/b4/5598a706697d1e2929eaf7fe68898ef4bea76e4950b9efbe1ef396b8813a/scipy-1.2.1.tar.gz (23.1MB)
Collecting nltk==3.4 (from -r requirements.txt (line 64))
  Downloading https://files.pythonhosted.org/packages/6f/ed/9c755d357d33bc1931e157f537721efb5b88d2c583fe593cc09603076cc3/nltk-3.4.zip (1.4MB)
Collecting pandas==0.24.1 (from -r requirements.txt (line 65))
  Downloading https://files.pythonhosted.org/packages/81/fd/b1f17f7dc914047cd1df9d6813b944ee446973baafe8106e4458bfb68884/pandas-0.24.1.tar.gz (11.8MB)
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "/usr/local/lib/python3.7/site-packages/pkg_resources/__init__.py", line 359, in get_provider
        module = sys.modules[moduleOrReq]
    KeyError: 'numpy'

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-install-_e5z6o6_/pandas/setup.py", line 732, in <module>
        ext_modules=maybe_cythonize(extensions, compiler_directives=directives),
      File "/tmp/pip-install-_e5z6o6_/pandas/setup.py", line 475, in maybe_cythonize
        numpy_incl = pkg_resources.resource_filename('numpy', 'core/include')
      File "/usr/local/lib/python3.7/site-packages/pkg_resources/__init__.py", line 1144, in resource_filename
        return get_provider(package_or_requirement).get_resource_filename(
      File "/usr/local/lib/python3.7/site-packages/pkg_resources/__init__.py", line 361, in get_provider
        __import__(moduleOrReq)
    ModuleNotFoundError: No module named 'numpy'

Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-install-_e5z6o6_/pandas/

EDIT 2:

This seems like an open pandas issue. For more details please refer to:

pandas-dev github

"Unfortunately, this means that a requirements.txt file is insufficient for setting up a new environment with pandas installed (like in a docker container)".

  **ImportError**:

  IMPORTANT: PLEASE READ THIS FOR ADVICE ON HOW TO SOLVE THIS ISSUE!

  Importing the multiarray numpy extension module failed.  Most
  likely you are trying to import a failed build of numpy.
  Here is how to proceed:
  - If you're working with a numpy git repository, try `git clean -xdf`
    (removes all files not under version control) and rebuild numpy.
  - If you are simply trying to use the numpy version that you have installed:
    your installation is broken - please reinstall numpy.
  - If you have already reinstalled and that did not fix the problem, then:
    1. Check that you are using the Python you expect (you're using /usr/local/bin/python),
       and that you have no directories in your PATH or PYTHONPATH that can
       interfere with the Python and numpy versions you're trying to use.
    2. If (1) looks fine, you can open a new issue at
       https://github.com/numpy/numpy/issues.  Please include details on:
       - how you installed Python
       - how you installed numpy
       - your operating system
       - whether or not you have multiple versions of Python installed
       - if you built from source, your compiler versions and ideally a build log

EDIT 3

requirements.txt ---> https://pastebin.com/0icnx0iu


EDIT 4

As of 01/12/20, the accepted solution started not to work anymore. Now, build breaks not at pandas, but at scipy but after numpy, while building scipy's wheel. This is the log:

  ----------------------------------------
  ERROR: Failed building wheel for scipy
  Running setup.py clean for scipy
  ERROR: Command errored out with exit status 1:
   command: /usr/bin/python3.6 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-s6nahssd/scipy/setup.py'"'"'; __file__='"'"'/tmp/pip-install-s6nahssd/scipy/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' clean --all
       cwd: /tmp/pip-install-s6nahssd/scipy
  Complete output (9 lines):

  `setup.py clean` is not supported, use one of the following instead:

    - `git clean -xdf` (cleans all files)
    - `git clean -Xdf` (cleans all versioned files, doesn't touch
                        files that aren't checked into the git repo)

  Add `--force` to your command to use it anyway if you must (unsupported).

  ----------------------------------------
  ERROR: Failed cleaning build dir for scipy
Successfully built numpy
Failed to build scipy
ERROR: Could not build wheels for scipy which use PEP 517 and cannot be installed directly

From the error it seems that building process is using python3.6, while I use FROM alpine:3.7.

Full log here -> https://pastebin.com/Tw4ubxSA

And this is the current Dockerfile:

https://pastebin.com/3SftEufx

Positively answered 26/2, 2019 at 16:45 Comment(13)
You mentioned "specifying pandas<0.21.0, because, allegedly, higher versions conflict with numpy", have you actually experienced issues between pandas 0.24.1 and numpy? I have been using this version since release every day and I have not experienced any conflict issue with numpy.Scarab
well in the context above, if I point to Collecting pandas==0.24.1, I get the error: File "/usr/lib/python3.6/site-packages/pkg_resources/__init__.py", line 346, in get_provider module = sys.modules[moduleOrReq] KeyError: 'numpy'Positively
uhm.. Have you tried putting your libraries in a requirements.txt file, COPY the file to your container and RUN pip install -r requirements. That is how I usually install python libraries in my docker projectsScarab
tried, to no avail. please refer to my edit.Positively
Use conda/pipenv/poetry environments to create a working dependencies for your project locally. Copy appropriate file into docker (e.g. Pipfile and Pipfile.lock) with COPY directive and activate your environment there. You should be able to easily run your code and make the Dockerfile itself more readable.Eladiaelaeoptene
care to answer using the dockerfile above with conda? would be really appreciated and upvoted.Positively
I am unable to run conda inside the alpine image, care if I use ubuntu? As alpine does not provide glibc but uses musl it creates a lot of problems with dumb workarounds like here. What are you trying to achieve, what is your end goal?Eladiaelaeoptene
ok, the more stable, the better. I need a self consistent core data package install, like above, with room for many installs in a requirements.txtPositively
Canyou try running pip install numpy --upgrade. Just wondering if an older version of numpy would already be installed and creates a conflict.Scarab
why do you want to build it yourself? You can find tons of already working Containers for datascience applications on Dockerhub, for example an Anaconda container would be sufficient. I think even nltk is in there by default, so you could just use such a 'turnkey' container.Geriatrician
You could answer with an example of such an install and I could accept thatPositively
Or just use the pandas alpine package..?Graniela
https://mcmap.net/q/103397/-why-does-it-take-ages-to-install-pandas-on-alpine-linuxGraniela
C
16

If you're not bound to Alpine 3.6, using Alpine 3.7 (or later) should work.

On Alpine 3.6, installing matplotlib failed for me with the following:

Collecting matplotlib
  Downloading https://files.pythonhosted.org/packages/26/04/8b381d5b166508cc258632b225adbafec49bbe69aa9a4fa1f1b461428313/matplotlib-3.0.3.tar.gz (36.6MB)
    Complete output from command python setup.py egg_info:
    Download error on https://pypi.org/simple/numpy/: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:833) -- Some packages may not be found!
    Couldn't find index page for 'numpy' (maybe misspelled?)
    Download error on https://pypi.org/simple/: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:833) -- Some packages may not be found!
    No local packages or working download links found for numpy>=1.10.0

However, on Alpine 3.7, it worked. This may be due to a numpy versioning issue (see here), but I'm not able to tell for sure. Past that problem, packages were built and installed successfully - taking a good while, about 30 minutes (since Alpine's musl-libc is not compatible to Python's Wheels format, all packages installed with pip have to be built from source).

Note that one important change is needed: you should only remove the build-runtime virtual package (apk del build-runtime) after pip install. Also, if applicable, you could replace numpy 1.16.1 with 1.16.2, which is the shipped version (otherwise 1.16.2 will be uninstalled and 1.16.1 built from source, further increasing the build time) - I haven't tried this, though.

For reference, here's my slightly modified Dockerfile and docker build output.

Note:

Usually Alpine is chosen as the base for minimizing the image size (Alpine is also otherwise very slick, but has compatibility issues with mainland Linux apps due to glibc/musl). Having to build Python packages from source kind of beats that purpose, since you get a very bloated image - 900MB before any cleanup, which also takes ages to build. The image could be greatly compacted by removing all intermediate compilation artifacts, build dependencies etc., but still.

If you can't get the Python package versions you need to work on Alpine, without having to build them from source, I would suggest trying other small and more compatible base images such as debian-slim, or even ubuntu.

Edit:

Following "Edit 3" with added requirements, here are updated Dockerfile and Docker build output. The following packages were added for satisfying build dependencies:

postgresql-dev libffi-dev libressl-dev libxml2 libxml2-dev libxslt libxslt-dev libjpeg-turbo-dev zlib-dev

For packages that failed to build due to specific headers, I used Alpine's package contents search to locate the missing package. Specifically for cffi, the ffi.h header was missing, which needs the libffi-dev package: https://pkgs.alpinelinux.org/contents?file=ffi.h&path=&name=&branch=v3.7.

Alternatively, when a package build failure is not very clear, the installation instructions of the specific package could be referred to, for example, Pillow.

The new image size, before any compaction, is 1.04GB. For cutting it down a bit, you could remove the Python and pip caches:

RUN apk del build-runtime && \
    find -type d -name __pycache__ -prune -exec rm -rf {} \; && \
    rm -rf ~/.cache/pip

This will bring image size down to 661MB, when using docker build --squash.

Calie answered 28/2, 2019 at 20:50 Comment(5)
still breaking for me, when It get to pip install requirements.txt. (preciselly at Collecting pycparser (from cffi>=1.1) I'll edit the question with a pastebin of my requirements.txt, if you wish to test with the full build.Positively
@data_garden added updated Dockerfile - please see my updated answerCalie
this worked, thanks a lot for your effort and clarity. +1Positively
@data_garden awesome, with pleasure. Please also see my recent edit, for cutting down image size a bit. Cheers!Calie
Solution is not working anymore. Please refer to my edit.Positively
C
5

Try adding this to your requirements.txt file:

numpy==1.16.0
pandas==0.23.4

I've been facing the same error since yesterday and this change solved it for me.

Crease answered 27/2, 2019 at 8:51 Comment(2)
Yes. This was my base image: FROM openjdk:8-alpineCrease
maybe you could try with FROM alpine:3.6, adding the other data packages above, and see if it still works in your env..Positively
C
3

FROM python:3.8-alpine

RUN apk --update add gcc build-base freetype-dev libpng-dev openblas-dev

RUN pip install --no-cache-dir matplotlib pandas

Currycomb answered 9/2, 2022 at 17:21 Comment(0)
G
2

An older Q&A at Why does it take ages to install Pandas on Alpine Linux relates.

If your aim to get a stable solution without knowing the nuts and bolts, for python 3 you can just build off the following (copy & paste of my answer from https://mcmap.net/q/103397/-why-does-it-take-ages-to-install-pandas-on-alpine-linux)

FROM python:3.7-alpine
RUN echo "@testing http://dl-cdn.alpinelinux.org/alpine/edge/testing" >> /etc/apk/repositories
RUN apk add --update --no-cache py3-numpy py3-pandas@testing

If your goal is to understand how to achieve a stable build, the discussion there and related images might help too...

Graniela answered 10/12, 2019 at 14:57 Comment(0)
D
1

This may not be completely relevant, since this the first answer that pops up when searching for numpy/pandas installation failed in Alpine, I am adding this answer.

The following fix worked for me(But it takes longer to install pandas/numpy)

apk update
apk --no-cache add curl gcc g++
ln -s /usr/include/locale.h /usr/include/xlocale.h

Now try installing pandas/numpy

Delorasdelorenzo answered 29/5, 2019 at 10:33 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.