How can I mount a GCS bucket in a custom Docker image on AI Platform?
F

2

13

I'm using Google's AI Platform to train machine learning models using a custom Docker image. To run existing code without modifications, I would like to mount a GCS bucket inside the container.

I think one way to achieve this is to install gcloud to authentication and gcsfuse for mounting in the container. My Dockerfile looks like this:

FROM nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04

WORKDIR /root

# Install system packages.
RUN apt-get update
RUN apt-get install -y curl
# ...

# Install gcsfuse.
RUN echo "deb http://packages.cloud.google.com/apt gcsfuse-bionic main" | tee /etc/apt/sources.list.d/gcsfuse.list
RUN curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add -
RUN apt-get update
RUN apt-get install -y gcsfuse

# Install gcloud.
RUN apt-get install -y apt-transport-https
RUN apt-get install -y ca-certificates
RUN echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] https://packages.cloud.google.com/apt cloud-sdk main" | tee -a /etc/apt/sources.list.d/google-cloud-sdk.list
RUN curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key --keyring /usr/share/keyrings/cloud.google.gpg add -
RUN apt-get update
RUN apt-get install -y google-cloud-sdk

# ...

ENTRYPOINT ["entrypoint.sh"]

Inside the entry point script, I then try to authenticate with Google cloud and mount the bucket. My entrypoint.sh looks like this:

#!/bin/sh
set -e

gcloud auth login
gcsfuse my-bucket-name /root/output
python3 script.py --logdir /root/output/experiment

I then build the container and run it either locally for testing or remotely on the AI Platform for the full training run:

# Run locally for testing.
nvidia-docker build -t my-image-name .
nvidia-docker run -it --rm my-image-name

# Run on AI Platform for full training run.
nvidia-docker build -t my-image-name .
gcloud auth configure-docker
nvidia-docker push my-image-name
gcloud beta ai-platform jobs submit training --region us-west1 --scale-tier custom --master-machine-type standard_p100 --master-image-uri my-image-name

Both locally and on the AI Platform, the entrypoint.sh script hangs at the line gcloud auth login, probably because it waits for user input. Is there a better way of authenticating with Google Cloud from within the container? If not, how can I automate the line that currently hangs?

Flotage answered 21/10, 2019 at 0:36 Comment(4)
GCS was never designed to be used as a filesystem, and you will get terrible performance if you're modifying files. You should strongly consider using Cloud Filestore instead: cloud.google.com/filestoreWolfe
Then you also won't have to use FUSE, which itself is also slow, and has well-known security issues since you won't be able to permission files properly.Wolfe
I only need to store lightweight log files without sensitive information, so I'm not too worried about it. However, it does seem that my jobs are sometimes crashing when trying to append to a file (works 4 out of 5 times). Do you know if that's not supported by FUSE and how the alternative workflow would look like?Flotage
@Flotage Did you manage to successfully mount the storage? How did you run the docker container with --privileged tag in the AI - platform training job? When I test it locally I can run docker run -it --privileged img_name and it works but i don't see an option in the AI platform.Bierman
S
5

Instead of using gcloud auth login which is primarily meant for human/user authentication, consider using gcloud auth activate-service-account and supplying a key file. See here for details:

https://cloud.google.com/sdk/gcloud/reference/auth/activate-service-account

I would recommend not placing the keys file inside the image but instead provide it externally. Another alternative is to realize that the authentication can implicit via environment variables. So following cloud native practices, have the environment provide the credentials needed and don't try and authenticate inside your environment at all. If you plan to run your container inside GCP Compute Engine or GKE you can implicitly provide the service account to the container from outside the container.

Sanjuana answered 21/10, 2019 at 0:49 Comment(5)
Thanks! Setting ENV GOOGLE_APPLICATION_CREDENTIALS /root/service-account-key.json worked. I've added the JSON file to .gitignore to avoid checking it into the repository but it's still stored inside the image. Having Google Cloud set the environment variable automatically would be amazing --- is this possible on the AI Platform?Flotage
Can you elaborate further ... maybe with a link ... to what you consider to be the Google AI Platform? I am familiar with Google GCP Compute Engine for running Docker images and familiar with GCP GKE for Kubernetes ... but not come across "Google AI Platform" before. Maybe if you can provide a link to the documentation of this platform that might help us?Sanjuana
AI Platform manages your jobs. I can launch jobs using gcloud beta ai-platform jobs submit training --master-image-uri my-image-name and they get scheduled after a few minutes and the container is deleted once the job finishes. I haven't used Kubernetes but it seems a bit like a Google-managed Kubernetes.Flotage
Ahh ... found it ... reading here ... cloud.google.com/ml-engine/docs/… and this is a guess on my part ... it looks like the container does run GCP authenticated already ... as the user/principal of service-$CMLE_PROJ_NUM@cloud-ml.google.com.iam.gserviceaccount.com. Rather than declaring that you want your container to run as some "other" identity, maybe consider giving authority to this identity to do what you need?Sanjuana
Thanks, that worked. The command is gcloud projects add-iam-policy-binding <project-name> --member serviceAccount:service-<account-number>@cloud-ml.google.com.iam.gserviceaccount.com --role roles/ml.serviceAgent and you can find your AI Platform service account number under credentials. Mounting succeeds now, but the jobs are sometimes crashing when trying to append to a file (works 4 out of 5 times). Maybe that's not supported by buckets?Flotage
U
-1

If the default service account meets your needs, you can configure your container to use it like this. You may also be able to give it what it needs by granting it extra permissions.


If you want to use your own service account, you'll need to authenticate as a service account via:

gcloud auth activate-service-account --key-file=somekey.json

That way the container won't hang while asking you to authenticate via a browser. So the obvious next question is:

How do I insert my service account's key into the container?

The Strategy

First, you'll want to generate a key file for whatever service account you do want to use.

It's not a good idea to store credentials in docker images, so I put the key in a script which I then put in a storage bucket. So the container downloads and runs the script, which switches the configured identity to a service account of my choosing.

Entrypoint

# runs as the default service account
gsutil cp "$1" /run/cmd
chmod +x /run/cmd
/run/cmd

Run Script (in bucket)

cat << EOF!! > /dev/shm/sa_key
THE KEY FILE CONTENTS GO HERE
EOF!!

gcloud auth activate-service-account --key-file=/dev/shm/sa_key

# commands below this line are performed with the specified identity

The default service account has access to the storage buckets in its project, so the script above will have to go in such a bucket. Be sure that that bucket is appropriately protected, anyone with access to it can assume the identity of the service account whose keys it contains.

Testing Locally

docker run -v "/home/me/.config/gcloud:/root/.config/gcloud" \
    theimagename gs://my-project_job1/run_script

This will use your user's active gcloud creds to pull down the script and then it will switch to the service account. When it finishes, your host's gcloud will be configured to use the service account--so you may need to switch it back to yourself vi gcloud auth login. To avoid this, you can instead mount a copy of that directory, that way the original remains untouched.

Running in GCP

gcloud ai-platform jobs submit training job1 \    
  --region us-west2 \
  --master-image-uri us.gcr.io/my-project/theimagename:latest \
  -- gs://my-project_job1/run_script

I hacked this up a bit to remove references to parts of my project that are irrelevant here, so this probably won't run as is, but I think this shows the gist of how I've been using it:

https://gist.github.com/MatrixManAtYrService/737cb408e5a27c2aaa19576b0f6ec18a

Unilateral answered 3/1, 2020 at 16:7 Comment(1)
I haven't evaluated it, but i wonder if cloud.google.com/kubernetes-engine/docs/how-to/batch/… would be a better product for this kind of thing.Unilateral

© 2022 - 2024 — McMap. All rights reserved.