Kubernetes and MPI

Asked 29/6, 2016 at 7:49 Answered 16/2, 2018 at 13:36

I want to run an MPI job on my Kubernetes cluster. The context is that I'm actually running a modern, nicely containerised app but part of the workload is a legacy MPI job which isn't going to be re-written anytime soon, and I'd like to fit it into a kubernetes "worldview" as much as possible.

One initial question: has anyone had any success in running MPI jobs on a kube cluster? I've seen Christian Kniep's work in getting MPI jobs to run in docker containers, but he's going down the docker swarm path (with peer discovery using consul running in each container) and I want to stick to kubernetes (which already knows the info of all the peers) and inject this information into the container from the outside. I do have full control over all the parts of the application, e.g. I can choose which MPI implementation to use.

I have a couple of ideas about how to proceed:

fat containers containing slurm and the application code -> populate the slurm.conf with appropriate info about the peers at container startup -> use srun as the container entrypoint to start the jobs
slimmer containers with only OpenMPI (no slurm) -> populate a rankfile in the container with info from outside (provided by kubernetes) -> use mpirun as the container entrypoint
an even slimmer approach, where I basically "fake" the MPI runtime by setting a few environment variables (e.g. the OpenMPI ORTE ones) -> run the mpicc'd binary directly (where it'll find out about its peers through the env vars)
some other option
give up in despair

I know trying to mix "established" workflows like MPI with the "new hotness" of kubernetes and containers is a bit of an impedance mismatch, but I'm just looking for pointers/gotchas before I go too far down the wrong path. If nothing exists I'm happy to hack on some stuff and push it back upstream.

Introit answered 29/6, 2016 at 7:49 Comment(1)

I doubt option 3 would work. Open MPI's orterun (a.k.a. mpirun and mpiexec) does much more than simply launching the executable multiple times. It serves as a central information broker between the ranks. Option 2 seems most reasonable. – Cammiecammy 29/6, 2016 at 11:3

I tried MPI Jobs on Kubernetes for a few days and solved it by using dnsPolicy:None and dnsConfig (CustomDNS=true feature gate will be needed).

I pushed my manifests (as Helm chart) here.

https://github.com/everpeace/kube-openmpi

I hope it would help.

She answered 16/2, 2018 at 13:36 Comment(0)

Assuming you don't want to use hw-specific MPI library (for example anything that uses direct access to communication fabric), I would go with option 2.

First, implement a wrapper for mpirun which populates necessary data using kubernetes API, specifically using endpoints if using a service (might be a good idea), could also scrape pod's exposed ports directly.
Add some form of checkpoint program that can be used for "rendezvous" synchronization before starting actual running code (I don't know how well MPI deals with ephemeral nodes). This is to ensure that when mpirun starts it has stable set of pods to use
And finally actually build a container with necessary code and I guess SSH service for mpirun to use for starting processes in other pods.

Another interesting option would be to use Stateful Sets, possibly even running with SLURM inside, which implement a "virtual" cluster of MPI machines running on kubernetes.

This provides stable hostnames for each node, which would reduce the problem of discovery and keeping track of state. You could also use statefully-assigned storage for container's local work filesystem (which, with some work, could be made to for example always refer to same local SSD).

Another benefit is that it would be probably least invasive to the actual application.

Athanor answered 1/10, 2017 at 7:43 Comment(4)

Yeah, there are a couple of new ways to do this in Kubernetes since I asked this question. Haven't had a chance to try any of them out, though :) – Introit 3/10, 2017 at 0:16

@ben can you point me to these new approaches you're talking about in your comment? thank you! – Background 7/1, 2018 at 13:52

well, it was kindof an off-hand comment, but I was mostly referring to things like helm charts and sonnet for setting up the infrastructure (I never ended up doing it properly, though) – Introit 10/1, 2018 at 2:50

A student of mine used approach 2 for doing bare docker containers in a cluster (no swarm/kubernetes - provisioned with Ansible) with some success. See github.com/SeiryuZ/HemeWeb – Urticaceous 11/1, 2018 at 11:30

Recommended topics

Hot tags