Building an MLOps platform is an action companies take in order to accelerate and manage the workflow of their data scientists in production. This workflow is reflected in ML pipelines, and includes the 3 main tasks of feature engineering
, training
and serving
.
Feature engineering and model training are tasks which require a pipeline orchestrator, as they have dependencies of subsequent tasks and that makes the whole pipeline prone to errors.
Software building pipelines are different from data pipelines, which are in turn different from ML pipelines.
A software CI/CD flow compiles the code to deploy-able artifacts and accelerates the software delivery process. So, code in, artifact out. It's being achieved by the invocation of compilation tasks, execution of tests and deployment of the artifact. Dominant orchestrators for such pipelines are Jenkins, Gitlab-CI, etc.
A data processing flow gets raw data and performs transformation to create features, aggregations, counts, etc. So data in, data out. This is achieved by the invokation of remote distributed tasks, which perform data transformations by storing intermediate artifacts in data repositories. Tools for such pipelines are Airflow, Luigi and some hadoop ecosystem solutions.
In the machine learning flow, the ML engineer writes code to train models, uses the data to evaluate them and then observes how they perform in production in order to improve them. So code and data in, model out. Hence the implementation of such a workflow requires a combination of the orchestration technologies we've discussed above.
TFX present this pipeline and proposes the use of components that perform these subsequent tasks. It defines a modern, complete ML pipeline, from building the features, to running the training, evaluating the results, deploying and serving the model in production
Kubernetes is the most advanced system for orchestrating containers, the defacto tool to run workloads in production, the cloud-agnostic solution to save you from a cloud vendor lock-in and hence optimize your costs.
Kubeflow is positioned as the way to do ML in Kubernetes, by implementing TFX. Eventually it handling the code and data in, model out. It provides a coding environment by implementing jupyter notebooks in the form of kubernetes resources, called notebooks
. All cloud providers are onboard with the project and implement their data loading mechanisms across KF's components. The orchestration is implemented via KF pipelines and the serving of the model via KF serving. The metadata across its components are specified in the specs of the kubernetes resources throughout the platform.
In Kubeflow, the TFX components exist in the form of reusable tasks, implemented as containers. The management of the lifecycle of these components is achieved through Argo, the orchestrator of KF pipelines. Argo implements these workflows as kubernetes CRDs. In a workflow
spec we define the dag tasks, the TFX components as containers, the metadata which will be written in the metadata store, etc. The execution of these workflows is happening nicely using standard kubernetes resources like pods, as well as custom resource definitions like experiments
. That makes the implementation of the pipeline and the components language-agnostic, unline Airflow which implements the tasks in python only. These tasks and their lifecycle is then managed natively by kubernetes, without the need to use duct-tape solutions like Airflow's kubernetes-operator. Since everything is implemented as kubernetes resources, everything is a yaml and so the most Git friendly configuration you can find. Good luck trying to enforce version control in Airflow's dag directory.
The deployment and management of the model in production is done via KF serving using the CRD of inferenceservice
. It utilizes Istio's secure access to the models via its virtualservices
, serverless resources using Knative Serving's scale-from-zero pods
, revisions
for versioning, prometheus metrics
for observability, logs
in ELK for debugging and more. Running models in production could not be more SRE friendly than that.
On the topic of splitting training/serving between cloud and on-premise, the use of kubernetes is even more important, as it abstracts the custom infrastructure implementation of each provider, and so provides a unified environment to the developer/ml engineer.