Owning your model deployment

After spending the last few months head down in model deployment, uncovering software engineering concepts I always thought I wouldn’t need, this blog is my attempt to clarify what I learnt and make it available to others who, like me, have always focused on the science aspect of data science but now need to get their hands dirty with MLOps or ML Engineering. The goal of this write-up is not to be an exhaustive tutorial (although I will try to include essential recipes where possible) nor is it to be perfectly right. Any extensive explanation of technologies like Kubernetes would probably be obsolete as soon as it’s finished (there’s always a shiny new tool, a shiny new service). Instead, I want to highlight the concepts behind these, why they are necessary, how to use them together to achieve our goal of deploying and maintaining models, and to that end I will prefer a pedagogical inaccuracy to a mysterious but exact technical explanation.

A lot of what follows also assumes you use one of the big Cloud providers (because that’s what mostly happens in Enterprise where a lot of this work is done). I am personally biased by my experience with Google Cloud Platform (GCP) but with the following button, you can customize your experience across the entire blog!

Introduction

There are many jobs in data, from data analyst, scientist, engineer, to more specialized roles like machine learning engineer or the more recent AI engineer.¹ They all overlap somewhat on various dimensions, each specializing in a corner of that space (data scientists for example are usually more focused on analysing and modeling data whereas data engineers are usually responsible for data processing and quality at scale). In an ideal organization, these roles would have clear scopes and responsibilities, with little overlap.

A data analyst might find something interesting in data and validate a business case with stakeholders, a data scientist would then model whatever pattern is in that data (to achieve some business outcome) as an MVP, which would then be deployed to production by an ML engineer/MLOps team using data pipelines created and maintained by a data engineer.

But the real world is not that simple. Sometimes you’re the only data specialist working on a project, maybe with the support of an engineer or a product owner. Of course, the work still has to be done. In those situations you are essentially full-stack, which is a sort of positive way to say you wear all (or almost all) caps² and are responsible for all (or almost all) of the lifecycle of the project. As a data scientist, I’m already comfortable with the analysis and modeling (well, that’s my job really) and I can hold my own with production-grade data pipelines (with a good knowledge of SQL and some awareness of tools like dbt or Dataform Glue Data Factory , you can build some resilient data pipelines).

Packaging your model before you ship it

Usually, in the analysis and modeling parts of the model lifecycle, you will spend a lot of time in some interactive environment like a Jupyter notebook (or, my personal favourite, a Marimo notebook). But that’s far from “production-ready”. Notebooks are great tools for exploratory work, but production-grade code must be packaged properly so it can bring some actual value.

Throughout this post, I’ll just assume you have some Python code you’ve painfully built, maybe using scikit-learn or some other library. It could be a classifier, a forecasting model, or just some data transformation with custom logic. It doesn’t really matter, what matters is that all of this can be done in a Python package which loads some data (from a database or object storage, whatever works for you), works with it, and writes/sends back the output as a data stream.

No matter how you do it, a key aspect of preparing this package for deployment is to create a Docker image for it. If you’re not familiar with it, a Docker image allows you to create isolated containers of your code, in which all dependencies can be controlled for. This is a great way to create reproducible builds so that your program doesn’t just “work on your machine” but everywhere as you expect it to. Building a Docker image is a two-step process:

first you create a Dockerfile which is essentially a recipe describes the state of your container in the image,
which is followed by the build process where the recipe is applied and the container’s final state is saved as an image.

For example, with our simple Python project, a typical Dockerfile would be³

# Base image to start from
# This one is a slim Linux image with Python installed
FROM python:3.12-slim-bookworm
# Copy the uv package manager from the official uv image
COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/

# Copy all the files from the project inside the container
COPY . /app

# Set the workdir of the container
WORKDIR /app
# Install dependencies using the lockfile
RUN uv sync --locked

# Default command run in the container with its arguments
ENTRYPOINT ["uv", "run"]
CMD ["my_package", "--arg", "value"]

If this isn’t clear yet, you can think of this as a bunch of commands you would run on a clean, empty virtual machine on which you would want to run your project, like in CI. There are other useful instructions you can use, among which EXPOSE (to specify which ports your container should expose to your network) and VOLUME (to specify mounting points for data living on the machine where the container is executed so you can access it from within).

Say you have the following file on the root directory of your project, you can build the image with docker build . and execute it using docker run. When using docker run with your image directly, the command defined with CMD will be the one executed. If you need to run a custom command (say for example your package has multiple entrypoints or different arguments available), you can just do

# Different arguments
docker run --other-arg value
# Different binary
docker run -it --entrypoint /bin/other-binary

The last thing you need to know is that if you build an image (either on your computer or in the CI), you still need to somehow publish it on a repository so that your deployment machine can pull it later on.

The most common place to do that would be Dockerhub, or if you’re doing this at work maybe your company has some artifact registry like Jfrog, and otherwise your cloud provider will definitely have one Artifact registry Container registry Container registry .

Running the model

So far, we’ve seen how to transform a Python package (with all its dependencies, serialized data, etc…) into a Docker image which provides a standard and universal way to run the program. Any computer, any cluster even, can run your code easily once you’ve done that. But what kind of machine should you choose next? As usual with this kind of questions, the answer is “it depends”.

Servers

Once upon a time, (and maybe still today depending on your need), your deployment environment would just be some remote machine, like an on-prem server or a plugged-in laptop at home, which you can just SSH into, download your code, and run it. Of course, if you can run the Docker image on your own computer, you can easily run it on a remote machine in the same way.

This kind of deployment is not really the standard anymore. With the advent of cloud computing (some decade ago), the “remote” machine may not even be a machine at all. For example, this blog is hosted on a VPS — a virtual private server — which “feels” like a single machine, but is really distributed over different hardware in a datacenter.⁵ This is great for reliability and scalability (doubling the capacity doesn’t need changing machine, it just means renting more compute from more machines) and is generally cheaper for cloud providers (since you don’t have to rent whole servers only). By design, these machines are made to be used exactly like dedicated server so you can deploy the model in the same way as you would in the previous case.

Kubernetes clusters

The gold standard for scalability lies with Kubernetes (often abbreviated as k8s). The main idea behind Kubernetes is to have a self-managing cluster. In the previous setups I described, you (or an engineer on your team), were in charge of managing the process of your running program. Starting it, killing it if needed, restarting it if it crashed, checking it’s still running, making sure it has enough memory, all of this is on you. For simple cases, that’s easy enough. But when you have to programmatically manage a lot of such jobs, especially if they are dynamically created, then it may be a full-time job.

Kubernetes is a solution to this problem which also enables scalability and reliability: not only are your jobs managed by the Kubernetes controller but you can define lower- and upper-bounds on resources so your cluster scales as needed. The only downside is that all of this requires a little bit of setup, and it’s complex enough that working with my own cluster at work is half the reason I decided to write this blog.

Kubernetes can seem complex mostly because it piles on various concepts together in a subtle hierarchy, so let me start by defining those:

On the physical side, compute is grouped in a cluster, which is built out of individual nodes,⁶ possibly bundled into node pools. For example, your cluster could have a standard node pool as well as a high-memory node pool whose nodes will be used for jobs requiring a lot of RAM.
On the virtual side, the cluster manages jobs for you, which are each associated to a Docker image. This image is pulled and ran inside a small bubble of resources called a pod. So whereas the container is the virtual environment in which your program runs, the pod is like a wrapper for one or more containers.⁷ In reality, you don’t manage pods yourself (that’s literally what k8s does for you), you manage a deployment, which defines what pods should run along with resource policies.
Because all of this represents some difficult routing, (what pod runs on which machine? What’s its IP?) Kubernetes provides for you a service placed in front of the cluster. When you send a query to the cluster, the service is in charge of finding which pod should receive it. It is a router but also a load balancer.

This is a lot of information, so let’s try to summarize this in a diagram:

This diagram represents one deployment within a cluster, with 3 pods running the same image distributed over 2 separate nodes.

Why all this complexity?

The first time I saw this setup, I wondered why you’d ever need all of this. But the truth is that when you want to run something in production reliably, at scale, and with little manual management, then whatever you build will eventually look a lot like a k8s deployment. Where this setup shines is that pods are automatically managed for you: when a pod dies, k8s automatically creates a new one with the same image. If a pod is non-responsive or consumes too much memory, it’s gracefully killed and another one is spun instead.

If you update the image and want to synchronize the deployment with the new code, you can do a rollout restart where new pods are created with the new image, the service now routes all traffic to those pods, and the previous pods are then shut down. If you plan more jobs than your cluster has resources for, pods will just be waiting (marked as pending) until the minimum amount of RAM/CPU/storage is available, at which point they’ll automatically start. All this and more can be done using the kubectl CLI tool directly, or, as the next section will show you, using configurations.

Overall, Kubernetes may look complex at first glance, but it’s because it comes packed with all the features you’d want in a production cluster, which is why it’s a great choice if you need to run a job more than once and care about reliability (i.e. if your program stops because of a random error).

Managing your Kubernetes

The second difficult aspect of Kubernetes is setting it up. As I mentioned before, a VPS, virtual machine or dedicated server requires some setup up-front but once that’s done, everything can be done like you would locally via SSH. With Kubernetes, there’s a little bit more work, such as

provisioning: most likely your cloud provider offers a cluster management service like Google Kubernetes Engine Amazon Elastic Kubernetes Service Azure Kubernetes Service ,
configuration: setting up your deployment means describing the containers which should be started, the pods to use on startup, the minimal/maximal resources for each pod, etc.

Regarding 1., there are a few things you will need to setup in general:

a service account to access whatever docker registry you ended up choosing,
SQL routing service if you need a hosted database,
maybe networking if your pods need to communicate reliably between each other/other services.

This can be a lot of work. Thankfully, a lot of this can be done quite easily using Terraform as infrastructure-as-code (IaC). If you want to read more about this, I recommend this excellent deck of slides from my colleague and partner in crime at IKEA. Summarizing briefly, Terraform is standard manner to define your infrastructure using version-controlled configuration files instead of clicking through endless, confusing UIs (yes I’m thinking of you Google Cloud Console). To illustrate this, here is what a simple Terraform configuration would look like to provision a Kubernetes cluster (example taken from here)

# Terraform definition and dependencies
terraform {
  required_version = ">= 1.0.0"
  required_providers {
    kubernetes = {
      source  = "hashicorp/kubernetes"
      version = "~> 2.0"
    }
  }
}

# Basic deployment of an nginx image
resource "kubernetes_deployment" "changeme_simple_deployment" {
  metadata {
    name = "changeme-simple-deployment"
    labels = {
      app = "changeme-simple-deployment"
    }
  }

  spec {
    replicas = 1 # Number of pods in deployment
    selector {
      match_labels = {
        app = "changeme-simple-deployment"
      }
    }
    template {
      metadata {
        labels = {
          app = "changeme-simple-deployment"
        }
      }
      spec {
        container {
          image = "nginx"
          name  = "nginx"

          # Resources needed for the pod
          resources {
            requests = {
                cpu    = "250m"
                memory = "50Mi"
            }
          }
        }
      }
    }
  }
}

A priori, you can also handle the configuration aspect of Kubernetes this way. In practice, there are other tools to help streamline things like templating (the part of the configuration which describes the deployment composition). For example, using the k8s package manager Helm, the above template would look like

apiVersion: apps/v1
kind: Deployment
metadata:
  name: {{ .Release.Name }}-deployment
  labels:
    app: {{ .Release.Name }}
spec:
  replicas: {{ .Values.replicaCount }}
  selector:
    matchLabels:
      app: {{ .Release.Name }}
  template:
    metadata:
      labels:
        app: {{ .Release.Name }}
    spec:
      containers:
        - name: nginx
          image: "{{ .Values.image.repository }}:{{ .Values.image.tag }}"
          resources:
            requests:
              cpu: {{ .Values.resources.requests.cpu }}
              memory: {{ .Values.resources.requests.memory }}

replicaCount: 1

image:
  repository: nginx
  tag: "stable"

resources:
  requests:
    cpu: 250m
    memory: 50Mi

But what’s the point of doing this when it could be done in Terraform?

At the level of this blog post, none really. On a personal note, I’m not a fan of Helm. For more advanced use cases though, Helm brings certain features such as versioning and rollback, as well as a certain level of decoupling from the infrastructure itself that can make your life easier. In general, Terraform is great to create, modify or delete infrastructure using a declarative configuration, while Helm manages what should run in the cluster.

Headless

There is an alternative way to run your image if it’s simple enough (and doesn’t require any complex dependency) through your favourite cloud provider Cloud Run Lambda functions Azure functions

The basic idea is that your provider handles the whole infrastructure for you, all you need is to specify an image and/or a runtime (Node, Python, etc.). Your code will be automatically pulled and run as a queued job. This is ideal if you don’t need to handle multiple processes at once or if you don’t have complex non provider-dependent infrastructure (like a specific database or whatever).

How to talk to your model

So, you built a model, you spent some time packaging it as an image and pushing it somewhere in the cloud. Then you rented some compute, set it up so it would find your image and run it. Unless your model is a one-and-done kind of pipeline, you probably will need to communicate with it. For example, say you have a classification model built to figure out whether a given order by customer is more likely to be a delivery or a collect order. With this information, you can already pre-allocate resources before your customer is done checking out, which can help with the efficiency of your fulfilment process. It may not sound super useful at first, but this kind of improvement can reduce bottlenecks in your fulfilment chain, and that translates into costs reductions.

So, the prediction model is running somewhere in the cloud, waiting to receive new orders it can assign a label to. But how do you give it new predictions? This is where HTTP servers and REST APIs come in. Without going in too much detail, an HTTP server is just a server receiving and responding to HTTP requests (the protocol that allows you to see this webpage), while REST APIs are just a standardized way to exchange data and information via such requests. In a REST API, the server just exposes some endpoints⁸ (URLs) as standard methods

An HTTP request is made of two parts: the header (the metadata) and the body, which contains the data. The various methods of REST APIs (GET, POST, PUT, PATCH and DELETE) are standardized contracts between the server and the user to clarify intent for each query. If the HTTP protocol is a grammar, and the request is a sentence, the REST method is just the verb carrying the meaning behind the request. A pair of GET and POST requests to the same URL /endpoint will expect different behaviour (and the server will have different assumptions about them) because of this. These are:

GET: an idempotent, safe, method to query data from the server. Any information brought by the request is part of the url (explicitly /user/XXX or within a GET parameter /endpoint?param=value),
POST: meant to create new data or deliver a payload to the server. Provided data by the request should be placed within the body of the request (most likely in JSON).
PUT/PATCH: (partial) update of data which makes these idempotent like GET but without the immutability.
DELETE: request data deletion from the server.

While these are standard conventions and you should abide by them when you build an open API, I have to admit that in practice, if you write your own REST API for your own internal use (to communicate between two of your programs for example), you’ll likely just use GET and POST, and that’s okay. You can always refine it if you ever need to make it public.

There are many ways to do this, some more robust than others (but also more time-consuming). For the sake of simplicity here, I will just use FastAPI within the Python package. This is a great package to build quick APIs. For example, to build the API described in the diagram above, the code is just

from fastapi import FastAPI
from pydantic import BaseModel
from my_package import model
from typing import Any

# Use a Pydantic model to validate
# request data automatically
class Order(BaseModel):
    items: list[int]

app = FastAPI()

@app.get("/health")
def health_check():
    """Checks the server is running"""
    return {"status": "healthy"}

@app.post("/assign")
async def assign_order(order: Order) -> dict[str, Any]:
    label = model(order)
    return {"orderType": label}

This means that when you run uv run fastapi run path/to/endpoints.py, an HTTP server will start and expose the endpoints defined above. Then, to expose the ports you need to add EXPOSE 80 to your Dockerfile, and the option -p HOST:CONTAINER to your docker run command (in this case -p 80:80).⁹

Monitoring your model

Proper model deployment doesn’t stop when the model is running in production. A lot of things can happen. In the case of a data product, the model can fail because of data drift, schema changes, etc. When those things happen, you need to be alerted somehow and be able to investigate. This is called monitoring.

The industry standard for this is OpenTelemetry (OTel), which is supported by the largest frameworks such as MLFlow. OTel relies on three concepts:

traces are groups of spans which are essentially what we all do when we sprinkle print and log statements throughout our code,
metrics are quantitative measures recorded throughout the program (execution time, loss during model training, etc.),
logs are your generic logging messages (but structure in JSON).

For example, a trace is a great way to track not only why your program crashed due to an out-of-memory error, but also what the exact sequence of events led to that happening. Similarly, by logging metrics during training, you can catch model degradation early on. Maybe the data you use isn’t that fresh anymore, maybe your features are not enough to explain the new pattern, but with metrics logging you will catch this before your production model reaches a critical stage.

Since this is a standard, you can find a lot of tools that handle this, such as logging libraries or even frameworks (like FastAPI or most ML and AI frameworks). With those, your application will be emitting all these signals which need to be caught by a sink, a server set up just for that. A good example of this is MLFlow which can group traces and metrics and display them for you. It’s a great way to monitor and debug your application in development and in production without having to trudge through layers and layers of logs by hand.

Conclusion

This was a much longer post than I intended, and to be honest it could have been even longer if I let myself go into details on each of the technologies I mentioned. There are a lot of aspects to deploying code in a production environment, which is why there are whole jobs dedicated to this, such as ML Engineer.

Still, sometimes you have to roll up your sleeves and take care of your deployment yourself, and that’s what this guide is for. I was envisioning this blog post to be a stepping stone to that goal, to clear up concepts and what steps need to be taken to have a production-grade model running.

If I can leave you with a simple takeaway, deployment is about building the automation and packaging to run the app or model without human interference. You shouldn’t need to build a virtual environment manually, you shouldn’t even need to push the Docker image by hand to the registry (use CI for that!), instead everything from the moment you merge your code to your trunk branch till you query your model should be handled by your automation pipeline. If the model crashes or fails, you should be alerted. It may be a lot of work, but it’s all worth it in the end when you don’t have to take care of every single update manually or deal with yet another “but it works on my laptop!” ticket.

Footnotes

To be honest, AI engineering is a lot closer to traditional software engineering than to the other data roles, but it is a common title claimed by data people. ↩
If only you also got the pay for all those hats uh? ↩
In the above Dockerfile, I included a command to copy all files of the project COPY . /app. It is highly recommended to include a .dockerignore file to your project if you want to avoid copying a lot of unnecessary bloat (.venv/, node_modules) or to copy data explicitly. ↩
To be exact, ONNX is a framework for interoperability between ML frameworks. It provides a graph architecture for tensor models like neural networks so you can represent a model trained with e.g., PyTorch, in ONNX format and then load it in another framework or even a native data structure for inference. In practice, this interoperability plus the fact that you can serialize the graph as a binary file makes ONNX a great tool for model serialization. ↩
As was rightfully pointed out by a friend, this is generally not the case since most servers are large enough that your virtual machines likely fits entirely within a single one. So in general what you rent is a slice of a single machine. But the overall point remains that what you rent is a virtual machine, not a physical one. It’s up to the provider to decide how to allocate that slice of compute, and it may be on a single machine or distributed across several ones. ↩
This is even more confusing because, just as in a VPS, a node can be a virtual piece of several machines shared across clusters. ↩
The extra containers in a pod are usually called sidecars. ↩
The /health endpoint is quite common and is used by Kubernetes to check that your container is not just running but also healthy and working as intended. ↩
Just so you know, most cloud-based solutions (VPS, k8s, etc.) have complex pre-set firewall rules which means you may need some platform-dependent configuration if you plan to access this server from outside the machine. But this is a bit beyond the scope of what I wanted to discuss here. ↩