Blog·06/06/2026

From notebook to an inference API that survives production

A model trained in a notebook is not a product. Turning it into a service that validates its input, responds with predictable latency and can be monitored is the work that separates a demo from a system.

The model that works in the notebook and the model that works in production are the same artifact but different problems. In the notebook you control the input, run a cell and look at the result. In production real traffic arrives, with malformed data, in parallel, and someone needs to know whether the service is alive and how long it takes.

First, an input contract. Most errors in an inference service are not the model's, they are the input's: a missing field, a type that does not match, an array with the wrong shape. Validating the request before it reaches the model, with an explicit schema, turns a 500 with an incomprehensible traceback into a 400 with a clear message. Pydantic with FastAPI gives you this almost for free, and it is the difference between a service you can debug and one you cannot.

Health and readiness are not optional. A health endpoint that confirms the model is loaded lets an orchestrator know when to route traffic and when to restart. Without it, a deployment that fails to load the model receives requests and fails them silently. It is one line of code that prevents an entire class of incidents.

If you do not measure it, it does not exist. An inference service with no metrics is a black box at the worst possible moment: when something is wrong and you do not know whether it is slower, failing more, or the volume spiked. Exposing a Prometheus metrics endpoint — request counter by outcome, prediction throughput, latency histogram — gives you real percentiles (the p95, not the mean) and a basis to alert before the problem is visible to the user.

Packaging for reproducibility. The service has to start the same on your machine, in CI and on the server. A container with pinned dependencies and a single start command removes the "it worked on my machine" class of failures. You do not need a full serving platform for this; for many cases, a container with an HTTP server and its metrics is enough and maintains itself.

I gathered this minimum — model loading, a validated `/predict`, `/health` and Prometheus `/metrics` in one command — into servectl. It does not replace an industrial serving platform, but it covers the stretch from "I have a model on disk" to "I have an observable endpoint", which is exactly where many projects stall.

Work with JMWEB

Let's build something that reaches production.

It all starts with a conversation. Bring a dataset, a goal or a model that is stuck; I will take care of the rest.

Start a project

Keep reading:

20/06/2026

Your eval dropped from 90 to 89%: real regression or noise?

Read article

13/06/2026

Leakage between train and test: the mistake that inflates every metric

Read article