
LLMs in production: a practical guide to deployment at scale
Deploying large language models in production requires careful orchestration of infrastructure, model serving, and monitoring. This guide walks through key patterns we've battle-tested at Euranova.
Choosing a serving framework
The first decision is selecting the right inference engine. For most production workloads, we recommend vLLM for its PagedAttention memory management and continuous batching capabilities.
from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Llama-3-70B-Instruct", tensor_parallel_size=4)
params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=512)
outputs = llm.generate(["Explain quantum computing in simple terms."], params)
for output in outputs:
print(output.outputs[0].text)
The key insight is that batching requests intelligently reduces per-token latency by up to 10x compared to naive sequential processing.
Infrastructure as Code
We define our GPU clusters using Terraform, ensuring reproducibility across environments:
resource "kubernetes_deployment" "llm_serving" {
metadata {
name = "llm-serving"
labels = { app = "llm-api" }
}
spec {
replicas = 3
selector { match_labels = { app = "llm-api" } }
template {
spec {
container {
name = "vllm"
image = "vllm/vllm-openai:latest"
resources {
limits = { "nvidia.com/gpu" = "4" }
requests = { memory = "64Gi", cpu = "16" }
}
}
}
}
}
}
Monitoring & observability
Once deployed, tracking token throughput, latency percentiles, and error rates is critical. We expose Prometheus metrics from the serving layer:
import { Counter, Histogram } from "prom-client";
const tokenCounter = new Counter({
name: "llm_tokens_generated_total",
help: "Total tokens generated",
labelNames: ["model", "status"],
});
const latencyHistogram = new Histogram({
name: "llm_request_duration_seconds",
help: "Request duration in seconds",
buckets: [0.1, 0.5, 1, 2, 5, 10, 30],
});
With these building blocks in place, you can confidently scale LLM workloads from prototype to production.

