Deploying large language models in production requires careful orchestration of infrastructure, model serving, and monitoring. This guide walks through key patterns we've battle-tested at Euranova.

Choosing a serving framework

The first decision is selecting the right inference engine. For most production workloads, we recommend vLLM for its PagedAttention memory management and continuous batching capabilities.

from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-3-70B-Instruct", tensor_parallel_size=4)

params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=512)
outputs = llm.generate(["Explain quantum computing in simple terms."], params)

for output in outputs:
    print(output.outputs[0].text)

The key insight is that batching requests intelligently reduces per-token latency by up to 10x compared to naive sequential processing.

Infrastructure as Code

We define our GPU clusters using Terraform, ensuring reproducibility across environments:

resource "kubernetes_deployment" "llm_serving" {
  metadata {
    name = "llm-serving"
    labels = { app = "llm-api" }
  }

  spec {
    replicas = 3
    selector { match_labels = { app = "llm-api" } }

    template {
      spec {
        container {
          name  = "vllm"
          image = "vllm/vllm-openai:latest"
          resources {
            limits   = { "nvidia.com/gpu" = "4" }
            requests = { memory = "64Gi", cpu = "16" }
          }
        }
      }
    }
  }
}

Monitoring & observability

Once deployed, tracking token throughput, latency percentiles, and error rates is critical. We expose Prometheus metrics from the serving layer:

import { Counter, Histogram } from "prom-client";

const tokenCounter = new Counter({
  name: "llm_tokens_generated_total",
  help: "Total tokens generated",
  labelNames: ["model", "status"],
});

const latencyHistogram = new Histogram({
  name: "llm_request_duration_seconds",
  help: "Request duration in seconds",
  buckets: [0.1, 0.5, 1, 2, 5, 10, 30],
});

With these building blocks in place, you can confidently scale LLM workloads from prototype to production.

LLMs in production: a practical guide to deployment at scale

Choosing a serving framework

Infrastructure as Code

Monitoring & observability

Unlocking 25 Years of R&D Data through graph visualization

The path to data centricity is based on a strong data architecture.

Europe lagging behind the Global South in AI adoption, OECD report