Skip to main content

Model Serving & Inference

Goals

  • Low latency, predictable throughput, and controlled cost
  • Safe rollouts, quick rollbacks, and reproducible deployments

Patterns

  • Stateless inference pods behind an API gateway (REST/gRPC)
  • Autoscaling on concurrency/queue depth; proactive warm‑up for GPUs
  • Canary and blue‑green for new model versions
  • Multi‑model routing via a policy layer (A/B, shadow traffic)
  • Request batching and dynamic padding
  • Token‑level stream responses for LLMs to improve perceived latency

Hardware & Optimization

  • Choose accelerators based on model size/concurrency, not hype
  • Quantization (INT8/FP8), pruning, distillation to shrink models
  • Use optimized runtimes: TensorRT, ONNX Runtime, vLLM, FasterTransformer
  • Cache repetitive inference or reranking results

Interfaces

  • gRPC for high‑throughput internal traffic; REST for external clients
  • Structured error schemas and rate‑limits; return inference metadata

Rollouts

  • Gate deploys on offline eval + small online canaries
  • Shadow traffic for safety; capture prompts/inputs for replayable analysis

Security

  • AuthN/Z, tenant isolation, and PII handling
  • For LLMs: guardrails, content filters, jailbreak/prompt‑injection detection