Skip to main content

Model Serving & Inference

Goals

Low latency, predictable throughput, and controlled cost
Safe rollouts, quick rollbacks, and reproducible deployments

Patterns

Stateless inference pods behind an API gateway (REST/gRPC)
Autoscaling on concurrency/queue depth; proactive warm‑up for GPUs
Canary and blue‑green for new model versions
Multi‑model routing via a policy layer (A/B, shadow traffic)
Request batching and dynamic padding
Token‑level stream responses for LLMs to improve perceived latency

Hardware & Optimization

Choose accelerators based on model size/concurrency, not hype
Quantization (INT8/FP8), pruning, distillation to shrink models
Use optimized runtimes: TensorRT, ONNX Runtime, vLLM, FasterTransformer
Cache repetitive inference or reranking results

Interfaces

gRPC for high‑throughput internal traffic; REST for external clients
Structured error schemas and rate‑limits; return inference metadata

Rollouts

Gate deploys on offline eval + small online canaries
Shadow traffic for safety; capture prompts/inputs for replayable analysis

Security

AuthN/Z, tenant isolation, and PII handling
For LLMs: guardrails, content filters, jailbreak/prompt‑injection detection

Goals
Patterns
Hardware & Optimization
Interfaces
Rollouts
Security