Model Serving & Inference
Goals
- Low latency, predictable throughput, and controlled cost
- Safe rollouts, quick rollbacks, and reproducible deployments
Patterns
- Stateless inference pods behind an API gateway (REST/gRPC)
- Autoscaling on concurrency/queue depth; proactive warm‑up for GPUs
- Canary and blue‑green for new model versions
- Multi‑model routing via a policy layer (A/B, shadow traffic)
- Request batching and dynamic padding
- Token‑level stream responses for LLMs to improve perceived latency
Hardware & Optimization
- Choose accelerators based on model size/concurrency, not hype
- Quantization (INT8/FP8), pruning, distillation to shrink models
- Use optimized runtimes: TensorRT, ONNX Runtime, vLLM, FasterTransformer
- Cache repetitive inference or reranking results
Interfaces
- gRPC for high‑throughput internal traffic; REST for external clients
- Structured error schemas and rate‑limits; return inference metadata
Rollouts
- Gate deploys on offline eval + small online canaries
- Shadow traffic for safety; capture prompts/inputs for replayable analysis
Security
- AuthN/Z, tenant isolation, and PII handling
- For LLMs: guardrails, content filters, jailbreak/prompt‑injection detection