Skip to main content

Monitoring, Drift & Evaluation

Observability Pillars

  • Metrics: latency (p50/p95), throughput, errors, GPU/CPU utilization
  • Logs: inputs/outputs with policy‑aware redaction, decisions, guardrails
  • Traces: end‑to‑end request path (feature fetch → inference → post‑processing)

ML/LLM Specifics

  • Quality: accuracy, BLEU/ROUGE, hallucination rate, safety violations
  • Data Drift: feature distribution shifts; prompt distribution changes for LLMs
  • Performance: tokens/sec, context length utilization, cache hit rate

Continuous Evaluation

  • Maintain golden datasets and prompt suites
  • Automate offline eval in CI with thresholds to block deploys
  • Sample online traffic for periodic eval and regression detection

Tooling

  • Export standardized metrics; use OpenTelemetry for traces
  • Build eval pipelines; store results with run metadata
  • Alerts on quality, drift, and anomalous inputs/outputs