78% had outages with zero alerts fired — plus decorator crash loops & AI SRE evals
Incidents & Postmortems
- A 3x-reviewed refactor stacking 5 decorators (logging, caching, retry, rate limiter, metrics) caused a 28-minute Kubernetes crash loop when the rate limiter's constructor subscribed to an event bus not yet initialized at pod startup — triggering 47 synchronous retries before the first health check passed.
- The retry decorator masked the real failure: CPU hit 99%, liveness probe timed out, pod restarted, loop repeated. Each decorator was individually correct and passed its own unit tests — the composition broke at the interface.
- Three fixes: lazy initialization (subscribe on first use, not in constructor), init timeout with static config fallback, and a Kubernetes startup probe separated from the liveness probe. The team also added one PR template question: "If this component fails silently during initialization, what is the blast radius?" — which caught two more issues before shipping.
Alert Fatigue Is Now a Measurable Reliability Risk
- NeuBird AI's 2026 State of Production Reliability and AI Adoption Report (n=1,039 SRE, DevOps, and ITOps professionals, February 2026) found that 44% of organizations experienced outages linked to suppressed or ignored alerts, and 78% had incidents where no alerts fired at all — failures discovered by customers first.
- 77% of on-call teams receive at least 10 alerts daily; 57% say fewer than 30% are actionable. 83% of engineers admit to occasionally dismissing alerts. Nearly 40% of organizations report over a quarter of their on-call engineers show burnout symptoms from incident management.
- The exec/practitioner AI perception gap is wide: 74% of executives report AI is actively used for incident management, vs. 39% of practitioners. Executives are 3× more likely than practitioners to say AI has significantly reduced operational toil (35% vs. 12%).
- NeuBird raised $19.3M and launched Falcon, an autonomous production operations agent positioned to prevent issues proactively rather than respond reactively. The report pegs downtime cost at $50K+ per hour for 61% of respondents, $100K+ for 34%.
Evaluating Autonomous SRE Agents in Production
- Datadog published how they built a replayable evaluation platform for Bits AI SRE, their autonomous incident investigation agent. The core problem: new features improved specific scenarios while silently regressing others, with no reliable way to measure agent behavior across diverse production incidents.
- The platform uses "world snapshots" — capturing signal structure and relationships (queries, logs, deployment events) rather than raw telemetry, since telemetry TTLs expire. Evaluation worlds are intentionally noisy: unrelated services, background errors, and tangential signals are included to mirror real production, preventing the agent from looking more accurate than it is.
- Label creation was initially manual and didn't scale. Datadog now uses Bits itself to generate evaluation labels from user feedback and investigation telemetry, increasing label creation by an order of magnitude. Agentic validation then reduced per-label review time by over 95%.
OTel for Agentic Systems: Standards vs. Reality
- Red Hat published a practical guide to distributed tracing for multi-agent workflows using OpenTelemetry, based on their
it-self-service-agentquickstart. A key finding: Llama Stack does not automatically propagate span context to MCP servers during Responses API calls — requires manual injection of parent trace context into HTTP headers. - Sentry's developer guide to AI agent observability recommends using OTel
gen_aisemantic conventions as the structured foundation:gen_ai.invoke_agentfor full run lifecycle,gen_ai.execute_toolfor individual tool calls. They recommend 100% sampling for AI traces — aggregate metrics alone can't surface where multi-turn reasoning fails. - A 2026 Observability Survey found 77% of teams call open standards strategically essential, but only 25% actually prioritize them when selecting tools — cost and ease of adoption win at decision time despite stated OTel/Prometheus preference.
Get SRE Briefing in your inbox
Subscribe to receive new issues as they're published.