43% of AI-generated fixes need a human anyway
Incidents & Postmortems
GitHub published the architectural root cause behind its Feb–March outage wave: tight service coupling and missing backpressure mechanisms allowed localized faults to cascade, with the Feb 9 incident tracing to config changes that triggered excessive database background processing; OpenAI is reportedly exploring GitHub alternatives following repeated engineering workflow disruptions. (Previously covered April 16: four structurally distinct incident classes — now the shared architectural explanation is public.)
A missing null check in a logging statement caused a P1 cascade: a NullPointerException in what reviewers treated as harmless logging code brought down a production service. The postmortem's finding: shallow system knowledge and the assumption that logging paths are safe are the exact conditions that bypass code review scrutiny.
eth.limo's DNS was hijacked via social engineering: an attacker impersonated a team member to initiate EasyDNS account recovery, gained NS record control, and flipped DNS to attacker-controlled servers. DNSSEC limited blast radius — forged records lacked valid cryptographic signatures, so resolvers dropped them; EasyDNS CEO called it the first successful social engineering attack in 28 years and is migrating high-value domains to a no-account-recovery tier.
GrafanaCON 2026: AI Observability and Loki Redesign
Grafana 13 adds AI Observability to Grafana Cloud (public preview) — real-time AI agent monitoring, output evaluation with alerts, and correlation of agent sessions with application telemetry — with Grafana Assistant now available for OSS and Enterprise self-managed deployments.
grafana/o11y-benchwas open-sourced as a benchmark for evaluating AI agents on observability workflows; one GrafanaCON winner reduced incident response time from 3 days to 1 minute using Grafana Assistant for automated investigation.Loki's redesigned architecture ships Kafka-backed ingestion separating reads/writes, columnar storage, and a new query engine claiming up to 10× faster performance on aggregated data and 20× less data scanned. Grafana Labs also acquired Logline to improve high-cardinality log indexing; OTel integration gains single-command installs with improved Kubernetes support.
OTel in Production: Scale and Scrutiny
Skyscanner manages OTel across 24 production clusters with a 6-person team — central DNS endpoint + Istio-based routing to nearest collector, gateway collectors (replica set) for bulk OTLP, DaemonSet agents for legacy Prometheus scraping. Key pattern: Istio service mesh spans converted to platform metrics without application-level instrumentation, solving Istio's native cardinality problem; practical advice: implement memory limiters and filter processors from day one.
Dash0 argues "supports OTel" is no longer a differentiator: with ~75% of organizations using or evaluating OTel in production, buyers should now audit implementation depth — semantic consistency, end-to-end context propagation, and downstream pipeline complexity are the real separators. (Previously covered April 9: 77% of teams called open standards essential, only 25% prioritized them at tool selection — the checkbox is checked; implementation maturity is the new evaluation criterion.)
AIOps & Reliability Patterns
AWS DevOps Agent reached general availability (March 31), autonomously investigating incidents the moment an alert fires — no human prompting required — with preview metrics of 75% lower MTTR and 94% root cause accuracy. GA adds Azure and on-prem coverage, custom agent skills via MCP, and webhook triggers from CloudWatch, PagerDuty, Dynatrace, ServiceNow, Splunk, and Grafana; pricing shifts to per-second of agent operational time.
"Organizational Second Hit Syndrome" — coined by Dr. Richard Cook and John Allspaw — describes a post-incident vulnerability window where a second failure generates destructive organizational overreaction, analogous to neurological second-impact syndrome. SRE Weekly #513 also flags: uncapped autoscaling accelerates cascades under pathological load rather than absorbing them, and ML systems need error budgets for model accuracy, data freshness, and fairness — not just uptime — since ML degrades gradually rather than failing suddenly.
Get SRE Briefing in your inbox
Subscribe to receive new issues as they're published.