43% of AI-generated fixes need a human anyway

Incidents & Postmortems

  • GitHub published the architectural root cause behind its Feb–March outage wave: tight service coupling and missing backpressure mechanisms allowed localized faults to cascade, with the Feb 9 incident tracing to config changes that triggered excessive database background processing; OpenAI is reportedly exploring GitHub alternatives following repeated engineering workflow disruptions. (Previously covered April 16: four structurally distinct incident classes — now the shared architectural explanation is public.)

  • A missing null check in a logging statement caused a P1 cascade: a NullPointerException in what reviewers treated as harmless logging code brought down a production service. The postmortem's finding: shallow system knowledge and the assumption that logging paths are safe are the exact conditions that bypass code review scrutiny.

  • eth.limo's DNS was hijacked via social engineering: an attacker impersonated a team member to initiate EasyDNS account recovery, gained NS record control, and flipped DNS to attacker-controlled servers. DNSSEC limited blast radius — forged records lacked valid cryptographic signatures, so resolvers dropped them; EasyDNS CEO called it the first successful social engineering attack in 28 years and is migrating high-value domains to a no-account-recovery tier.


GrafanaCON 2026: AI Observability and Loki Redesign


OTel in Production: Scale and Scrutiny


AIOps & Reliability Patterns

  • AWS DevOps Agent reached general availability (March 31), autonomously investigating incidents the moment an alert fires — no human prompting required — with preview metrics of 75% lower MTTR and 94% root cause accuracy. GA adds Azure and on-prem coverage, custom agent skills via MCP, and webhook triggers from CloudWatch, PagerDuty, Dynatrace, ServiceNow, Splunk, and Grafana; pricing shifts to per-second of agent operational time.

  • "Organizational Second Hit Syndrome" — coined by Dr. Richard Cook and John Allspaw — describes a post-incident vulnerability window where a second failure generates destructive organizational overreaction, analogous to neurological second-impact syndrome. SRE Weekly #513 also flags: uncapped autoscaling accelerates cascades under pathological load rather than absorbing them, and ML systems need error budgets for model accuracy, data freshness, and fairness — not just uptime — since ML degrades gradually rather than failing suddenly.

Get SRE Briefing in your inbox

Subscribe to receive new issues as they're published.