The five stages of grief were never about the bereaved, and the linear postmortem pipeline is that same myth in a Jira board. Recovery is an oscillation, not a march.
Almost everything you think you know about grief is a myth, and the myth has a body count.
The "five stages of grief" (denial, anger, bargaining, depression, acceptance) are so culturally embedded that people apologize for grieving in the wrong order. But Elisabeth Kübler-Ross never studied the bereaved. Her 1969 stages came from conversations with dying patients confronting their own mortality, not people who had lost someone. And when researchers went looking for evidence that the bereaved actually move through those stages, in that order, toward "closure," they didn't find it. The stages "have not been empirically demonstrated," as the clinical literature puts it bluntly; worse, the model actively harms grievers who don't fit it, who feel they're failing at grief because they're angry in month six, or fine on Tuesday and shattered on Wednesday, when the script says they should have reached acceptance and moved on.
I want you to hold that discomfort, because your engineering org runs the exact same broken model on its outages. The linear pipeline (incident → postmortem → action items → ticket closed → move on) is the five-stages fallacy in a Jira board. It promises closure. It assumes recovery is a march to a finish line. And it leaves the same wreckage: engineers who think they're failing because they're still rattled three days after the page cleared, when the system told them the incident was "resolved" at 2:47 a.m. and resolution was supposed to mean over.
There is a better model of how humans actually recover, it's the most empirically supported one we have, and it turns out to be a blueprint for on-call culture.
In 1999, two grief researchers, Margaret Stroebe and Henk Schut, published the Dual Process Model of coping with bereavement. It's now the dominant framework in the field, and its core claim is deceptively simple: healthy grieving is not a sequence of stages but a flexible oscillation between two kinds of coping.
The first is loss-oriented coping: confronting the loss directly. Sitting with the grief, replaying what happened, feeling the absence, processing the event itself. The second is restoration-oriented coping: dealing with everything the loss disrupted: rebuilding routines, taking on new roles, re-engaging with ordinary life, and, importantly, resting from the grief. And here is the part that makes the model a design principle rather than a platitude: Stroebe and Schut argue that you cannot attend to both at once. You oscillate. You confront, then you retreat and rebuild, then you confront again, dosing your exposure to the loss because constant exposure is unsurvivable and constant avoidance leaves the loss unmetabolized.
When researchers describe this model to grieving people, they tend to recognize it instantly (yes, that's exactly what it's like, I can't do the grief all day, I have to come up for air), and it holds across cultures. But its real power is in the two warnings it issues, because both of them describe a specific way that engineering teams break.
The model says: constant immersion in the loss is maladaptive. And: constant avoidance is maladaptive. Neither pole, held alone, is recovery. Most on-call cultures pick a pole.
The rumination-only team treats every outage as an open wound that never gets to close. The retro runs long and circular; old incidents get dragged back into every new postmortem; the same near-miss is re-litigated quarter after quarter; "we need to be more careful" becomes the org's entire emotional posture. This feels like diligence, like taking reliability seriously. In Dual Process terms, it's pathological loss-orientation: the team is stuck immersed, unable to oscillate out into restoration, and the result is exactly what the grief model predicts for someone who can't stop confronting the loss. They freeze. They become risk-paralyzed. They burn out, not from the incidents, but from never being allowed to look away from them.
The restoration-only team is the far more common failure, and the more invisible one, because it's disguised as health. This is the "ship forward" culture: an outage gets mitigated, a perfunctory doc gets filed, and the team is immediately back to roadmap velocity. No real processing. No sanctioned pause. "We don't dwell on failures, we learn and move on", except the learning is the part they skipped. This is the trap the Dual Process Model names precisely: restoration is not avoidance, but avoidance masquerading as restoration is the maladaptive thing. Moving on without metabolizing isn't restoration; it's suppression wearing restoration's clothes.
And it has measurable costs, in two registers. The human one: post-incident stress is commonplace among site reliability engineers (mood, sleep, and concentration disruptions that linger for a day or two after the page clears), and the 2025 Catchpoint SRE Report found that roughly 70% of SREs cite on-call stress as a direct contributor to burnout, with engineers now spending a median 30% of their week on operational work, up from 25% the year before. The "resolved" status was a lie about the human; the loss didn't end when the incident did.
The system register is just as stark. Unprocessed signal doesn't disappear; it accumulates into the next outage. Splunk's State of Observability 2025 found that roughly three-quarters of UK IT teams had suffered outages tied to missing or ignored alerts, that 54% said false alerts were actively demoralizing their staff, and that 15% admitted to deliberately ignoring or suppressing alerts. That's the org-level version of restoration-only coping: a team so starved of real processing time that it copes by not looking, and the unattended signal becomes the next incident. Avoidance at the system level is just deferred loss.
The instinct of a conscientious leader, reading this, is to lean toward the loss pole: process more, postmortem harder, never let an incident go unexamined. The Dual Process Model, and a sobering body of trauma research, says that instinct is wrong in both directions.
Restoration is not the absence of coping; it's half the mechanism. Rebuilding your routine, taking respite, getting a clean win, re-engaging with normal work: in the grief model these are active, necessary, health-producing acts, not indulgences you earn after the "real" work of mourning. You literally cannot confront the loss twenty-four hours a day; the oscillation out is what makes the oscillation in survivable. A protected recovery day after a brutal on-call week, or a no-incident sprint of pure building after a bad month, is not the team being soft. It is the restoration phase the recovery actually requires.
And the "process harder" instinct doesn't just under-rate restoration; it can do real harm, which we know because emergency services ran the experiment. Critical Incident Stress Debriefing (CISD) is a formal ritual where responders are gathered shortly after a traumatic event for a single, structured, mandatory session to process it together. It sounds obviously good, the same way "always do a deep emotional postmortem immediately" sounds obviously good. The evidence says otherwise. A Cochrane review found no evidence that single-session psychological debriefing reduces psychological morbidity, and recommended that compulsory debriefing of trauma victims cease. In some studies it made things worse: one trial of burn-trauma survivors found 26% of the debriefed group had PTSD at 13-month follow-up, versus 9% of the controls who got no debrief. Forced, single-shot, immersive processing, the thing that looks like maximum care, backfired.
The lesson isn't "don't process." It's the Dual Process lesson exactly: you dose, and you oscillate. Mandatory immersion right after the wound, with no respite arm, is not the loss pole done well; it's the rumination failure mode formalized into a calendar invite. The thing that works is bounded processing followed by genuine restoration, alternating, over time.
Here's the part you can actually build, on Monday, with no budget. Because neither pole self-corrects (a rumination culture won't spontaneously rest, and a ship-forward culture won't spontaneously reflect), you have to engineer the alternation deliberately. Resilient incident culture is not the right amount of postmortem; it's the rhythm between two phases you design on purpose.
Phase one, a bounded loss-orientation. The blameless postmortem is the loss pole done safely: identify the contributing causes "without indicting any individual," as the Google SRE practice frames it, because removing blame is what gives people the confidence to confront the failure honestly instead of flinching from it. But two design constraints matter as much as blamelessness. Time-box it: a postmortem is a phase, not a permanent state, and an unbounded one becomes the rumination trap. And explicitly close it: name the moment the loss-orientation ends. "We have processed this. The causes are understood, the follow-ups are owned, and this incident is now closed." Without an explicit close, the loss stays ambient, re-litigated at every future retro, and you've built the freeze culture by accident.
Phase two, a sanctioned restoration. This is the phase almost everyone omits, and it has to be sanctioned, explicitly blessed by leadership, or it won't happen, because restoration always looks like slacking to a velocity-obsessed org. Protected recovery time after a heavy rotation. Real respite from the pager, with a handoff that actually holds. A deliberate clean win, a stretch of no-incident project work that lets people re-engage with building rather than firefighting. The grief model says re-engagement with normal life is curative; the engineering translation is let people go build something that works for a while. Some teams make it policy: a mandatory comp day after a Sev-1, a rotation that ends in a guaranteed no-pager week, a standing rule that whoever ran the incident doesn't also draw the next sprint's hardest story. The exact ritual matters less than its existence, a sanctioned, defended phase where the nervous system is actually allowed to stand down.
And the structural supports that make the oscillation possible, drawn straight from the reliability data: a blameless culture (so the loss phase is safe to enter), a cap on load so the restoration phase is real, the SRE guidance of roughly two to three actionable incidents per on-call shift as the sustainable ceiling, adequate rotation size so no one is permanently immersed, and genuinely protected project time. And one piece the grief researchers don't need but you do: an explicit hand-off between the phases. "The postmortem is closed, now recover." Because the failure that quietly destroys people is being left stuck (stuck immersed in a loss that never closes, or stuck avoiding a loss that never got processed), and a clear transition is what keeps the oscillation moving.
The deepest thing the Dual Process Model has to teach an engineering leader is also the hardest to put on a status page: there is no closure, and chasing it is the bug. You do not "get over" a death. You do not "get over" a bad outage. The healthy outcome was never a finish line where the loss is resolved and filed away; it's a sustainable rhythm in which you confront the thing in bounded doses, then genuinely rebuild and rest, then confront it again, until, over time, the confronting takes less and the rebuilding takes more, and the loss becomes something the system has metabolized rather than something it's still bleeding from or pretending didn't happen.
So stop optimizing for the speed to "incident closed," and start designing the oscillation. After the next bad one, do the bounded, blameless, explicitly-closed postmortem, and then, with the same intentionality, give the team the restoration phase the recovery actually requires. The most empirically supported model of how humans survive loss is telling you something your incident tooling never will: you metabolize it in doses, and then you rebuild. Build the rhythm, not the resolution: your people, and your reliability, both depend on the same thing the bereaved depend on, which is permission to look away before they have to look again.
"Resolved" at 2:47 a.m. was a lie about the human. A green checkmark is a lie about the work.
The same way a status page can't tell you a team has actually recovered, a dashboard can't tell you an agent's output is actually sound, only that something marked it done. If you now run on-call for a fleet of agents as well as people, you need the durable substrate under the status: provenance you can verify and reputation earned from real track record. The Agent Trust Stack is that substrate, so "done" means metabolized, not just marked.
pip install agent-trust-stack · npm install agent-trust-stack
vibeagentmaking.com → · See it in action