There is a ritual in almost every IT operations team that repeats with irritating regularity. The service goes down. The team fixes it. The ticket is closed in Jira. The team lead breathes a sigh of relief. And everyone gets on with their life.
Two weeks later, the same service goes down again for the same reason.
The gap between what is said and what is done
of IT leaders say post-incident learning is "vital" for the organisation — State of AI-First Operations 2026
the percentage that actually achieves structured learning after an incident — State of AI-First Operations 2026
That 52% gap is not a willpower problem. It is a definition problem. Most teams consider the post-mortem finished when the report is written. Resilient organisations consider the post-mortem finished when the root cause is permanently eliminated.
These are two completely different worlds.
When the ticket is truly closed
In the teams that genuinely learn from their incidents, there is an unwritten but systematically applied rule: the ticket for a serious incident is not closed when the service comes back online. It is closed when at least one of these three conditions is met:
- A proactive alert has been automated that would have detected the problem before it affected users. The next similar incident is detected in minutes, not hours.
- The architecture has been modified to eliminate the failure point. Not a patch. A structural change that means the same failure cannot recur in the same way.
- The L1 runbook has been updated with the exact steps to resolve the problem in less than half the original time. The tacit knowledge of the senior engineer who resolved it becomes explicit team knowledge.
If none of the three is met, the ticket is technically closed but the problem is still open.
Why write the report that everyone knows nobody will read
System goes down → incident resolved → someone writes a 4-page document in Confluence → the document is shared in the Slack channel → three people read it → nobody changes anything → in 6 weeks a variant of the same problem occurs → the cycle repeats.
The worst part is not that the report goes unread. The worst part is that everyone in the room knows nothing is going to change and yet they write it, present it and file it. It is collective energy burned in a performance.
Why does this happen? Because the incentive is badly designed. Teams are evaluated on resolution time (the MTTR we discussed in the first dose of this series). Not on incidents avoided. If you resolve fast, you are the hero. If you invest three hours in a post-mortem that prevents the next incident, nobody sees you.
The mindset shift that distinguishes resilient teams
Teams that genuinely do not repeat incidents have one thing in common: they treat every serious incident as priority technical debt, not as a one-off event to document and forget.
Resolving the incident is the emergency response. The post-mortem with real impact is the structural response. Without the second, the first is just a patch.
Learning from the fix saves the year."
If your team does not have time for post-mortems with real impact, it is not a time problem. It is a priorities problem. Recurring incidents are not bad luck. They are the direct consequence of a system that rewards fire-fighting and penalises building fireproof buildings.
The question you should ask yourself when closing the next incident is not "is the service running?". It should be "what needs to change so this does not happen again?"
Series: IT Operations without the smoke
Do your post-mortems generate real changes or end up in Confluence? Share it on LinkedIn
Share on LinkedInHow many of your incidents from last year were repeated?
We help you turn your post-mortems into concrete actions that eliminate problems at their root.
