An incident response playbook is a written, repeatable process for handling production incidents. It defines severity levels, names the roles — incident commander, comms lead, scribe — and lays out every step from detection to resolution and a blameless post-mortem. The point is to remove improvisation from the worst ten minutes of your week.
Most teams already have the pieces scattered around: a runbook here, an on-call rotation there, a half-remembered Slack thread about who declares a SEV1. A playbook puts them in one ordered list that a half-awake engineer can follow at 3am without thinking. This guide is the version we’d hand a new on-call engineer: the severity scale, the roles, the timeline, the status-update cadence, and the post-mortem — concrete enough to copy.
What is an incident response playbook?
An incident is any unplanned event that degrades a service your users depend on — an outage, a data bug, a security event, a third-party dependency falling over. The playbook is the agreed answer to four questions you do not want to be debating while the site is down: how bad is it (severity), who is in charge (roles), what happens next (the timeline), and who needs to know (communication). Write those down once and every future incident starts from the same line.
A playbook is not a runbook. A runbook is service-specific — how to fail over the payments database, how to drain a bad node. The playbook is the meta-process that decides when you reach for a runbook and who calls the shots while you do. Keep them linked: the playbook is the front door, the runbooks are the rooms behind it.
What are the incident severity levels?
Severity is the first decision because it drives everything after it: who gets paged, how often you communicate, and whether anyone wakes up. Keep the scale short — three levels are enough for most teams, and a fourth (SEV4) for “noticed, not urgent”. Define them by customer impact, not by how interesting the bug is.
| Severity | Impact | Example | Response |
|---|---|---|---|
| SEV1 | Critical — full outage or data loss | Checkout down; customers cannot pay | Page the IC now, all-hands, updates every 30 min |
| SEV2 | Major — degraded, a workaround exists | Elevated error rate; one region slow | Page on-call, updates every 60 min |
| SEV3 | Minor — limited or internal impact | Background job lagging; admin tool glitch | Handle in business hours, track as a ticket |
The roles: incident commander, comms lead, scribe
The single biggest failure mode in an incident is five smart engineers all debugging the same thing in silence while no one talks to customers and no one decides anything. Roles fix that. For a SEV1 you want three named people; for a SEV2 one person often wears all three hats. The rule that matters: say the role out loud when you take it, so everyone knows who is who.
Incident commander (IC)
The IC owns the incident, not the fix. They keep the timeline moving, decide on mitigations, delegate investigation, and call when to escalate or stand down. Crucially the IC should keep their hands off the keyboard — the moment the commander is deep in a stack trace, no one is steering. Their job is to ask “what’s our fastest path to mitigation?” and “who is doing it?” over and over.
Comms lead
The comms lead owns everyone outside the incident channel: the status page, the support team, leadership, and affected customers. They translate engineer-speak into plain impact statements and they protect the responders from a stream of “any update?” pings. On a smaller incident the IC covers this, but as soon as customers are visibly affected, split it out — talking to people is a full-time job during a SEV1.
Scribe
The scribe keeps the running timeline: when the alert fired, when each action was taken, what was tried, what changed. This is not bureaucracy — it is the raw material for the post-mortem, and it stops the team from re-trying something that already failed an hour ago. A pinned doc or a board on the incident’s home is enough; timestamp every entry.
The incident timeline: detection to resolution
Here is the core of the playbook — the ordered checklist a responder runs from the moment an alert fires. Print it, pin it to your b/oncall board, and don’t skip steps under pressure; the order is what keeps a stressful hour from becoming a chaotic one.
- Acknowledge and declare. Ack the page so it stops escalating, then declare an incident and set an initial severity. Saying “this is a SEV2” out loud is what starts the clock for everyone else.
- Open the incident home. Spin up the dedicated channel and open the
b/oncallboard so the timeline, dashboards, and runbook links live in one place instead of scattered DMs. - Assign roles. Name the IC first; for a SEV1, name the comms lead and scribe too. If no one volunteers, the person who declared is IC until relieved.
- Stabilise before you diagnose. Stop the bleeding first — roll back the last deploy, fail over, flip the feature flag, scale up. You can find the root cause after users are safe; do not let a debugging rabbit hole delay mitigation.
- Communicate on cadence. Post the first status update within minutes, then on the interval your severity demands — even “no change, still investigating” is a valuable update.
- Diagnose and fix. With the impact contained, work the root cause. The IC delegates investigation threads and keeps everyone reporting back to the channel, not working in private.
- Verify recovery. Confirm the fix with the same signal that detected the problem — error rate back to baseline, the failing transaction now succeeding. Trust the graphs, not a hopeful “looks better.”
- Declare resolved. Post a final update, stand the team down, and thank them. State clearly that the incident is closed so no one keeps firefighting a fixed problem.
- Schedule the post-mortem. Before everyone disperses, put the review on the calendar — within a few business days, while memory is fresh.
Status-update cadence and templates
During an incident, silence reads as “it’s getting worse.” A predictable update cadence is the cheapest way to keep stakeholders calm and out of the responders’ way. Tie the interval to severity, and always end every update by naming the time of the next one — “next update at 14:30”, never “soon”.
- SEV1: update every 30 minutes, internal and public status page.
- SEV2: update every 60 minutes, internal channel plus support.
- SEV3: update at start and resolution; a ticket is enough.
Use one template for every update so readers can scan it without re-learning the format. Keep it short and lead with impact in plain language — the CEO and a customer should both understand it:
- Title:
[SEV1] Checkout unavailable - Status: Investigating / Identified / Monitoring / Resolved
- Impact: who and what is affected, in one sentence, no jargon
- Since: start time and elapsed duration
- Current action: what we are doing right now
- Next update: a specific clock time
The blameless post-mortem
The incident ends when service is restored; the learning ends with the post-mortem. “Blameless” is the load-bearing word: you analyse the system and the conditions that let a mistake cause damage, not the individual who pushed the button. People who fear being named stop telling you what really happened, and then you fix the wrong thing. Assume everyone acted reasonably given what they knew at the time, and ask why the system made the wrong action so easy.
Write the document within a few days, while the timeline is still sharp. A workable structure:
- Summary: what happened, severity, and total customer impact in a few lines.
- Timeline: the scribe’s timestamped log, from detection to resolution.
- Root cause: the technical and contributing causes — usually more than one.
- What went well / what hurt: the detection that worked, the runbook that was missing.
- Action items: concrete, owned, and dated — each with a name and a due date, tracked to done.
A post-mortem with no owned action items is a diary entry. The whole value is the follow-up: the alert you add, the runbook you write, the guardrail that makes this class of incident impossible next time. Keep the finished documents somewhere the whole team can find them — a slug like b/postmortems beats a folder no one remembers the path to.
Keep the playbook in one place
A playbook only works if a stressed responder can reach it in one move. The pattern we recommend: keep a single b/oncall board as the incident home, and make it the muscle-memory destination for everyone on call. A slug resolves from the address bar in under 50ms, so b/oncall is faster to reach than finding the right bookmark — and because BookSlash boards are a multi-tool canvas, that one board can hold the live severity scale, the role assignments, the running timeline, the status-update template, and embeds of your dashboards and runbooks side by side.
Lock the high-value slug so it cannot be repointed by accident, and lean on workspace roles and the audit log (90 days on Pro, 365 on Enterprise) so you know who changed the playbook and when. If the term b/ is new to you, start with what are go links, then see the full incident response setup and the companion runbooks pattern. Detection-to-resolution is a process you practise; the playbook is just the version of it you wrote down before you needed it.