What are the standard incident severity levels?

Most teams use three or four levels defined by customer impact. SEV1 is critical: a full outage or data loss that needs an all-hands response and frequent updates. SEV2 is major degradation with a workaround, paged to on-call. SEV3 is minor or internal impact handled in business hours. When unsure between two levels, round up; you can always downgrade once you understand the blast radius.

What does an incident commander do?

The incident commander (IC) owns the incident, not the fix. They keep the timeline moving, decide on mitigations, delegate investigation to other engineers, and call when to escalate or stand down. The IC should stay off the keyboard so someone is always steering: their job is to keep asking what the fastest path to mitigation is and who is doing it.

How often should you send incident status updates?

Tie the cadence to severity: roughly every 30 minutes for a SEV1, every 60 minutes for a SEV2, and at start and resolution for a SEV3. Post the first update within minutes of declaring, and always end each update by naming a specific clock time for the next one rather than saying 'soon.' Even a 'no change, still investigating' update is valuable because silence reads as the incident getting worse.

What is a blameless post-mortem?

A blameless post-mortem reviews an incident by analysing the system and conditions that let a mistake cause damage, rather than blaming the person involved. The assumption is that everyone acted reasonably given what they knew at the time. This honesty is what surfaces the real root cause; the document should produce concrete, owned, dated action items, not just a narrative.

Incident Response Playbook: Severity, Roles, Timeline

An incident response playbook is a written, repeatable process for handling production incidents. It defines severity levels, names the roles — incident commander, comms lead, scribe — and lays out every step from detection to resolution and a blameless post-mortem. The point is to remove improvisation from the worst ten minutes of your week.

Most teams already have the pieces scattered around: a runbook here, an on-call rotation there, a half-remembered Slack thread about who declares a SEV1. A playbook puts them in one ordered list that a half-awake engineer can follow at 3am without thinking. This guide is the version we’d hand a new on-call engineer: the severity scale, the roles, the timeline, the status-update cadence, and the post-mortem — concrete enough to copy.

What is an incident response playbook?

An incident is any unplanned event that degrades a service your users depend on — an outage, a data bug, a security event, a third-party dependency falling over. The playbook is the agreed answer to four questions you do not want to be debating while the site is down: how bad is it (severity), who is in charge (roles), what happens next (the timeline), and who needs to know (communication). Write those down once and every future incident starts from the same line.

A playbook is not a runbook. A runbook is service-specific — how to fail over the payments database, how to drain a bad node. The playbook is the meta-process that decides when you reach for a runbook and who calls the shots while you do. Keep them linked: the playbook is the front door, the runbooks are the rooms behind it.

What are the incident severity levels?

Severity is the first decision because it drives everything after it: who gets paged, how often you communicate, and whether anyone wakes up. Keep the scale short — three levels are enough for most teams, and a fourth (SEV4) for “noticed, not urgent”. Define them by customer impact, not by how interesting the bug is.

Severity	Impact	Example	Response
SEV1	Critical — full outage or data loss	Checkout down; customers cannot pay	Page the IC now, all-hands, updates every 30 min
SEV2	Major — degraded, a workaround exists	Elevated error rate; one region slow	Page on-call, updates every 60 min
SEV3	Minor — limited or internal impact	Background job lagging; admin tool glitch	Handle in business hours, track as a ticket

Tip

When you are unsure between two levels, round up. It is cheap to downgrade a SEV1 ten minutes in once you understand the blast radius; it is expensive to discover at minute forty that a “SEV3” was quietly losing orders the whole time. Make the on-call engineer the person who declares — they can always escalate, and waiting for a manager to bless the severity just burns minutes.

The roles: incident commander, comms lead, scribe

The single biggest failure mode in an incident is five smart engineers all debugging the same thing in silence while no one talks to customers and no one decides anything. Roles fix that. For a SEV1 you want three named people; for a SEV2 one person often wears all three hats. The rule that matters: say the role out loud when you take it, so everyone knows who is who.

Incident commander (IC)

The IC owns the incident, not the fix. They keep the timeline moving, decide on mitigations, delegate investigation, and call when to escalate or stand down. Crucially the IC should keep their hands off the keyboard — the moment the commander is deep in a stack trace, no one is steering. Their job is to ask “what’s our fastest path to mitigation?” and “who is doing it?” over and over.

Comms lead

The comms lead owns everyone outside the incident channel: the status page, the support team, leadership, and affected customers. They translate engineer-speak into plain impact statements and they protect the responders from a stream of “any update?” pings. On a smaller incident the IC covers this, but as soon as customers are visibly affected, split it out — talking to people is a full-time job during a SEV1.

Scribe

The scribe keeps the running timeline: when the alert fired, when each action was taken, what was tried, what changed. This is not bureaucracy — it is the raw material for the post-mortem, and it stops the team from re-trying something that already failed an hour ago. A pinned doc or a board on the incident’s home is enough; timestamp every entry.

The incident timeline: detection to resolution

Here is the core of the playbook — the ordered checklist a responder runs from the moment an alert fires. Print it, pin it to your b/oncall board, and don’t skip steps under pressure; the order is what keeps a stressful hour from becoming a chaotic one.

Acknowledge and declare. Ack the page so it stops escalating, then declare an incident and set an initial severity. Saying “this is a SEV2” out loud is what starts the clock for everyone else.
Open the incident home. Spin up the dedicated channel and open the b/oncall board so the timeline, dashboards, and runbook links live in one place instead of scattered DMs.
Assign roles. Name the IC first; for a SEV1, name the comms lead and scribe too. If no one volunteers, the person who declared is IC until relieved.
Stabilise before you diagnose. Stop the bleeding first — roll back the last deploy, fail over, flip the feature flag, scale up. You can find the root cause after users are safe; do not let a debugging rabbit hole delay mitigation.
Communicate on cadence. Post the first status update within minutes, then on the interval your severity demands — even “no change, still investigating” is a valuable update.
Diagnose and fix. With the impact contained, work the root cause. The IC delegates investigation threads and keeps everyone reporting back to the channel, not working in private.
Verify recovery. Confirm the fix with the same signal that detected the problem — error rate back to baseline, the failing transaction now succeeding. Trust the graphs, not a hopeful “looks better.”
Declare resolved. Post a final update, stand the team down, and thank them. State clearly that the incident is closed so no one keeps firefighting a fixed problem.
Schedule the post-mortem. Before everyone disperses, put the review on the calendar — within a few business days, while memory is fresh.

Status-update cadence and templates

During an incident, silence reads as “it’s getting worse.” A predictable update cadence is the cheapest way to keep stakeholders calm and out of the responders’ way. Tie the interval to severity, and always end every update by naming the time of the next one — “next update at 14:30”, never “soon”.

SEV1: update every 30 minutes, internal and public status page.
SEV2: update every 60 minutes, internal channel plus support.
SEV3: update at start and resolution; a ticket is enough.

Use one template for every update so readers can scan it without re-learning the format. Keep it short and lead with impact in plain language — the CEO and a customer should both understand it:

Title: [SEV1] Checkout unavailable
Status: Investigating / Identified / Monitoring / Resolved
Impact: who and what is affected, in one sentence, no jargon
Since: start time and elapsed duration
Current action: what we are doing right now
Next update: a specific clock time

Note

Separate internal and external updates. Internal can carry hypotheses and half-confirmed detail; the public status page should only state confirmed impact and the next update time. Mixing them is how an unconfirmed guess becomes a customer-facing promise you have to walk back.

The blameless post-mortem

The incident ends when service is restored; the learning ends with the post-mortem. “Blameless” is the load-bearing word: you analyse the system and the conditions that let a mistake cause damage, not the individual who pushed the button. People who fear being named stop telling you what really happened, and then you fix the wrong thing. Assume everyone acted reasonably given what they knew at the time, and ask why the system made the wrong action so easy.

Write the document within a few days, while the timeline is still sharp. A workable structure:

Summary: what happened, severity, and total customer impact in a few lines.
Timeline: the scribe’s timestamped log, from detection to resolution.
Root cause: the technical and contributing causes — usually more than one.
What went well / what hurt: the detection that worked, the runbook that was missing.
Action items: concrete, owned, and dated — each with a name and a due date, tracked to done.

A post-mortem with no owned action items is a diary entry. The whole value is the follow-up: the alert you add, the runbook you write, the guardrail that makes this class of incident impossible next time. Keep the finished documents somewhere the whole team can find them — a slug like b/postmortems beats a folder no one remembers the path to.

Keep the playbook in one place

A playbook only works if a stressed responder can reach it in one move. The pattern we recommend: keep a single b/oncall board as the incident home, and make it the muscle-memory destination for everyone on call. A slug resolves from the address bar in under 50ms, so b/oncall is faster to reach than finding the right bookmark — and because BookSlash boards are a multi-tool canvas, that one board can hold the live severity scale, the role assignments, the running timeline, the status-update template, and embeds of your dashboards and runbooks side by side.

Lock the high-value slug so it cannot be repointed by accident, and lean on workspace roles and the audit log (90 days on Pro, 365 on Enterprise) so you know who changed the playbook and when. If the term b/ is new to you, start with what are go links, then see the full incident response setup and the companion runbooks pattern. Detection-to-resolution is a process you practise; the playbook is just the version of it you wrote down before you needed it.

Incident response
playbook.

What is an incident response playbook?

What are the incident severity levels?

The roles: incident commander, comms lead, scribe

Incident commander (IC)

Comms lead

Scribe

The incident timeline: detection to resolution

Status-update cadence and templates

The blameless post-mortem

Keep the playbook in one place

Common questions about go links

Your stack. Your shortcuts.
One keystroke for everyone.

Incident responseplaybook.

What is an incident response playbook?

What are the incident severity levels?

The roles: incident commander, comms lead, scribe

Incident commander (IC)

Comms lead

Scribe

The incident timeline: detection to resolution

Status-update cadence and templates

The blameless post-mortem

Keep the playbook in one place

Common questions about go links

Your stack. Your shortcuts.One keystroke for everyone.

Incident response
playbook.

Your stack. Your shortcuts.
One keystroke for everyone.