What is the difference between a runbook and a playbook?

A runbook is the mechanical procedure for a known situation, such as draining a full disk on a queue worker. A playbook is the broader response process for a class of incidents, including roles, severity levels, and communication. In short, the runbook tells you which commands to run; the playbook tells you how the whole response is coordinated.

What should a runbook include?

At minimum: a one-line service summary and its blast radius, a map of alerts to symptoms, first checks with direct dashboard and log links, copy-pasteable mitigation commands with the metric to watch, escalation rules with who to page and when, rollback and verification steps, dependencies, and a "last validated" date with a name.

How long should a runbook be?

Short enough that an on-call engineer can find the right step in under a minute. Put the recovery commands at the top and push architecture and background to the bottom or a linked wiki page. If a runbook reads like a novel before the first command, it will get skipped in favour of paging the author, which defeats its purpose.

How often should you update a runbook?

Update it as part of every postmortem, since each incident is a free correctness test, and re-validate it on a schedule with game days where someone who did not write it follows it against a staging failure. Stamp each section with a "last validated" date; anything older than a quarter should be treated as suspect until re-checked.

Where should runbooks be stored?

Behind a stable, memorable address the on-call engineer can reach in seconds, ideally one go-link slug per service such as b/runbook-checkout, with the slug placed directly in the alert payload. When the document moves you repoint the slug once and every alert and bookmark keeps working. Storing the runbook behind a slug also lets it live on a board alongside live dashboards and an incident timeline.

How to Write a Runbook: Template, Sections, Examples

To write a runbook, document the exact steps an on-call engineer follows to operate or recover one service: how to tell it is broken, how to confirm the blast radius, the commands to mitigate, and when to escalate. A good runbook is a checklist for a tired human at 3am, not a design doc. Write it for the person who has been paged, not the person who built the system.

Most runbooks fail the same way: they are written once, during a calm sprint, by the person who knows the system best — and then never opened during the incident they were meant for. The author already holds the context in their head, so the document quietly assumes it too. At 3am the on-call engineer does not have that context. They have a pager, a half-awake brain, and ninety seconds before the next escalation. This guide is about writing the other kind of runbook: the one that actually gets used.

What is a runbook (and what it is not)?

A runbook is a service-scoped operational document: a set of concrete, ordered procedures for keeping one system healthy and recovering it when it is not. It answers “the alert fired — now what?” with specific commands, dashboard links, and decision points. It is deliberately narrow. One runbook covers one service or one failure mode, not your whole platform.

It helps to separate three things people lump together. A runbook is the mechanical procedure for a known situation (“disk is full on the queue worker — here is how to drain it”). A playbook is the broader response process for a class of incidents, including roles and communication. A wiki page is background knowledge — how the system is designed and why. The runbook borrows from both but is the thing you keep open while the incident is live.

Tip

The fastest test of a runbook: hand it to an engineer who has never touched the service and ask them to follow it without interrupting you. Every place they stop and ask a question is a missing step. Fix those before you call the runbook done.

What sections does a good runbook have?

A runbook that works under pressure has a predictable shape, so the on-call engineer can jump straight to the section they need instead of reading top to bottom. These are the sections that earn their place, and what each one is for at 3am.

Section	What it answers	Why it matters at 3am
Service summary	What this service does and what breaks if it is down	Sets blast radius in one sentence so you size the response
Alerts & symptoms	Which alert maps to which failure	Turns a cryptic pager line into a known problem
First checks	Dashboards and queries to confirm what is actually wrong	Stops you fixing the wrong thing
Mitigations	The exact commands to stop the bleeding	Copy-paste recovery, no improvisation
Escalation	Who to wake and when	Removes the “am I allowed to call them?” hesitation
Rollback & verify	How to undo a change and confirm recovery	Tells you when you are actually done

Order matters as much as content. The sections a paged engineer needs first — symptoms, first checks, mitigations — belong at the top, and everything that explains the system sits below them. A reader should reach a runnable command within a screen or two of scrolling. If your template forces them past a paragraph of architecture history to get there, the history has won an argument it should have lost.

Write mitigations as commands, not prose

The single biggest difference between a runbook that gets used and one that gets skipped is the mitigation section. “Restart the affected workers” is prose — it makes the reader translate intent into commands while the graph is on fire. Instead, give the literal command: kubectl rollout restart deploy/queue-worker -n payments, followed by the exact dashboard link to watch and the metric that should drop. Assume the reader knows the tool but not this service.

Make the contrast concrete. “Scale up the workers if the queue is backing up” forces three decisions onto a half-awake reader: which workers, how much, and how to tell it worked. The runbook version removes all three — kubectl scale deploy/queue-worker -n payments --replicas=8, watch the queue_depth panel on the payments dashboard, and expect it to fall below 1,000 within five minutes. Same intent, zero translation.

Encode the decision points, not just the happy path

Real incidents branch, so a runbook that only describes the path where everything works will strand the responder the moment it does not. Write the forks explicitly: “if queue depth keeps climbing after the restart, the worker is not the bottleneck — move to the database section.” Pair every mitigation with the signal that confirms it worked and the next step if it did not. Severity belongs here too: state the threshold that turns a quiet degradation into a page-everyone incident, so the on-call engineer is not negotiating severity with themselves at 3am.

A copyable runbook template

Start every new runbook from the same skeleton so engineers always know where to look. Copy the structure below, name the doc after the service, and fill each section with concrete specifics. Keep the headings even when a section is short — an empty “Escalation” heading is a useful prompt to go find the answer.

Title & owner. Runbook: <service>, the owning team, and the slug it lives at (for example b/runbook-<service>).
Summary. One or two sentences: what the service does, what depends on it, and the user-visible impact when it is down.
Alerts & symptoms. A short table mapping each alert name to the likely cause and the section that fixes it.
First checks. The two or three dashboards, log queries, and health endpoints to confirm what is broken — with direct links, not “check the dashboard.”
Mitigations. Numbered procedures, each as copy-pasteable commands, the metric to watch, and the expected result.
Escalation. Who to page after which threshold or elapsed time, and the slug to the on-call schedule (b/oncall).
Rollback & verify. How to revert the last change and the explicit signal that recovery is complete.
Dependencies & links. Upstream and downstream services, the status page, and the architecture doc for anyone who needs the “why.”
Last validated. A date and a name. If it is older than a quarter, treat every command as suspect until re-checked.

That last line does more work than it looks. A runbook with no validation date is a runbook nobody trusts, and an untrusted runbook gets ignored in favour of paging the author — defeating the entire point.

How do you keep a runbook current?

A runbook is only as good as its last test. Code changes, dashboards move, the mitigation that worked last quarter now points at a renamed deployment. The fix is to make staleness visible and to attach updates to events that already happen, rather than relying on a someday review.

Update it in the postmortem. Every incident is a free correctness test. If the runbook was wrong or missing a step, fixing it is an action item before the incident is closed — not a backlog ticket.
Run game days. Periodically have someone who did not write the runbook follow it against a staging failure. The gaps surface fast and cheaply.
Date and sign each section. A visible “last validated” stamp turns trust into something measurable. Stale stamps are a queue of work.
Keep it next to the work. A runbook buried three wiki clicks from the alert will rot. One that lives behind a memorable slug, linked from the alert itself, gets opened — and edited — far more often.

Ownership is what turns these habits into something that actually happens. Put a named team — not an individual — on each runbook, ideally the same team that holds the pager for the service, and review the set on a fixed cadence rather than waiting for the next outage to expose the rot. A workspace with roles and an audit log helps here: you can see who last edited a high-value runbook and when, and lock the ones whose accuracy people bet the incident on so a careless edit cannot quietly break them.

Runbook anti-patterns to avoid

Most broken runbooks share a handful of failure modes. If you recognise these in your own docs, you have your edit list.

The novel. Three thousand words of architecture before the first command. Put the recovery steps at the top; move the background to the bottom or a linked wiki page.
Assumed context. “Just fail over the usual way” is useless to the person who has never done it. Spell out the usual way.
Prose instead of commands. Anything you would type, write as a code block. The reader should copy, not translate.
No decision points. Real incidents branch. State the conditions: “if error rate is still above 5% after the restart, escalate” beats a single happy path.
Findable only by the author. If locating the runbook requires knowing it exists, it does not exist. A predictable address — one slug per service — solves this.
Stale by default. A runbook whose commands silently stopped working is worse than none — it sends the responder confidently down a dead end. Treat an un-dated runbook as stale until someone proves otherwise.

Note

A runbook is not a substitute for a system that fails gracefully. If a service needs a twelve-step manual recovery every week, the runbook is a signal, not a solution — the real fix is automation or a design change. Write the runbook, then use how often it is opened as evidence for the work that removes it.

Where should a runbook live?

The best-written runbook is worthless if the on-call engineer cannot find it in the first thirty seconds of an incident. It needs a stable, memorable, typeable address that does not change when the underlying doc moves. This is exactly what team go links are for: a short keyword that redirects to a destination, owned by the team rather than one person’s bookmarks.

The convention that holds up is one slug per service: b/runbook-checkout, b/runbook-payments, b/runbook-queue. Put the slug directly in the alert payload and the pager description, so the link to the runbook arrives with the page itself. When the document moves, you repoint the slug once and every alert, Slack thread, and bookmark keeps working. Lock the high-value runbook slugs to admins so a typo cannot quietly repoint b/runbook-payments at the wrong page.

Practically, that address has to resolve from wherever the page drops the responder. A browser extension — Chrome, Firefox, Edge, Safari, or Brave — turns b/runbook-checkout typed in the address bar into a redirect in roughly 40-50ms, and the same slug resolves from Raycast or Spotlight when the responder is in a launcher rather than a browser. For the cases an extension cannot cover — a phone, a fresh laptop, a contractor mid-incident — a custom slug domain such as b.your-co.com gives the link a normal URL that works in any browser.

Because a slug can resolve to a full canvas rather than just a redirect, the runbook itself can live behind it. A board can hold the procedure as code blocks, the live dashboards as embeds, a checklist for the current incident, and notes the responders add in real time — all behind b/runbook-checkout. That keeps the static procedure and the live response in one place instead of scattered across a wiki, a chat thread, and three browser tabs.

Putting it together

Writing a runbook that survives 3am comes down to three habits: write for the tired stranger, not the author; lead with copy-pasteable commands and clear decision points; and give it an address the whole team can reach without thinking. Start from the template above, validate it by having someone else follow it, and keep it honest by updating it inside every postmortem.

For the patterns specific to operating services this way, see how teams structure runbooks behind slugs, and how the same approach extends to a live incident response board with the on-call schedule, status page, and a running timeline all one keyword away.

How to write a runbook
that gets used at 3am.

What is a runbook (and what it is not)?

What sections does a good runbook have?

Write mitigations as commands, not prose

Encode the decision points, not just the happy path

A copyable runbook template

How do you keep a runbook current?

Runbook anti-patterns to avoid

Where should a runbook live?

Putting it together

Common questions about go links

Your stack. Your shortcuts.
One keystroke for everyone.

How to write a runbookthat gets used at 3am.

What is a runbook (and what it is not)?

What sections does a good runbook have?

Write mitigations as commands, not prose

Encode the decision points, not just the happy path

A copyable runbook template

How do you keep a runbook current?

Runbook anti-patterns to avoid

Where should a runbook live?

Putting it together

Common questions about go links

Your stack. Your shortcuts.One keystroke for everyone.

How to write a runbook
that gets used at 3am.

Your stack. Your shortcuts.
One keystroke for everyone.