To write a runbook, document the exact steps an on-call engineer follows to operate or recover one service: how to tell it is broken, how to confirm the blast radius, the commands to mitigate, and when to escalate. A good runbook is a checklist for a tired human at 3am, not a design doc. Write it for the person who has been paged, not the person who built the system.
Most runbooks fail the same way: they are written once, during a calm sprint, by the person who knows the system best — and then never opened during the incident they were meant for. The author already holds the context in their head, so the document quietly assumes it too. At 3am the on-call engineer does not have that context. They have a pager, a half-awake brain, and ninety seconds before the next escalation. This guide is about writing the other kind of runbook: the one that actually gets used.
What is a runbook (and what it is not)?
A runbook is a service-scoped operational document: a set of concrete, ordered procedures for keeping one system healthy and recovering it when it is not. It answers “the alert fired — now what?” with specific commands, dashboard links, and decision points. It is deliberately narrow. One runbook covers one service or one failure mode, not your whole platform.
It helps to separate three things people lump together. A runbook is the mechanical procedure for a known situation (“disk is full on the queue worker — here is how to drain it”). A playbook is the broader response process for a class of incidents, including roles and communication. A wiki page is background knowledge — how the system is designed and why. The runbook borrows from both but is the thing you keep open while the incident is live.
What sections does a good runbook have?
A runbook that works under pressure has a predictable shape, so the on-call engineer can jump straight to the section they need instead of reading top to bottom. These are the sections that earn their place, and what each one is for at 3am.
| Section | What it answers | Why it matters at 3am |
|---|---|---|
| Service summary | What this service does and what breaks if it is down | Sets blast radius in one sentence so you size the response |
| Alerts & symptoms | Which alert maps to which failure | Turns a cryptic pager line into a known problem |
| First checks | Dashboards and queries to confirm what is actually wrong | Stops you fixing the wrong thing |
| Mitigations | The exact commands to stop the bleeding | Copy-paste recovery, no improvisation |
| Escalation | Who to wake and when | Removes the “am I allowed to call them?” hesitation |
| Rollback & verify | How to undo a change and confirm recovery | Tells you when you are actually done |
Order matters as much as content. The sections a paged engineer needs first — symptoms, first checks, mitigations — belong at the top, and everything that explains the system sits below them. A reader should reach a runnable command within a screen or two of scrolling. If your template forces them past a paragraph of architecture history to get there, the history has won an argument it should have lost.
Write mitigations as commands, not prose
The single biggest difference between a runbook that gets used and one that gets skipped is the mitigation section. “Restart the affected workers” is prose — it makes the reader translate intent into commands while the graph is on fire. Instead, give the literal command: kubectl rollout restart deploy/queue-worker -n payments, followed by the exact dashboard link to watch and the metric that should drop. Assume the reader knows the tool but not this service.
Make the contrast concrete. “Scale up the workers if the queue is backing up” forces three decisions onto a half-awake reader: which workers, how much, and how to tell it worked. The runbook version removes all three — kubectl scale deploy/queue-worker -n payments --replicas=8, watch the queue_depth panel on the payments dashboard, and expect it to fall below 1,000 within five minutes. Same intent, zero translation.
Encode the decision points, not just the happy path
Real incidents branch, so a runbook that only describes the path where everything works will strand the responder the moment it does not. Write the forks explicitly: “if queue depth keeps climbing after the restart, the worker is not the bottleneck — move to the database section.” Pair every mitigation with the signal that confirms it worked and the next step if it did not. Severity belongs here too: state the threshold that turns a quiet degradation into a page-everyone incident, so the on-call engineer is not negotiating severity with themselves at 3am.
A copyable runbook template
Start every new runbook from the same skeleton so engineers always know where to look. Copy the structure below, name the doc after the service, and fill each section with concrete specifics. Keep the headings even when a section is short — an empty “Escalation” heading is a useful prompt to go find the answer.
- Title & owner.
Runbook: <service>, the owning team, and the slug it lives at (for exampleb/runbook-<service>). - Summary. One or two sentences: what the service does, what depends on it, and the user-visible impact when it is down.
- Alerts & symptoms. A short table mapping each alert name to the likely cause and the section that fixes it.
- First checks. The two or three dashboards, log queries, and health endpoints to confirm what is broken — with direct links, not “check the dashboard.”
- Mitigations. Numbered procedures, each as copy-pasteable commands, the metric to watch, and the expected result.
- Escalation. Who to page after which threshold or elapsed time, and the slug to the on-call schedule (
b/oncall). - Rollback & verify. How to revert the last change and the explicit signal that recovery is complete.
- Dependencies & links. Upstream and downstream services, the status page, and the architecture doc for anyone who needs the “why.”
- Last validated. A date and a name. If it is older than a quarter, treat every command as suspect until re-checked.
That last line does more work than it looks. A runbook with no validation date is a runbook nobody trusts, and an untrusted runbook gets ignored in favour of paging the author — defeating the entire point.
How do you keep a runbook current?
A runbook is only as good as its last test. Code changes, dashboards move, the mitigation that worked last quarter now points at a renamed deployment. The fix is to make staleness visible and to attach updates to events that already happen, rather than relying on a someday review.
- Update it in the postmortem. Every incident is a free correctness test. If the runbook was wrong or missing a step, fixing it is an action item before the incident is closed — not a backlog ticket.
- Run game days. Periodically have someone who did not write the runbook follow it against a staging failure. The gaps surface fast and cheaply.
- Date and sign each section. A visible “last validated” stamp turns trust into something measurable. Stale stamps are a queue of work.
- Keep it next to the work. A runbook buried three wiki clicks from the alert will rot. One that lives behind a memorable slug, linked from the alert itself, gets opened — and edited — far more often.
Ownership is what turns these habits into something that actually happens. Put a named team — not an individual — on each runbook, ideally the same team that holds the pager for the service, and review the set on a fixed cadence rather than waiting for the next outage to expose the rot. A workspace with roles and an audit log helps here: you can see who last edited a high-value runbook and when, and lock the ones whose accuracy people bet the incident on so a careless edit cannot quietly break them.
Runbook anti-patterns to avoid
Most broken runbooks share a handful of failure modes. If you recognise these in your own docs, you have your edit list.
- The novel. Three thousand words of architecture before the first command. Put the recovery steps at the top; move the background to the bottom or a linked wiki page.
- Assumed context. “Just fail over the usual way” is useless to the person who has never done it. Spell out the usual way.
- Prose instead of commands. Anything you would type, write as a code block. The reader should copy, not translate.
- No decision points. Real incidents branch. State the conditions: “if error rate is still above 5% after the restart, escalate” beats a single happy path.
- Findable only by the author. If locating the runbook requires knowing it exists, it does not exist. A predictable address — one slug per service — solves this.
- Stale by default. A runbook whose commands silently stopped working is worse than none — it sends the responder confidently down a dead end. Treat an un-dated runbook as stale until someone proves otherwise.
Where should a runbook live?
The best-written runbook is worthless if the on-call engineer cannot find it in the first thirty seconds of an incident. It needs a stable, memorable, typeable address that does not change when the underlying doc moves. This is exactly what team go links are for: a short keyword that redirects to a destination, owned by the team rather than one person’s bookmarks.
The convention that holds up is one slug per service: b/runbook-checkout, b/runbook-payments, b/runbook-queue. Put the slug directly in the alert payload and the pager description, so the link to the runbook arrives with the page itself. When the document moves, you repoint the slug once and every alert, Slack thread, and bookmark keeps working. Lock the high-value runbook slugs to admins so a typo cannot quietly repoint b/runbook-payments at the wrong page.
Practically, that address has to resolve from wherever the page drops the responder. A browser extension — Chrome, Firefox, Edge, Safari, or Brave — turns b/runbook-checkout typed in the address bar into a redirect in roughly 40-50ms, and the same slug resolves from Raycast or Spotlight when the responder is in a launcher rather than a browser. For the cases an extension cannot cover — a phone, a fresh laptop, a contractor mid-incident — a custom slug domain such as b.your-co.com gives the link a normal URL that works in any browser.
Because a slug can resolve to a full canvas rather than just a redirect, the runbook itself can live behind it. A board can hold the procedure as code blocks, the live dashboards as embeds, a checklist for the current incident, and notes the responders add in real time — all behind b/runbook-checkout. That keeps the static procedure and the live response in one place instead of scattered across a wiki, a chat thread, and three browser tabs.
Putting it together
Writing a runbook that survives 3am comes down to three habits: write for the tired stranger, not the author; lead with copy-pasteable commands and clear decision points; and give it an address the whole team can reach without thinking. Start from the template above, validate it by having someone else follow it, and keep it honest by updating it inside every postmortem.
For the patterns specific to operating services this way, see how teams structure runbooks behind slugs, and how the same approach extends to a live incident response board with the on-call schedule, status page, and a running timeline all one keyword away.