All games

Observability Incident Triage

Triage production incidents by choosing useful metrics, logs, traces, queue signals, database evidence, request ids, and alerting strategies.

Concept
Production observability, incident triage, metrics, logs, traces, and alerts
Difficulty
Intermediate
Play time
6-9 minutes
Path
Production Reliability
practice/observability-incident-triage Incident triage score

Play, get feedback, save local progress, and optionally submit a leaderboard score.

Concept explanation

Incident response is a race against uncertainty. This game gives you a pager, a symptom, and a handful of possible signals so you can practice finding the truth without drowning in noise.

Your local progress

0 XP 0 games played 0 completed

Progress, review history, and best scores are stored in this browser with localStorage.

Open full progress dashboard

Playable game area

Use the controls below. Feedback appears immediately, and final scores are stored locally.

Leaderboard

Top 10 submitted scores. No account required.

Loading leaderboard...

    Finish the game to load your latest local score.

    Learning objectives

    • Choose high-signal telemetry for common backend incidents.
    • Use metrics, logs, traces, deploy markers, and request IDs together.
    • Distinguish actionable alerts from noisy operational trivia.

    How to play

    1. Read the production symptom and available context.
    2. Choose the telemetry move that would reduce uncertainty fastest.
    3. Use explanations to build an incident response mental model.

    Scoring

    • High-signal triage choices add points and streak bonuses.
    • Low-signal detours explain why they waste time or hide user impact.
    • Completion saves local progress and best triage score.

    Backend concept notes

    Observability is the ability to ask useful questions about a running system. During incidents, the best signals connect user symptoms, recent changes, failing dependencies, and concrete request paths.

    Metrics show shape and impact, traces show where time went, logs provide event detail, request IDs connect user reports, and alerts should be tied to actionable user-facing risk.

    Common mistakes

    • Relying only on average latency while p95 or p99 users suffer.
    • Diving into random logs before scoping by service, route, deploy, or request id.
    • Alerting on noisy resource blips instead of sustained symptoms or SLO burn.
    • Watching queue depth without message age, retry rate, or worker errors.

    FAQ

    Short answers for how this game fits backend interview and study practice.

    Are logs enough for observability?

    Logs are useful, but incidents usually need metrics for impact, traces for path timing, and correlation IDs to connect events across services.

    What makes an alert good?

    A good alert is actionable, has clear owner expectations, and usually represents user impact or a durable risk to user impact.