Observability Incident Triage

Concept explanation

Incident response is a race against uncertainty. This game gives you a pager, a symptom, and a handful of possible signals so you can practice finding the truth without drowning in noise.

Playable game area

Use the controls below. Feedback appears immediately, and final scores are stored locally.

Leaderboard

Top 10 submitted scores. No account required.

Loading leaderboard...

Finish the game to load your latest local score.

Learning objectives

Choose high-signal telemetry for common backend incidents.
Use metrics, logs, traces, deploy markers, and request IDs together.
Distinguish actionable alerts from noisy operational trivia.

How to play

Read the production symptom and available context.
Choose the telemetry move that would reduce uncertainty fastest.
Use explanations to build an incident response mental model.

Scoring

High-signal triage choices add points and streak bonuses.
Low-signal detours explain why they waste time or hide user impact.
Completion saves local progress and best triage score.

Backend concept notes

Observability is the ability to ask useful questions about a running system. During incidents, the best signals connect user symptoms, recent changes, failing dependencies, and concrete request paths.

Metrics show shape and impact, traces show where time went, logs provide event detail, request IDs connect user reports, and alerts should be tied to actionable user-facing risk.

Common mistakes

Relying only on average latency while p95 or p99 users suffer.
Diving into random logs before scoping by service, route, deploy, or request id.
Alerting on noisy resource blips instead of sustained symptoms or SLO burn.
Watching queue depth without message age, retry rate, or worker errors.

Keep practicing nearby backend concepts while the mental model is fresh.

Reliability Intermediate

Circuit Breaker Clinic

Diagnose dependency failures and choose circuit breaker, timeout, fallback, retry, half-open, and bulkhead strategies that reduce blast radius.

Time: 6-9 minutes
Concept: Circuit breakers, timeouts, retries, fallbacks, and dependency isolation

Play Circuit Breaker Clinic

Queues Intermediate

Message Queue Simulator

Tune workers, retries, and dead-letter behavior while jobs move through an async queue with failures and poison messages.

Time: 7-11 minutes
Concept: Async jobs, retries, visibility timeout, and dead-letter queues

Play Message Queue Simulator

Scaling Intermediate

Load Balancer Challenge

Route simulated traffic across backend servers using round robin, weighted round robin, least connections, and random strategies.

Time: 6-10 minutes
Concept: Load balancing strategies

Play Load Balancer Challenge

FAQ

Short answers for how this game fits backend interview and study practice.

Are logs enough for observability?

Logs are useful, but incidents usually need metrics for impact, traces for path timing, and correlation IDs to connect events across services.

What makes an alert good?

A good alert is actionable, has clear owner expectations, and usually represents user impact or a durable risk to user impact.