Two engineers enter the same live incident. One broken system. Real-time coordination under pressure. Practice the communication, task splitting, and shared situational awareness that make on-call rotations work.
engineer-a$ kubectl describe pod api-gateway -n edge Observation: readiness failing after dependency timeout engineer-b$ kubectl get svc,endpoints -n edge Thread: endpoints missing for auth backend engineer-a$ kubectl describe networkpolicy -n edge Decision: isolate policy drift before restart Both engineers see the same system. The report shows who did what.
Real outages are team events. The best SRE in the world is useless if they can't coordinate with another engineer under pressure.
The room model mirrors real incident management, not a chat room with a shared terminal bolted on.
Pick a scenario: Kubernetes cascading failure, Azure networking, GPU diagnostics. The system provisions a shared container and generates paired handoff tokens for Engineer A and Engineer B.
Each engineer enters the lobby with their token. Presence indicators show who's connected. Engineer A starts the session when both are ready. It's a deliberate readiness gate, not an accidental race.
Both engineers work the same live system. The platform tracks each engineer's commands independently. After the session, managers get per-engineer replays showing decision patterns, not just commands.
Chaos Mode exposes collaboration quality, the one dimension most technical assessments completely ignore.
Do they split the problem sensibly? One engineer narrows blast radius while the other validates dependencies. Or do they both run the same commands?
Who takes ownership of the incident? Who proposes a plan, assigns threads, and keeps the room focused? Or does nobody step up?
Do they verify changes before moving on? Do they test the user path, not just the symptom? Or do they declare victory at the first green light?
How well do they narrate what they're finding? Can Engineer B understand what Engineer A discovered without asking? Silence is data too.
"You take the network policy, I'll trace the service mesh." Clean delegation under pressure is a signal you cannot get from a multiple-choice quiz.
In cascading failures, the order of fixes matters. Do they understand dependency chains? Or do they fix the loudest symptom first?
Put your top two candidates in the same incident room. See who actually leads under pressure instead of who interviews better. One 20-minute session replaces hours of panel interviews.
Your team just migrated to Kubernetes. Instead of hoping they'll learn from runbooks, throw them into a cascading pod failure together. They'll learn faster under pressure, and a colleague is there to catch mistakes.
New SRE joins the team. Pair them with a senior engineer in a Chaos Mode session. The senior watches methodology, the new hire builds confidence, and managers get a real read on readiness.
Before the next change freeze or on-call rotation, run your team through a shared incident. Same pressure, none of the customer impact.
Our most advanced scenario doesn't end when you fix the first problem. Each resolution triggers the next hidden failure, just like production.
Both engineers can join through the browser or the CLI. Each person picks whichever interface they work fastest in, and both connect to the same shared container over the same WebSocket. Commands, presence, and replay all work identically regardless of surface.
PARIUM / war room SCENARIO K8s Cascading Failure ROOM chaos-room-42 STATUS Waiting for start ──────────────────────────────── ● Engineer A connected (you) ● Engineer B connected ──────────────────────────────── Press S to start the session Press Tab to cycle themes Press Ctrl+C to leave
Screen sharing lets one person drive while others watch. Chaos Mode gives both engineers full terminal access to the same live system. Both can run commands, both get tracked independently, and the report shows exactly who contributed what. It's the difference between watching someone cook and both being in the kitchen.
Yes, and it's one of the best fits. Pair a new SRE with a senior engineer. Run your platform team through a Kubernetes failure before the next migration. Use it for on-call readiness checks. Same incident engine, different purpose.
Currently optimised for two engineers, labelled Engineer A and Engineer B. This keeps the signal clean: you can clearly see who led, who investigated, and who verified. Two is enough to expose collaboration patterns without the noise of a large group.
Scenarios with multiple root causes or branching failure paths. The Kubernetes cascading failure (6 phases) is designed for exactly this. It rewards engineers who split threads and coordinate. Simple single-fix scenarios work better as solo assessments.
Per-engineer command history with timestamps. Who ran what, when, and in what order. Presence data: connection times, disconnects. AI analysis of each engineer's approach. And the full session replay, so you can watch the collaboration unfold like a recording.
Because production is a team sport, and your interviews should be too.