Home
Product
Assessments Screening Links Team Drills CLI Chaos Mode
Solutions by Stack
AI Infrastructure Cloud & Platforms Kubernetes Data Centers Linux & Bare Metal
Solutions by Role
Site Reliability Engineers Platform Engineers DevOps Engineers DC Technicians Linux Admins
Resources
Blog Status Privacy Login Sign up
New K8s cascading failure scenario

The flight simulator for
production incidents

Real terminal-based incident simulations. See how candidates actually troubleshoot before you trust them on-call.

Kubernetes · GPU Clusters · Cloud · Docker · Linux
Talk to Sales
Scenario Simulation
Incident
INC-7234
Severity
SEV-1
State
Active
System
k8s-prod-03
Issue
Pod CrashLoop
Impact
API degraded
Duration
3m 6s
candidate@gpu-node-01
Active
07:12
candidate@gpu-node-01 - parium assessment
# Candidate investigating GPU driver failure
root@gpu-node-01:~$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver.
 
root@gpu-node-01:~$ lsmod | grep -E 'nvidia|nouveau'
nouveau 2093056 1
 
root@gpu-node-01:~$ modprobe -r nouveau && modprobe nvidia
Loading nvidia driver...
 
root@gpu-node-01:~$ curl -s localhost:8080/health | jq
{ "status": "healthy", "gpus": 2 }
 
# ✓ Incident resolved in 08:42 - 0 hints used
No sandbox Real terminals, real tools, real incidents
15 minutes From link sent to report delivered
Zero engineer time Scored and analysed without your team reviewing
Full evidence Commands, timing, investigation path, replay

What take-home tests miss

Respect your candidates' time - and your engineers' too.

The take-home test

  • 3-hour time commitment - the best candidates might not find the time
  • Another hour for your team to review each submission
  • Artificial tasks that don't test real incident response
  • Non-deterministic - two reviewers, two different scores
  • Hard to know if LLMs have been used

With Parium

  • 15 minutes. A real broken server. A real terminal.
  • AI analysis reads the session so your team doesn't have to
  • Tests exactly what they'll do on day one: debug production
  • Same scenario, every candidate. Clear pass/fail with data.
  • Built-in paste detection and tab-switch monitoring
  • Full behavioral picture: session replay shows pastes, tab switches, and every command

Up and running in 3 simple steps

Use our ready-made scenarios or let us build custom assessments for your stack.

01

Tell us what you need

Pick from our ready-made scenarios (GPU debugging, server performance, Kubernetes) or tell us your stack and we'll build custom assessments.

02

Send to candidates

Share a link. Candidates enter their details and drop straight into a live terminal. No downloads, no accounts, no friction.

03

Review the report

See exactly how they debug: time to resolution, commands used, investigation path. Your hiring manager gets a scored report without reviewing a single line of output.

Built for serious technical hiring

The tools your team needs to assess real engineering skills

Real Terminal Environments

Full Linux VMs via the browser, or connect through our CLI for chaos room sessions. Not a sandbox. A real system to debug.

Time-to-Resolution Tracking

Automatic timing from first command to incident resolution. Compare candidates against your team's benchmarks or against each other.

Runbook & Hint System

Real SOPs like your team uses. Track whether candidates follow procedures independently or need guidance, and how much.

LLM Detection

Paste events, tab switches, and timing patterns are captured and surfaced in the report. Your team decides what matters.

Full Session Replay

Every command and keystroke recorded with timestamps. Replay the entire session or export the full log for review.

Multiple Scenarios

Azure networking, K8s cascading failures, GPU driver conflicts, and more. Match the scenario to the role you're hiring for.

From first application to production ready

Four products on one incident engine. Each one feeds the next.

01
For hiring at scale

Screening Links

One branded link per role. Drop it into your ATS, job spec, or recruiter outreach. Candidates self-serve. You get a scored shortlist.

See Screening Links
02
For deeper evaluation

Assessments

Role-matched incident scenarios with configurable difficulty. Send a link, get a scored report with full evidence and session replay.

See Assessments
03
For final-stage hiring

Chaos Mode

Two engineers in the same live incident. Tests coordination, communication, and leadership under pressure. Per-engineer replay.

See Chaos Mode
04
For your existing team

Team Drills

Run your engineers through practice incidents. Onboarding exercises, on-call readiness checks, and team calibration before you hire.

See Team Drills
All surfaces available via the browser or our CLI. Install with npm install -g @parium.ai/cli

See what candidates actually face

Real logs, real configs, real system state. The tools are what your team already uses: dmesg journalctl kubectl nvidia-smi. Health check endpoints validate the fix.

SEV-1 Mid–Expert $8,200/hr impact

Production Edge API unreachable through Azure Load Balancer

Three independent root causes have drifted into a cascading failure. Candidates must trace the request path through Azure networking layers and fix each misconfiguration in the right order.

Root causes to identify
01 NSG rule blocking port 8080 traffic
02 Load Balancer health probe pointing to wrong endpoint
03 Application bound to loopback instead of 0.0.0.0
Simulated tools
az network systemctl journalctl ss curl ssh
INCIDENT.txt
═══════════════════════════════════════════════
          INCIDENT ALERT - SEV 1
═══════════════════════════════════════════════

INCIDENT ID:  INC-2026-0315-LB503
SEVERITY:     Critical - Production
AFFECTED:     edge-api.parium.internal
IMPACT:       $8,200/hr revenue at risk

───────────────────────────────────────────────

Production Edge API is returning HTTP 503 through
the Azure Load Balancer. The VM appears to be
running, but zero backend health probes succeed.

Active escalations:  3 customer tickets
Executive visibility: Yes - CTO notified

═══════════════════════════════════════════════
              YOUR TASK
═══════════════════════════════════════════════

1. Investigate why the LB returns 503
2. Identify all root causes (there may be more
   than one)
3. Apply fixes using approved remediation tools
4. Verify health check returns 200 OK
SEV-1 Expert Up to $500K/hr

Cascading cluster failure across 6 progressive phases

A multi-engineer scenario where every fix triggers the next hidden failure. Starts with a pod crash-loop and escalates to etcd split-brain and cascading drain storms. Tests crisis management, not just Kubernetes knowledge.

Cascade progression
01 Pod CrashLoopBackOff → fix liveness probe
02 Worker node goes NotReady → diagnose kubelet
03 DNS network policy breaks cluster-wide
04 Memory surge from backed-up traffic
05 Etcd split-brain from clock skew
06 Cascading cordon/drain storm
Simulated tools
kubectl crictl systemctl journalctl etcdctl timedatectl
INCIDENT.txt
═══════════════════════════════════════════════
          INCIDENT ALERT - SEV 1
═══════════════════════════════════════════════

INCIDENT ID:  INC-2026-WAR-ROOM
SEVERITY:     Critical - Cascading
CLUSTER:      prod-us-east-1 (18 nodes)
IMPACT:       $15K/hr → escalating

───────────────────────────────────────────────

api-gateway pods are in CrashLoopBackOff.
Customer-facing traffic is failing. SLA budget
is burning. This incident has executive
visibility.

WARNING: This incident will escalate.
Each fix you apply may reveal the next failure.
Prioritise methodically.

SLA budget remaining: 47 minutes
Oncall team:          Platform Engineering
Incident room:        Active - you are IC

═══════════════════════════════════════════════
              YOUR TASK
═══════════════════════════════════════════════

1. Restore api-gateway service availability
2. Investigate and resolve cascading failures
3. Validate cluster health at each phase
4. Maintain SLA budget - time matters
SEV-2 Mid-Level $4,200/hr impact

GPU has fallen off the bus - 7 of 8 GPUs visible

An NVIDIA A100 GPU is reporting Xid 79 errors and has disappeared from the PCIe bus. ML training jobs expecting 8 GPUs are failing. Tests hardware diagnosis skills and, critically, whether candidates know when to escalate vs. fix.

Diagnostic path
01 Verify GPU count and identify missing device
02 Check kernel logs for Xid errors and PCIe faults
03 Run DCGM diagnostics to rule out ECC errors
04 Apply ASPM power management fix or escalate
Simulated tools
nvidia-smi dcgmi lspci dmesg lsmod ipmitool
INCIDENT.txt
═══════════════════════════════════════════════
          INCIDENT ALERT - SEV 2
═══════════════════════════════════════════════

INCIDENT ID:  INC-2026-0119-GPU
SEVERITY:     High - Production ML
AFFECTED:     gpu-node-01.neocloud.internal
IMPACT:       $4,200/hr compute waste

───────────────────────────────────────────────

GPU compute jobs are failing on gpu-node-01.
The node has 2x NVIDIA A100 80GB GPUs but only
7 of 8 devices are detected by monitoring.

Queued jobs:     3 LLM fine-tuning runs
Last healthy:    08:00 UTC today
Kernel log:      Xid 79 - GPU fallen off bus

═══════════════════════════════════════════════
              YOUR TASK
═══════════════════════════════════════════════

1. Investigate why nvidia-smi shows fewer GPUs
2. Identify the root cause (driver vs hardware)
3. Restore GPU functionality if possible
4. Escalate to hardware team if necessary

From L1 support to senior SRE

Scenarios matched to every role on your team

Site Reliability Engineers

Incident response, system debugging, and production troubleshooting. The scenarios they'd actually face on-call.

GPU Driver Failure Kubernetes Performance Issues Service Outages

DevOps & Platform Engineers

Configuration errors, container failures, API gateway issues, and log-driven debugging.

API Gateway Config Container Issues Log Analysis CI/CD Pipelines

Data Center Engineers

Hardware diagnostics, bare metal troubleshooting, GPU driver issues, and knowing when to escalate.

GPU Diagnostics IPMI/BMC Driver Conflicts Hardware Failures

Linux System Administrators

Runaway processes, disk issues, service recovery, and the fundamentals that senior hires still get wrong.

Runaway Process Disk Management Service Recovery System Boot

Drop into incidents from your terminal

The Parium CLI connects you directly to shared incident sessions from your own terminal. No browser, no context switching. Just parium open and you're in.

$ npm install -g @parium.ai/cli@latest
  • WebSocket terminal attach for real SSH-like sessions
  • Collaborative incident rooms for team sessions
  • Dark, light, and mono themes that auto-detect your terminal
  • Browser-to-terminal handoff with secure tokens
Terminal - parium
Preview
$ parium open

  █▀█ ▄▀█ █▀█ █ █ █ █▀▄▀█
  █▀▀ █▀█ █▀▄ █ █▄█ █ ▀ █
  Chaos Terminal Client v0.1.0-alpha.2

Paste handoff token: ••••••••••••

 Token validated
 Session resolved - k8s-chaos-war-room
 Attaching to terminal...

──────────────────────────────────────
  SESSION  K8s Cascading Failure
  STATUS   ● LIVE
  PHASE    3 of 6 - DNS network policy
  IMPACT   $120K/hr
──────────────────────────────────────

candidate@prod-worker-07:~$ 

An assessment that respects engineers' time.

No unfamiliar IDEs. No artificial puzzles. Just a terminal and a real incident - the environment they work in every day.

  • Finish in under 20 minutes - not days
  • Real tools, real terminal - no unfamiliar IDEs
  • Reflects how your team actually works
  • Your engineers focus on building, not reviewing take-homes
  • AI reads the session so your team doesn't have to
  • Results ready to share with the hiring panel
Passed
Assessment Results
Feb 15, 2025 · 14:32 UTC
Candidate
Sarah Chen
Scenario
GPU Failure
Resolution
07:38
Time Limit
20:00
Commands
14
Hints Used
0
LLM Risk
Low
Outcome
Root cause correctly identified
Production-safe fix applied
Service health verified
Timeline
00:00 Session started
01:12 Checked GPU state
03:44 Identified driver conflict
05:21 Applied fix
07:38 Health check passed
Behaviour
3:44
Time to root cause
High
Confidence
Command Log
00:12 $ nvidia-smi
NVIDIA-SMI has failed - driver not loaded
00:45 $ lsmod | grep nouveau
nouveau 2461696 1
01:12 $ dmesg | grep -i gpu
[10:14:32] NVRM: GPU has fallen off the bus
02:34 $ modprobe -r nouveau
03:44 $ modprobe nvidia
Loading nvidia driver...
05:21 $ nvidia-smi
GPU 0: NVIDIA A100 | 45C | 32W
07:38 $ curl -s localhost:8080/health
{"status":"healthy"}

Frequently asked questions

How Parium works, what your team sees, and what candidates experience.

Candidates connect to a real, isolated Linux environment - not a browser simulation or multiple-choice sandbox. Each assessment spins up a fresh system with the incident pre-configured. They get full terminal access with real bash, real logs, and real system tools. It's the same experience as SSH'ing into a production server.

Parium is built for any role that requires hands-on Linux troubleshooting: Site Reliability Engineers (SRE), DevOps Engineers, Platform Engineers, Data Center Technicians, Linux System Administrators, Cloud Engineers, and Infrastructure Engineers. Our scenarios range from L1 support tasks (config errors, disk space) to L4 senior-level incidents (GPU driver conflicts, kernel modules, PCIe issues).

We monitor for patterns that suggest external help - things like leaving the terminal for extended periods, large paste events, and unusual command timing. Suspicious activity gets flagged in the hiring manager report with enough context for you to make an informed judgment. We can't catch everything, but the patterns are usually pretty obvious.

When the candidate clicks "Verify Fix," we run a health check against the scenario's success criteria (e.g., curl the API endpoint, check nvidia-smi output). If it passes, we record their time-to-resolution. The hiring manager gets a report with every command, timestamps, hints used, suspicious activity flags, and an analysis of how the candidate approached the problem.

HackerRank, Codility, and similar platforms test algorithmic coding in sandboxed editors. Parium tests operational skills in real Linux environments. Your SRE candidates don't need to reverse a linked list - they need to figure out why nginx won't start or why the GPU driver isn't loading. We measure how they investigate, not whether they memorised the answer.

Yes. We can build scenarios that mirror your actual production environment - your monitoring tools, your deployment setup, your common failure modes. Whether it's Kubernetes on EKS, GPU clusters with SLURM, or legacy systems with custom daemons, we'll create assessments that test exactly what your team deals with day-to-day. Get in touch to discuss.

Beyond pass/fail, we give you session replay - watch exactly how candidates approached the problem. You'll see every command they ran, when they pasted content (and what they pasted), when they switched tabs, how long they were away, and when they used hints. It's like watching over their shoulder, but asynchronously. You see how they think, not just whether they got the answer.

Consistent by design

Every candidate gets the same scenario, the same environment, the same success criteria. No more "it depends on who reviewed it."

01

Same scenario, every time

No variation between candidates. Everyone faces the same incident with the same tools available.

02

Objective criteria

Clear pass/fail based on whether the fix works, not on how well someone writes a README or formats their code.

03

Data-driven decisions

Time-to-resolution, commands used, hints requested. Compare candidates on the metrics that matter.

Get started with Parium

Whether you need a custom scenario for your stack, want to discuss enterprise pricing, or just have questions, we'd love to hear from you.

Talk to us

We'll get back to you within a working day.

Ready to hire engineers you'd trust on call?

See real incident performance before you hire.

Contact Sales