Data Centre Assessments

Scenarios

Data centre scenarios available today

Each scenario includes an operational runbook or telemetry view. Junior candidates get guided procedures. Senior candidates get the incident brief and drive the investigation independently.

Read-Only Storage Controller

A storage system has gone read-only. Applications can read data but writes are failing. The candidate investigates the storage controller state, RAID health, drive SMART data, and filesystem mount options to identify the cause and restore write access without data loss.

15-25 min·Senior / Mid / Junior levels

RAID Degradation and Drive Replacement

A RAID array is running in degraded mode. One drive has failed and another is showing predictive failure indicators. The candidate identifies the failed and failing drives, assesses rebuild risk, follows the replacement procedure, and initiates rebuild while the system is live.

20-30 min·Senior / Mid levels

Thermal Event Response

Environmental sensors have triggered alerts. Inlet temperatures are rising across a row of racks. The candidate reviews IPMI sensor data, identifies the affected zone, checks CRAC unit status, and determines whether to start migrating workloads or wait for facilities to respond.

15-20 min·Mid / Junior levels

Power Event: PDU Failure

A PDU has lost a phase. Half the servers in a rack have lost redundant power. UPS is holding but runtime is limited. The candidate must assess the impact, identify which servers need immediate attention (single-corded vs dual-corded), and execute the emergency power procedure safely.

15-25 min·Senior / Mid levels

BMC/IPMI Remote Recovery

A remote server is unresponsive. SSH is down but the BMC is reachable. The candidate uses IPMI tools to check power state, review SEL logs, inspect sensor readings, and determine whether a graceful reboot, a hard reset, or a console session is needed. Tests remote hands decision-making.

10-15 min·Mid / Junior levels

GPU Xid 79 with Escalation Decision

A GPU in a DC rack has reported Xid 79 errors. The candidate investigates using the DC runbook, checks dmesg, nvidia-smi, and PCIe state, and must decide whether to attempt a driver reset, drain the node, or escalate for hardware RMA. Tests the boundary between operational fix and hardware escalation.

20-30 min·Senior / Mid / Junior levels

Custom data centre scenarios

Every data centre has its own procedures, ticketing systems, escalation paths, and hardware configurations. If your team uses specific runbooks, specific hardware vendors (Dell, HPE, Supermicro, Lenovo), or specific monitoring tools (Nagios, Zabbix, Prometheus, Datadog), we can build scenarios that match your operational environment.

Assess the people who keep the lights on.

The tools and procedures your DC team uses