Web Server Down — 502s Site-Wide
SaaS Startup · Production · ~1,200 users affected
A recent deployment broke the reverse proxy. Every request is returning 502. You're paged.
CloudDrill puts you on call for production P0/P1 outages — cascading failures, AWS networking disasters, compliance breaches — so you build the judgment that certifications don't test and interviews can't fake.
Most platforms prepare you to answer questions about cloud.
CloudDrill prepares you to fix it when it breaks.
Every scenario starts with an alert. You triage, investigate, and restore production — the same way you will on the job, without the blast radius.
SaaS Startup · Production · ~1,200 users affected
A recent deployment broke the reverse proxy. Every request is returning 502. You're paged.
E-Commerce · Black Friday · Checkout unavailable
API pods are stuck in a restart loop after a config push. Customers can't complete purchases. Leadership is on the bridge.
FinTech · Production · 100% of transactions failing
The app lost database connectivity after a VPC change. No transactions are processing. Trace the network path and restore access.
Enterprise SaaS · Pre-deploy review · Release blocked
State drift means the next apply would recreate the production database. Stop the plan, fix the drift, unblock the release.
Engineering Team · CD Pipeline · Deploy impossible
The container exits with code 1 before the app initializes. The team is blocked on a Friday deploy. Find the cause and unblock.
Defense Contractor · FedRAMP Audit · Examiner waiting
The auditor is on-site and has requested AU-2 Audit Events evidence. Collect, format, and submit compliant CloudTrail evidence before the window closes.
Modeled on real production incidents and senior-level interview scenarios — not textbook examples.
Requires networking, containers, Linux, and Kubernetes internals — simultaneously.
Many outages happen because one small networking rule is wrong. Engineers must trace the full path.
Many engineers can write Terraform. Fewer can recover from broken Terraform.
Companies often care more about troubleshooting broken pipelines than building them.
A huge percentage of cloud troubleshooting ultimately becomes Linux troubleshooting.
Engineers often spend hours troubleshooting what turns out to be DNS.
Companies want engineers who can find root cause quickly — not just read dashboards.
These skills are rarely taught but heavily used. The simulator can stand out here.
Very few platforms cover this. Appeals strongly to government, defense, and regulated-industry engineers.
The most realistic incidents aren't single-service problems. A deployment fails. Terraform changed a Security Group. Kubernetes lost database connectivity. Pods started crashing. CI/CD rolled back. Monitoring generated hundreds of alerts. The engineer must trace the full chain — exactly what happens at 2am in production.
Track progress across tracks, pick up challenges where you left off, and drill the exact skills interviewers test.
Every challenge you complete moves your readiness score. When you walk into an interview, you'll know exactly where you stand — and so will your interviewer.
Join the waitlist. Shape what gets built. Your answers directly determine which incidents we simulate first.
We'll be in touch when beta access opens.
If you said yes to an interview, expect a calendar link within the week.