Beta · Now accepting signups

Train on Real
Incidents Before
You're Responsible.

CloudDrill puts you on call for production P0/P1 outages — cascading failures, AWS networking disasters, compliance breaches — so you build the judgment that certifications don't test and interviews can't fake.

Free beta access · No spam · Unsubscribe anytime

✓ You're on the list — we'll be in touch when beta opens.
View Incident Queue ↓
25+
Incidents
P0–P3
Severity Levels
Free
Beta Access
$
The Difference

Courses teach.
CloudDrill trains.

Most platforms prepare you to answer questions about cloud.
CloudDrill prepares you to fix it when it breaks.

Courses & Certs
CloudDrill
Format
Watch videos
Fix live outages
Assessment
Multiple choice
Real terminals
Scenario
Build demos
Restore production
Skill tested
Memorize concepts
Troubleshoot under pressure
Interview prep
Hope for the best
Arrive with a score
What you learn
Cloud concepts
How to think under pressure
Active Incident Queue

You're paged.
Restore service.

Every scenario starts with an alert. You triage, investigate, and restore production — the same way you will on the job, without the blast radius.

03:42 AM P1

Web Server Down — 502s Site-Wide

SaaS Startup · Production · ~1,200 users affected

A recent deployment broke the reverse proxy. Every request is returning 502. You're paged.

LinuxNGINXsystemd
You are on call. →
~15 min
02:13 AM P1

Checkout API Down — CrashLoopBackOff

E-Commerce · Black Friday · Checkout unavailable

API pods are stuck in a restart loop after a config push. Customers can't complete purchases. Leadership is on the bridge.

$18,000/hr revenue impact
KubernetesSecretskubectl
You are on call. →
~25 min
11:57 PM P0

Database Unreachable — Transactions Failing

FinTech · Production · 100% of transactions failing

The app lost database connectivity after a VPC change. No transactions are processing. Trace the network path and restore access.

AWSVPCSecurity GroupsRDS
You are on call. →
~30 min
09:15 AM P1

Terraform Will Destroy the Production DB

Enterprise SaaS · Pre-deploy review · Release blocked

State drift means the next apply would recreate the production database. Stop the plan, fix the drift, unblock the release.

TerraformStateAWSRDS
You are on call. →
~40 min
04:31 PM P2

New Release Won't Start — Pipeline Blocked

Engineering Team · CD Pipeline · Deploy impossible

The container exits with code 1 before the app initializes. The team is blocked on a Friday deploy. Find the cause and unblock.

DockerCI/CDDebugging
You are on call. →
~20 min
Exclusive Track
08:00 AM P1

Auditor On-Site — AU-2 Evidence Due in 2hrs

Defense Contractor · FedRAMP Audit · Examiner waiting

The auditor is on-site and has requested AU-2 Audit Events evidence. Collect, format, and submit compliant CloudTrail evidence before the window closes.

FedRAMPCloudTrailNISTAU-2
You are on call. →
~60 min
Full Incident Catalog

Every scenario that
wakes engineers at 2am.

Modeled on real production incidents and senior-level interview scenarios — not textbook examples.

☸️ Coming soon
Kubernetes Production Outages
  • Pods stuck in CrashLoopBackOff
  • ImagePullBackOff — registry auth failures
  • DNS failures inside clusters
  • OOMKilled containers / memory leaks
  • Ingress misconfigurations

Requires networking, containers, Linux, and Kubernetes internals — simultaneously.

☁️ Coming soon
AWS Networking Problems
  • EC2 instance unreachable
  • Security Group misconfigurations
  • NAT Gateway and route table mistakes
  • VPC peering failures
  • Private subnet internet access

Many outages happen because one small networking rule is wrong. Engineers must trace the full path.

🏗️ Coming soon
Terraform Problems
  • State file corruption
  • Resource drift and failed applies
  • Importing existing resources
  • Dependency cycles
  • Accidentally replacing production

Many engineers can write Terraform. Fewer can recover from broken Terraform.

⚙️ Coming soon
CI/CD Pipeline Failures
  • Build and test failures
  • Secret injection issues
  • Container image push failures
  • Deployment rollbacks
  • Failed GitHub Actions workflows

Companies often care more about troubleshooting broken pipelines than building them.

🐧 Coming soon
Linux Incidents
  • Disk full / CPU spikes / memory exhaustion
  • Zombie processes and service crashes
  • Log explosion
  • Permission and SSH failures

A huge percentage of cloud troubleshooting ultimately becomes Linux troubleshooting.

🌐 Coming soon
DNS Failures
  • Internal DNS resolution failures
  • Route53 misconfigurations
  • Split-horizon DNS problems
  • Wrong CNAMEs / expired records

Engineers often spend hours troubleshooting what turns out to be DNS.

📊 Coming soon
Monitoring & Alerting Incidents
  • Missing and false-positive alerts
  • Prometheus scraping failures
  • Grafana dashboards broken
  • OpenTelemetry configuration issues

Companies want engineers who can find root cause quickly — not just read dashboards.

🚨 Advanced
Incident Response Scenarios
  • Major service outage — analyze alerts, create timeline
  • Production deployment failure
  • API latency spike — identify root cause
  • Multi-region failure
  • Write the postmortem

These skills are rarely taught but heavily used. The simulator can stand out here.

🛡️ Exclusive
FedRAMP / Compliance Investigations
  • Public S3 bucket discovered
  • IAM privilege escalation risk
  • Secrets committed to Git
  • Missing audit logs — gather evidence
  • Encryption compliance failures

Very few platforms cover this. Appeals strongly to government, defense, and regulated-industry engineers.

🔗 Multi-Layer Failures — Senior-Level Scenarios

The most realistic incidents aren't single-service problems. A deployment fails. Terraform changed a Security Group. Kubernetes lost database connectivity. Pods started crashing. CI/CD rolled back. Monitoring generated hundreds of alerts. The engineer must trace the full chain — exactly what happens at 2am in production.

Platform Preview

A real environment.
Not a quiz.

Track progress across tracks, pick up challenges where you left off, and drill the exact skills interviewers test.

Career Outcomes

Stop preparing.
Arrive ready.

Every challenge you complete moves your readiness score. When you walk into an interview, you'll know exactly where you stand — and so will your interviewer.

Ready for:
✓ DevOps Engineer ✓ Platform Engineer ✓ SRE ✓ Cloud Engineer
Beta Access

Get early access

Join the waitlist. Shape what gets built. Your answers directly determine which incidents we simulate first.

Your Info
Please enter your name.
Enter a valid email address.
Please select your role.
Help Us Build the Right Thing
Be specific. This becomes our roadmap.

We'll only use your email for CloudDrill beta updates.

You're on the list.

We'll be in touch when beta access opens.
If you said yes to an interview, expect a calendar link within the week.