CloudDrill — Production Incident Simulation for Engineers

Train on Real
Incidents Before
You're Responsible.

CloudDrill puts you on call for production P0/P1 outages — cascading failures, AWS networking disasters, compliance breaches — so you build the judgment that certifications don't test and interviews can't fake.

✓ You're on the list — we'll be in touch when beta opens.

View Incident Queue ↓

25+

Incidents

P0–P3

Severity Levels

Free

Beta Access

You're paged.
Restore service.

Every scenario starts with an alert. You triage, investigate, and restore production — the same way you will on the job, without the blast radius.

03:42 AM P1

Web Server Down — 502s Site-Wide

SaaS Startup · Production · ~1,200 users affected

A recent deployment broke the reverse proxy. Every request is returning 502. You're paged.

LinuxNGINXsystemd

You are on call. →

~15 min

02:13 AM P1

Checkout API Down — CrashLoopBackOff

E-Commerce · Black Friday · Checkout unavailable

API pods are stuck in a restart loop after a config push. Customers can't complete purchases. Leadership is on the bridge.

$18,000/hr revenue impact

KubernetesSecretskubectl

You are on call. →

~25 min

11:57 PM P0

Database Unreachable — Transactions Failing

FinTech · Production · 100% of transactions failing

The app lost database connectivity after a VPC change. No transactions are processing. Trace the network path and restore access.

AWSVPCSecurity GroupsRDS

You are on call. →

~30 min

09:15 AM P1

Terraform Will Destroy the Production DB

Enterprise SaaS · Pre-deploy review · Release blocked

State drift means the next apply would recreate the production database. Stop the plan, fix the drift, unblock the release.

TerraformStateAWSRDS

You are on call. →

~40 min

04:31 PM P2

New Release Won't Start — Pipeline Blocked

Engineering Team · CD Pipeline · Deploy impossible

The container exits with code 1 before the app initializes. The team is blocked on a Friday deploy. Find the cause and unblock.

DockerCI/CDDebugging

You are on call. →

~20 min

08:00 AM P1

Auditor On-Site — AU-2 Evidence Due in 2hrs

Defense Contractor · FedRAMP Audit · Examiner waiting

The auditor is on-site and has requested AU-2 Audit Events evidence. Collect, format, and submit compliant CloudTrail evidence before the window closes.

FedRAMPCloudTrailNISTAU-2

You are on call. →

~60 min

Every scenario that
wakes engineers at 2am.

Modeled on real production incidents and senior-level interview scenarios — not textbook examples.

☸️ Coming soon

Kubernetes Production Outages

Pods stuck in CrashLoopBackOff
ImagePullBackOff — registry auth failures
DNS failures inside clusters
OOMKilled containers / memory leaks
Ingress misconfigurations

Requires networking, containers, Linux, and Kubernetes internals — simultaneously.

☁️ Coming soon

AWS Networking Problems

EC2 instance unreachable
Security Group misconfigurations
NAT Gateway and route table mistakes
VPC peering failures
Private subnet internet access

Many outages happen because one small networking rule is wrong. Engineers must trace the full path.

🏗️ Coming soon

Terraform Problems

State file corruption
Resource drift and failed applies
Importing existing resources
Dependency cycles
Accidentally replacing production

Many engineers can write Terraform. Fewer can recover from broken Terraform.

⚙️ Coming soon

CI/CD Pipeline Failures

Build and test failures
Secret injection issues
Container image push failures
Deployment rollbacks
Failed GitHub Actions workflows

Companies often care more about troubleshooting broken pipelines than building them.

🐧 Coming soon

Linux Incidents

Disk full / CPU spikes / memory exhaustion
Zombie processes and service crashes
Log explosion
Permission and SSH failures

A huge percentage of cloud troubleshooting ultimately becomes Linux troubleshooting.

🌐 Coming soon

DNS Failures

Internal DNS resolution failures
Route53 misconfigurations
Split-horizon DNS problems
Wrong CNAMEs / expired records

Engineers often spend hours troubleshooting what turns out to be DNS.

📊 Coming soon

Monitoring & Alerting Incidents

Missing and false-positive alerts
Prometheus scraping failures
Grafana dashboards broken
OpenTelemetry configuration issues

Companies want engineers who can find root cause quickly — not just read dashboards.

🚨 Advanced

Incident Response Scenarios

Major service outage — analyze alerts, create timeline
Production deployment failure
API latency spike — identify root cause
Multi-region failure
Write the postmortem

These skills are rarely taught but heavily used. The simulator can stand out here.

🛡️ Exclusive

FedRAMP / Compliance Investigations

Public S3 bucket discovered
IAM privilege escalation risk
Secrets committed to Git
Missing audit logs — gather evidence
Encryption compliance failures

Very few platforms cover this. Appeals strongly to government, defense, and regulated-industry engineers.

🔗 Multi-Layer Failures — Senior-Level Scenarios

The most realistic incidents aren't single-service problems. A deployment fails. Terraform changed a Security Group. Kubernetes lost database connectivity. Pods started crashing. CI/CD rolled back. Monitoring generated hundreds of alerts. The engineer must trace the full chain — exactly what happens at 2am in production.

Stop preparing.
Arrive ready.

Every challenge you complete moves your readiness score. When you walk into an interview, you'll know exactly where you stand — and so will your interviewer.

Ready for:

✓ DevOps Engineer ✓ Platform Engineer ✓ SRE ✓ Cloud Engineer

Train on Real
Incidents Before
You're Responsible.

Courses teach.
CloudDrill trains.

You're paged.
Restore service.

Web Server Down — 502s Site-Wide

Checkout API Down — CrashLoopBackOff

Database Unreachable — Transactions Failing

Terraform Will Destroy the Production DB

New Release Won't Start — Pipeline Blocked

Auditor On-Site — AU-2 Evidence Due in 2hrs

Every scenario that
wakes engineers at 2am.

🔗 Multi-Layer Failures — Senior-Level Scenarios

A real environment.
Not a quiz.

Stop preparing.
Arrive ready.

Get early access

You're on the list.

Train on Real Incidents Before You're Responsible.

Courses teach.CloudDrill trains.

You're paged.Restore service.

Web Server Down — 502s Site-Wide

Checkout API Down — CrashLoopBackOff

Database Unreachable — Transactions Failing

Terraform Will Destroy the Production DB

New Release Won't Start — Pipeline Blocked

Auditor On-Site — AU-2 Evidence Due in 2hrs

Every scenario thatwakes engineers at 2am.

🔗 Multi-Layer Failures — Senior-Level Scenarios

A real environment.Not a quiz.

Stop preparing.Arrive ready.

Get early access

You're on the list.

Train on Real
Incidents Before
You're Responsible.

Courses teach.
CloudDrill trains.

You're paged.
Restore service.

Every scenario that
wakes engineers at 2am.

A real environment.
Not a quiz.

Stop preparing.
Arrive ready.