58%
of organizations that experience a significant IT outage have never tested their DR plan before the incident (Ponemon Institute 2023)
RTO vs. RPO
Recovery Time Objective (how long before systems are back) and Recovery Point Objective (how much data loss is acceptable) are the two metrics that define whether a DR test passed or failed
4 test types
in increasing order of confidence: tabletop (discussion), simulation (partial execution), parallel test (failover while primary stays up), full cutover (primary down, secondary is production)

A DR test validates that your recovery procedures actually work and that your team can execute them under pressure. The four test types build in complexity: tabletop tests verify the plan exists and the team knows it; simulation tests execute the runbook in a non-production environment; parallel tests bring up DR infrastructure while production runs; and full cutover tests put DR infrastructure in production and verify everything works end-to-end. Each type answers a different question and fits a different risk tolerance.

Test Type 1: Tabletop Exercise (Discovery)

What it validates: does the team know the plan? Are the runbooks complete? Who is responsible for each step?

Format: 2-hour workshop with IT, security, operations, and business stakeholders. No systems are touched.

Scenario example: "Our primary data center lost power at 2:00 AM on a Friday. The backup generator failed after 4 hours. All servers are offline. Walk us through what happens next."

Facilitated questions to drive discussion:

  • Who declares a disaster and how is that communicated to the team and to customers?
  • Which systems are restored first and in what order? Why?
  • Where is the DR runbook and who has access to it right now (not on the primary systems)?
  • What is our RTO and are we confident we can meet it based on the current runbook?
  • Who contacts vendors, partners, and customers, and with what message?
  • What happens if the primary on-call person is unavailable?

Pass criteria: a complete runbook exists, team members know their roles, critical vendor contacts are accessible outside the primary systems, and dependencies between systems are documented.

Output: a list of gaps, an updated runbook with assigned owners, and scheduled follow-up actions.

Test Type 2: Simulation (Functional Validation)

What it validates: can team members actually execute the runbook steps?

Scope: execute individual recovery procedures against non-production systems, or with production in read-only mode.

Example simulations:

  • Restore a production database from backup to a staging environment and verify data integrity
  • Bring up the application tier in the DR VPC and confirm it connects to the restored database
  • Validate DNS failover routing in a staging environment by triggering Route 53 health check failure

Pre-test checklist:

  • Runbook reviewed and confirmed current within the past 30 days
  • DR credentials accessible from a location that doesn't depend on primary systems (password manager, printed credentials in a secure location)
  • Team members have read through the relevant runbook steps at least once before test day
  • Scope and pass/fail criteria defined in advance and agreed upon by all participants

Timing: run each step with a stopwatch. Sum the times to calculate projected RTO.

Common findings in simulations:

  • Expired SSL certificates in the DR environment that block HTTPS traffic
  • Missing firewall rules in the DR VPC preventing inter-service communication
  • Database restore time 3x longer than the RTO allows because the backup is larger than expected
Free daily briefing

Briefings like this, every morning before 9am.

Threat intel, active CVEs, and campaign alerts, distilled for practitioners. 50,000+ subscribers. No noise.

Test Type 3: Parallel Test (Warm Standby Validation)

What it validates: does the complete DR environment work while production remains safe?

Pattern: bring up the DR environment fully, route test traffic to it, validate all critical functions work, then optionally execute a full DNS cutover.

AWS cross-region parallel test:

# Promote RDS read replica to a standalone instance
# (non-destructive -- primary replica relationship is severed but primary DB is unaffected)
aws rds promote-read-replica \
  --db-instance-identifier dr-replica \
  --region us-west-2

# Bring up application tier in DR region
aws autoscaling update-auto-scaling-group \
  --auto-scaling-group-name dr-app-asg \
  --min-size 2 --max-size 10 --desired-capacity 2 \
  --region us-west-2

# Validate application health in DR region
curl https://dr-lb.us-west-2.elb.amazonaws.com/health

# Measure actual RTO:
# Time from replica promotion command to successful health check response

DNS failover validation:

# Before: confirm DNS points to primary LB
dig yourdomain.com @8.8.8.8

# Simulate Route 53 health check failure by marking primary unhealthy
# After propagation: confirm DNS fails over to DR LB
dig yourdomain.com @8.8.8.8

Pass criteria: DR environment serves all functional tests within RTO, production traffic is unaffected throughout the test.

Test Type 4: Full Cutover (Maximum Confidence)

What it validates: the complete DR scenario -- primary offline, DR becomes production.

Risk level: highest. Schedule during a planned maintenance window and notify all stakeholders in advance.

Pre-test requirements:

  • Successful parallel test completed within the past 90 days
  • Explicit change management approval with stakeholder sign-off
  • Rollback procedure defined and tested (how to restore primary and cut traffic back)

Full cutover steps (AWS cross-region example):

# Step 1: Announce maintenance window to users (30 minutes before)
# Step 2: Set application to read-only to stop writes to primary

# Step 3: Verify replication lag is zero
aws rds describe-db-instances \
  --db-instance-identifier dr-replica \
  --query 'DBInstances[0].ReplicaLag'
# Must return 0 before proceeding

# Step 4: Promote RDS read replica
aws rds promote-read-replica \
  --db-instance-identifier prod-dr-replica \
  --region us-west-2

# Step 5: Update application config to DR database endpoint
# Step 6: Scale up DR application tier
aws autoscaling set-desired-capacity \
  --auto-scaling-group-name dr-app-asg \
  --desired-capacity 10 \
  --region us-west-2

# Step 7: Update Route 53 to point to DR load balancer
aws route53 change-resource-record-sets \
  --hosted-zone-id [zone-id] \
  --change-batch file://dr-dns-changeset.json

# Step 8: Validate -- run full functional test suite against DR environment
# Step 9: Record actual RTO (time from maintenance start to "system operational")

Pass criteria: RTO met, RPO met (verify by checking the timestamp of the last transaction committed before cutover vs. first transaction in DR), and all critical business functions validated by the business team.

Post-Test Report Template

Use this template for every DR test regardless of type:

DR TEST REPORT -- [Date]

Test Type: [Tabletop / Simulation / Parallel / Full Cutover]
Systems in Scope: [list each system tested]
Test Duration: [start time] to [end time]
Test Lead: [name]

Objectives
RTO Target: [X hours]   Actual RTO: [X hours]   Result: [PASS / FAIL]
RPO Target: [X hours]   Actual RPO: [X hours]   Result: [PASS / FAIL]

Steps Completed Successfully:
- [list each step and the time it took]

Issues Found:
#  | Step          | Description                              | Severity | Owner  | Due Date
1  | DB restore    | Restore took 2h vs 45min estimate        | High     | [name] | [date]
2  | DNS failover  | Propagation took 8 min vs 2 min target   | Medium   | [name] | [date]

Overall Result: [PASS / PASS WITH ISSUES / FAIL]

Remediation Plan: [list action items with owners and due dates]

Next Test: [date and type]

Sign-off:
[IT Lead signature]       [Security Lead signature]       [Business Owner signature]

Distribute the report to IT leadership, security, and business continuity stakeholders within 5 business days of the test. Unresolved issues become inputs to the next test planning cycle.

The bottom line

DR tests fail in predictable ways: the runbook references systems that no longer exist, credentials have rotated and weren't updated, the database restore takes 3x longer than the RTO allows, and the team has never actually run the failover procedure under time pressure. All of these are better discovered during a scheduled test than during an actual incident. Run a tabletop annually, a simulation semi-annually, and a full cutover at least once before you need it.

Frequently asked questions

What is the difference between RTO and RPO?

RTO (Recovery Time Objective) is the maximum acceptable time to restore service after a disaster. RPO (Recovery Point Objective) is the maximum acceptable data loss measured in time (e.g., 4-hour RPO means you can lose up to 4 hours of data). DR test success means both objectives were met.

How often should you test your DR plan?

Minimum annually for tabletop; semi-annually is better. Full cutover tests are typically annual or when significant infrastructure changes occur. After any major architecture change, run at minimum a simulation before the next scheduled test date.

What is the difference between a DR test and a business continuity test?

DR tests validate that technical systems can be recovered and meet RTO/RPO. Business continuity tests validate that business operations can continue during an outage (e.g., manual processes, alternate suppliers, communication plans). A complete resilience program tests both.

What is a parallel test and when should I use it?

A parallel test (also called warm standby test) brings up the DR environment while primary is still running, validates that it works, then fails traffic over. Lower risk than a full cutover because primary can take traffic back immediately if DR fails. Use for production systems where you cannot afford downtime during the test.

What should be in a DR test report?

Test type, date, scope (which systems), defined RTO/RPO targets, actual recovery time achieved, actual data loss observed, steps that succeeded, steps that failed or took longer than expected, issues discovered, action items with owners and due dates.

How do I test cloud-to-cloud DR in AWS?

For cross-region failover: test Route 53 health check failover, validate RDS read replica promotion time, confirm S3 cross-region replication lag, test application startup in the DR region, validate DNS propagation time against RTO. AWS Fault Injection Simulator can inject regional failures for testing.

What is the most common finding in a DR test?

The runbook is out of date. The most common test outcome is that documented steps no longer match the current environment: credentials that have rotated, servers that have been renamed, dependencies that have changed. This is why you test -- to find the gap between the documentation and reality.

Sources & references

  1. NIST SP 800-34 Contingency Planning Guide
  2. AWS Disaster Recovery Whitepaper
  3. ISO 22301 Business Continuity
  4. Azure Site Recovery Documentation
  5. FEMA Continuity Guidance

Free resources

25
Free download

Critical CVE Reference Card 2025–2026

25 actively exploited vulnerabilities with CVSS scores, exploit status, and patch availability. Print it, pin it, share it with your SOC team.

No spam. Unsubscribe anytime.

Free download

Ransomware Incident Response Playbook

Step-by-step 24-hour IR checklist covering detection, containment, eradication, and recovery. Built for SOC teams, IR leads, and CISOs.

No spam. Unsubscribe anytime.

Free newsletter

Get threat intel before your inbox does.

50,000+ security professionals read Decryption Digest for early warnings on zero-days, ransomware, and nation-state campaigns. Free, weekly, no spam.

Unsubscribe anytime. We never sell your data.

Eric Bang
Author

Founder & Cybersecurity Evangelist, Decryption Digest

Cybersecurity professional with expertise in threat intelligence, vulnerability research, and enterprise security. Covers zero-days, ransomware, and nation-state operations for 50,000+ security professionals weekly.

Free Brief

The Mythos Brief is free.

AI that finds 27-year-old zero-days. What it means for your security program.

Joins Decryption Digest. Unsubscribe anytime.

Daily Briefing

Get briefings like this every morning

Actionable threat intelligence for working practitioners. Free. No spam. Trusted by 50,000+ SOC analysts, CISOs, and security engineers.

Unsubscribe anytime.

Mythos Brief

Anthropic's AI finds zero-days your scanners miss.