ISA provides technical resources and standards to help industrial automation professionals advance their careers and the field. We enable automation professionals worldwide to solve problems and enhance their skills by bringing people together to create new technologies and share best practices with future automation professionals.

Follow Us

ISA Resources

Home

Career Center

Connect

ISA Merchandise

Upcoming Events

Automation Summit and Expo

OT Cybersecurity Summit

Industry Insights

Automation.com

Consortia

ICS4ICS

ISA 100 Wireless

ISA Global Cybersecurity Alliance

ISA Secure
Create an Account

Login

Cloud Computing and Disaster Recovery: The Hidden Risks of Hyperscalers

By: Catalin Voicu

17 November, 2025

5 min read

Feature Image for Cloud Computing and Disaster Recovery: The Hidden Risks of Hyperscalers

Recent outages from cloud service providers reveal systemic risks related to cloud computing and shatter some long-held myths about what “resilience” in the cloud really means.

In the middle of the night on October 20, AWS’s most critical region, US East-1, went dark, taking down half the internet with it. What began as a DNS race condition in DynamoDB spiraled into failed EC2 instances, inaccessible S3 buckets and widespread service outages. Apps like Snapchat, Asana and even some banking and government systems were at a standstill for nearly 15 hours. Businesses large and small lost productivity, revenue, and customer trust.

AWS eventually “fixed” the issue, but the damage was already done. For IT, automation and business continuity teams, this wasn’t just a blip. It was a wake-up call about systemic risk.

Nine days later, Azure, the world’s second largest cloud provider, also suffered a significant disruption. It turns out another human error (also known as a configuration change) knocked out services including Microsoft 365, Xbox, and major airline check in portals.

After all, if AWS and Azure, the most mature global hyperscalers can have an outages that last for hours at a time, what does that mean for the rest of us relying on them?

The reality: Outages will happen again

Take a look at AWS’s own Service Level Agreement. They openly acknowledge they can’t guarantee constant uptime. Their stated 99.5% availability leaves plenty of wiggle room for what they know will eventually happen.

Even “five-nines” (99.999%) availability only applies if you’re deploying across multiple Availability Zones (AZs).

But as the East-1 outage proved, when an entire region goes down, every AZ inside it goes with it.

And the AWS post-mortem? Instead of offering reassurance, it quietly confirmed our worst fears. AWS admitted that the EC2 failure that cascaded across services had “no established operational recovery procedure.” The same architectural interdependencies and regional limitations that caused this outage still exist today.

Unless AWS completely re-engineers its core infrastructure, the best we can expect are more temporary fixes, essentially bandages on a much deeper problem.

The systemic risks hidden in the cloud

None of this is hypothetical. US-east-1 has now suffered five major disruptions in just the past few years. Each one looked a little different, but every single time, it exposed the same truth: when a core AWS service fails, others fall with it. That’s the very definition of systemic risk.

The fallout isn’t just downtime. You lose time, data, and customer trust while waiting on AWS to restore services. Even after the root cause is fixed, some customers waited hours for dependent systems to recover. The deeper your integration with AWS APIs, the longer your tail of recovery.
Still, this outage wasn’t without lessons. It shattered some long-held myths about what “resilience” in the cloud really means.

Busted! The myths that didn’t survive Oct. 20

One good thing to come out of all this, it’s that several long-standing disaster recovery myths finally met their end.

Myth #1: Multi-AZ Is 'good enough'

AWS snapshots and S3 backups are designed for high durability because they are replicated across multiple Availability Zones (AZs) for up to 99.99999% reliability. That’s great, but many organizations assumed that simply spreading workloads across AZs within a single region was enough for resilience.

The East-1 outage proved that’s not true. When an entire region fails, every AZ inside it goes dark, leaving even the best-architected deployments unreachable.

Using snapshots and S3 backups across multiple AZs is still best practice, but it’s only part of the story. True protection requires cross-region and cross-account replication to safeguard against regional disasters.

Myth #2: Time for a DR overhaul—Multicloud is the answer

After every major outage, the knee-jerk reaction is the same: “Let’s rebuild everything. Time to go Multicloud.”

And sure, it’s tempting. About 80% of restores are simple, recovering a few deleted files or a key instance. Most teams handle those manually or with third-party tools. But October 20th exposed how fragile manual recovery becomes when everything fails at once.

The truth is, Multicloud backup and cross-cloud DR isn’t a quick fix. It introduces massive complexity: multiple consoles, steep learning curves, new compliance audits, and unpredictable data egress costs.

A smarter, more practical approach is to stay within your current cloud but strengthen your strategy with cross-region replication (for region-wide failures) and cross-account DR (for malware or ransomware protection). It’s simpler, more controlled, and far more cost-effective than most teams realize, while still keeping your data and workloads available, even during a full-scale regional outage.

Myth 3: Cross-region and cross-account DR is too expensive

The biggest reason IT teams shy away from adding another layer of replication? Cost.

But cross-region DR doesn’t mean doubling your cloud bill. IT teams need to communicate this clearly to CFOs and executives focused on the bottom line. By leveraging incremental backups and tiered storage (i.e. AWS S3 Glacier, AWS Intelligent-Tiering, Azure Blob), the cost of cross-region DR usually runs just 5–15% of your total cloud spend.

That’s a rounding error compared with the price of even a single hour of downtime for a global SaaS app, banking system, or critical business service.

Myth #4: Your SaaS DR tool will save you

When the AWS outage hit, many teams instinctively logged into their SaaS disaster recovery tools to start restoring systems. But the problem? Their SaaS tool was down too. When your DR solution relies on the same external infrastructure, a regional outage can render it useless.

This is why Infrastructure-as-a-Service (IaaS) DR tools are essential. When your backup server sits in your own environment and has direct access to all your backups, there’s no dependency on external services. You can restore anything, anytime, without waiting for someone else’s system to come back online. Your recovery tool is available and ready when you are.

Myth #5: Building a DR plan takes months

You don’t need to reinvent your entire infrastructure. Some companies panicked during the outage, but many didn’t. The difference? Those who stayed calm had regional recovery playbooks ready. Their critical data was automatically replicated across regions and accounts, and when the outage hit, they simply executed the plan.

Here are four practical steps you can implement immediately:

Use IaaS-based DR tools, not SaaS. That way, your recovery infrastructure isn’t dependent on the same failing region.
Recover your full network stack. Your servers are only as good as the virtual cables that connect them. Subnets, routing tables, security groups, load balancers, essentially all network settings should be cloned and ready so failover doesn’t stall on configuration errors.
Automate testing. Use DR drills and testing tools that prioritize resources, schedule drills, and generate reports for full visibility of success.
Test until it’s boring. The companies that recovered in 15 minutes weren’t necessarily the most sophisticated—they were just the most prepared. They recovered fast because they had the muscle memory in place.

The practical takeaway after the outage

Think of your cloud backup and disaster recovery like your neighborhood road being repaved. If construction temporarily blocks your street, do you need to build a whole new detour road? Or do you just need to provide a clear path for your vehicle and assurance that traffic can flow again quickly.

The AWS outage wasn’t a fluke. It was just a reminder that hyperscaler convenience comes with systemic fragility. Ultimately, the burden of resilience falls on you, not the provider. The question isn’t if AWS, Azure or GCP will fail, it’s whether your business can afford to wait 15 hours when it does.

Catalin Voicu

Catalin Voicu is a Cloud Solutions Engineer at N2W (formerly N2WS), a Multicloud disaster recovery provider serving over 1,000 organizations worldwide. He helps organizations in industrial, financial and public sectors design affordable, automated DR strategies that withstand real-world outages.

View all Articles and News

Weathering the Perfect Storm: ISA in Conversation with Newsweek [Q&A]

ISA President Eric Cosman, Executive Board Member Steve Mustard, and Newsweek staff discuss cybersecurity for critical infrastructure. Listen to the audio.

By: Jennifer Halsey

24 July, 2020 | 15 minutes
Technology Trends That Empower Innovation

By: Bill Lydon

25 September, 2024 | 35 minutes
SAMA and ISA Cybersecurity Expert Steve Mustard in Conversation [Podcast]

By: Contributing Authors

30 November, 2020 | 21 minutes
Bill’s Top 10 Automation & Control Trends for 2018 - A Year of Technology Driven Change

By: Bill Lydon

30 January, 2018 | 20 minutes

International Society of Automation
PO Box 12277 
Research Triangle Park, NC 27709

E-Mail: [email protected]

Follow Us

ISA Resources

Upcoming Events

Industry Insights

Consortia

Monthly Magazine

Learn more about us

More things to read

Events and Webinars

Advertising Opportunities

Follow Us

Cloud Computing and Disaster Recovery: The Hidden Risks of Hyperscalers

The reality: Outages will happen again

The systemic risks hidden in the cloud

Busted! The myths that didn’t survive Oct. 20

Myth #1: Multi-AZ Is 'good enough'

Myth #2: Time for a DR overhaul—Multicloud is the answer

Myth 3: Cross-region and cross-account DR is too expensive

Myth #4: Your SaaS DR tool will save you

Myth #5: Building a DR plan takes months

The practical takeaway after the outage

Catalin Voicu

Trending Articles

Poka Plants the Flag on Industrial AI for Connected Work

Siemens to Acquire Precision Innovations to Expand AI-Powered System-on-a-Chip Design Exploration and Optimization

Schneider Electric Recognized with New Global Lighthouse Designations for El Paso and Beijing Plants

Beckhoff Meets Growing Security Requirements Driven by Cyber Resilience Act and Machinery Regulation

Related Articles

Weathering the Perfect Storm: ISA in Conversation with Newsweek [Q&A]

Technology Trends That Empower Innovation

SAMA and ISA Cybersecurity Expert Steve Mustard in Conversation [Podcast]

Bill’s Top 10 Automation & Control Trends for 2018 - A Year of Technology Driven Change

Follow Us