Monday, July 1, 2024

Redundancy, Disaster Recovery and The Cloud

 If you pay attention to tech news, you know that cyber attacks are commonplace these days.  There are a variety of vectors of attack from malicious software finding its way onto a computer because someone opened a link in an email they shouldn't, to recent examples where social engineering was used to access a network attached device and download the malware onto it.  Regardless of the method, once an attacker has a beachhead in your network, you are likely in for outages and loss of production or income.  Your defenses today include a wide array of active measures; firewalls, anti-virus, VPNs and device policies.  All of these help prevent an attack - but what do you do once you've been taken offline by an intrusion?  In this post, we'll look at how some of the biggest companies harden their networks for specific levels of redundancy and disaster recovery.


High Redundancy

In 2005 I worked for one of the largest privately owned insurance agencies in the nation. You might say their foot print was nation wide.  To ensure uptime and recovery, they utilized a redundancy system that was truly impressive.  Today they likely have improved on this plan, but 20 years ago, this was big and rock solid stuff.  

It started with an A/B power system in the data center.  By having redundant power, your primary supply could go down and the backup would recover in seconds.  The help desk firm I worked for in the late 90's had a failover response on their standby generators of half a second.  The lights barely flickered.  A/B power schemes usually have similar backup capacity that can weather an outage of a certain duration, but is usually limited to a fuel supply and how long it will last.

The second layer of redundancy was to have a North / South data center clone in their home city.  This ensured that any outage that compromised an entire site, like local natural disaster or other unforeseeable interruption at one location wouldn't stop business because the secondary location could pick up the slack.

But what if we had an earthquake? That's where the East / West redundancy came in.  In a South West state, they had another data center that formed the South component of a two site redundancy pair.  In another state up in the North Central plains, they had the North location.  

This provided regional failover capability. The glue that held it all together, at the time, were Cisco Smart Switches.  A demo conducted one weekend for the executives involved tracking the packet loss on the network when an entire data center was shut down.  Thanks to the highspeed routing and error checking of the smart switches, there were zero packets lost, and no discernable down time that could be measured.

On top of this highly redundant network, they use clustering and cloning.  IBM mainframes provided a high performing computational platform upon which servers were cloned in real time across mirrored Linux partitions.  Every transaction was relayed to 7 other clones of the same server in real time.


Disaster Recovery

The same company used what was, at the time, the best you could get in disaster recovery.  Nightly backups were taken from every tier one server, all of the business critical systems, written to optical disc or magnetic tape, depending on the system.  These were shipped every day to cold storage, literally, in an old salt mine converted to secure storage, deep underground.  In the event of a disaster, the nightly backups could be pulled and restored within hours.  This ensured minimal down time in the event a network wide failure of some sort, like a cyberattack, managed to affect all systems and all locations.

It's typical today for companies to use this strategy still.  The service of securing and transporting backup medium was, until recently, a lucrative business model.  Network bandwidth improvements have made this almost an antiquated approach to backing up and securing data these days as the Cloud has distributed not only the networks and compute resources companies build their own infrastructure on top of, but the backup can be done from the cloud providers data center rather than from your on-premise systems.

Restoring Service

While I was working for a major automotive manufacturer, the second of my career, they got hacked.  Worse, it was ransomware - more than 80 percent of the computers connected to the network had their user directories encrypted.  Rather than pay the hackers, the company opted to reimage all of the machines... every server, workstation and laptop.  I participated directly in this effort and it took several days to complete.  I pulled at least two 14 hour shifts.  The entire outage lasted approximately 5 days.  The speed of recovery had a lot to do with the skill of the IT department, and the relatively small flaw that was exploited on a Microsoft Domain Controller.  It should have been patched, but was a couple of days behind a zero day bug announcement.  The process of evaluating patches to systems was the real culprit - it took too long to decide it was OK to apply a patch via certification in the lab, and left the window of vulnerability open.

A more recent story saw a regional manufacturer taken off line for upwards of two weeks.  Without knowing all of the internal details, we can infer that they either had trouble isolating the problem or insufficient redundancy or DR planning in place.  Some of the things they could have done to help would have been to use any of the many available network scanning tools now available.  Running an up-to-date commercial anti-virus on all of their devices may have quarantined malware quickly.  Reports indicate that their VPN was put out of commission for a long while, leaving remote workers incapacitated.  A potential work around for this would have been to use a Microsoft hybrid virtual network combining the Network Gateway SASS offering with a secure virtual network switch to their on premise network.  Utilizing Microsoft two-factor authentication would have helped to ensure only their employees could access the VPN, and all they would have needed to do was use the Windows built in VPN client to connect.  

Getting Better

Many companies to day still retain a heavy on-premise component due to security or cost concerns. There is a fair amount of distrust surrounding cloud computing, which is hard to argue with when big name providers deprecate and turn off functions regularly.  They are also not immune to breaches.  Auth0, a provider of OAuth services to enable SSO across applications, recently had a mid-day outage that took all of their customers off line for 45 minutes during the work day.  Centralized services like this provide a cost savings and distributed service which can service your internal and external apps, but poorly architected service offerings present a single point of failure that can cost you and your business a lot of money. 

Having the right mix with properly vetted service providers that have also hardened their systems and applied architectural best practices to their designs to provide redundancy and resiliency is an important step not to miss in the design of your network and infrastructure.  Picking the right balance of low priority and mission critical systems that have layers of redundancy available to them can prevent embarrassing outages that can turn customers off or delay the delivery of goods and services, which in turn can cost you money and business.  

The old adage of not keeping all of your eggs in one basket still applies.  And while the cloud is in fact just someone else's computer, properly applying the design patterns that it affords is the key to keeping your costs low and your uptime -- up. If all you do is move a server from your in-office rack to a virtual clone on the cloud, you're only turning a capital expense into an operational expense.  By properly designing your applications and services, you can take advantage of distributed architectures, high redundancy, and high availability, while realizing lower capital costs with tightly controlled and monitored operational expenses.  

The key to a successful migration to or integration with a cloud service provider begins with engaging with an enterprise architect with a broad range of exposure to service providers and current offerings.  Be wary of flashy consulting firms with name brand recognition as they don't always deliver.  Watch out too for consultants who pad their resumes with a never ending stream of certificates, but have little actual experience.  The long time veterans who have seen some things and survived some of these scenarios remain your best shot and fortifying your systems and redesigning your network and service layers for optimal security and performance.