While traditional IT infrastructure methods for disaster recovery have been the norm, the promise and development of vendor-agnostic services for core infrastructure and data are giving rise to cloud native models that will significantly increase operational resiliency.
Matthew Leybold
Matthew Leybold is an Associate Director with Boston Consulting Group out of New York City and leads the Cloud and IT Infrastructure topic, serving Financial Institutions and Public Sector industry verticals for BCG Platinion North America.
Historically, enterprise technology resiliency has been rooted in the data center, co-location facilities, and with classic Managed Service Providers (MSPs). Best practices for Business Continuity Management (BCM) and Disaster Recovery (DR) are often rooted in the enterprise data center, co-location, and Managed Service Provider (MSP) facilities by a combination of enterprise IT organizations, third-party vendors, and DR solutions.
However, while data center and traditional infrastructure methods for disaster recovery have been the norm, the promise and development of vendor-agnostic services for core infrastructure and data are giving rise to cloud native models that will significantly increase operational resiliency.
The rise in public cloud adoption has resulted in significantly increased usage of low-cost, low-risk solutions for Backup and Restore models. The past few years has even seen cloud native resiliency for some applications “born in the cloud.” However, until recently, leveraging true hybrid and multicloud as a mechanism for operational resiliency has not been possible due to historical enterprise cloud strategy approaches, as well as lack of available tooling by Cloud Service Providers (CSP) to reduce the friction to enablement. Enterprise cloud customers are now adopting multicloud as a strategic imperative, and CSPs now provide capabilities that make this dream a reality.
Cloud native resiliency improves upon traditional recovery models with the following features:
- Providing uniform templates and building blocks: if you look at the key components of architecting for resiliency, all of them are cloud native and many of them are standard, commercial cloud services used for production operations that are resilient by design (e.g. AWS AutoScaling for preventing failure events; AWS Direct Connect/ELB for network-based resiliency and workload distribution supporting critical workloads; EBS/S3/Glacier for long term data retention and storage)
- Automating the full lifecycle to include failure recovery: from automation of the backup process itself, to preventative measures such as horizontal scaling, to the failover triggered action handlers, cloud native resiliency features increased automation of your implementation and the failover model it is designed for
- Using “operational tooling” in favor of “backup tooling”: new CSP-native offerings for multicloud, namely GCP Anthos, are significantly reducing the friction to multicloud deployments, which will eventually reduce the barrier to entry for cross-CSP deployments and new resiliency models
What Are the Deployment Models We Can Choose from?
Three key resiliency models have emerged when enabling resiliency methods with Cloud Service Providers (CSP):
Single CSP in Active-Passive Configuration. This is the most common model for resiliency and for legacy applications generally involve a single CSP that will most often act as the failover site from a Production application in an on-premise data center environment.
In a cloud native model, and most often for applications that are born in the cloud, this is also the most popular resiliency model for applications that are tolerant of certain Recovery Time Objectives (RTO) that do not necessitate an Active-Active replication model. The key models here include Warm Standby, Pilot Light and Backup and Restore.
In the Warm Standby model, all essential services for an application are running in a minimum viable manner, almost in a “Production Lite” version of your full production environment. In the event of a failure or other scenario that triggers recovery, this standby environment can be easily scaled up to handle the production load, and networking changes can effectively be switched to route all traffic to the Warm Standby environment.
In the Pilot Light model, the data from the production application is replicated and the application environment is stored as a template, which can be spun up in the event of a recovery scenario. The time to return to normal production operations is longer than Warm Standby, but this can also be a very cost-effective method for recovery if the application has a relatively higher tolerance for outage timeframes.
Finally, many enterprise cloud customers have enabled basic storage and archival models with CSPs (e.g. Amazon Web Services‘ Backup and Restore, S3 Glacier) for low cost, long-term retention of data from enterprise, on-premise workloads. This was one of the first viable use cases for the early adoption of public cloud.
Evaluation and Key Considerations: This model is generally a standard, reliable resiliency method for when minutes to hours of executing a failure event is tolerable.
(Note: read more on Recovery Time Objectives and Recovery Point Objectives here).
Single CSP in Active-Active Configuration. Also known as Multisite in AWS configurations, mission-critical applications often require Active-Active failover that are either cross-Availability Zone (AZ) or in some cases, cross-Availability Region (AR) depending on criticality, regulatory requirements for Out-of-Region Recovery (OOR), and latency tolerance.
Typically, this model features a single CSP with Active-Active configuration; in AWS, this is called the Multisite model which runs identical Production workloads intra-region or OOR, with network traffic cutover and rules established in DNS. The Recovery Point Objective (RPO) is generally designated last asynchronous or synchronous database write. Some customers have attempted to design their DR by using a multiregion Active-Active design pattern across AWS US East-West regions
Evaluation and Key Considerations: If you have extremely mission-critical, time-sensitive, ultra-low latency production applications where “milliseconds have million-dollar implications”, this is often the appropriate resiliency model. Often, applications with extremely low latency and latency-sensitive applications can opt for intra-region models, or even global deployments with multiregion data center and cloud deployments that keep data closer to the edge and user.
Multi-CSP in Active-Passive or Active-Active Configuration. This model is similar to the Single CSP in Active-Active configuration, with the exception that a cross-AZ or cross-AR scenario is replaced with a different CSP vendor region. This can either be in the same region for each CSP (e.g. AWS US East — Northern Va, and GCP us-east-a/b/c), or cross-region for disparate CSP ecosystems.
Evaluation and Key Considerations: Numerous organizations attempted this model 3-5 years ago before robust, mature CSP tooling and service offerings were available. In modern application builds, going multi-CSP is often reserved for use cases where a multicloud deployment is either warranted or required by necessity (e.g. availability limitations of certain CSPs cross-region, reducing the risk of single CSP vendor reliance). However, each attempted with their own custom solution, as these initiatives occurred pre-emergence of multicloud tooling such as GCP Anthos. This is also an excellent option for multinationals and global organizations that need global deployments for technical considerations as well as serving customers in different countries that have varying laws and governance around the use of data.
Why Wasn’t This Possible Before, and What Blockers Have Been Removed?
Years prior, many organizations had tried, and failed, to achieve the multicloud architecture and operating model. However, the new wave of CSP-native tooling and services was not available, and many other reasons inhibited the adoption of multicloud, including:
Lack of available tooling. Today, the friction to adoption of multicloud is significantly lower with container-based offerings, orchestration tools, and each CSP now providing services and tooling that promotes a more open and modular ecosystem. Many organizations attempted, failed at multicloud, and rolled back to hybrid cloud models and even “multiple cloud” models where disparate environments were managed separately. In fact, a number of major Financial Institutions tried to implement homegrown, cross-CSP models for reducing the risk of a single-vendor environment, and eventually fell back into hybrid cloud models without the support of robust, mature CSP offerings to accomplish the same.
Multicloud requiring unrealistic talent and skill requirements at enterprise scale. At the time, the CSP landscape was so nascent that it was hard enough to go all-in with a single CSP, much less multiple vendors, without the enterprise talent model and skills in place. Now, most organizations operate in multiple CSP infrastructure and SaaS environments and have more of the skills required to build and operate a cloud native ecosystem, although this still remains a constant challenge.
“The juice wasn’t worth the squeeze”. Given the lack of tooling, bespoke solutions attempted by many organizations proved that the uplift in resiliency, disaster recovery planning, and workload mobility was offset by the cost and maintenance overhead of the full stack and cross-environment interdependencies with bespoke solutions.
How Is Service and Tooling Maturity Enabling Cloud Native Resiliency?
Each of the Big 3 CSPs has developed and released hybrid and multicloud services that open up a world of new possibilities with multicloud operating models as well as new approaches to operational resiliency:
Google Cloud Platform (GCP). GCP Anthos was first launched in 2019 initially as a hybrid cloud solution and is generally considered first to market in the expansion to support for multicloud models. Per Anthos documentation, the stack is intended to be environment agnostic but primarily geared currently to run on AWS and on-premise Anthos clusters on VMware infrastructure, per Anthos technical documentation. The strategic shift toward multicloud significantly opens up the aperture on models for cross-vendor and CSP resiliency models and is creating a lot of fast followers by other CSPs seeking to achieve the same capabilities.
Microsoft Azure. Microsoft was early to the hybrid cloud trend with the release of Azure Private Stack, enabling hybrid cloud in a Microsoft ecosystem-consistent model. Since then, the offering has expanded to Azure hybrid and multicloud solutions which include Azure Arc for enabling a single control plane cross-environment, Azure IoT for extending workloads to the edge, and numerous supporting services that underpin the multicloud ecosystem (e.g. Security, Data, Identity, Network).
Amazon Web Services (AWS). For a number of years, AWS was primarily public cloud native and eventually evolved into hybrid cloud with the AWS Outposts service offering. More recently, announcements have been made that have shown the AWS-native container services from Amazon will be expanded to support multiple environments and even other CSP vendors. It is not apparent yet if Amazon ECS Anywhere and EKS Anywhere will offer as high of a degree as the Azure and GCP product sets for enabling multi-cloud in its 2021 release. However, it is a big step in the same direction and embrace of the shift of many customers towards a demand signal of multicloud support, and eventual container portability cross-CSP for increased resiliency of those with multicloud operating models.
What Are the Actions We Should Take to Enable and Improve Cloud Native Resiliency?
Action #1: Choose the right cloud native “anchor” for your enterprise technology ecosystem and build a resiliency model around it.
For many organizations, this still remains as the enterprise data center for mission-critical business and enterprise workloads and services (e.g. identity, encryption and key management), which are then extended or federated to public cloud environments. For other startups, digital and cloud natives, public CSP ecosystems are often where greenfield environments are born and also supports both mission and business-critical workloads as well as enterprise IT services.
Disaster recovery is not necessarily easier when you choose to anchor in the public cloud. It is more complicated to fail back post-recovery. With easier availability of cloud infrastructure resources, many companies have DR infrastructure setup and ready to go (e.g. Warm Failover, Pilot Light models). However, many organizations do not practice cloud native BC/DR, or alternative models in multi-CSP or data center environments. BC/DR should be a top priority and backed into cloud product design, engineering and operations Automating your recovery processes is one of the best investments you can make to protect your business from unscheduled events. Automation reduces DR testing hesitation, reduces risk increases the frequency of tests. Chaos engineering comes up often as a technique to test resiliency; we have yet to see this applied in practice outside of big techs.
It is also critical to understand the key risk areas with each model and how to mitigate them, such as latency, data residency and security policies. Physical proximity of data center and cloud regions, the data residency and encryption implications of hybrid and multicloud tooling, and security orchestration of the same are all often high barriers to compliance and entry with many organizations
Read more here to better understand maintaining operational resiliency in a hybrid and multicloud world.
Action #2: Understand what is commoditized for cross-CSP vendor services against what is differentiated and how it impacts resiliency planning.
When designing mission-critical systems, one must not only look at features but also critically review service selection and architecture against its implications on resiliency. Two critical considerations should be factored in:
- CSP-native service selection: Which services should be selected for intra-CSP and what is their default failover profile (e.g. zone/region)? If an additional design is also required by the vendor then the appropriate resiliency must be built-in. It is key to note that in the spirit of the well-known cloud best practices of the Shared Responsibility Model and Well-Architected Framework, the onus is on the CSP vendor for adherence to SLA performance for commitments including service uptime and durability. Additionally, the architecture, usage and configuration of those services the failover and resiliency model of each service must be understood by the customer and how it supports the broader application architecture necessary for that system’s failover requirements (e.g. Active-Active, Active-Passive, Pilot Light).
- Cross-CSP or environment selection: Which services (or entire app/data sets) should be candidates for cross-CSP or environment failover. If CSP-native service selection will not meet all of the needs of your application architecture, consider your options with cross-CSP vendor environments and tooling, as well as the enterprise data center in a hybrid arrangement.
Action #3: Make balanced cloud architecture choices that maximize control while leveraging the benefits of cloud resiliency.
Both cloud native and cloud-agnostic tooling and services will be imperative, and with recent CSP developments, commercial and public sector organizations now have the tools to succeed.
There exists a continuum of adoption that spans everything from an extreme desire for CSP agnosticism, control, and workload portability, to a full embrace of one or more CSP vendors and everything their environments have to offer. The recommendation is to carefully assess the posture of each adoption model against the broader technology strategy and make balanced cloud architecture and CSP service choices the underpin the vision for your environment and solution strategy.
Many organizations have arrived at hybrid or multicloud ecosystems “by accident,” rather than by deliberate strategy, and are going through the process of understanding their current data center and cloud footprint to assess the best model for operational resiliency, among other business demands. In the figure below, a wide spectrum of choice spans fully cloud native, CSP embedded ecosystems as well as cloud-agnostic solutions. Each model can fully enable hybrid and multicloud architectures, but vary in vendor engagement as well as tool and service selection to build the service fabric that is disaster tolerant.
Observations in the market show a wide range of adoption patterns across each architectural model, with respective technical and operational tradeoffs. Cloud agnostic models are generally designed with bespoke DR solutions cross-CSP and data center, whilst using public cloud services heavily in the IaaS stack, often requires open standards and vendor-agnostic platform tools that can reside in any cloud environment.
Restoration of data systems is complicated. When thinking of cloud services and their resiliency, keep your data architecture in focus. Data consistency, data traversal cost, latency between your primary and backup sites are key considerations when designing highly available mission-critical applications in the cloud.
In the CSP-native models, organizations generally go “all in” with the cloud and instantiate the appropriate cross-region, and even cross-CSP controls to ensure operational resiliency.
Further Reading
CSP resiliency references:
Feature image via Pixabay.
I was astonished at thеiг expertise ⲟn thhe subject.
Dear immortals, I need some wow gold inspiration to create.