Cloud Migration Challenges
Cloud Sprawl
Cloud sprawl occurs when every application team buys their own cloud environment, configures and manages it themselves, and connect it piecemeal to the enterprise internal WAN. In the absence of a central policy and architecture, each individual application team desiring to migrate to the cloud will obtain their CSP service, build a haphazardly managed environment according to the convenience of the application developers, and pursue their own connection back to the enterprise network.
This results in an environment with no standardization of OS images, no common architecture, no centralized management, and in which enterprise security and manageability are generally afterthoughts. The CIO lacks the necessary visibility to enforce policies, ensure security, or manage cost.
The answer to cloud sprawl is to stay ahead of the rush into the cloud by creating a shared enterprise environment, architected by SMEs in network engineering, security, DevOps, etc, instead of assembled piecemeal by application teams intent on their own narrow requirements. Once this environment is in place, the applications can be moved in on an orderly basis and allocated the resources they need, while those teams which have moved to the cloud ahead of the shared environment can be mandated to migrate into it and shut down their own cloud environments on a firm schedule.
Legacy Lift-and-Shift
One of the challenges of migration to the cloud is that there may not be time or resources to modernize existing datacenter applications into a cloud-native form. Therefore, they will still be built on an architecture which expects higher reliability in network and server infrastructure than the cloud provides.
A cloud-native application will generally be built with the capability to distribute its workload across multiple instances, giving the ability both to scale its resource usage to the current workload, and to respond to instance or network failures by dynamically spinning up new instances to replace lost capacity.
A legacy application lacks the flexibility to adapt to the vagaries of a cloud environment, where individual instances or network infrastructure may fail or be migrated unexpectedly, and the application is expected to adapt. Because of this, building a cloud environment for migrated legacy applications requires implementing a network which replicates the reliability of a data center environment. However, because the major cloud providers do not support multicast or broadcast traffic, which is the foundation of traditional network redundancy mechanisms such as VRRP or HSRP, the same goals must be accomplished in new ways.
Avoiding CSP Lock-In
While many of the fundamental principles of cloud computing are similar across Cloud Service Providers, each CSP naturally implements them differently and offers their own proprietary tool set to their customers to use. Therefore, it is all too easy to implement a solution customized to one cloud platform, which is so tightly bound to the provider’s proprietary solution that it becomes impossible to migrate to another cloud provider without rebuilding from scratch.
The answer, then, is to design an architecture which can be adapted to the individual nuances of each CSP’s environment, and take advantage of the capabilities they offer, without being wedded to a proprietary offering.
Secure Connectivity to the Enterprise Network
Since network traffic must traverse third-party networks (whether telco carrier, or the CSP’s internal network) before reaching the secure enclave within the cloud, this traffic must be protected via encryption while in transit. However, most CSPs have historically offered for WAN connectivity a choice between encrypted tunnels over the public Internet, or unencrypted private connections via technologies such as AWS DirectConnect or Azure ExpressRoute.
AWS and Azure have very recently started to offer native options for building IPSEC tunnels across DirectConnect or ExpressRoute, but at the time the project started, the only choice was to build upon the limited native capabilities and add third-party devices to enable secure connectivity across private channels.
The Architecture
LTG’s approach to these requirements was to build a network architecture which permitted centralized control and sharing of resources, yet was scalable and extensible enough to permit expansion of the cloud environment as demand for it increased. The underlying principle was to aggregate WAN connectivity to and from the cloud into shared connections, through the use of a transit environment (a separate AWS VPC or Azure VNet) which securely terminates connections with the enterprise WAN and connects to all the individual server enclaves within the shared cloud infrastucture.
This aggregation avoids the problem of every individual application enclave having a separate connection back to the enterprise. All traffic external to the cloud environment passes through the transit, where it can be inspected and monitored. At the same time, the transit can be built robustly enough to provide redundancy across multiple network devices and paths, and with sufficient bandwidth to avoid becoming a bottleneck.
Transit Environment
The transit environment contains multiple virtual routers (such as Cisco CSR 1000v or Juniper vSRX) which can terminate IPSEC VPN tunnels which carry traffic across the WAN from the enterprise. This provides the flexibility to encrypt traffic across any path, whether the public Internet or the private connections (DirectConnect/ExpressRoute) which lack native support for traffic encryption. The use of VPN tunnels also allows the enterprise to fully control its own private routing domain, instead of having to advertise routes for internal IP ranges into carrier networks.
The components of the transit (virtual routers, subnets, etc.) can be built to avoid single points of failure using the availability tools provided by the CSP, such as availability zones in AWS or availability sets in Azure, which ensure that redundant virtual devices are not running on the same physical hardware, which would represent a single point of failure. At the network layer, redundancy is provided by the use of routing protocols such as BGP which allow traffic to be rerouted in the event of a device failure, and BFD to speed detection of link failures.
Multiple forms of WAN connectivity can be routed through the transit environment, including Internet VPN connectivity, private carrier connections to the CSP (e.g., Verizon SCI or AT&T NetBond), or connections from interchanges such as Equinix Cloud Exchange/ECX. Depending on the routing policies of the carrier, it may be possible to aggregate multiple connections into a single transit, or multiple transit environments may be required. Tunnels are built to routers at the enterprise edge, either internally or within a TIC environment, depending on whether the applications in the cloud need to face the public internet or not.
The tunnel design can be adapted to the specifics of the existing enterprise environment. For instance, an existing Cisco-proprietary network domain using DMVPN and EIGRP can be extended into the cloud, or a standards-based multivendor environment can use static IPSEC tunnels and BGP to provide connectivity.
Server Environments
Every individual server hosting environment within the shared cloud (AWS VPCs or Azure VNets) is connected back to the transit environment, whether single or multiple. These connections generally go across VPC or VNet peering configured between the server environment and the transit, but can be adapted according to the particular CSP. In some environments, such as AWS, which limits the types of traffic which can be routed across peering connections, it is necessary to deploy additional virtual routers into the server VPCs and create tunnels between the transit routers and the server VPC routers. In others, such as Azure, it is possible to route traffic from server VPCs directly to the transit routers across VNet peerings.
The real challenges arise when virtual routers are required in the server hosting environments. Naturally, gateway redundancy is required, so that if one virtual router fails the other can handle traffic. However, all of the traditional methods for achieving gateway redundancy, such as HSRP, VRRP, or GLBP rely on multicast in order to function, and most CSP networks do not support multicast or broadcast. Therefore, an entirely different approach is required to perform the same function.
In order to route traffic from hosts to a virtual router, static routes need to be added to the VPC/VNet route table via the CSP’s web console or API calls. However, only a single gateway IP or interface can be added, so in order to fail over to another device, the route table must be modified. LTG’s approach was to work with Cisco and Juniper on scripting automatic failover behaviors, such that when redundant virtual routers are deployed in a server environment, the standby router can detect that its active partner has failed, connect to the CSP’s API gateway, and modify the routing table to send traffic to itself instead. This is an entirely different model of failover than the familiar tools used in physical networks, but is necessary in order to provide the same datacenter-level reliability expected by legacy applications which can not simply spawn new instances in a different subnet if they lose connectivity.
The Implementation
LTG has built and continuously improved ICE’s cloud network environment since the day that the requirement for a shared and centrally managed environment was handed down. From the beginning, LTG designed the environment to flexible and modular, and as a result ICE has been able to expand their AWS and Azure presence to meet new requirements, upgrade or replace components of the network, and take advantage of new technologies offered by the CSPs which were not available when the environment was first built.
ICE’s major cloud presence is in AWS GovCloud and Azure Government, so the architecture delineated above was adapted to both environments to efficiently use the native capabilities of each cloud while maintaining a common design philosophy.
ICE’s CIO has mandated that all cloud usage should be restricted to the dedicated government cloud regions offered by AWS and Azure (GovCloud and Azure Government, respectively). This presents additional challenges in that CSP features and software must be certified for FedRAMP High usage before they can be made available in the government clouds, and these environments often lack features that are taken for granted by engineers accustomed to working in the commercial cloud regions, or these features have long delays coming to the government clouds.
For instance, the need for multiple virtual routers in every AWS server VPC has been superseded by AWS’ finally introducing the Transit Gateway (TGW) feature into the FedRAMP High GovCloud regions. By building out parallel connectivity structures and conducting a phased cutover via routing policy changes, LTG was able to transition all of ICE’s production and non-production environments to a TGW architecture without the need for extended outages or disruption of production traffic. This allowed ICE to retire dozens of virtual routers, saving license and maintenance fees while vastly simplifying the management of the AWS network environment and lightening the burden on Operations.
While multiple upgrades and redesigns have been necessary over the years as capabilities and requirements changed, the underlying principles have remained solid, and as the cloud environments settle into a more stable long-term design, the focus shifts to streamlining and foolproofing the network as much as possible to insure ease of management and maintenance by the Operations team.
Conclusion
By applying consistent and rigorous design principles, while continuously improving and upgrading ICE’s cloud network to take advantage of new technologies and new capabilities, LTG has met ICE’s requirement of building a highly robust and manageable shared environment. The network meets the requirements not just of modern cloud-native applications, but of the lengthy roster of legacy applications which had to be migrated into the cloud without the benefit of redesign.
0 Comments