Cloud Bursting with Tanzu Service Mesh

Watch the demo here: https://via.vmw.com/tchza28ff6f0no5432

Introduction

Cloud Bursting is a very interesting topic to explore, as it’s related with the effective usage of computing resources just in the moment they are needed. Across this post we explore how to do it automatically using VMware technology, so it can maximize the benefits and outcomes of applying this method.

Cloud Bursting and why automating it

With cloud bursting we can use cloud computing resources in the moments when on-premises resources are at peak activity and cannot provide more resources to attend to the demand.

But, what happens typically when this peak moment is reached and we want to activate cloud bursting? Well, that’s not easy: networks must connect the on-premises environment with the public cloud environment where cloud bursting must happen; security posture must be implemented and tested effectively and continuously; monitoring must be activated; resources must be configured in the public cloud to provide coverage to the additional workload demanded at the peak moment.

Therefore, the implementation of cloud bursting can be a complex activity, and it doesn’t make a lot of sense to configure it every time a peak moment is reached, as its benefits can be dissolved due to the operational effort required to implement it.

What if we could implement automatic cloud bursting so the infrastructure provides itself what’s needed in the cloud in the peak moments, deprovisioning the resources when back to normal demand?

Automatic cloud bursting could then maximize the promised benefits minimizing the effort to implement and configure the cloud infrastructure.

Combining this method with Kubernetes could bring even more benefits due to the auto-scaling capabilities either at the pods level or at the worker nodes level. In other words, we could have a minimal cluster deployed in the cloud, waiting to be activated and dynamically scaled at the peak moments.

Then, which components do we need to implement automatic cloud bursting?

First we would need a solution to connect Kubernetes clusters that are deployed in different places, like a private datacenter and any cloud infrastructure. We would also need a solution to balance the load between the different sites when the app would run in a peak moment, intelligent enough to balance only in the on-premises infrastructure when the demand is normal. And we would need an utility to watch the local status of the application, detect when the peak moment is reached and calculate the excess to implement dynamically in the cloud.

So, let’s dive into the details of this exciting automatic cloud bursting solution!

The solution

Architecture

How can an automatic cloud-bursting solution be implemented between an on-premises data center and the public cloud? We picked the acme fitness shop demo app, and used its frontend application to demonstrate it.

The solution is built with 2 Kubernetes clusters running the same acme fitness shop frontend, one placed on a local datacenter (placed in our case in US) and starting with 1 replica (1 pod) and the other in the public cloud Azure (placed in EU), starting with 0 replicas (0 pods). To connect both clusters, Tanzu Service Mesh supports the usage of Global Namespaces and it’s integrated with NSX Advanced Load Balancer (Avi) for GSLB and DNS services; this Load Balancer is installed in the two places; the one installed in US plays the leader role; the DNS resolution is restricted to the local datacenter only.

This schema shows the high-level building blocks of this solution:

Local Data Center details

The local datacenter is running Tanzu on vSphere 8 with vSAN, NSX and NSX Advanced Loadbalancer (Avi) for GSLB and DNS services.
For connectivity between our on-prem data center and our VNET in Azure we are using VMware SDWAN (VeloCloud) with IPSec VPN.

The illustration above is a very high-level description of how the different components are involved in publishing the frontend acmeshopping. vSphere is the foundation for all the compute and storage (vSAN). We have enabled WCP on it with NSX as the underlying network including the NSX built-in load balancer. NSX is responsible for all the routing in and out and is configured with BGP to our upstream router. The loadbalancer is responsible for exposing the TKC API endpoint as well as all services of type LoadBalancer which we are exposing our acmeshopping with. The DNS record for the acmeshopping.gslb.cloudburst.somecooldomain.net is NSX ALB responsible for by using the DNS service in NSX ALB. Our internal DNS server is configured to forward all requests to the .gslb.cloudburst.somecooldomain.net domain. The Velocloud edge is responsible for establishing an IPSec tunnel between our onprem networks and Azure VNETs. The reason for that is we need connectivity between our onprem NSX ALB controller and health checks betwen the GSLB pool members which is done in the “backend” through the Velocloud IPsec tunnel.

So the following components have been installed and configured:

vSphere
NSX
NSX Advanced LoadBalancer controller and Service Engines
VMware SDWAN Edge
IPSec
Tanzu with vSphere (WCP) Supervisor cluster and a Workload Cluster
The acmeshopping application deployed in the Workload Cluster

The important component is the GSLB feature in NSX Advanced Loadbalancer. This is responsible for loadbalance our acmeshopping application between on-prem and our public cloud instances of the acmeshopping application. It makes us able to use the same DNS record but will balance the load with different sets of loadbalancing algorithms, we are just using round-robin in this example, but there are other ways to load balance with GSLB, some examples below:

Public cloud details

In Azure we deployed the Kubernetes infrastructure to be used only in the cloud bursting events, and also all the networking components needed to connect to the Velocloud endpoint and also to connect to the leader AVI LB.

The Kubernetes cluster is based on the Tanzu Kubernetes Grid distribution supported on Azure infrastructure, and was deployed following the steps from the public documentation.

Regarding to the connectivity to the Velocloud Edge endpoint placed in the local DC, we configured an Azure connection with these elements:

An Azure Site-to-site connection, type IPsec

A Virtual network gateway (called vngw) with VPN type Route-based, and with an assigned public IP address

A Local network gateway also with an assigned public IP address and with the Azure and remote address spaces added to its configuration

Finally, the NSX Advanced Load Balancer (AVI) was deployed using the Marketplace offering and following the deployment wizard. Then an Azure Cloud infrastructure was set up to use Azure infrastructure to deploy the Service Engines dynamically. The GSLB site configuration was also configured showing it as Active, depending on the Leader placed on the local DC:

Tanzu Service Mesh details

Integration with AVI LB

First, the integration between Tanzu Service Mesh (TSM) and the AVI LB running on the local DC must be set up. For this, first the local TKG cluster was firstly labeled with “Proxy Location=uswa”:

Then the AVI Integration used this label to connect to the local DC AVI LB using the TKG cluster as proxy:

Finally, a DNS account was created:

With all these components in place, the AVI Integration showed as connected:

TSM Global Namespace setup

A global namespace is a unique concept in Tanzu Service Mesh, as it defines an application boundary. A global namespace connects the resources and workloads that make up the application into one virtual unit to provide consistent traffic routing, connectivity, resiliency, and security for applications across multiple clusters and clouds. Each global namespace is an isolated domain that provides automatic service discovery and manages service identities within that global namespace.

For this setup, we created a Global Namespace called acme-shopping with the domain acmegns.tanzu (this controls how one resource can locate another and provides a registry for the virtual services onboarded inside):

Then, we onboarded the 3 namespaces, 2 with the frontends and 1 with the backend:

The public services setup makes use of the AVI Integration to configure the access to the application, configuring automatically the public URL in the DNS server of the AVI controller deployed in the local DC:

Regarding the GSLB schema, we selected Round Robin. Other schemas available are Weighted and Active-Passive. Regarding the health checks, the default ones were selected:

In the final step of the Global Namespace, the different services were recognized successfully including the frontend and backend services:

With this, we tested the access to the application and it showed the connection between the frontend running in the US and the backend running in EU:

Regarding the frontend service (called shopping), it was in yellow state as there weren’t any replicas running on the cloud-based Azure cluster. This is a normal condition as cloud bursting situation was not happening yet, and therefore the GSLB is in red status for the Azure part:

Automatic cloud burster details

Autoscaling represents the ability of a service to automatically scale up or down to efficiently handle changes of the service demand. With Tanzu Service Mesh Service Autoscaler, developers and operators can have automatic scaling of microservices that meet changing levels of demand based on metrics, such as CPU or memory usage. These metrics are available to Tanzu Service Mesh without needing additional code changes or metrics plugins. But this autoscaler works at cluster level, therefore we thought of a solution to extend the scaling between clusters onboarded to a TSM Global Namespace.

The first step to configure extended auto scaling between the local DC cluster based in the US and the remote Azure cluster based in EU was to add a local TSM autoscaler, using the CRD provided by TSM to any onboarded cluster to configure an Autoscaler Definition (asd):

apiVersion: autoscaling.tsm.tanzu.vmware.com/v1alpha1

kind: Definition

metadata:

labels:

app: shopping

spec:

scaleTargetRef:

kubernetes:

apiVersion: apps/v1

kind: Deployment

scaleRule:

mode: EFFICIENCY

enabled: true

instances:

min: 1

max: 4

default: 1

stepsDown: 1

stepsUp: 1

trigger:

gracePeriodSeconds: 60

metric:

scaleUp: 30

scaleDown: 15

windowSeconds: 60



When the autoscaler is deployed it will report its status through the K8s API:

We developed a K8s solution (called tsm-cloud-burster) to monitor this deployed asd, querying its status section and checking when it reaches the max number of replicas, getting in that moment the possible excess number of replicas that would be needed to meet the demand in that situation, getting the contents of the Reason field when is “EnforcingLimit”.

This excess is what triggers the cloud-bursting event, and for that it’s necessary to scale the remote deployment from the tsm-cloud-burster pod.

In the cloud-bursting event, the frontend service in TSM shows the number of replicas that are contributing to the cloud-based part of the GSLB:

When the cloud-bursting event is gone, the replicas in the Azure cluster disappear automatically and the service returns to be backed by local DC service instances only, not needing any resource running in the cloud.

The instructions and code related to the tsm-cloud-burster solution are in this GitHub repo.

Automatic cloud bursting is totally feasible with the combination of Tanzu Service Mesh and its integration with NSX Advanced Load Balancer to provide DNS and GSLB services, adding a simple K8s-based application (like tsm-cloud-burster) to monitor periodically the local autoscaler and order to the remote cluster to provide the excess number of replicas in every moment that it’s really needed.

Benefits

Cloud bursting in Kubernetes refers to the process of extending an application's capacity to run in a public cloud when the resources of the private cloud or on-premises infrastructure are not sufficient to handle the increased demand. This approach offers several advantages that are boosted when full automation is configured:

Scalability: expanding and contracting resources help optimize infrastructure costs and maintain high performance.
Cost optimization: By leveraging cloud resources only when required, organizations can save on infrastructure costs. They can maintain a smaller on-premises infrastructure and only pay for additional cloud resources when needed, thus optimizing their investments. Adding solutions like VMware Aria Cost powered by CloudHealth can help to calculate with total precision the expense in cloud resources when cloud bursting happens.
High availability: Cloud bursting helps maintain high availability by distributing workloads across multiple environments. In case of an issue with the on-premises infrastructure, the public cloud can take over the workload, ensuring minimal disruption. This can be met through the Active-Passive GSLB schema available in the integration between VMware Tanzu Service Mesh and VMware NSX Advanced Load Balancer.
Flexibility: Kubernetes provides a uniform platform for managing and deploying applications across different environments, making it easier for organizations to adopt and implement cloud bursting. This flexibility enables businesses to choose the most appropriate and cost-effective cloud provider for their needs.
Disaster recovery: In the event of a disaster or failure in the on-premises infrastructure, the application can continue to run in the public cloud, ensuring continuity of operations and minimizing downtime. This can be also met through the Active-Passive GSLB schema available in the integration between VMware Tanzu Service Mesh and VMware NSX Advanced Load Balancer.
Faster time-to-market: By leveraging cloud resources to handle additional workloads, development and testing can be expedited, which helps organizations bring new features and services to market more quickly.
Global reach: Cloud providers have data centers located in different regions across the globe. Cloud bursting allows organizations to deploy applications closer to their end-users, resulting in reduced latency and improved user experience. Combining this with VMware Tanzu Service Mesh security policy management, can help to establish access control and geofencing policies to protect the access to the different parts of the application.

Finally we have demonstrated that the potential challenges associated with cloud bursting, such as data security, compliance, and latency, can be addressed through automatic cloud bursting implementation.

Where to know more about it

Antonio Gallego

Andreas Marqvardsen