Tanzu Kubernetes Grid and Portworx Asynchronous Disaster Recovery Solution
Tanzu Kubernetes Grid and Portworx Asynchronous Disaster Recovery Solution
Author: Nithin Krishnan, Ranjith B Purvachari, K Puneeth Keerthy
As enterprises increasingly adopt containerization and microservices architecture, the need for robust disaster recovery (DR) solutions has become more apparent. Portworx, a cloud native storage and data management company, provides a DR solution for VMware Tanzu multi-cloud environments that is easy to implement and manage.
In this blog, we will discuss Portworx DR on VMware Tanzu and how customers can leverage Portworx DR capabilities on VMware Tanzu Kubernetes Grid. This blog covers the following three sections:
- Introduction to Portworx and asynchronous DR
- Installing Portworx on Tanzu Kubernetes Grid
- Configuring Portworx Async DR on Tanzu Kubernetes Grid
Introduction to Portworx and Async DR
Portworx Enterprise, a certified Partner Ready for VMware Tanzu, is a software-defined container storage platform built from the ground up for Kubernetes by providing scale-out software defined container storage, data availability, data security, and backup for Kubernetes-based applications.
Portworx DR (PX-DR) on Tanzu Kubernetes Grid integration provides a disaster recovery solution for Kubernetes clusters. It replicates the data and configuration of the apps present in a primary cluster to a secondary cluster located in a different availability zone or region. The secondary cluster is therefore up to date with the primary cluster. In the event of a disaster, the secondary cluster can be promoted to the primary cluster and take over the workload seamlessly.
Figure 1: High-level PX-DR solution diagram on VMware Tanzu Kubernetes Grid
Understanding Portworx asynchronous disaster recovery solution
When it comes to disaster recovery planning, organizations need to have a solid strategy in place to ensure business continuity in the event of a disaster. With PX-DR, organizations can build a robust disaster recovery plan for applications running on Tanzu Kubernetes Grid clusters.
Since our clusters are spanned across multiple regions and the network latency between the nodes is high, we will be setting up an asynchronous DR solution offered by Portworx. Here are a few of the advantages of Portworx Async DR:
- Portworx Async DR is a disaster recovery solution that helps organizations protect their critical data in the event of a disaster or outage.
- The solution is asynchronous, which means that data is replicated in the background without impacting production workloads.
- With Portworx Async DR, organizations can replicate their data to a secondary site, which can be in a different geographic location or cloud provider, for added resiliency.
- The solution provides near-zero recovery point objective (RPO) and recovery time objective (RTO), which means that data can be recovered quickly and with minimal data loss.
- The solution is designed to be cloud native, which means that it can be easily integrated with Kubernetes and other cloud native technologies.
- Portworx Async DR is highly scalable, allowing organizations to protect their data as their workloads grow and evolve.
Defining RPO and RTO requirements for each application is crucial in disaster recovery planning. RPO refers to the maximum acceptable amount of time since the last data recovery point, which essentially means the amount of data loss that a business can tolerate.
On the other hand, RTO refers to the maximum acceptable delay between the interruption of service and restoration of service, which is the amount of time it takes to fully recover from a disaster. Organizations must find the right balance between application availability needs and the cost associated with designing and implementing a solution that meets these requirements.
Installing Portworx on Tanzu Kubernetes Grid
Now that we have learned about Portworx and why it is used, let’s take a look at how to install it on Tanzu workload clusters. The following steps are to run on both source and destination clusters in order to install a Portworx cluster.
Prerequisites
- VMware software-defined data center (SDDC)
- Two Tanzu workload clusters deployed across multiple regions
- Portworx Enterprise license activated on both the clusters
- Object storage that should be accessible from both clusters
Steps to install the Portworx cluster
- Export the Kubernetes version of the
- Tanzu Kubernetes cluster
- Portworx version to be installed
- Namespace where Portworx needs to be installed
export KBVER=$(kubectl version --short | awk -F'[v+_-]' '/Server Version: / {print $3}') export PXVER="2.13.0" export NAMESPACE="portworx" export PX_CLUSTERNAME="<provide a name for portworx cluster>" |
- Label the worker nodes where the Portworx cluster nodes need to be installed. For proof-of-concept purposes, all worker nodes where Portworx nodes should be created are labeled. This is needed to make sure the Portworx pods are deployed on these selected worker nodes. For each worker node add the label px/enabled=true.
#List all the K8s Cluster Nodes kubectl get nodes #Label the Nodes kubectl label node <Worker Node Name> px/enabled=true --overwrite=true |
- Portworx needs a key-value database (KVdb) to store the metadata information. It creates an internal KVdb on each of the worker nodes during the installation. Add the label px/metadata-node=true to all the worker nodes in order to install KVdb on the worker nodes.
kubectl label node <Worker Node Name> px/metadata-node=true --overwrite=true |
- Installation of Portworx components can be done in multiple ways, including Helm chart or operator-based installation. For the current setup, an operator-based installation is used. Install the operator and the necessary custom resource definitions (CRDs) using the following command.
kubectl apply -f 'https://install.portworx.com/${PXVER}?comp=pxoperator&kbver=${KBVER}&ns=${NAMESPACE}' |
- Once the operator is installed and running, verify the status of the operator pod.
kubectl get pods-n portworx |
- The Portworx cluster configuration is specified by a Kubernetes CRD called StorageCluster, which is an object that acts as the definition of the Portworx cluster. When modifying the StorageCluster object, the operator will update the Portworx cluster in the background. There are multiple ways to create the spec file. We can make use of the PX-Central console to generate the StorageCluster spec based on custom user inputs.
- Below is the StorageCluster spec obtained from the PX-Central console. Only a few of the fields in the StorageCluster need to be manually added—such as annotations according to the solution, which we will be implementing. For more details on the StorageCluster CRD spec and how to add additional fields to the existing spec, refer to the Portworx documentation.
kind: StorageCluster apiVersion: core.libopenstorage.org/v1 metadata: name: px-workload-cluster namespace: portworx annotations: portworx.io/service-type: portworx-api:LoadBalancer # (1) spec: image: portworx/oci-monitor:2.13.0 imagePullPolicy: IfNotPresent kvdb: internal: true cloudStorage: deviceSpecs: - sc=default,size=150 # (2) journalDeviceSpec: auto # (3) kvdbDeviceSpec: sc=default,size=44 # (4) secretsProvider: k8s stork: enabled: true args: webhook-controller: "true" autopilot: enabled: true csi: enabled: true monitoring: telemetry: enabled: true prometheus: enabled: true exportMetrics: true alertManager: enabled: true env: - name: ENABLE_CSI_DRIVE value: "true" - name: "PRE-EXEC" value: "iptables -A INPUT -p tcp --match multiport --dports 9001:9020 -j ACCEPT" |
In the above specifications
- We are setting the annotation in the StorageCluster object to make sure the Portworx API service is exposed to LoadBalancer.
- The storage class name and the storage size for each of the Portworx cluster nodes is defined. For example, for a three node Portworx cluster, a total of 450 GB of storage pool will be created.
- journalDeviceSpec needs to be set to “auto” as this is a parameter used to specify the device or partition where Portworx stores its journal data. The journal is a critical component of Portworx and is used to ensure the consistency of data written to the storage cluster.
- This field is used to set the StorageClass name and storage size for Portworx KVdb which is used to store metadata and configuration information related to StorageCluster. Since .spec.kvdb.internal is set to true Portworx creates a kvdb server during the installation of the cluster. We can choose to use an external KVdb to store metadata and configuration information related to the storage cluster.
- Apply the manifest file and wait until the StorageCluster status shows as Online. It might take 25–30 minutes for the installation to be completed.
kubectl -n portworx get storagecluster NAME CLUSTER UUID STATUS VERSION AGE px-workload-cluster <Cluster UUID> Online 2.13.0 13d |
- Check the Portworx cluster status and check for StoragePool size and other details. If the cluster set up is completed successfully, the status for all the Portworx storage nodes will show as Online.
export NAMESPACE="portworx" export PX_POD=$(kubectl get pods -l name=portworx -n $NAMESPACE -o jsonpath='{.items[0].metadata.name}') kubectl exec $PX_POD -n $NAMESPACE -- /opt/pwx/bin/pxctl status
|
- Activate the PX Enterprise License by running the following:
kubectl exec -n portworx <portworx-pod> --/opt/pwx/bin/pxctl license activate"enterprise code" |
- Repeat the same steps in the Destination Cluster (tanzu portworx-dr cluster). Make sure you provide a different name for the StorageCluster Object to differentiate the object name created in both of the clusters.
Portworx Central
Portworx Central (PX-Central) is an on-premises management solution for Portworx Enterprise that provides simplified monitoring, enhanced security, and data insights, enabling administrators to manage and monitor Portworx clusters, and its resources deployed across multiple clusters from a single pane of glass. Please refer to the documentation to set up PX-Central.
By Installing PX-Central on the Tanzu shared services cluster, it would be easy to manage the Portworx clusters deployed on Site A and Site B. Tanzu shared services cluster is used to deploy shared services such as Harbor, contour. Also, monitoring and logging tools can be installed, and these centralized services can monitor workload clusters.
Configuring Portworx Async DR on Tanzu Kubernetes Grid
Before getting started, it’s important to ensure all the prerequisites are met and Portworx installation is complete. Please refer to the previous section to set up the same specs.
Prior to configuring Portworx PX-DR for asynchronous disaster recovery configurations, the following terms will be frequently used. The detailed description of these terms is mentioned before creating them.
- Cluster-wide-encryption key
- Object store credentials
- Stork
- ClusterPair
- SchedulePolicy, MigrationSchedule, Migrations
In the below configurations the two Tanzu workload clusters present in Site A and Site B will be referred as:
- Source Cluster – (portworx-workload-cluster in Site A)
- Destination Cluster – (portworx-dr-cluster in Site B)
Let’s get started!
- Create a cluster-wide-encryption key in both the clusters. This is a common key that is used to encrypt all your volumes created using Portworx StorageClass.
# Source Cluster and Destination Cluster export encryptionKey="encryption*key*1234" kubectl -n portworx create secret generic px-vol-encryption \ > --from-literal=cluster-wide-secret-key=$encryptionKey secret/px-vol-encryption created |
- Set the cluster key on both the clusters.
# Source Cluster and Destination Cluster kubectl exec -n portworx <portworx-pod> -- /opt/pwx/bin/pxctl secrets set-cluster-key \ --secret cluster-wide-secret-key |
- In the destination portworx-dr-cluster obtain the cluster UUID which is required to create object store credentials.
# Destination Cluster kubectl exec <portworx-pod> -n portworx -- /opt/pwx/bin/pxctl status | grep UUID | awk '{print $3}' |
Creation of object store credentials
Object store credentials in Portworx are used to authenticate and authorize access to object storage providers, such as Amazon Simple Storage Service (Amazon S3), Google Cloud Storage, or Microsoft Azure Blob Storage. Object store credentials consist of an endpoint URL, an access key, and a secret key, which are used to establish a secure connection between Portworx and the object storage provider.
- Create object store credentials in both source and destination clusters after obtaining above UUID of destination cluster. Run the same command in both the clusters inside px pod. The options for creating object store credentials differs for various types of object store. Please refer to the documentation for more information.
Note: The below step could be used to create the credentials if the object store is Amazon S3 or Minio.
# Source Cluster and Destination Cluster. kubectl exec <px-workload-cluster-pod-name> -n portworx -- /opt/pwx/bin/pxctl credentials create \ --provider s3 \ --s3-access-key <access-key> \ --s3-secret-key <secret-key> \ --s3-region us-east-1 \ --s3-endpoint <object store endpoint> \ --s3-storage-class STANDARD \ ClusterPair_<UUID of Destination Cluster> Output: Defaulted container "portworx" out of: portworx, csi-node-driver-registrar Credentials created successfully, UUID: <UUID will be shown> |
- Verify if the credentials are successfully created in both the clusters using the below command.
# Source Cluster and Destination Cluster kubectl exec <portworx-pod-name> -- /opt/pwx/bin/pxctl credentials list Defaulted container "portworx" out of: portworx, csi-node-driver-registrar |
- Switch to the second cluster context to generate the ClusterPair.
Generating ClusterPair spec on Destination Cluster and applying it in Source Cluster
ClusterPair is an object which is used to create trust between source and destination Cluster for communication. The ClusterPair object pairs the Portworx storage driver with the Kubernetes scheduler, allowing volumes and resources to be migrated between clusters.
In order to generate ClusterPair we need to download storkctl binary. This is needed to generate a ClusterPair spec based on certain details pertaining to the cluster where the Portworx is installed.
- Download storkctl binary from any one of stork pods. Stork is the storage scheduler used by Portworx for Kubernetes that helps achieve even tighter integration. It allows users to co-locate pods with their data, provides seamless migration of pods in case of storage errors and makes it easier to create and restore snapshots of Portworx volumes.
kubectl cp <stork-pod-name>:/storkctl/linux/storkctl /usr/local/bin/storkctl |
- Change the permissions for storkctl binary:
chmod +x /usr/local/bin/storkctl |
- Verify if storkctl is correctly installed by running:
storkctl version storkctl Version: 23.2.0-1f0d6530 |
- Generate ClusterPair on Destination cluster by running:
# Destination Cluster storkctl generate clusterpair dr-workload-cluster-cp -n dr-enabled-namespace -o yaml > dr-cluster-clusterpair.yaml |
- Edit the YAML file to add the below key values under options. Note that all the values need to be mentioned in double quotes in order for the ClusterPair object to be validated.
- IP – This is the Portworx API Load balancer IP which is exposed in Destination cluster, Portworx in source cluster uses this IP to communicate with the Portworx API in the destination cluster.
- Port – This is the port exposed by the Portworx API service in destination cluster.
- Token – the token is used to establish authentic connection with destination cluster.
- Mode – By default, every seventh migration is a full migration. If you specify mode: DisasterRecovery, then every migration is incremental. When doing a one-time migration (and not DR), skip this option.
options: ip: “<Portworx API LoadBalancer IP of dr-workload cluster>” port: “9001” token: “<Obtain the cluster token of dr-workload-cluster>” mode: “DisasterRecovery” |
To obtain the destination cluster token, run the below command in the destination cluster (portworx-dr-cluster).
# Destination Cluster kubectl -n portworx exec -it <portworx-pod-name> -- /opt/pwx/bin/pxctl cluster token show Token is <Destination Token will be shown in the output> |
- Validate the YAML to verify if all the fields are correctly added.
- Apply the ClusterPair object in source cluster only (source cluster).
# Source Cluster kubectl -n dr-enabled-namespace apply -f dr-cluster-clusterpair.yaml |
- Check the status of ClusterPair it should show as ready. The ClusterPair might fail to go to the ready state if the following results occur:
- External IP of Portworx node on destination cluster is invalid or is not reachable due to firewall.
- The token is invalid.
- The Portworx nodeport is not mentioned correctly.
- All the values under the “options” field in ClusterPair object is not enclosed in double quotes.
# Source Cluster kubectl apply -f dr-cluster-clusterpair.yaml clusterpair.stork.libopenstorage.org/dr-workload-cluster-cp created storkctl get clusterpair -n dr-enabled-namespace NAME STORAGE-STATUS SCHEDULER-STATUS CREATED dr-workload-cluster-cp Ready Ready 25 Apr 23 06:48 UTC |
Creation of SchedulePolicy and MigrationSchedules
Once the communication is established between the source and destination cluster we need to create schedulepolicy and MigrationSchedule in the source cluster (px-workload-cluster).
- A schedulepolicy is an object which describes the intervals during which the migrations will be scheduled. The interval depends on the amount of data and also on the network speed between the clusters. The schedulepolicy is a clusterscoped object.
- A MigrationSchedule is an object using to schedule migrations to migrate Kubernetes resources to destination cluster based on the schedulepolicy.
- MigrationSchedule creates an object called migration which tracks the status of every migration which is triggered based on the schedulepolicy. Portworx will not try to create another migration if an existing migration is in progress.
Create a schedule policy using below spec. here we have set an interval of 5 mins only for demo purposes. Portworx recommends using an interval of at least 15 mins.
# Source Cluster cat <EOF | kubectl apply -f - apiVersion: stork.libopenstorage.org/v1alpha1 kind: SchedulePolicy metadata: name: portwx-workload-cluster-schedule-policy policy: interval: intervalMinutes: 5 # For demo purpose only. EOF migrationschedule.stork.libopenstorage.org/migrationschedule created # verify the object storkctl get schedulepolicy -n dr-enabled-namespace |
- In the below migration schedule object please note:
- Multiple namespace names can be mentioned under namespace field—currently we are enabling migration in dr-enabled-namespace.
- The ClusterPair is defined which we created in the dr-enabled-namespace.
- The includeResources and includeVolumes will include all the Kubernetes resources and the volumes present in the namespace.
- startApplications should be set to false which will not scale up the workloads in destination cluster post migration. Note: This is important since the next migrations will fail as the applications are already running in destination cluster.
- Portworx provides some of the default SchedulePolicies which can be used or we can use custom SchedulePolicy which we will be creating.
# Source Cluster cat <EOF | kubectl apply -f – apiVersion: stork.libopenstorage.org/v1alpha1 kind: MigrationSchedule metadata: name: migrationschedule namespace: dr-enabled-namespace spec: template: spec: ClusterPair: portwx-workload-cluster includeResources: true startApplications: false includeVolumes: true namespaces: - dr-enabled-namespace schedulePolicyName: portwx-workload-cluster-schedule-policy suspend: false autoSuspend: true EOF migrationschedule.stork.libopenstorage.org/migrationschedule created # verify the object installation storkctl get migrationschedule -n dr-enabled-namespace Deploying of sample application on DR-enabled namespace |
- Now create a sample stateful application in the source cluster.
# Source Cluster cat <EOF | kubectl apply -f - apiVersion: apps/v1 kind: Deployment metadata: name: postgresql namespace: dr-enabled-namespace labels: app: postgresql spec: replicas: 1 selector: matchLabels: app: postgresql template: metadata: labels: app: postgresql spec: containers: - name: postgresql image: postgres:latest env: - name: POSTGRES_USER value: postgres - name: POSTGRES_PASSWORD value: postgres123 - name: POSTGRES_DB value: px-dr-demo ports: - containerPort: 5432 volumeMounts: - name: postgresql-data mountPath: /var/lib/postgresql/data volumes: - name: postgresql-data persistentVolumeClaim: claimName: postgresql-pvc --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: postgresql-pvc namespace: dr-enabled-namespace spec: StorageClassName: px-db accessModes: - ReadWriteOnce resources: requests: storage: 1Gi --- apiVersion: v1 kind: Service metadata: name: postgresql namespace: dr-enabled-namespace spec: selector: app: postgresql ports: - name: postgresql port: 5432 targetPort: 5432 EOF deployment.apps/postgresql created persistentvolumeclaim/postgresql-pvc created service/postgresql created |
- Here in PVC object spec px-db is the storage class used to create Portworx volumes. Use kubectl get sc to get all the available px StorageClasseses and choose based on your requirement.
- Wait until the pod goes to running state.
kubectl get pods NAME READY STATUS RESTARTS AGE postgresql-6796db6bb8-bw2v4 1/1 Running 0 64s |
- Insert a sample data into the database.
kubectl exec -it <post-- psql -U postgres px-dr-demo psql (15.2 (Debian 15.2-1.pgdg110+1)) px-dr-demo=# CREATE TABLE users ( id SERIAL PRIMARY KEY, name VARCHAR(50), email VARCHAR(50) ); INSERT INTO users (name, email) VALUES ('John', 'john@example.com'), ('Jane', 'jane@example.com'), ('Bob', 'bob@example.com'); CREATE TABLE INSERT 0 3 px-dr-demo=# SELECT * FROM users; id | name | email ----+------+------------------ 1 | John | john@example.com 2 | Jane | jane@example.com 3 | Bob | bob@example.com (3 rows) |
- Switch to second cluster to wait for async replication to complete. The migration is initiated based on interval define in SchedulePolicy.
# Source Cluster kubectl get migrations -n dr-enabled-namespace NAME AGE migrationschedule-interval-2023-04-25-094456 20s |
- As you can see, a migration has been initiated by stork based on MigrationSchedule and SchedulePolicy.
# Source Cluster storkctl get migrations -n dr-enabled-namespace NAME CLUSTERPAIR STAGE STATUS VOLUMES RESOURCES CREATED ELAPSED TOTAL BYTES TRANSFERRED migrationschedule-interval-2023-04-25-094456 dr-workload-cluster-cp Volumes InProgress 0/1 0/0 25 Apr 23 09:44 UTC Volumes (53.50854835s) Resources (NA) 0 |
- Wait for the status to be completed.
storkctl get migrations -n dr-enabled-namespace NAME CLUSTERPAIR STAGE STATUS VOLUMES RESOURCES CREATED ELAPSED TOTAL BYTES TRANSFERRED migrationschedule-interval-2023-04-25-101113 dr-workload-cluster-cp Final Successful 1/1 5/5 25 Apr 23 10:11 UTC Volumes (39s) Resources (6s) 43601920 |
- As shown in the above status, migration status is successful and all the respective objects along with the pvc is asynchronously migrated to destination cluster (portworx-dr-cluster).
- Remember that once the migration is completed, the applications will be scaled down to zero replicas in the destination cluster and only on activating the migration on destination cluster, the replicas will be increased to the replica count of the workload as defined in source cluster.
Validating Async DR failover operation
Considering a scenario where the application in source cluster is inaccessible, we need to activate the migrations in the destination namespace. Activating the migration implies that the workloads in destination cluster be scaled up to the exact number of replicas of the application in source cluster. Post migration we will be verifying the data which we had inserted into the PostgreSQL database.
In order to perform a failover operation from the source workload cluster to the destination DR cluster, use the following steps:
1. Suspend MigrationSchedule and scale down your application.
# Source Cluster storkctl suspend migrationschedule migrationschedule -n dr-enabled-namespace MigrationSchedule migrationschedule suspended successfully storkctl get migrationschedule migrationschedule -n dr-enabled-namespace # Suspend should show as true in the output |
- The above output indicates the MigrationSchedule has been suspended.
- Verify the objects created in the destination cluster.
# Destination Cluster kubectl get all NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/postgresql ClusterIP 100.70.219.188 <none> 5432/TCP 97s NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/postgresql 0/0 0 0 97s NAME DESIRED CURRENT READY AGE replicaset.apps/postgresql-6d9d8d47f8 0 0 0 97s |
- As you can see the resources are migrated to the destination cluster’s dr-enabled-namespace. Now scale down the application in source cluster.
# Source Cluster kubectl -n dr-enabled-namespace scale deploy postgresql --replicas=0 deployment.apps/postgresql scaled |
- Switch to destination cluster context and activate the migration.
# Destination Cluster storkctl activate migration -n dr-enabled-namespace Set the ApplicationActivated status in the MigrationSchedule dr-enabled-namespace/migrationschedule to true Updated replicas for deployment dr-enabled-namespace/postgresql to 1 |
- Verify the data in the database which we had created in source cluster (px-workload-cluster).
# Destination Cluster kubectl get pods NAME READY STATUS RESTARTS AGE postgresql-6d9d8d47f8-z7bzj 1/1 Running 0 40s kubectl exec -it postgresql-6d9d8d47f8-z7bzj -- psql -U postgres px-dr-demo psql (15.2 (Debian 15.2-1.pgdg110+1)) Type "help" for help. px-dr-demo=# SELECT * FROM users; id | name | email ----+------+------------------ 1 | John | john@example.com 2 | Jane | jane@example.com 3 | Bob | bob@example.com (3 rows) px-dr-demo=# |
Verify the above data which we had created in portworx-workload-cluster in Site A. It has been successfully replicated to Site B portworx-dr-cluster. Portworx Async DR is successfully completed.
In conclusion, Portworx DR on VMware Tanzu provides an easy-to-implement and manage disaster recovery solution for Kubernetes clusters. By replicating data and configuration to a secondary cluster located in a different availability zone or region, organizations can ensure business continuity in the face of disasters.