Tanzu Kubernetes Grid and Portworx Asynchronous Disaster Recovery Solution

Author: Nithin Krishnan, Ranjith B Purvachari, K Puneeth Keerthy

As enterprises increasingly adopt containerization and microservices architecture, the need for robust disaster recovery (DR) solutions has become more apparent. Portworx, a cloud native storage and data management company, provides a DR solution for VMware Tanzu multi-cloud environments that is easy to implement and manage.

In this blog, we will discuss Portworx DR on VMware Tanzu and how customers can leverage Portworx DR capabilities on VMware Tanzu Kubernetes Grid. This blog covers the following three sections:

Introduction to Portworx and asynchronous DR
Installing Portworx on Tanzu Kubernetes Grid
Configuring Portworx Async DR on Tanzu Kubernetes Grid

Introduction to Portworx and Async DR

Portworx Enterprise, a certified Partner Ready for VMware Tanzu, is a software-defined container storage platform built from the ground up for Kubernetes by providing scale-out software defined container storage, data availability, data security, and backup for Kubernetes-based applications.

Portworx DR (PX-DR) on Tanzu Kubernetes Grid integration provides a disaster recovery solution for Kubernetes clusters. It replicates the data and configuration of the apps present in a primary cluster to a secondary cluster located in a different availability zone or region. The secondary cluster is therefore up to date with the primary cluster. In the event of a disaster, the secondary cluster can be promoted to the primary cluster and take over the workload seamlessly.

Figure 1: High-level PX-DR solution diagram on VMware Tanzu Kubernetes Grid

Understanding Portworx asynchronous disaster recovery solution

When it comes to disaster recovery planning, organizations need to have a solid strategy in place to ensure business continuity in the event of a disaster. With PX-DR, organizations can build a robust disaster recovery plan for applications running on Tanzu Kubernetes Grid clusters.

Since our clusters are spanned across multiple regions and the network latency between the nodes is high, we will be setting up an asynchronous DR solution offered by Portworx. Here are a few of the advantages of Portworx Async DR:

Portworx Async DR is a disaster recovery solution that helps organizations protect their critical data in the event of a disaster or outage.
The solution is asynchronous, which means that data is replicated in the background without impacting production workloads.
With Portworx Async DR, organizations can replicate their data to a secondary site, which can be in a different geographic location or cloud provider, for added resiliency.
The solution provides near-zero recovery point objective (RPO) and recovery time objective (RTO), which means that data can be recovered quickly and with minimal data loss.
The solution is designed to be cloud native, which means that it can be easily integrated with Kubernetes and other cloud native technologies.
Portworx Async DR is highly scalable, allowing organizations to protect their data as their workloads grow and evolve.

Defining RPO and RTO requirements for each application is crucial in disaster recovery planning. RPO refers to the maximum acceptable amount of time since the last data recovery point, which essentially means the amount of data loss that a business can tolerate.

On the other hand, RTO refers to the maximum acceptable delay between the interruption of service and restoration of service, which is the amount of time it takes to fully recover from a disaster. Organizations must find the right balance between application availability needs and the cost associated with designing and implementing a solution that meets these requirements.

Installing Portworx on Tanzu Kubernetes Grid

Now that we have learned about Portworx and why it is used, let’s take a look at how to install it on Tanzu workload clusters. The following steps are to run on both source and destination clusters in order to install a Portworx cluster.

Prerequisites

VMware software-defined data center (SDDC)
Two Tanzu workload clusters deployed across multiple regions
Portworx Enterprise license activated on both the clusters
Object storage that should be accessible from both clusters

Steps to install the Portworx cluster

Export the Kubernetes version of the
1. Tanzu Kubernetes cluster
2. Portworx version to be installed
3. Namespace where Portworx needs to be installed

export KBVER=$(kubectl version --short | awk -F'[v+_-]' '/Server Version: / {print $3}')

export PXVER="2.13.0"

export NAMESPACE="portworx"

export PX_CLUSTERNAME="<provide a name for portworx cluster>"

Label the worker nodes where the Portworx cluster nodes need to be installed. For proof-of-concept purposes, all worker nodes where Portworx nodes should be created are labeled. This is needed to make sure the Portworx pods are deployed on these selected worker nodes. For each worker node add the label px/enabled=true.

#List all the K8s Cluster Nodes

kubectl get nodes

#Label the Nodes

kubectl label node <Worker Node Name> px/enabled=true --overwrite=true

Portworx needs a key-value database (KVdb) to store the metadata information. It creates an internal KVdb on each of the worker nodes during the installation. Add the label px/metadata-node=true to all the worker nodes in order to install KVdb on the worker nodes.

kubectl label node <Worker Node Name> px/metadata-node=true --overwrite=true

Installation of Portworx components can be done in multiple ways, including Helm chart or operator-based installation. For the current setup, an operator-based installation is used. Install the operator and the necessary custom resource definitions (CRDs) using the following command.

kubectl apply -f 'https://install.portworx.com/${PXVER}?comp=pxoperator&kbver=${KBVER}&ns=${NAMESPACE}'

Once the operator is installed and running, verify the status of the operator pod.

kubectl get pods-n portworx

The Portworx cluster configuration is specified by a Kubernetes CRD called StorageCluster, which is an object that acts as the definition of the Portworx cluster. When modifying the StorageCluster object, the operator will update the Portworx cluster in the background. There are multiple ways to create the spec file. We can make use of the PX-Central console to generate the StorageCluster spec based on custom user inputs.

Below is the StorageCluster spec obtained from the PX-Central console. Only a few of the fields in the StorageCluster need to be manually added—such as annotations according to the solution, which we will be implementing. For more details on the StorageCluster CRD spec and how to add additional fields to the existing spec, refer to the Portworx documentation.

kind: StorageCluster

apiVersion: core.libopenstorage.org/v1

metadata:

namespace: portworx

annotations:

portworx.io/service-type: portworx-api:LoadBalancer # (1)

spec:

image: portworx/oci-monitor:2.13.0

imagePullPolicy: IfNotPresent

kvdb:

internal: true

cloudStorage:

deviceSpecs:

- sc=default,size=150 # (2)

journalDeviceSpec: auto # (3)

kvdbDeviceSpec: sc=default,size=44 # (4)

secretsProvider: k8s

stork:

enabled: true

args:

webhook-controller: "true"

autopilot:

enabled: true

csi:

enabled: true

monitoring:

telemetry:

enabled: true

prometheus:

enabled: true

exportMetrics: true

alertManager:

enabled: true

env:

- name: ENABLE_CSI_DRIVE

value: "true"

- name: "PRE-EXEC"

value: "iptables -A INPUT -p tcp --match multiport --dports 9001:9020 -j ACCEPT"

In the above specifications

We are setting the annotation in the StorageCluster object to make sure the Portworx API service is exposed to LoadBalancer.
The storage class name and the storage size for each of the Portworx cluster nodes is defined. For example, for a three node Portworx cluster, a total of 450 GB of storage pool will be created.
journalDeviceSpec needs to be set to “auto” as this is a parameter used to specify the device or partition where Portworx stores its journal data. The journal is a critical component of Portworx and is used to ensure the consistency of data written to the storage cluster.
This field is used to set the StorageClass name and storage size for Portworx KVdb which is used to store metadata and configuration information related to StorageCluster. Since .spec.kvdb.internal is set to true Portworx creates a kvdb server during the installation of the cluster. We can choose to use an external KVdb to store metadata and configuration information related to the storage cluster.

Apply the manifest file and wait until the StorageCluster status shows as Online. It might take 25–30 minutes for the installation to be completed.

kubectl -n portworx get storagecluster

NAME CLUSTER UUID STATUS VERSION AGE

px-workload-cluster <Cluster UUID> Online 2.13.0 13d

Check the Portworx cluster status and check for StoragePool size and other details. If the cluster set up is completed successfully, the status for all the Portworx storage nodes will show as Online.

export NAMESPACE="portworx"

export PX_POD=$(kubectl get pods -l name=portworx -n $NAMESPACE -o jsonpath='{.items[0].metadata.name}')

kubectl exec $PX_POD -n $NAMESPACE -- /opt/pwx/bin/pxctl status

Activate the PX Enterprise License by running the following:

kubectl exec -n portworx <portworx-pod> --/opt/pwx/bin/pxctl license activate"enterprise code"

Repeat the same steps in the Destination Cluster (tanzu portworx-dr cluster). Make sure you provide a different name for the StorageCluster Object to differentiate the object name created in both of the clusters.

Portworx Central

Portworx Central (PX-Central) is an on-premises management solution for Portworx Enterprise that provides simplified monitoring, enhanced security, and data insights, enabling administrators to manage and monitor Portworx clusters, and its resources deployed across multiple clusters from a single pane of glass. Please refer to the documentation to set up PX-Central.

By Installing PX-Central on the Tanzu shared services cluster, it would be easy to manage the Portworx clusters deployed on Site A and Site B. Tanzu shared services cluster is used to deploy shared services such as Harbor, contour. Also, monitoring and logging tools can be installed, and these centralized services can monitor workload clusters.

Configuring Portworx Async DR on Tanzu Kubernetes Grid

Before getting started, it’s important to ensure all the prerequisites are met and Portworx installation is complete. Please refer to the previous section to set up the same specs.

Prior to configuring Portworx PX-DR for asynchronous disaster recovery configurations, the following terms will be frequently used. The detailed description of these terms is mentioned before creating them.

Cluster-wide-encryption key
Object store credentials
Stork
ClusterPair
SchedulePolicy, MigrationSchedule, Migrations

In the below configurations the two Tanzu workload clusters present in Site A and Site B will be referred as:

Source Cluster – (portworx-workload-cluster in Site A)
Destination Cluster – (portworx-dr-cluster in Site B)

Let’s get started!

Create a cluster-wide-encryption key in both the clusters. This is a common key that is used to encrypt all your volumes created using Portworx StorageClass.

# Source Cluster and Destination Cluster

export encryptionKey="encryption*key*1234"

kubectl -n portworx create secret generic px-vol-encryption \

> --from-literal=cluster-wide-secret-key=$encryptionKey

secret/px-vol-encryption created

Set the cluster key on both the clusters.

# Source Cluster and Destination Cluster

kubectl exec -n portworx <portworx-pod> -- /opt/pwx/bin/pxctl secrets set-cluster-key \

--secret cluster-wide-secret-key

In the destination portworx-dr-cluster obtain the cluster UUID which is required to create object store credentials.

# Destination Cluster

kubectl exec <portworx-pod> -n portworx -- /opt/pwx/bin/pxctl status | grep UUID | awk '{print $3}'

Creation of object store credentials

Object store credentials in Portworx are used to authenticate and authorize access to object storage providers, such as Amazon Simple Storage Service (Amazon S3), Google Cloud Storage, or Microsoft Azure Blob Storage. Object store credentials consist of an endpoint URL, an access key, and a secret key, which are used to establish a secure connection between Portworx and the object storage provider.

Create object store credentials in both source and destination clusters after obtaining above UUID of destination cluster. Run the same command in both the clusters inside px pod. The options for creating object store credentials differs for various types of object store. Please refer to the documentation for more information.

Note: The below step could be used to create the credentials if the object store is Amazon S3 or Minio.

# Source Cluster and Destination Cluster.

kubectl exec <px-workload-cluster-pod-name> -n portworx -- /opt/pwx/bin/pxctl credentials create \

--provider s3 \

--s3-access-key <access-key> \

--s3-secret-key <secret-key> \

--s3-region us-east-1 \

--s3-endpoint <object store endpoint> \

--s3-storage-class STANDARD \

ClusterPair_<UUID of Destination Cluster>

Output:

Defaulted container "portworx" out of: portworx, csi-node-driver-registrar

Credentials created successfully, UUID: <UUID will be shown>

Verify if the credentials are successfully created in both the clusters using the below command.

# Source Cluster and Destination Cluster

kubectl exec <portworx-pod-name> -- /opt/pwx/bin/pxctl credentials list

Defaulted container "portworx" out of: portworx, csi-node-driver-registrar

Switch to the second cluster context to generate the ClusterPair.

Generating ClusterPair spec on Destination Cluster and applying it in Source Cluster

ClusterPair is an object which is used to create trust between source and destination Cluster for communication. The ClusterPair object pairs the Portworx storage driver with the Kubernetes scheduler, allowing volumes and resources to be migrated between clusters.

In order to generate ClusterPair we need to download storkctl binary. This is needed to generate a ClusterPair spec based on certain details pertaining to the cluster where the Portworx is installed.

Download storkctl binary from any one of stork pods. Stork is the storage scheduler used by Portworx for Kubernetes that helps achieve even tighter integration. It allows users to co-locate pods with their data, provides seamless migration of pods in case of storage errors and makes it easier to create and restore snapshots of Portworx volumes.

kubectl cp <stork-pod-name>:/storkctl/linux/storkctl /usr/local/bin/storkctl

Change the permissions for storkctl binary:

chmod +x /usr/local/bin/storkctl

Verify if storkctl is correctly installed by running:

storkctl version

storkctl Version: 23.2.0-1f0d6530

Generate ClusterPair on Destination cluster by running:

# Destination Cluster

storkctl generate clusterpair dr-workload-cluster-cp -n dr-enabled-namespace -o yaml > dr-cluster-clusterpair.yaml

Edit the YAML file to add the below key values under options. Note that all the values need to be mentioned in double quotes in order for the ClusterPair object to be validated.
- IP – This is the Portworx API Load balancer IP which is exposed in Destination cluster, Portworx in source cluster uses this IP to communicate with the Portworx API in the destination cluster.
- Port – This is the port exposed by the Portworx API service in destination cluster.
- Token – the token is used to establish authentic connection with destination cluster.
- Mode – By default, every seventh migration is a full migration. If you specify mode: DisasterRecovery, then every migration is incremental. When doing a one-time migration (and not DR), skip this option.

options:

ip: “<Portworx API LoadBalancer IP of dr-workload cluster>”

port: “9001”

token: “<Obtain the cluster token of dr-workload-cluster>”

mode: “DisasterRecovery”

To obtain the destination cluster token, run the below command in the destination cluster (portworx-dr-cluster).

# Destination Cluster

kubectl -n portworx exec -it <portworx-pod-name> -- /opt/pwx/bin/pxctl cluster token show

Token is

Validate the YAML to verify if all the fields are correctly added.

Apply the ClusterPair object in source cluster only (source cluster).

# Source Cluster

kubectl -n dr-enabled-namespace apply -f dr-cluster-clusterpair.yaml

Check the status of ClusterPair it should show as ready. The ClusterPair might fail to go to the ready state if the following results occur:
- External IP of Portworx node on destination cluster is invalid or is not reachable due to firewall.
- The token is invalid.
- The Portworx nodeport is not mentioned correctly.
- All the values under the “options” field in ClusterPair object is not enclosed in double quotes.

# Source Cluster

kubectl apply -f dr-cluster-clusterpair.yaml

clusterpair.stork.libopenstorage.org/dr-workload-cluster-cp created

storkctl get clusterpair -n dr-enabled-namespace

NAME STORAGE-STATUS SCHEDULER-STATUS CREATED

dr-workload-cluster-cp Ready Ready 25 Apr 23 06:48 UTC

Creation of SchedulePolicy and MigrationSchedules

Once the communication is established between the source and destination cluster we need to create schedulepolicy and MigrationSchedule in the source cluster (px-workload-cluster).

A schedulepolicy is an object which describes the intervals during which the migrations will be scheduled. The interval depends on the amount of data and also on the network speed between the clusters. The schedulepolicy is a clusterscoped object.
A MigrationSchedule is an object using to schedule migrations to migrate Kubernetes resources to destination cluster based on the schedulepolicy.
MigrationSchedule creates an object called migration which tracks the status of every migration which is triggered based on the schedulepolicy. Portworx will not try to create another migration if an existing migration is in progress.

Create a schedule policy using below spec. here we have set an interval of 5 mins only for demo purposes. Portworx recommends using an interval of at least 15 mins.

# Source Cluster

cat <EOF | kubectl apply -f -

apiVersion: stork.libopenstorage.org/v1alpha1

kind: SchedulePolicy

metadata:

policy:

interval:

intervalMinutes: 5 # For demo purpose only.

EOF

migrationschedule.stork.libopenstorage.org/migrationschedule created

# verify the object

storkctl get schedulepolicy -n dr-enabled-namespace

In the below migration schedule object please note:
- Multiple namespace names can be mentioned under namespace field—currently we are enabling migration in dr-enabled-namespace.
- The ClusterPair is defined which we created in the dr-enabled-namespace.
- The includeResources and includeVolumes will include all the Kubernetes resources and the volumes present in the namespace.
- startApplications should be set to false which will not scale up the workloads in destination cluster post migration. Note: This is important since the next migrations will fail as the applications are already running in destination cluster.
- Portworx provides some of the default SchedulePolicies which can be used or we can use custom SchedulePolicy which we will be creating.

# Source Cluster

cat <EOF | kubectl apply -f –

apiVersion: stork.libopenstorage.org/v1alpha1

kind: MigrationSchedule

metadata:

namespace: dr-enabled-namespace

spec:

template:

spec:

ClusterPair: portwx-workload-cluster

includeResources: true

startApplications: false

includeVolumes: true

namespaces:

- dr-enabled-namespace

schedulePolicyName: portwx-workload-cluster-schedule-policy

suspend: false

autoSuspend: true

EOF

migrationschedule.stork.libopenstorage.org/migrationschedule created

# verify the object installation

storkctl get migrationschedule -n dr-enabled-namespace

Deploying of sample application on DR-enabled namespace

Now create a sample stateful application in the source cluster.

# Source Cluster

cat <EOF | kubectl apply -f -

apiVersion: apps/v1

kind: Deployment

metadata:

namespace: dr-enabled-namespace

labels:

app: postgresql

spec:

replicas: 1

selector:

matchLabels:

app: postgresql

template:

metadata:

labels:

app: postgresql

spec:

containers:

- name: postgresql

image: postgres:latest

env:

- name: POSTGRES_USER

value: postgres

- name: POSTGRES_PASSWORD

value: postgres123

- name: POSTGRES_DB

value: px-dr-demo

ports:

- containerPort: 5432

volumeMounts:

- name: postgresql-data

mountPath: /var/lib/postgresql/data

volumes:

- name: postgresql-data

persistentVolumeClaim:

claimName: postgresql-pvc

---

apiVersion: v1

kind: PersistentVolumeClaim

metadata:

namespace: dr-enabled-namespace

spec:

StorageClassName: px-db

accessModes:

- ReadWriteOnce

resources:

requests:

storage: 1Gi

---

apiVersion: v1

kind: Service

metadata:

namespace: dr-enabled-namespace

spec:

selector:

app: postgresql

ports:

- name: postgresql

port: 5432

targetPort: 5432

EOF

deployment.apps/postgresql created

persistentvolumeclaim/postgresql-pvc created

service/postgresql created

Here in PVC object spec px-db is the storage class used to create Portworx volumes. Use kubectl get sc to get all the available px StorageClasseses and choose based on your requirement.
Wait until the pod goes to running state.

kubectl get pods

NAME READY STATUS RESTARTS AGE

postgresql-6796db6bb8-bw2v4 1/1 Running 0 64s

Insert a sample data into the database.

kubectl exec -it <post-- psql -U postgres px-dr-demo

psql (15.2 (Debian 15.2-1.pgdg110+1))

px-dr-demo=# CREATE TABLE users (

id SERIAL PRIMARY KEY,

name VARCHAR(50),

email VARCHAR(50)

);

INSERT INTO users (name, email) VALUES

('John', 'john@example.com'),

('Jane', 'jane@example.com'),

('Bob', 'bob@example.com');

CREATE TABLE

INSERT 0 3

px-dr-demo=# SELECT * FROM users;

id | name | email

----+------+------------------

1 | John | john@example.com

2 | Jane | jane@example.com

3 | Bob | bob@example.com

(3 rows)

Switch to second cluster to wait for async replication to complete. The migration is initiated based on interval define in SchedulePolicy.

# Source Cluster

kubectl get migrations -n dr-enabled-namespace

NAME AGE

migrationschedule-interval-2023-04-25-094456 20s

As you can see, a migration has been initiated by stork based on MigrationSchedule and SchedulePolicy.

# Source Cluster

storkctl get migrations -n dr-enabled-namespace

NAME CLUSTERPAIR STAGE STATUS VOLUMES RESOURCES CREATED ELAPSED TOTAL BYTES TRANSFERRED

migrationschedule-interval-2023-04-25-094456 dr-workload-cluster-cp Volumes InProgress 0/1 0/0 25 Apr 23 09:44 UTC Volumes (53.50854835s) Resources (NA) 0

Wait for the status to be completed.

storkctl get migrations -n dr-enabled-namespace

NAME CLUSTERPAIR STAGE STATUS VOLUMES RESOURCES CREATED ELAPSED TOTAL BYTES TRANSFERRED

migrationschedule-interval-2023-04-25-101113 dr-workload-cluster-cp Final Successful 1/1 5/5 25 Apr 23 10:11 UTC Volumes (39s) Resources (6s) 43601920

As shown in the above status, migration status is successful and all the respective objects along with the pvc is asynchronously migrated to destination cluster (portworx-dr-cluster).
Remember that once the migration is completed, the applications will be scaled down to zero replicas in the destination cluster and only on activating the migration on destination cluster, the replicas will be increased to the replica count of the workload as defined in source cluster.

Validating Async DR failover operation

Considering a scenario where the application in source cluster is inaccessible, we need to activate the migrations in the destination namespace. Activating the migration implies that the workloads in destination cluster be scaled up to the exact number of replicas of the application in source cluster. Post migration we will be verifying the data which we had inserted into the PostgreSQL database.

In order to perform a failover operation from the source workload cluster to the destination DR cluster, use the following steps:

1. Suspend MigrationSchedule and scale down your application.

# Source Cluster

storkctl suspend migrationschedule migrationschedule -n dr-enabled-namespace

MigrationSchedule migrationschedule suspended successfully

storkctl get migrationschedule migrationschedule -n dr-enabled-namespace

# Suspend should show as true in the output

The above output indicates the MigrationSchedule has been suspended.
Verify the objects created in the destination cluster.

# Destination Cluster

kubectl get all

NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE

service/postgresql ClusterIP 100.70.219.188 <none> 5432/TCP 97s

NAME READY UP-TO-DATE AVAILABLE AGE

deployment.apps/postgresql 0/0 0 0 97s

NAME DESIRED CURRENT READY AGE

replicaset.apps/postgresql-6d9d8d47f8 0 0 0 97s

As you can see the resources are migrated to the destination cluster’s dr-enabled-namespace. Now scale down the application in source cluster.

# Source Cluster

kubectl -n dr-enabled-namespace scale deploy postgresql --replicas=0

deployment.apps/postgresql scaled

Switch to destination cluster context and activate the migration.

# Destination Cluster

storkctl activate migration -n dr-enabled-namespace

Set the ApplicationActivated status in the MigrationSchedule dr-enabled-namespace/migrationschedule to true

Updated replicas for deployment dr-enabled-namespace/postgresql to 1

Verify the data in the database which we had created in source cluster (px-workload-cluster).

# Destination Cluster

kubectl get pods

NAME READY STATUS RESTARTS AGE

postgresql-6d9d8d47f8-z7bzj 1/1 Running 0 40s

kubectl exec -it postgresql-6d9d8d47f8-z7bzj -- psql -U postgres px-dr-demo

psql (15.2 (Debian 15.2-1.pgdg110+1))

Type "help" for help.

px-dr-demo=# SELECT * FROM users;

id | name | email

----+------+------------------

1 | John | john@example.com

2 | Jane | jane@example.com

3 | Bob | bob@example.com

(3 rows)

px-dr-demo=#

Verify the above data which we had created in portworx-workload-cluster in Site A. It has been successfully replicated to Site B portworx-dr-cluster. Portworx Async DR is successfully completed.

In conclusion, Portworx DR on VMware Tanzu provides an easy-to-implement and manage disaster recovery solution for Kubernetes clusters. By replicating data and configuration to a secondary cluster located in a different availability zone or region, organizations can ensure business continuity in the face of disasters.

Tanzu Kubernetes Grid and Portworx Asynchronous Disaster Recovery Solution