Storage and stateful applications

Overview

KCP provides a control plane that implements the concept of Transparent Multi Cluster (TMC) for compute, network, and storage. In order to give the illusion of transparent storage in KCP, it exposes the same Kubernetes APIs for storage (PVC/PV), so users and workloads do not need to be aware of the coordinations taken by the control plane behind the scenes.

Placement for storage in KCP uses the same concepts used for compute: "SyncTargets in a Location are transparent to the user, and workloads should be able to seamlessly move from one SyncTarget to another within a Location, based on operational concerns of the compute service provider, like decommissioning a cluster, rebalancing capacity, or due to an outage of a cluster. It is the compute service's responsibility to ensure that for workloads in a location, to the user it looks like ONE cluster."

KCP will provide the basic controllers and coordination logic for moving volumes, as efficiently as possible, using the underlying storage topology and capabilities. It will use the SyncTargets storage APIs to manage volumes, and not require direct access from the control plane to the storage itself. For more advanced or custom solutions, KCP will allow external coordinators to take over.

Main concepts

Transparent multi-cluster - describes the TMC concepts.
Placement, Locations and Scheduling - describes the KCP APIs and mechanisms used to control compute placement, which will be used for storage as well. Refer to the concepts of SyncTarget, Location, and Placement.
Kubernetes storage concepts - documentation of storage APIs in Kubernetes.
Persistent Volumes - PVCs are the main storage APIs used to request storage resources for applications. PVs are invisible to users, and used by administrators or privileged controllers to provision storage to user claims, and will be coordinated by KCP to support transparent multi-cluster storage.
Kubernetes CSI - The Container Storage Interface (CSI) is a standard for exposing arbitrary block and file storage systems to containerized workloads. The list of drivers provides a "menu" of storage systems integrated with kubernetes and their properties.
StatefulSets volumeClaimTemplates - workload definition used to manage “sharded” stateful applications. Specifying volumeClaimTemplates in the statefulset spec will provide stable storage by creating a PVC per instance.

Volume types

Each physical-cluster (aka "pcluster") brings its own storage to multi-cluster environments, and in order to make efficient coordination decisions, KCP will identify the following types:

Shared network-volumes

These volumes are provisioned from an external storage system that is available to all/some of the pclusters over an infrastructure network. These volumes are typically provided by a shared-filesystem (aka NAS), with access-mode of ReadWriteMany (RWX) or ReadOnlyMany (ROX). A shared volume can be used by any pod from any pcluster (that can reach it) at the same time. The application is responsible for the consistency of its data (for example with eventual consistency semantics, or stronger synchronization services like zookeeper). Examples of such storage are generic-NFS/SMB, AWS-EFS, Azure-File, GCP-Filestore, CephFS, GlusterFS, NetApp, GPFS, etc.

Owned network-volumes

These volumes are provisioned from an external storage system that is available to all/some of the pclusters over an infrastructure network. However unlike shared volumes, owned volumes require that only a single node/pod will mount the volume at a time. These volumes are typically provided by a block-level storage system, with access-mode of ReadWriteOnce (RWO) or ReadWriteOncePod (RWOP). It is possible to move the ownership between pclusters (that have access to that storage), by detaching from the current owner, and then attaching to the new owner. But it would have to guarantee a single owner to prevent data inconsistencies or corruptions, and even work if the owner pcluster is offline (see forcing detach with “fencing” below). Examples of such storage are AWS-EBS, Azure-Disk, Ceph-RBD, etc.

Internal volumes

These volumes are provisioned inside the pcluster itself, and rely on its internal resources (aka hyper-converged or software-defined storage). This means that the availability of the pcluster also determines the availability of the volume. In some systems these volumes are bound to a single node in the pcluster, because the storage is physically attached to a host. However, advanced clustered/distributed systems make efforts to overcome temporary and permanent node failures by adding data redundancy over multiple nodes. These volumes can have any type of access-mode (RWO/RWOP/RWX/ROX), but their strong dependency on the pcluster itself is the key difference from network volumes. Examples of such storage are host-path/local-drives, TopoLVM, Ceph-rook, Portworx, OpenEBS, etc.

Topology and locations

Regular topology

A regular storage topology is one where every Location is defined so that all of its SyncTargets are connected to the same storage system. This makes it trivial to move network volumes transparently between SyncTargets inside the same location.

Multi-zone cluster

A more complex topology is where pclusters contain nodes from several availability-zones, for the sake of being resilient to a zone failure. Since volumes are bound to a single zone (where they were provisioned), then a volume will not be able to move between SyncTargets without nodes on that zone. This is ok if all the SyncTargets of the Location span over the same set of zones, but if the zones are different, or the capacity per zone is too limited, copying to another zone might be necessary.

Internal volumes

Internal volumes are always confined to one pcluster, which means it has to be copied outside of the pcluster continuously to keep the application available even in the case where the pcluster fails entirely (network split, region issue, etc). This is similar to how DR solutions work between locations.

Disaster recover between locations

A regular Disaster Recovery (DR) topology will create pairs of Locations so that one is “primary” and the other is “secondary” (sometimes this relation is mutual). For volumes to be able to move between these locations, their storage systems would need to be configured to mirror/replicate/backup/snapshot (whichever approach is more appropriate depends on the case) every volume to its secondary. With such a setup, KCP would need to be able to map between the volumes on the primary and the secondary, so that it could failover and move workloads to the secondary and reconnect to the last copied volume state. See more on the DR section below.

Provisioning volumes

Volume provisioning in Kubernetes involves the CSI controllers and sidecar, as well as a custom storage driver. It reconciles PVCs by dynamically creating a PV for a PVC, and binding them together. This process depends on the CSI driver to be running on the SyncTarget compute resources, and would not be able to run on KCP workspaces. Instead, KCP will pick a designated SyncTarget for the workload placement, which will include the storage claims (PVCs), and the CSI driver on the SyncTarget will perform the storage provisioning.

In order to support changing workload placement overtime, even if the provisioning SyncTarget is offline, KCP will have to retrieve the volume information from that SyncTarget, and keep it in the KCP workspace for future coordination. The volume information inside the PV is expected to be transferable between SyncTargets that connect to the same storage system and drivers, although some transformations would be required.

To retrieve the volume information and maintain it in KCP, a special sync state is required that will sync UP the PV from a SyncTarget to KCP. This state is referred to as Upsync - see Resource Upsyncing.

The provisioning flow includes: (A) PVC synced to SyncTarget, (B) CSI provisioning on the pcluster, (C) Syncer detects PVC binding and initiates PV Upsync. Transformations would be applied in KCP virtual workspace to make sure that the PVC and PV would appear bound in KCP, similar to how it is in a single cluster. Once provisioning itself is complete, coordination logic will switch to a normal Sync state, to allow multiple SyncTargets to share the same volume, and for owned volumes to move ownership to another SyncTarget.

Moving shared volumes

Shared volume can easily move to any SyncTarget in the same Location by syncing the PVC and PV together, so they bind only to each other on the pcluster. Syncing will transform their mutual references so that the PVC.volumeName = PV.name and PV.claimRef = { PVC.name, PVC.namespace } are set appropriately for the SyncTarget, since the downstream PVC.namespace and PV.name will not be the same as upstream.

Moving volumes will set the volume's reclaimPolicy to always Retain, to avoid unintended deletion by any one of the SyncTargets while others use it. Once deletion of the upstream PVC is initiated, the coordination controller will transform the reclaimPolicy to Delete for one of the SyncTargets. See more in the section on deleting volumes.

Moving owned volumes

TBD - this section is a work in progress...

Detach from owner

Owned volumes require that at most one pcluster can use them at any given time. As placement changes, the coordination controller is responsible to serialize the state changes of the volume to move the ownership of the volume safely. First, it will detach the volume from the current owner, and wait for it to acknowledge that it successfully removed it, and only then will sync the volume to a new target.

Forcing detach with fencing

However, in case the owner is not able to acknowledge that it detached the volume, a forced-detach flow might be possible. The storage system has to support a CSI extension for network fencing, effectively blocking an entire pcluster from accessing the storage until fencing is removed. Once the failed pcluster recovers, and can acknowledge that it detached from the moved volumes, fencing will be removed from the storage and that pcluster can recover the rest of its workloads.

kubernetes-csi-addons
NetworkFence (currently implemented only by ceph-csi).

Storage classes

TBD - this section is a work in progress...

Storage classes can be thought of as templates to PVs, which allow pclusters to support multiple storage providers, or configure different policies for the same provider. Just like PVs are invisible to users, so do storage classes. However, users may choose a storage class by name when specifying their PVCs. When the storage class field is left unspecified (which is common), the pcluster will use its default storage class. However, the default storage class is a bit limited for multi-tenancy because it is one class per the entire pcluster.

Matching storage classes between SyncTargets in the same Location would be a simple way to ensure that storage can be moved transparently. However KCP should be able to verify the storage classes match across the Location and warn when this is not the case, to prevent future issues.

Open questions

How to match classes and make sure the same storage system is used in the location?
How to support multiple classes per pcluster (eg. RWO + RWX)?
Maybe a separate SyncTarget per class?
Can we have a separate default class per workspace?

Deleting volumes

TBD - this section is a work in progress...

Persistent-volumes reclaiming allows volumes to be configured how to behave when they are reclaimed. By default, storage classes will apply a reclaimPolicy: Delete to dynamically provisioned PVs unless explicitly specified to Retain. This means that volumes there were provisioned, will also get de-provisioned and their storage will be deleted. However, admins can modify the class to Retain volumes, and invoke cleanup on their own schedule.

While moving volumes, either shared or owned, the volume's reclaimPolicy will be set to Retain to prevent any SyncTarget from releasing the volume storage on scheduling changes.

Once the PVC is marked for deletion on KCP, the coordination controller will first pick one SyncTarget as owner (or use the current owner for owned volumes) and make sure to remove all sharers, and wait for their sync state to be cleared. Then it will set the owner's volume reclaimPolicy to Delete so that it will release the volume storage.

Setting a PV to Retain on KCP itself should also be respected by the controllers and allow manual cleanup of the volume in KCP, instead of automatically with the PVC.

Copying volumes

TBD - this section is a work in progress...

Disaster recovery

TBD - this section is a work in progress...

Pairing locations as continuously replicating storage between each other.
KCP would have to be able to map primary volumes to secondary volumes to failover workloads between locations.

Examples

TBD - this section is a work in progress...

Shared NFS storage

NFS server running in every location, external to the SyncTarget, but available over the network.
Note that high-availability and data-protection of the storage itself is out of scope and would be handled by storage admin or provided by enterprise products.
Workloads allow volumes with RWX access-mode.
KCP picks one SyncTarget to be the provisioner and syncs up the volume information.
After provisioning completes, sync down to any SyncTarget in the Location that the workload decides to be placed to allow moving transparently as needed when clusters become offline or drained.
Once the PVC is deleted, the deletion of the volume itself is performed by one of the SyncTargets.

Roadmap

Moving owned volumes
Fencing
Copy-on-demand
Copy-continuous
DR-location-pairing and primary->secondary volume mapping
Statefulsets
COSI Bucket + BucketAccess