Placement, Locations, and Scheduling
KCP implements Compute as a Service via a concept of Transparent Multi Cluster (TMC). TMC means that Kubernetes clusters are attached to a kcp installation to execute workload objects from the users' workspaces by syncing these workload objects down to those clusters and the objects' status back up. This gives the illusion of native compute in KCP.
We call it Compute as a Service because the registered SyncTargets
live in workspaces that
are (normally) invisible to the users, and the teams operating compute can be different from
the compute consumers.
The APIs used for Compute as a Service are:
scheduling.kcp.io/v1alpha1
– we call the outcome of this placement of namespaces.workload.kcp.io/v1alpha1
– responsible for the syncer component of TMC.
Main Concepts
SyncTarget
inworkload.kcp.io/v1alpha1
– representations of Kubernetes clusters that are attached to a kcp installation to execute workload objects from the users' workspaces. On a Kubernetes cluster, there is one syncer process for eachSyncTarget
object.
Sync targets are invisible to users, and (medium term) at most identified via a UID.
Location
inscheduling.kcp.io/v1alpha1
– represents a collection ofSyncTarget
objects selected via instance labels, and exposes labels (potentially different from the instance labels) to the users to describe, identify and select locations to be used for placement of user namespaces onto sync targets.
Locations are visible to users, but owned by the compute service team, i.e. read-only to the users and only projected into their workspaces for visibility. A placement decision references a location by name.
SyncTarget
s in a Location
are transparent to the user. Workloads should be able to seamlessly move from one SyncTarget
to another
within a Location
, based on operational concerns of the compute service provider, like decommissioning a cluster, rebalancing
capacity, or due to an outage of a cluster.
It is compute service's responsibility to ensure that for workloads in a location, to the user it looks like ONE cluster.
Placement
inscheduling.kcp.io/v1alpha1
– represents a selection rule to choose ONELocation
via location labels, and bind the selected location to MULTIPLE namespaces in a user workspace. For Workspaces with multiple Namespaces, users can create multiple Placements to assign specific Namespace(s) to specific Locations.
Placement
are visible and writable to users. A default Placement
is automatically created when a workload APIBinding
is
created on the user workspace, which randomly select a Location
and bind to all namespaces in this workspace. The user can mutate
or delete the default Placement
. The corresponding APIBinding
will be annotated with workload.kcp.io/skip-default-object-creation
,
so that the default Placement
will not be recreated upon deletion.
- Compute Service Workspace (previously Negotiation Workspace) – the workspace owned by the compute service team to hold
the
APIExport
(namedkubernetes
today) with the synced resources, andSyncTarget
andLocation
objects.
The user binds to the APIExport
called kubernetes
using an APIBinding
. From this moment on, the users' workspaces
are subject to placement.
Note
Binding to a compute service is a permanent decision. Unbinding (i.e. deleting of the APIBinding object) means deletion of the workload objects.
It is planned to allow multiple location workspaces for the same compute service, even with different owners.
Placement and resource scheduling
The placement state is one of
Pending
– the placement controller waits for a validLocation
to selectBound
– at least one namespace is bound to the placement. When the user updates the spec of thePlacement
, the selected location of the placement will be changed inBound
state.Unbound
– a location is selected by the placement, but no namespace is bound to the placement. When the user updates the spec of thePlacement
, the selected location of the placement will be changed inUnbound
state.
Note
Sync targets from different locations can be bound at the same time, while each location can only have one sync target bound to the namespace.
The user interface to influence the placement decisions is the Placement
object. For example, user can create a placement to bind namespace with
label of "app=foo" to a location with label "cloud=aws" as below:
apiVersion: scheduling.kcp.io/v1alpha1
kind: Placement
metadata:
name: aws
spec:
locationSelectors:
- matchLabels:
cloud: aws
namespaceSelector:
matchLabels:
app: foo
locationWorkspace: root:default:location-ws
A matched location will be selected for this Placement
at first, which makes the Placement
turns from Pending
to Unbound
. Then if there is at
least one matching Namespace, the Namespace will be annotated with scheduling.kcp.io/placement
and the placement turns from Unbound
to Bound
.
After this, a SyncTarget
will be selected from the location picked by the placement. state.workload.kcp.io/<cluster-id>
label with value of Sync
will be set if a valid SyncTarget
is selected.
The user can create another placement targeted to a different location for this Namespace, e.g.
apiVersion: scheduling.kcp.io/v1alpha1
kind: Placement
metadata:
name: gce
spec:
locationSelectors:
- matchLabels:
cloud: gce
namespaceSelector:
matchLabels:
app: foo
locationWorkspace: root:default:location-ws
which will result in another state.workload.kcp.io/<cluster-id>
label added to the Namespace, and the Namespace will have two different
state.workload.kcp.io/<cluster-id>
label.
Placement is in the Ready
status condition when
- selected location matches the
Placement
spec. - selected location exists in the location workspace.
Sync target removing
A sync target will be removed when:
- corresponding
Placement
is deleted. - corresponding
Placement
is not inReady
condition. - corresponding
SyncTarget
is evicting/not Ready/deleted
All above cases will make the SyncTarget
represented in the label state.workload.kcp.io/<cluster-id>
invalid, which will cause
finalizers.workload.kcp.io/<cluster-id>
annotation with removing time in the format of RFC-3339 added on the Namespace.
Resource Syncing
As soon as the state.workload.kcp.io/<cluster-id>
label is set on the Namespace, the workload resource controller will
copy the state.workload.kcp.io/<cluster-id>
label to the resources in that namespace.
Note
In the future, the label on the resources is first set to empty string ""
, and a coordination controller will be
able to apply changes before syncing starts. This includes the ability to add per-location finalizers through the
finalizers.workload.kcp.io/<cluster-id>
annotation such that the coordination controller gets full control over
the downstream life-cycle of the objects per location (imagine an ingress that blocks downstream removal until the new replicas
have been launched on another sync target). Finally, the coordination controller will replace the empty string with Sync
such that the state machine continues.
With the state label set to Sync
, the syncer will start seeing the resources in the namespace
and starts syncing them downstream, first by creating the namespace. Before syncing, it will also set
a finalizer workload.kcp.io/syncer-<cluster-id>
on the upstream object in order to delay upstream deletion until
the downstream object is also deleted.
When the deletion.internal.workload.kcp.io/<cluster-id>
is added to the Namespace. The virtual workspace apiserver
will translate that annotation into a deletion timestamp on the object the syncer sees. The syncer
notices that as a started deletion flow. As soon as there are no coordination controller finalizers registered via the
finalizers.workload.kcp.io/<cluster-id>
annotation anymore, the syncer will start a deletion of the downstream object.
When the downstream deletion is complete, the syncer will remove the finalizer from the upstream object, and the
state.workload.kcp.io/<cluster-id>
labels gets deleted as well. The syncer stops seeing the object in the virtual
workspace.
Note
There is a missing bit in the implementation (in v0.5) about removal of the state.workload.kcp.io/<cluster-id>
label from namespaces: the syncer currently does not participate in the namespace deletion state-machine, but has to and signal finished
downstream namespace deletion via state.workload.kcp.io/<cluster-id>
label removal.
For more information on the upsync use case for storage, refer to the storage doc.
Resource Upsyncing
In most cases kcp will be the source for syncing resources to the SyncTarget
, however, in some cases,
kcp would need to receive a resource that was provisioned by a controller on the SyncTarget
.
This is the case with storage PVs, which are created on the SyncTarget
by a CSI driver.
Unlike the Sync
state, the Upsync
state is exclusive, and only a single SyncTarget
can be the source of truth for an upsynced resource.
In addition, other SyncTargets
cannot be syncing down while the resource is being upsynced.
A resource coordination controller will be responsible for changing the state.workload.kcp.io/<cluster-id>
label,
to drive the different flows on the resource. A resource can be changed from Upsync
to Sync
in order to share it across SyncTargets
.
This change will be applied by the coordination controller when needed, and the original syncer will detect that change and stop upsyncing to that resource,
and all the sync targets involved will be in Sync
state.