Cluster Failover Process Analysis
Let's analyze the Karmada failover feature.
Add taints on fault cluster
After the cluster status becomes unhealthy, a taint{effect: NoSchedule} will be added to the cluster as follows:
- when cluster status's
Readycondition changes toFalse, Karmada controller will add the following taint to the target cluster object:
key: cluster.karmada.io/not-ready
effect: NoSchedule
- when cluster status's
Readycondition changes toUnknown, Karmada controller will add the following taint to the target cluster object:
key: cluster.karmada.io/unreachable
effect: NoSchedule
In addition, Karmada controller does not actively add NoExecute taints to cluster objects. Users can actively manage taints on cluster objects, including NoExecute taints, through the cluster taint management feature.
Failover
When the Karmada controller detects that a cluster has been tainted with the NoExecute taint, and the taint cannot be tolerated by the tolerance strategy defined in the affected PropagationPolicy/ClusterPropagationPolicy, the Karmada controller will remove the cluster from the scheduling results of the resources matched by these policies. Afterward, the Karmada scheduler will reschedule all affected resources.
There are several constraints:
- For each rescheduled application, it still needs to meet the restrictions of
PropagationPolicy/ClusterPropagationPolicy, such asClusterAffinityorSpreadConstraints. - The application distributed on the ready clusters after the initial scheduling will remain during failover rescheduling.
Duplicated schedule type
For resources with the scheduling type set to Duplicated, during rescheduling after a cluster failure, the rescheduling process will only proceed if the number of candidate clusters that satisfy the propagation policy constraints is greater than or equal to the number of failed clusters; otherwise, rescheduling will not be performed.
Here, candidate clusters refer to the newly calculated cluster scheduling results in the current scheduling process, which are different from the already scheduled clusters.
Take Deployment as example:
unfold me to see the yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx
labels:
app: nginx
spec:
replicas: 2
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- image: nginx
name: nginx
---
apiVersion: policy.karmada.io/v1alpha1
kind: PropagationPolicy
metadata:
name: nginx-propagation
spec:
resourceSelectors:
- apiVersion: apps/v1
kind: Deployment
name: nginx
placement:
clusterAffinity:
clusterNames:
- member1
- member2
- member3
- member5
spreadConstraints:
- maxGroups: 2
minGroups: 2
replicaScheduling:
replicaSchedulingType: Duplicated
Suppose a Karmada instance manages 5 clusters: member1, member2, member3, member4, member5, and the initial scheduling result of Deployment default/nginx is clusters member1 and member2.
When the member2 cluster fails, the Karmada scheduler will reschedule this workload.
It should be noted that rescheduling will not delete the application on the ready cluster member1. In the remaining 3 clusters, only member3 and member5 match the clusterAffinity policy.
Due to the limitations of spreadConstraints, the final result can be [member1, member3] or [member1, member5].
Divided schedule type
For resources with the scheduling type set to Divided, Karmada scheduler will try to migrate replicas to the other healthy clusters.
Take Deployment as example:
unfold me to see the yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx
labels:
app: nginx
spec:
replicas: 3
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- image: nginx
name: nginx
---
apiVersion: policy.karmada.io/v1alpha1
kind: PropagationPolicy
metadata:
name: nginx-propagation
spec:
resourceSelectors:
- apiVersion: apps/v1
kind: Deployment
name: nginx
placement:
clusterAffinity:
clusterNames:
- member1
- member2
replicaScheduling:
replicaDivisionPreference: Weighted
replicaSchedulingType: Divided
weightPreference:
staticWeightList:
- targetCluster:
clusterNames:
- member1
weight: 1
- targetCluster:
clusterNames:
- member2
weight: 2
Karmada scheduler will divide the replicas according the weightPreference. The initial schedule result is member1 with 1 replica and member2 with 2 replicas.
When member1 fails, it triggers rescheduling. Karmada scheduler will try to migrate replicas to the other health clusters. The final result will be member2 with 3 replicas.
Graceful eviction feature
In order to prevent service interruption during cluster failover, Karmada need to ensure the removal of evicted workloads will be delayed until the workloads are available on new clusters.
The GracefulEvictionTasks field is added to ResourceBinding/ClusterResourceBinding to indicate the eviction task queue.
When the faulty cluster is removed from the resource scheduling result by taint-manager, it will be added to the eviction task queue.
The gracefulEviction controller is responsible for processing tasks in the eviction task queue. During the procession, the gracefulEviction controller evaluates whether the current task can be removed form the eviction task queue one by one. The judgment conditions are as follows:
- Check the health status of the current resource scheduling result. If the resource health status is healthy, the condition is met.
- Check whether the waiting duration of the current task exceeds the timeout interval, which can be configured via
graceful-eviction-timeoutflag(default is 10 minutes). If exceeds, and meets the condition.