Version: Next

Cluster Failover Process Analysis

Let's analyze the Karmada failover feature.

Add taints on fault cluster

After the cluster status becomes unhealthy, a taint{effect: NoSchedule} will be added to the cluster as follows:

when cluster status's Ready condition changes to False, Karmada controller will add the following taint to the target cluster object:

key: cluster.karmada.io/not-ready
effect: NoSchedule

when cluster status's Ready condition changes to Unknown, Karmada controller will add the following taint to the target cluster object:

key: cluster.karmada.io/unreachable
effect: NoSchedule

In addition, Karmada controller does not actively add NoExecute taints to cluster objects. Users can actively manage taints on cluster objects, including NoExecute taints, through the cluster taint management feature.

Failover

When the Karmada controller detects that a cluster has been tainted with the NoExecute taint, and the taint cannot be tolerated by the tolerance strategy defined in the affected PropagationPolicy/ClusterPropagationPolicy, the Karmada controller will remove the cluster from the scheduling results of the resources matched by these policies. Afterward, the Karmada scheduler will reschedule all affected resources.

There are several constraints:

For each rescheduled application, it still needs to meet the restrictions of PropagationPolicy/ClusterPropagationPolicy, such as ClusterAffinity or SpreadConstraints.
The application distributed on the ready clusters after the initial scheduling will remain during failover rescheduling.

Duplicated schedule type

For resources with the scheduling type set to Duplicated, during rescheduling after a cluster failure, the rescheduling process will only proceed if the number of candidate clusters that satisfy the propagation policy constraints is greater than or equal to the number of failed clusters; otherwise, rescheduling will not be performed. Here, candidate clusters refer to the newly calculated cluster scheduling results in the current scheduling process, which are different from the already scheduled clusters.

Take Deployment as example:

unfold me to see the yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
  labels:
    app: nginx
spec:
  replicas: 2
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - image: nginx
        name: nginx
---
apiVersion: policy.karmada.io/v1alpha1
kind: PropagationPolicy
metadata:
  name: nginx-propagation
spec:
  resourceSelectors:
    - apiVersion: apps/v1
      kind: Deployment
      name: nginx
  placement:
    clusterAffinity:
      clusterNames:
        - member1
        - member2
        - member3
        - member5
    spreadConstraints:
      - maxGroups: 2
        minGroups: 2
    replicaScheduling:
      replicaSchedulingType: Duplicated

Suppose a Karmada instance manages 5 clusters: member1, member2, member3, member4, member5, and the initial scheduling result of Deployment default/nginx is clusters member1 and member2. When the member2 cluster fails, the Karmada scheduler will reschedule this workload.

It should be noted that rescheduling will not delete the application on the ready cluster member1. In the remaining 3 clusters, only member3 and member5 match the clusterAffinity policy.

Due to the limitations of spreadConstraints, the final result can be [member1, member3] or [member1, member5].

Divided schedule type

For resources with the scheduling type set to Divided, Karmada scheduler will try to migrate replicas to the other healthy clusters.

Take Deployment as example:

unfold me to see the yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
  labels:
    app: nginx
spec:
  replicas: 3
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - image: nginx
        name: nginx
---
apiVersion: policy.karmada.io/v1alpha1
kind: PropagationPolicy
metadata:
  name: nginx-propagation
spec:
  resourceSelectors:
    - apiVersion: apps/v1
      kind: Deployment
      name: nginx
  placement:
    clusterAffinity:
      clusterNames:
        - member1
        - member2
    replicaScheduling:
      replicaDivisionPreference: Weighted
      replicaSchedulingType: Divided
      weightPreference:
        staticWeightList:
          - targetCluster:
              clusterNames:
                - member1
            weight: 1
          - targetCluster:
              clusterNames:
                - member2
            weight: 2

Karmada scheduler will divide the replicas according the weightPreference. The initial schedule result is member1 with 1 replica and member2 with 2 replicas.

When member1 fails, it triggers rescheduling. Karmada scheduler will try to migrate replicas to the other health clusters. The final result will be member2 with 3 replicas.

Graceful eviction feature

In order to prevent service interruption during cluster failover, Karmada need to ensure the removal of evicted workloads will be delayed until the workloads are available on new clusters.

The GracefulEvictionTasks field is added to ResourceBinding/ClusterResourceBinding to indicate the eviction task queue.

When the faulty cluster is removed from the resource scheduling result by taint-manager, it will be added to the eviction task queue.

The gracefulEviction controller is responsible for processing tasks in the eviction task queue. During the procession, the gracefulEviction controller evaluates whether the current task can be removed form the eviction task queue one by one. The judgment conditions are as follows:

Check the health status of the current resource scheduling result. If the resource health status is healthy, the condition is met.
Check whether the waiting duration of the current task exceeds the timeout interval, which can be configured via graceful-eviction-timeout flag(default is 10 minutes). If exceeds, and meets the condition.

Add taints on fault cluster​

Failover​

Duplicated schedule type​

Divided schedule type​

Graceful eviction feature​

Add taints on fault cluster

Failover

Duplicated schedule type

Divided schedule type

Graceful eviction feature