版本：1.15

故障迁移过程解析

让我们对Karmada集群故障迁移的过程进行一个简单的解析。

添加集群污点

当集群被判定为不健康之后，集群将会被添加上Effect值为NoSchedule的污点，具体情况为：

当集群Ready状态为False时，将被添加如下污点：

key: cluster.karmada.io/not-ready
effect: NoSchedule

当集群Ready状态为Unknown时，将被添加如下污点：

key: cluster.karmada.io/unreachable
effect: NoSchedule

如果集群的不健康状态持续一段时间（该时间可以通过--failover-eviction-timeout标签进行配置，默认值为5分钟）仍未恢复，集群将会被添加上Effect值为NoExecute的污点，具体情况为：

当集群Ready状态为False时，将被添加如下污点：

key: cluster.karmada.io/not-ready
effect: NoExecute

当集群Ready状态为Unknown时，将被添加如下污点：

key: cluster.karmada.io/unreachable
effect: NoExecute

容忍集群污点

当用户创建PropagationPolicy/ClusterPropagationPolicy资源后，Karmada会通过webhook为它们自动增加如下集群污点容忍：

apiVersion: policy.karmada.io/v1alpha1
kind: PropagationPolicy
metadata:
  name: nginx-propagation
  namespace: default
spec:
  placement:
    clusterTolerations:
    - effect: NoExecute
      key: cluster.karmada.io/not-ready
      operator: Exists
      tolerationSeconds: 300
    - effect: NoExecute
      key: cluster.karmada.io/unreachable
      operator: Exists
      tolerationSeconds: 300
  ...

其中，容忍的tolerationSeconds值可以通过--default-not-ready-toleration-seconds与default-unreachable-toleration-seconds标签进行配置，这两个标签的默认值均为300。

故障迁移

当Karmada检测到故障群集不再被PropagationPolicy/ClusterPropagationPolicy分发策略容忍时，该集群将被从资源调度结果中删除，随后，Karmada调度器将重新调度相关工作负载。

重调度的过程有以下几个限制：

对于每个重调度的工作负载，其仍然需要满足PropagationPolicy/ClusterPropagationPolicy的约束，如ClusterAffinity或SpreadConstraints。
应用初始调度结果中健康的集群在重调度过程中仍将被保留。

Duplicated调度类型

对于Duplicated调度类型，当集群故障之后进行重新调度，满足分发策略限制的候选集群数量大于等于故障集群数量时，调度将继续执行，否则不执行。其中候选集群是指在本次调度过程中，区别与已调度的集群，新计算出的集群调度结果。

以Deployment资源为例：

unfold me to see the yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
  labels:
    app: nginx
spec:
  replicas: 2
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - image: nginx
        name: nginx
---
apiVersion: policy.karmada.io/v1alpha1
kind: PropagationPolicy
metadata:
  name: nginx-propagation
spec:
  resourceSelectors:
    - apiVersion: apps/v1
      kind: Deployment
      name: nginx
  placement:
    clusterAffinity:
      clusterNames:
        - member1
        - member2
        - member3
        - member5
    spreadConstraints:
      - maxGroups: 2
        minGroups: 2
    replicaScheduling:
      replicaSchedulingType: Duplicated

假设有5个成员集群，初始调度结果在member1和member2集群中。当member2集群发生故障，将触发调度器重调度。

需要注意的是，重调度不会删除原本状态为Ready的集群member1上的工作负载。在其余3个集群中，只有member3和member5匹配clusterAffinity策略。

由于分发约束的限制，最后应用调度的结果将会是[member1, member3]或[member1, member5]。

Divided调度类型

对于Divided调度类型，Karmada调度器将尝试将应用副本迁移到其他健康的集群中去。

以Deployment资源为例：

unfold me to see the yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
  labels:
    app: nginx
spec:
  replicas: 3
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - image: nginx
        name: nginx
---
apiVersion: policy.karmada.io/v1alpha1
kind: PropagationPolicy
metadata:
  name: nginx-propagation
spec:
  resourceSelectors:
    - apiVersion: apps/v1
      kind: Deployment
      name: nginx
  placement:
    clusterAffinity:
      clusterNames:
        - member1
        - member2
    replicaScheduling:
      replicaDivisionPreference: Weighted
      replicaSchedulingType: Divided
      weightPreference:
        staticWeightList:
          - targetCluster:
              clusterNames:
                - member1
            weight: 1
          - targetCluster:
              clusterNames:
                - member2
            weight: 2

Karmada调度器将根据权重表weightPreference来划分应用副本。初始调度结果中，member1集群上有1个副本，member2集群上有2个副本。

当member1集群故障之后，将触发重调度，最后的调度结果将会是member2集群上有3个副本。

优雅故障迁移

为了防止集群故障迁移过程中服务发生中断，Karmada需要确保故障集群中应用副本的删除动作延迟到应用副本在新集群上可用之后才执行。

ResourceBinding/ClusterResourceBinding中增加了GracefulEvictionTasks字段来表示优雅驱逐任务队列。

当故障集群被taint-manager从资源调度结果中删除时，它将被添加到优雅驱逐任务队列中。

gracefulEvction控制器负责处理优雅驱逐任务队列中的任务。在处理过程中，gracefulEvction控制器逐个评估优雅驱逐任务队列中的任务是否可以从队列中移除。判断条件如下：

检查当前资源调度结果中资源的健康状态。如果资源健康状态为健康，则满足条件。
检查当前任务的等待时长是否超过超时时间，超时时间可以通过graceful-evction-timeout标志配置（默认为10分钟）。如果超过，则满足条件。

添加集群污点​

容忍集群污点​

故障迁移​

Duplicated调度类型​

Divided调度类型​

优雅故障迁移​