跳转到文档内容
版本:Next

Work to Workload Sync Availability SLO Error Budget Burn Rate Exceeded

Understanding This Alert

This alert fires when the SLO's error budget is being consumed faster than sustainable — page alerts indicate urgent issues requiring immediate action, while ticket alerts should be addressed during business hours. See the Reliability Engineering Guide for details on burn rates, time windows, and severity thresholds.

What This Alert Means

Karmada is failing to deploy workloads to your member clusters (Stage 5: Work Execution). This is the final and most critical step where your resources actually get created in the target clusters.

Impact

  • Workloads don't run - Despite successful scheduling, applications fail to start in member clusters
  • Most visible failures - Your applications won't be available
  • Direct end-user impact - Services depending on these workloads will be down
  • Deployment pipeline broken at the final step - Everything works until the last mile

This is a critical issue with direct impact on application availability. Users can see their resources in Karmada but not in member clusters.

Possible Causes

  • Member cluster API server issues or unreachable clusters
  • Admission control rejections due to policy misconfigurations
  • Conflicts with pre-existing resources in the member cluster
  • Missing CRDs in member clusters
  • Authentication/authorization failures to member clusters
  • Network connectivity issues between control plane and member clusters

Remediation

1. Review the Sloth SLO dashboards in Grafana. These dashboards are your primary tool for understanding the scope and timeline of the issue:

  • SLO Details Dashboard — Drill into this specific SLO to see the current burn rate, error budget remaining, monthly burndown chart, and alert state. Use this to confirm the alert and understand when the issue began.
  • SLO Overview Dashboard — Check the fleet-wide view to see if other SLOs are also burning budget, which may indicate a broader systemic issue.

2. Check Kubernetes events for work sync failures:

kubectl get events -n karmada-system --field-selector reason=SyncFailed --sort-by='.lastTimestamp'

Review the event messages to understand why the sync failed and determine the fix.

3. Identify affected clusters:

sum by (member_cluster) (
rate(create_resource_to_cluster{result="error"}[5m])
+ rate(update_resource_to_cluster{result="error"}[5m])
+ rate(delete_resource_from_cluster{result="error"}[5m])
)

4. Check member cluster connectivity. Work sync errors most commonly occur when member clusters are unreachable.

kubectl get clusters
kubectl describe cluster <cluster-name>
cluster_ready_state == 0

5. Check the execution controller logs:

Check the logs using your logging solution (e.g., kubectl logs, Loki, Elasticsearch):

kubectl logs -n karmada-system -l app=karmada-controller-manager --tail=200 | grep -i "sync work\|failed"

6. Check for resource-specific errors. Specific resource kinds failing may indicate missing CRDs, RBAC restrictions, or resource conflicts in the member cluster.

sum by (kind) (
rate(create_resource_to_cluster{result="error"}[5m])
+ rate(update_resource_to_cluster{result="error"}[5m])
)

7. Check authentication/authorization to member clusters:

kubectl logs -n karmada-system -l app=karmada-controller-manager --tail=200 | grep -i "unauthorized\|forbidden\|certificate"

8. Check network connectivity to member clusters:

kubectl run connectivity-test --rm -it --image=curlimages/curl --namespace=karmada-system -- \
curl -k https://<member-cluster-api-server>:6443/healthz

9. Check for recent changes. Were member cluster credentials rotated? Were admission policies changed in member clusters? Were new CRD-based resources added?

Mitigation

SymptomAction
Member cluster unreachableRestore network connectivity to member cluster
Member cluster API server errorsAddress member cluster API server health
Authentication failuresRotate or renew member cluster credentials
Missing CRDs in member clusterInstall required CRDs in affected member clusters
Resource conflictsResolve conflicts; check for namespace/name collisions
Network partitionWork with network team to restore connectivity