版本：Next

Cluster Status Sync Latency SLO Error Budget Burn Rate Exceeded

Understanding This Alert

This alert fires when the SLO's error budget is being consumed faster than sustainable. This SLO uses ticket-level alerts only — investigate and address during normal business hours. See the Reliability Engineering Guide for details on burn rates, time windows, and severity thresholds.

What This Alert Means

The process of syncing status information from ready member clusters is taking longer than expected. Karmada periodically pulls cluster status (node count, resource capacity, conditions) from each member cluster. This SLO only tracks syncs for clusters that are in a Ready state.

Impact

Stale cluster information - The scheduler may use outdated capacity data when making placement decisions
Suboptimal scheduling - Workloads may be placed on clusters that appear to have capacity but don't
Delayed health detection - Changes in cluster health take longer to be reflected in the control plane
Delayed failover - Failed clusters continue receiving new workload placements

Possible Causes

Member cluster API server slowness
Network latency between the Karmada control plane and member clusters (especially cross-region)
Large clusters with many nodes requiring more status data to collect
Resource pressure on the controller manager
High number of registered member clusters

Remediation

1. Review the Sloth SLO dashboards in Grafana. These dashboards are your primary tool for understanding the scope and timeline of the issue:

SLO Details Dashboard — Drill into this specific SLO to see the current burn rate, error budget remaining, monthly burndown chart, and alert state. Use this to confirm the alert and understand when the issue began.
SLO Overview Dashboard — Check the fleet-wide view to see if other SLOs are also burning budget, which may indicate a broader systemic issue.

2. Identify which specific clusters are slow:

histogram_quantile(0.95,
  sum by (le, member_cluster) (rate(cluster_sync_status_duration_seconds_bucket[5m])))
  > 1

3. Check member cluster API server latency. The cluster sync controller fetches status from member cluster API servers; their latency is the primary driver.

kubectl describe cluster <slow-cluster-name>

kubectl run conn-test --rm -it --image=curlimages/curl --namespace=karmada-system -- \
  curl -w "%{time_total}\n" -s -o /dev/null -k https://<cluster-api-server>:6443/healthz

4. Check network latency to member clusters. Cross-region member clusters inherently have higher sync latency. If you have recently added cross-region clusters, consider adjusting the threshold.

5. Check the cluster status controller workqueue:

workqueue_depth{name="cluster-status-controller"}

histogram_quantile(0.95,
  sum by (le) (rate(workqueue_queue_duration_seconds_bucket{name="cluster-status-controller"}[5m])))

6. Check controller-manager resource usage:

kubectl top pod -n karmada-system -l app=karmada-controller-manager

7. Check total cluster count. The more member clusters registered, the more work the cluster controller must do.

kubectl get clusters --no-headers | wc -l

8. Check controller-manager logs:

Check the logs using your logging solution (e.g., kubectl logs, Loki, Elasticsearch):

kubectl logs -n karmada-system -l app=karmada-controller-manager --tail=200 | grep -i "cluster\|failed\|sync"

9. Check for recent changes. Were new member clusters added (especially cross-region)? Were there network path changes?

Mitigation

Root Cause	Action
Member cluster API server slow	Address member cluster health
High network latency (cross-region)	Adjust the latency threshold to match actual latency; consider cluster-local agents
Controller CPU-constrained	Increase CPU limits for controller-manager
Many clusters causing contention	Consider increasing controller-manager worker thread count
Network partition to specific clusters	Restore network connectivity; investigate network path

Understanding This Alert​

What This Alert Means​

Impact​

Possible Causes​

Remediation​

Mitigation​

Related Resources​