跳转到文档内容
版本:Next

Work to Workload Sync Latency SLO Error Budget Burn Rate Exceeded

Understanding This Alert

This alert fires when the SLO's error budget is being consumed faster than sustainable. This SLO uses ticket-level alerts only — investigate and address during normal business hours. See the Reliability Engineering Guide for details on burn rates, time windows, and severity thresholds.

What This Alert Means

The final step of resource propagation (Stage 5: Work Execution) -- deploying workloads to member clusters -- is taking longer than expected. While workloads are being successfully deployed, the operations are exceeding the configured latency threshold at an elevated rate.

Impact

  • Slower deployments - Applications take longer to start running in member clusters
  • Delayed updates - Changes to existing workloads propagate more slowly
  • Increased end-to-end deployment time - The overall time from resource creation to running workload increases
  • Reduced sync throughput - High latency can cause a backlog of pending Work objects

Every millisecond of work sync latency adds directly to how long users wait between submitting a resource and seeing it running in member clusters.

Possible Causes

  • Member cluster API server slowness
  • Large or complex resource manifests
  • High volume of resources being synced simultaneously
  • Network latency between the Karmada control plane and member clusters (especially cross-region)
  • Slow admission webhooks in member clusters
  • Execution controller under resource pressure

Remediation

1. Review the Sloth SLO dashboards in Grafana. These dashboards are your primary tool for understanding the scope and timeline of the issue:

  • SLO Details Dashboard — Drill into this specific SLO to see the current burn rate, error budget remaining, monthly burndown chart, and alert state. Use this to confirm the alert and understand when the issue began.
  • SLO Overview Dashboard — Check the fleet-wide view to see if other SLOs are also burning budget, which may indicate a broader systemic issue.

2. Check if latency is isolated to specific clusters. Cross-region clusters will naturally have higher latency.

kubectl run latency-test --rm -it --image=curlimages/curl --namespace=karmada-system -- \
curl -w "%{time_total}" -s -o /dev/null -k https://<member-cluster-api-server>:6443/healthz

3. Check member cluster API server load:

kubectl --context=<member-cluster-context> top pods -n kube-system | grep apiserver

4. Check for large resource manifests. Large Work objects take longer to transmit and apply.

kubectl get works -A -o json | jq '.items | map({name: .metadata.name, namespace: .metadata.namespace, size: (.spec | tostring | length)}) | sort_by(.size) | reverse | .[0:10]'

5. Check member cluster admission webhooks. Admission webhooks can significantly increase request latency.

kubectl --context=<member-cluster-context> get validatingwebhookconfigurations
kubectl --context=<member-cluster-context> get mutatingwebhookconfigurations

6. Check work sync duration breakdown:

histogram_quantile(0.95,
sum by (le, result) (rate(work_sync_workload_duration_seconds_bucket[5m])))

7. Check execution controller workqueue:

workqueue_depth{name="execution-controller"}

histogram_quantile(0.95,
sum by (le) (rate(workqueue_queue_duration_seconds_bucket{name="execution-controller"}[5m])))

8. Check controller-manager logs:

Check the logs using your logging solution (e.g., kubectl logs, Loki, Elasticsearch):

kubectl logs -n karmada-system -l app=karmada-controller-manager --tail=200 | grep -i "sync work\|failed"

9. Check for recent changes. Were new cross-region clusters added? Were admission webhooks installed in member clusters? Did resource sizes increase?

Mitigation

Root CauseAction
High network latency to cross-region clustersAdjust the latency threshold; co-locate controllers if possible
Member cluster API server overloadedReduce request rate; scale member cluster API server
Large Work objectsReduce resource size; split large ConfigMaps/Secrets
Slow admission webhooks in member clusterOptimize or temporarily disable non-critical webhooks
Execution controller CPU-constrainedIncrease CPU limits for controller-manager
Backlog of Work objectsAddress root cause; consider increasing controller workers