版本：Next

Work to Workload Sync Latency SLO Error Budget Burn Rate Exceeded

Understanding This Alert

This alert fires when the SLO's error budget is being consumed faster than sustainable. This SLO uses ticket-level alerts only — investigate and address during normal business hours. See the Reliability Engineering Guide for details on burn rates, time windows, and severity thresholds.

What This Alert Means

The final step of resource propagation (Stage 5: Work Execution) -- deploying workloads to member clusters -- is taking longer than expected. While workloads are being successfully deployed, the operations are exceeding the configured latency threshold at an elevated rate.

Impact

Slower deployments - Applications take longer to start running in member clusters
Delayed updates - Changes to existing workloads propagate more slowly
Increased end-to-end deployment time - The overall time from resource creation to running workload increases
Reduced sync throughput - High latency can cause a backlog of pending Work objects

Every millisecond of work sync latency adds directly to how long users wait between submitting a resource and seeing it running in member clusters.

Possible Causes

Member cluster API server slowness
Large or complex resource manifests
High volume of resources being synced simultaneously
Network latency between the Karmada control plane and member clusters (especially cross-region)
Slow admission webhooks in member clusters
Execution controller under resource pressure

Remediation

1. Review the Sloth SLO dashboards in Grafana. These dashboards are your primary tool for understanding the scope and timeline of the issue:

SLO Details Dashboard — Drill into this specific SLO to see the current burn rate, error budget remaining, monthly burndown chart, and alert state. Use this to confirm the alert and understand when the issue began.
SLO Overview Dashboard — Check the fleet-wide view to see if other SLOs are also burning budget, which may indicate a broader systemic issue.

2. Check if latency is isolated to specific clusters. Cross-region clusters will naturally have higher latency.

kubectl run latency-test --rm -it --image=curlimages/curl --namespace=karmada-system -- \
  curl -w "%{time_total}" -s -o /dev/null -k https://<member-cluster-api-server>:6443/healthz

3. Check member cluster API server load:

kubectl --context=<member-cluster-context> top pods -n kube-system | grep apiserver

4. Check for large resource manifests. Large Work objects take longer to transmit and apply.

kubectl get works -A -o json | jq '.items | map({name: .metadata.name, namespace: .metadata.namespace, size: (.spec | tostring | length)}) | sort_by(.size) | reverse | .[0:10]'

5. Check member cluster admission webhooks. Admission webhooks can significantly increase request latency.

kubectl --context=<member-cluster-context> get validatingwebhookconfigurations
kubectl --context=<member-cluster-context> get mutatingwebhookconfigurations

6. Check work sync duration breakdown:

histogram_quantile(0.95,
  sum by (le, result) (rate(work_sync_workload_duration_seconds_bucket[5m])))

7. Check execution controller workqueue:

workqueue_depth{name="execution-controller"}

histogram_quantile(0.95,
  sum by (le) (rate(workqueue_queue_duration_seconds_bucket{name="execution-controller"}[5m])))

8. Check controller-manager logs:

Check the logs using your logging solution (e.g., kubectl logs, Loki, Elasticsearch):

kubectl logs -n karmada-system -l app=karmada-controller-manager --tail=200 | grep -i "sync work\|failed"

9. Check for recent changes. Were new cross-region clusters added? Were admission webhooks installed in member clusters? Did resource sizes increase?

Mitigation

Root Cause	Action
High network latency to cross-region clusters	Adjust the latency threshold; co-locate controllers if possible
Member cluster API server overloaded	Reduce request rate; scale member cluster API server
Large Work objects	Reduce resource size; split large ConfigMaps/Secrets
Slow admission webhooks in member cluster	Optimize or temporarily disable non-critical webhooks
Execution controller CPU-constrained	Increase CPU limits for controller-manager
Backlog of Work objects	Address root cause; consider increasing controller workers

Understanding This Alert​

What This Alert Means​

Impact​

Possible Causes​

Remediation​

Mitigation​

Related Resources​