版本：Next

Binding to Work Sync Latency SLO Error Budget Burn Rate Exceeded

Understanding This Alert

This alert fires when the SLO's error budget is being consumed faster than sustainable. This SLO uses ticket-level alerts only — investigate and address during normal business hours. See the Reliability Engineering Guide for details on burn rates, time windows, and severity thresholds.

What This Alert Means

The process of converting ResourceBindings into Work resources (Stage 4: Work Creation & Override Application) is taking longer than expected. While operations are succeeding, they are exceeding the configured latency threshold at an elevated rate, which delays the overall resource propagation pipeline.

Impact

Slower deployments - Resources take longer to move from scheduling decisions to propagation tasks
Delayed updates - Changes to workloads propagate more slowly to member clusters
End-to-end pipeline slowdown - Downstream steps (actual workload deployment) are delayed

Binding sync latency determines how quickly scheduling decisions translate into deployable Work objects.

Possible Causes

High volume of resources being propagated simultaneously
API server slowness affecting Work object creation
Complex resource templates requiring more processing time
Complex override policies with many rules
Large resource manifests taking longer to process
Controller-manager under resource pressure

Remediation

1. Review the Sloth SLO dashboards in Grafana. These dashboards are your primary tool for understanding the scope and timeline of the issue:

SLO Details Dashboard — Drill into this specific SLO to see the current burn rate, error budget remaining, monthly burndown chart, and alert state. Use this to confirm the alert and understand when the issue began.
SLO Overview Dashboard — Check the fleet-wide view to see if other SLOs are also burning budget, which may indicate a broader systemic issue.

2. Check binding sync duration breakdown:

histogram_quantile(0.95,
  sum by (le, result) (rate(binding_sync_work_duration_seconds_bucket[5m])))

3. Check for complex override policies. Override policy evaluation happens during binding sync; complex policies with many rules are a common cause of latency.

kubectl get clusteroverridepolicies -o yaml | grep -c "overrideRules"
kubectl get overridepolicies -A -o yaml | grep -c "overrideRules"

4. Check the binding controller workqueue:

workqueue_depth{name=~"binding-controller|cluster-resource-binding-controller"}

histogram_quantile(0.95,
  sum by (le) (rate(workqueue_queue_duration_seconds_bucket{name=~"binding-controller|cluster-resource-binding-controller"}[5m])))

histogram_quantile(0.95,
  sum by (le) (rate(workqueue_work_duration_seconds_bucket{name=~"binding-controller|cluster-resource-binding-controller"}[5m])))

5. Check binding volume. A high number of bindings being processed simultaneously increases controller load.

kubectl get resourcebindings -A --no-headers | wc -l
kubectl get clusterresourcebindings --no-headers | wc -l

6. Check controller-manager resource usage:

kubectl top pod -n karmada-system -l app=karmada-controller-manager

7. Check controller-manager logs:

Check the logs using your logging solution (e.g., kubectl logs, Loki, Elasticsearch):

kubectl logs -n karmada-system -l app=karmada-controller-manager --tail=200 | grep -i "Sync work\|failed"

8. Check for recent changes. Were override policies added or modified? Did the number of propagated resources increase significantly?

Mitigation

Root Cause	Action
Complex override policies	Simplify override rules; reduce number of per-resource overrides
Large manifests	Review and reduce ConfigMap/Secret sizes; split large resources
Controller CPU-constrained	Increase CPU limits
API server slow writes	Address API server/etcd performance
Backlog in queue	Check for and resolve any errors causing retries

Understanding This Alert​

What This Alert Means​

Impact​

Possible Causes​

Remediation​

Mitigation​

Related Resources​