Version: Next

Karmada Scheduler Availability SLO Error Budget Burn Rate Exceeded

Understanding This Alert

This alert fires when the SLO's error budget is being consumed faster than sustainable — page alerts indicate urgent issues requiring immediate action, while ticket alerts should be addressed during business hours. See the Reliability Engineering Guide for details on burn rates, time windows, and severity thresholds.

What This Alert Means

The Karmada scheduler is failing to successfully schedule workloads to member clusters (Stage 2: Scheduling). The scheduler decides which clusters should run your workloads based on your propagation policies. This SLO tracks result="error" on the karmada_scheduler_schedule_attempts_total metric — note that this includes both system-level scheduler failures and unschedulable outcomes (e.g., no clusters matching placement constraints).

Impact

Workloads don't propagate - Resources remain in the control plane but don't reach member clusters
Deployments are stuck - New applications can't be distributed
No rescheduling on failure - When clusters fail, workloads can't be moved to healthy clusters
Scheduling decisions fail - The scheduler can't determine where to place resources

This is a critical issue that blocks workload distribution across your clusters.

Possible Causes

Scheduler pod is unhealthy, crash-looping, or OOMKilled
No matching clusters for the placement rules
Insufficient cluster capacity
Scheduler estimator failures or unreachable estimators
Scheduler plugin execution failures
Member clusters in NotReady state

Remediation

1. Review the Sloth SLO dashboards in Grafana. These dashboards are your primary tool for understanding the scope and timeline of the issue:

SLO Details Dashboard — Drill into this specific SLO to see the current burn rate, error budget remaining, monthly burndown chart, and alert state. Use this to confirm the alert and understand when the issue began.
SLO Overview Dashboard — Check the fleet-wide view to see if other SLOs are also burning budget, which may indicate a broader systemic issue.

2. Check Kubernetes events for scheduling failures:

kubectl get events -n karmada-system --field-selector reason=ScheduleBindingFailed --sort-by='.lastTimestamp'

Review the event messages to understand why the scheduling failed and determine the fix.

3. Check the scheduling error rate and breakdown by type:

sum by (result, schedule_type) (rate(karmada_scheduler_schedule_attempts_total[5m]))

This metric reports result="scheduled" for successes and result="error" for failures. The schedule_type label helps distinguish first-time scheduling from rescheduling. Note that result="error" includes both system-level failures (e.g., estimator unreachable) and constraint mismatches (e.g., no eligible clusters) — check events and logs to distinguish them.

4. Check the scheduler pod health:

kubectl get pods -n karmada-system -l app=karmada-scheduler
kubectl logs -n karmada-system -l app=karmada-scheduler --tail=200 | grep -i "error\|failed\|panic"

5. Check for slow plugins. Slow plugins can cause scheduling timeouts that surface as errors.

histogram_quantile(0.99,
  sum by (le, plugin) (rate(karmada_scheduler_plugin_execution_duration_seconds_bucket[5m])))

6. Check estimator connectivity. Scheduler errors can occur when the scheduler cannot reach estimators.

sum(rate(karmada_scheduler_estimator_estimating_request_total{result="error"}[5m]))
/
sum(rate(karmada_scheduler_estimator_estimating_request_total[5m]))

kubectl get pods -n karmada-system -l app=karmada-scheduler-estimator

7. Check member cluster connectivity:

kubectl get clusters

cluster_ready_state == 0

8. Check scheduler queue health:

scheduler_pending_bindings
rate(karmada_scheduler_queue_incoming_bindings_total[5m]) by (event)

9. Check for recent changes. Were new member clusters added, PropagationPolicies modified, or scheduler plugins changed?

Mitigation

Symptom	Action
Scheduler pod crash-looping	Check logs for panics; describe pod for OOM events
Plugin failures in logs	Identify failing plugin; check plugin configuration
Estimator unreachable	Check estimator pod health and network connectivity
All clusters NotReady	Address cluster connectivity issues first
No matching clusters for placement rules	Review PropagationPolicy `clusterAffinity` and `spreadConstraints`; check cluster labels

Understanding This Alert​

What This Alert Means​

Impact​

Possible Causes​

Remediation​

Mitigation​

Related Resources​