Skip to main content
Version: Next

Karmada Scheduler Latency SLO Error Budget Burn Rate Exceeded

Understanding This Alert

This alert fires when the SLO's error budget is being consumed faster than sustainable. This SLO uses ticket-level alerts only — investigate and address during normal business hours. See the Reliability Engineering Guide for details on burn rates, time windows, and severity thresholds.

What This Alert Means

End-to-end scheduling operations (Stage 2: Scheduling) are exceeding the configured latency threshold at an elevated rate.

Impact

  • Delayed workload placement - Resources take longer to be assigned to member clusters
  • Slower deployments - End-to-end deployment time increases
  • Pipeline bottleneck - Downstream propagation steps are delayed waiting on scheduling decisions
  • Slower failure recovery - Rescheduling speed is reduced when clusters fail

Possible Causes

  • Scheduler pod under resource pressure (CPU or memory)
  • Large number of clusters or resources increasing scheduling complexity
  • Scheduler estimator latency contributing to overall scheduling time
  • API server latency slowing the scheduler's reads and writes
  • Scheduler queue backlog causing queuing delays

Remediation

1. Review the Sloth SLO dashboards in Grafana. These dashboards are your primary tool for understanding the scope and timeline of the issue:

  • SLO Details Dashboard — Drill into this specific SLO to see the current burn rate, error budget remaining, monthly burndown chart, and alert state. Use this to confirm the alert and understand when the issue began.
  • SLO Overview Dashboard — Check the fleet-wide view to see if other SLOs are also burning budget, which may indicate a broader systemic issue.

2. Identify the slow scheduling stage:

histogram_quantile(0.95,
sum by (le, schedule_step) (rate(karmada_scheduler_scheduling_algorithm_duration_seconds_bucket[5m])))

histogram_quantile(0.95,
sum by (le, extension_point) (rate(karmada_scheduler_framework_extension_point_duration_seconds_bucket[5m])))

3. Identify slow plugins:

topk(10,
histogram_quantile(0.95,
sum by (le, plugin) (rate(karmada_scheduler_plugin_execution_duration_seconds_bucket[5m]))))

4. Check estimator latency. Estimator calls are often the longest part of scheduling.

histogram_quantile(0.95,
sum by (le) (rate(karmada_scheduler_estimator_estimating_algorithm_duration_seconds_bucket[5m])))

histogram_quantile(0.95,
sum by (le, step) (rate(karmada_scheduler_estimator_estimating_algorithm_duration_seconds_bucket[5m])))

5. Check the number of clusters being evaluated. More clusters means more work per scheduling operation.

kubectl get clusters --no-headers | wc -l

6. Check scheduler resource usage:

kubectl top pod -n karmada-system -l app=karmada-scheduler

7. Check scheduler logs:

Check the logs using your logging solution (e.g., kubectl logs, Loki, Elasticsearch):

kubectl logs -n karmada-system -l app=karmada-scheduler --tail=200 | grep -i "error"

8. Check for recent changes. Were new clusters registered, new PropagationPolicies added, or scheduler configuration modified?

Mitigation

Root CauseAction
Slow estimatorsCheck estimator pod health; see Scheduler Estimator Latency
Slow pluginsReview plugin configuration; disable non-essential plugins if possible
Many clusters to evaluateUse clusterAffinity in PropagationPolicies to pre-filter clusters
Scheduler CPU-constrainedIncrease CPU limits
High scheduling throughputAdd scheduler replicas (leader election is supported and enabled by default)