Skip to main content
Version: Next

Karmada API Server Availability SLO Error Budget Burn Rate Exceeded

Understanding This Alert

This alert fires when the SLO's error budget is being consumed faster than sustainable — page alerts indicate urgent issues requiring immediate action, while ticket alerts should be addressed during business hours. See the Reliability Engineering Guide for details on burn rates, time windows, and severity thresholds.

What This Alert Means

The Karmada API Server is returning an elevated rate of server errors (5xx) or throttling responses (429). The API server is the front door for all Karmada control plane operations.

Impact

  • All control plane operations are affected - No resources can be created, updated, or deleted
  • Workload propagation is blocked - Nothing can be scheduled or deployed to member clusters
  • CLI and API access is degraded - Users cannot interact with the Karmada control plane

This is a critical issue that blocks all Karmada control plane operations.

Possible Causes

  • Karmada API Server pod is unhealthy, crash-looping, or OOMKilled
  • etcd backend is degraded or unreachable
  • High request volume causing throttling (429s)
  • Network issues between clients and the API server
  • TLS certificate issues

Remediation

1. Review the Sloth SLO dashboards in Grafana. These dashboards are your primary tool for understanding the scope and timeline of the issue:

  • SLO Details Dashboard — Drill into this specific SLO to see the current burn rate, error budget remaining, monthly burndown chart, and alert state. Use this to confirm the alert and understand when the issue began.
  • SLO Overview Dashboard — Check the fleet-wide view to see if other SLOs are also burning budget, which may indicate a broader systemic issue.

2. Check API Server pod health:

kubectl get pods -n karmada-system -l app=karmada-apiserver
kubectl describe pod -n karmada-system <pod-name>

Look for OOMKilled, CrashLoopBackOff, or other abnormal pod states.

3. Check API Server logs:

Check the Karmada API Server logs for errors using your logging solution (e.g., kubectl logs, Loki, Elasticsearch):

kubectl logs -n karmada-system -l app=karmada-apiserver --tail=100

4. Check the error breakdown by status code to identify the type of failures:

sum by (code) (rate(apiserver_request_total{code=~"5..|429"}[5m]))

5. Break down errors by resource and verb to identify the failing operation:

sum by (resource, verb) (rate(apiserver_request_total{code=~"5..|429"}[5m]))

6. Check in-flight request saturation:

apiserver_current_inflight_requests

7. Check etcd health. etcd issues are a common cause of API server 5xx errors.

kubectl get pods -n karmada-system -l component=etcd
kubectl logs -n karmada-system <etcd-pod> --tail=50

Check etcd request latency:

histogram_quantile(0.99, sum by (le) (rate(etcd_request_duration_seconds_bucket[5m])))

8. If 429s are dominant, identify top request sources:

topk(10, sum by (user_agent) (rate(apiserver_request_total[5m])))

9. Check for recent changes. Was there a recent Karmada upgrade, configuration change, new member clusters, or new PropagationPolicies?

Mitigation

SymptomAction
API server pod crash-loopingCheck logs, describe pod for OOM or config errors
High memory / OOM killsIncrease memory limits or reduce client QPS
High CPU saturationScale up API server replicas or increase CPU limits
etcd latency or errorsInvestigate etcd cluster health; check disk I/O
High 429 rateIdentify and throttle abusive clients; tune --max-requests-inflight
Network issuesCheck network policies and connectivity between components