Metrics-based Observability Guide
This guide provides comprehensive guidance on using Prometheus metrics from Karmada components to gain observability into your Karmada deployment. Following observability best practices, this guide will help you monitor the health, performance, and reliability of your multi-cluster environment.
Overview
Karmada exports rich Prometheus metrics across all its components, enabling you to monitor:
- Multi-cluster scheduling performance and reliability
- Resource propagation and synchronization health
- Cluster resource utilization and availability
- Failover and eviction operations
- Autoscaling behavior
All Karmada components expose metrics endpoints that can be scraped by Prometheus. This guide demonstrates how to leverage these metrics for effective observability.
Prerequisites
Before you begin, ensure you have:
- A running Karmada instance
- Prometheus installed and configured (see Use Prometheus to monitor Karmada control plane)
- Basic familiarity with Prometheus query language (PromQL)
- (Optional) Grafana for visualization
Quick Start
Get Karmada monitoring running in 5 minutes:
- Install Prometheus - Follow the setup guide to configure Prometheus scraping
- Import Grafana dashboards - Download and import the pre-built dashboards below for instant visibility
- Set up critical alerts - Copy the essential alerting rules to get notified of issues
- Verify monitoring works - Run the validation queries to confirm metrics are flowing
For comprehensive monitoring guidance, continue reading the sections below.
Karmada Metrics Architecture
Karmada exports Prometheus metrics from the following components:
- karmada-apiserver
- karmada-controller-manager
- karmada-scheduler
- karmada-scheduler-estimator
- karmada-agent
- karmada-webhook
- karmada-descheduler
For a complete list of available metrics, see the Karmada Metrics Reference.
Critical Metrics and Health Signals
This section covers the most important metrics for monitoring Karmada health. For the complete list of available metrics, see the Karmada Metrics Reference.
Priority Levels:
- ⚡ Critical - Monitor these first; essential for production operations
- ⚠️ Important - Add after critical metrics are in place
- 💡 Optional - Advanced monitoring and optimization
⚡ Critical Metrics (Monitor First)
cluster_ready_state
- Type: Gauge
- Labels:
member_cluster - Description: Indicates whether each member cluster is ready (1) or not ready (0).
- Why it matters: A cluster with value 0 cannot accept workload scheduling - this is the most critical health indicator.
- Example query:
# Show not-ready clusters
cluster_ready_state == 0
karmada_scheduler_schedule_attempts_total
- Type: Counter
- Labels:
result(success/error),schedule_type - Description: Count of scheduling attempts by result.
- Why it matters: High error rates mean workloads cannot be placed, blocking deployments.
- Example query:
# Scheduling success rate
rate(karmada_scheduler_schedule_attempts_total{result="success"}[5m])
/
rate(karmada_scheduler_schedule_attempts_total[5m])
karmada_scheduler_e2e_scheduling_duration_seconds
- Type: Histogram
- Labels:
result,schedule_type - Description: End-to-end time to schedule a resource.
- Why it matters: High latency delays application deployments.
- Example query:
# P95 scheduling latency
histogram_quantile(0.95,
rate(karmada_scheduler_e2e_scheduling_duration_seconds_bucket[5m]))
cluster_cpu_allocated_number
- Type: Gauge
- Labels:
member_cluster - Description: Number of CPU cores currently allocated (requested) in the cluster.
- Why it matters: Used with
cluster_cpu_allocatable_numberto calculate CPU utilization. High utilization (>85%) causes scheduling failures. - Example query:
# CPU utilization per cluster
(cluster_cpu_allocated_number / cluster_cpu_allocatable_number) * 100
cluster_cpu_allocatable_number
- Type: Gauge
- Labels:
member_cluster - Description: Total number of CPU cores available for allocation in the cluster.
- Why it matters: Represents cluster CPU capacity. Used with
cluster_cpu_allocated_numberto calculate utilization and available capacity. - Example query:
# Available CPU capacity by cluster
cluster_cpu_allocatable_number - cluster_cpu_allocated_number