Metrics-based Observability Guide
This guide provides comprehensive guidance on using Prometheus metrics from Karmada components to gain observability into your Karmada deployment. Following observability best practices, this guide will help you monitor the health, performance, and reliability of your multi-cluster environment.
Overview
Karmada exports rich Prometheus metrics across all its components, enabling you to monitor:
- Multi-cluster scheduling performance and reliability
- Resource propagation and synchronization health
- Cluster resource utilization and availability
- Failover and eviction operations
- Autoscaling behavior
All Karmada components expose metrics endpoints that can be scraped by Prometheus. This guide demonstrates how to leverage these metrics for effective observability.
Prerequisites
Before you begin, ensure you have:
- A running Karmada instance
- Prometheus installed and configured (see Use Prometheus to monitor Karmada control plane)
- Basic familiarity with Prometheus query language (PromQL)
- (Optional) Grafana for visualization
Quick Start
Get Karmada monitoring running in 5 minutes:
- Install Prometheus - Follow the setup guide to configure Prometheus scraping
- Import Grafana dashboards - Download and import the pre-built dashboards below for instant visibility
- Set up critical alerts - Copy the essential alerting rules to get notified of issues
- Verify monitoring works - Run the validation queries to confirm metrics are flowing
For comprehensive monitoring guidance, continue reading the sections below.
Karmada Metrics Architecture
Karmada exports Prometheus metrics from the following components:
- karmada-apiserver
- karmada-controller-manager
- karmada-scheduler
- karmada-scheduler-estimator
- karmada-agent
- karmada-webhook
- karmada-descheduler
For a complete list of available metrics, see the Karmada Metrics Reference.
Critical Metrics and Health Signals
This section covers the most important metrics for monitoring Karmada health. For the complete list of available metrics, see the Karmada Metrics Reference.
Priority Levels:
- ⚡ Critical - Monitor these first; essential for production operations
- ⚠️ Important - Add after critical metrics are in place
- 💡 Optional - Advanced monitoring and optimization
⚡ Critical Metrics (Monitor First)
cluster_ready_state
- Type: Gauge
- Labels:
member_cluster - Description: Indicates whether each member cluster is ready (1) or not ready (0).
- Why it matters: A cluster with value 0 cannot accept workload scheduling - this is the most critical health indicator.
- Example query:
# Show not-ready clusters
cluster_ready_state == 0
karmada_scheduler_schedule_attempts_total
- Type: Counter
- Labels:
result(success/error),schedule_type - Description: Count of scheduling attempts by result.
- Why it matters: High error rates mean workloads cannot be placed, blocking deployments.
- Example query:
# Scheduling success rate
rate(karmada_scheduler_schedule_attempts_total{result="success"}[5m])
/
rate(karmada_scheduler_schedule_attempts_total[5m])
karmada_scheduler_e2e_scheduling_duration_seconds
- Type: Histogram
- Labels:
result,schedule_type - Description: End-to-end time to schedule a resource.
- Why it matters: High latency delays application deployments.
- Example query:
# P95 scheduling latency
histogram_quantile(0.95,
rate(karmada_scheduler_e2e_scheduling_duration_seconds_bucket[5m]))
cluster_cpu_allocated_number
- Type: Gauge
- Labels:
member_cluster - Description: Number of CPU cores currently allocated (requested) in the cluster.
- Why it matters: Used with
cluster_cpu_allocatable_numberto calculate CPU utilization. High utilization (>85%) causes scheduling failures. - Example query:
# CPU utilization per cluster
(cluster_cpu_allocated_number / cluster_cpu_allocatable_number) * 100
cluster_cpu_allocatable_number
- Type: Gauge
- Labels:
member_cluster - Description: Total number of CPU cores available for allocation in the cluster.
- Why it matters: Represents cluster CPU capacity. Used with
cluster_cpu_allocated_numberto calculate utilization and available capacity. - Example query:
# Available CPU capacity by cluster
cluster_cpu_allocatable_number - cluster_cpu_allocated_number
cluster_memory_allocated_bytes
- Type: Gauge
- Labels:
member_cluster - Description: Amount of memory in bytes currently allocated (requested) in the cluster.
- Why it matters: Used with
cluster_memory_allocatable_bytesto calculate memory utilization. Memory exhaustion prevents new pod scheduling. - Example query:
# Memory utilization per cluster
(cluster_memory_allocated_bytes / cluster_memory_allocatable_bytes) * 100
cluster_memory_allocatable_bytes
- Type: Gauge
- Labels:
member_cluster - Description: Total amount of memory in bytes available for allocation in the cluster.
- Why it matters: Represents cluster memory capacity. Used with
cluster_memory_allocated_bytesto calculate utilization and available capacity. - Example query:
# Available memory in gigabytes by cluster
(cluster_memory_allocatable_bytes - cluster_memory_allocated_bytes) / 1024 / 1024 / 1024
karmada_work_sync_workload_duration_seconds
- Type: Histogram
- Labels:
result - Description: Time to sync Work objects to member clusters.
- Why it matters: Critical for end-to-end propagation latency - delays here delay all workload deployments.
- Example query:
# P95 work sync latency
histogram_quantile(0.95,
rate(karmada_work_sync_workload_duration_seconds_bucket[5m]))
controller_runtime_webhook_requests_total
- Type: Counter
- Labels:
code(HTTP status),webhook - Description: Webhook request count by HTTP status code.
- Why it matters: Webhook failures block all API operations (creates, updates, deletes).
- Example query:
# Webhook error rate
sum(rate(controller_runtime_webhook_requests_total{code!~"2.."}[5m]))
/
sum(rate(controller_runtime_webhook_requests_total[5m]))
⚠️ Important Metrics (Add After Critical)
scheduler_pending_bindings
- Type: Gauge
- Labels:
queue(active/backoff/unschedulable) - Description: Number of bindings in each queue.
- Why it matters: Bindings stuck in unschedulable queue indicate placement problems.
- Example query:
# Unschedulable bindings
scheduler_pending_bindings{queue="unschedulable"}
workqueue_depth
- Type: Gauge
- Labels:
name(controller name) - Description: Number of items in each controller's work queue.
- Why it matters: High/growing depth means controllers are falling behind.
- Example query:
# Controllers with deep queues
workqueue_depth > 100
karmada_eviction_queue_depth
- Type: Gauge
- Labels:
name - Description: Number of resources awaiting eviction during failover.
- Why it matters: Growing queue means failover is delayed.
- Example query:
# Eviction queue depth
karmada_eviction_queue_depth
resource_match_policy_duration_seconds
- Type: Histogram
- Description: Time to match resources to PropagationPolicies.
- Why it matters: First step in propagation pipeline - delays here cascade downstream.
- Example query:
# P95 policy matching latency
histogram_quantile(0.95,
rate(resource_match_policy_duration_seconds_bucket[5m]))
karmada_policy_apply_attempts_total
- Type: Counter
- Labels:
result(success/error) - Description: Count of policy application attempts by result.
- Why it matters: Errors prevent resources from propagating.
- Example query:
# Policy application error rate
rate(karmada_policy_apply_attempts_total{result="error"}[5m])
/
rate(karmada_policy_apply_attempts_total[5m])
create_resource_to_cluster
- Type: Counter
- Labels:
result,apiversion,kind,member_cluster - Description: Count of resource creation operations to member clusters by result.
- Why it matters: Errors indicate connectivity or permission issues when creating resources in member clusters.
- Example query:
# Resource creation error rate by cluster
sum by (member_cluster) (
rate(create_resource_to_cluster{result="error"}[5m])
)
update_resource_to_cluster
- Type: Counter
- Labels:
result,apiversion,kind,member_cluster - Description: Count of resource update operations to member clusters by result.
- Why it matters: Errors indicate connectivity or permission issues when updating resources in member clusters.
- Example query:
# Resource update error rate by kind
sum by (kind) (
rate(update_resource_to_cluster{result="error"}[5m])
)
delete_resource_from_cluster
- Type: Counter
- Labels:
result,apiversion,kind,member_cluster - Description: Count of resource deletion operations from member clusters by result.
- Why it matters: Errors indicate connectivity or permission issues when deleting resources from member clusters.
- Example query:
# Resource deletion error rate by cluster
sum by (member_cluster) (
rate(delete_resource_from_cluster{result="error"}[5m])
)
workqueue_retries_total
- Type: Counter
- Labels:
name(controller name) - Description: Count of retries per controller.
- Why it matters: High retry rates indicate controller failures or transient issues.
- Example query:
# Retry ratio by controller
rate(workqueue_retries_total[5m])
/
rate(workqueue_adds_total[5m])
💡 Optional Metrics (Advanced Monitoring)
cluster_ready_node_number
- Type: Gauge
- Labels:
member_cluster - Description: Number of nodes in Ready state in the cluster.
- Why it matters: Used with
cluster_node_numberto calculate node readiness ratio. A low ratio suggests node health issues that could impact workload capacity. - Example query:
# Node readiness ratio per cluster
(cluster_ready_node_number / cluster_node_number) < 0.8
cluster_node_number
- Type: Gauge
- Labels:
member_cluster - Description: Total number of nodes in the cluster.
- Why it matters: Represents total cluster node count. Used with
cluster_ready_node_numberto calculate node readiness ratio and track cluster size. - Example query:
# Total nodes across all clusters
sum(cluster_node_number)
karmada_federatedhpa_process_duration_seconds
- Type: Histogram
- Labels:
result - Description: FederatedHPA processing time.
- Why it matters: Indicates how quickly HPA decisions are made. High latency can delay scaling actions.
- Example query:
# P95 FederatedHPA processing latency
histogram_quantile(0.95,
rate(karmada_federatedhpa_process_duration_seconds_bucket[5m]))
karmada_scheduler_estimator_estimating_request_total
- Type: Counter
- Labels:
result,type - Description: Scheduler estimator requests by result.
- Why it matters: Errors indicate the scheduler cannot accurately estimate cluster capacity, leading to poor placement decisions.
- Example query:
# Estimator error rate
rate(karmada_scheduler_estimator_estimating_request_total{result="error"}[5m])
/
rate(karmada_scheduler_estimator_estimating_request_total[5m])
karmada_build_info
- Type: Gauge (constant value of 1)
- Labels:
git_version,git_commit,build_date,go_version,compiler,platform - Description: Component version information.
- Why it matters: Essential for tracking component versions and correlating issues with specific releases.
- Example query:
# List all component versions
karmada_build_info
# Check for version mismatches
count(karmada_build_info) by (git_version)
See the Karmada Metrics Reference for detailed information on all available metrics.
Grafana Dashboards
We provide production-ready Grafana dashboards that you can download and import immediately. These dashboards provide comprehensive observability coverage for all Karmada components and operations.
Dashboard Summary
| Dashboard | Purpose | Key Metrics | Download |
|---|---|---|---|
| API Server Insights | Monitor API server performance and resource usage | Request latency, error rates, etcd performance | 📥 JSON |
| Controller Manager Insights | Track controller reconciliation and workqueue health | Reconciliation errors, queue depth, worker utilization | 📥 JSON |
| Member Cluster Insights | Monitor cluster health and capacity | Cluster readiness, CPU/memory/pod utilization | 📥 JSON |
| Scheduler Insights | Track scheduling performance and plugin behavior | Scheduling latency, success rate, queue status | 📥 JSON |
| Resource Propagation Insights | Monitor resource propagation pipeline | Policy application, work sync duration, propagation errors | 📥 JSON |
Detailed Dashboard Descriptions
1. API Server Insights
📥 Download
Purpose: Monitor Karmada API server performance, request patterns, and resource usage.
Note: Karmada is kube-native. As such, the Karmada API server is the standard Kubernetes API server. For detailed information about the metrics exposed, see the Kubernetes API Server Metrics documentation.
Dashboard Sections:
- Overview: Version, running replicas, uptime, in-flight requests, error rates (4xx/5xx)
- Request Latency: P50/P90/P99 latency overall and by verb, mutating vs read-only breakdown
- Request Mix & Hotspots: Top resources and API groups by QPS, request patterns
- Payload Sizes: Response size analysis and largest responses by resource
- Admission, Audit, Watches: Controller/webhook latency, audit events, watch events
- etcd: Request rates and latency by operation
- Go Runtime: CPU, memory, heap, GC, goroutines, threads, file descriptors
Use Case: API server performance tuning, capacity planning, troubleshooting slow requests, identifying resource hotspots
Recommended for: Platform engineering, performance optimization, API server capacity planning
2. Controller Manager Insights
📥 Download
Purpose: Deep dive into controller-manager reconciliation, workqueue health, and runtime performance.
Dashboard Sections:
- Reconciliation Overview: Error counts and percentages (30m/1h windows), reconciliations per second by result, duration percentiles (P50/P90/P99), active workers, and worker utilization
- Workqueue Health: Queue depth, items added rate, queue wait time and work duration percentiles, unfinished work, retry rates
- Admission Webhooks: Webhook request rates and latency percentiles
- Go Runtime & Process: CPU, memory, heap, goroutines, threads, file descriptors
Features:
- Controller filtering: Filter by specific controllers to focus on individual controller performance
- Multi-time window analysis: View error rates over both 30-minute and 1-hour windows
- Workqueue health monitoring: Track queue depth, latency, and retry rates for early detection of issues
Use Case: Controller debugging, identifying reconciliation bottlenecks, webhook troubleshooting, detecting stuck controllers
Recommended for: Development, performance optimization, debugging controller issues
3. Member Cluster Insights
📥 Download
Purpose: Monitor health, capacity, and utilization of member clusters in your Karmada federation.
Dashboard Sections:
- Overview: Total/Ready/Not Ready cluster counts, health percentage, readiness status grid
- Capacity (Allocatable vs Allocated): CPU, Memory, and Pod capacity trends showing allocatable vs allocated resources over time
- Utilization % (Allocated / Allocatable): CPU, Memory, and Pod utilization percentages with color-coded thresholds
- Cluster Sync Status: Average and P95 cluster sync duration tracking
Features:
- Multi-cluster filtering: View all clusters or filter to specific ones
- Color-coded utilization: Green (<70%), Yellow (70-85%), Orange (85-95%), Red (>95%)
- Real-time capacity tracking: Compare allocatable vs allocated resources for capacity planning
- Health at a glance: Quickly identify unhealthy clusters and resource constraints
Use Case: Member cluster health monitoring, capacity planning, identifying resource bottlenecks, tracking cluster sync performance
Recommended for: Multi-cluster operations, capacity planning, cluster health checks, SRE teams
4. Scheduler Insights
📥 Download
Purpose: Monitor Karmada scheduler performance, scheduling latency, algorithm efficiency, and plugin behavior.
Dashboard Sections:
- Overview: Total scheduling attempts, success/error counts, and success rate tracking
- Scheduling Throughput & Results: Scheduling attempts per second by result (success/error/unschedulable)
- E2E Scheduling Latency: End-to-end scheduling duration percentiles (P50/P90/P99) for overall performance tracking
- Algorithm & Queue: Scheduling algorithm latency, pending bindings by queue type, and queue incoming rate
- Framework Extension Points: Latency tracking for each extension point in the scheduling framework
- Plugin Execution: Plugin execution duration percentiles to identify slow plugins
- Breakdowns by Schedule Type: Detailed analysis segmented by schedule type (scale schedule, lazy activation, etc.)
Features:
- Multi-dimensional filtering: Filter by schedule type, result, event, extension point, and plugin
- Comprehensive latency analysis: Track latency at multiple stages (E2E, algorithm, framework, plugins)
- Queue health monitoring: Monitor pending bindings and queue incoming rates
- Plugin performance tracking: Identify slow or problematic scheduler plugins
Use Case: Scheduler performance optimization, troubleshooting slow scheduling, plugin development and debugging, capacity planning
Recommended for: Scheduler tuning, plugin developers, performance engineers, troubleshooting scheduling delays
5. Resource Propagation Insights
📥 Download
Purpose: Monitor the resource propagation pipeline from policy application to work synchronization across member clusters.
Dashboard Sections:
- Overview: Policy apply success/error rates, total apply rate, and overall propagation health
- Policy Matching & Applying: Average, P50, P90, and P99 latency for resource-to-policy matching operations
- Work Sync Durations: Latency percentiles for work synchronization to member clusters
Features:
- End-to-end propagation monitoring: Track the complete pipeline from policy application through work sync
- Multi-percentile latency tracking: P50/P90/P99 latency for detailed performance analysis
- Success vs error breakdown: Monitor both successful operations and error rates
Use Case: Troubleshooting propagation delays, identifying policy matching bottlenecks, monitoring cross-cluster deployment performance
Recommended for: Multi-cluster operations, policy debugging, performance optimization, propagation SLA monitoring
Installation
Prerequisites
- Grafana 8.0+ installed
- Prometheus datasource configured in Grafana
- Prometheus scraping Karmada component metrics (see setup guide)