Metrics-based Observability Guide
This guide provides comprehensive guidance on using Prometheus metrics from Karmada components to gain observability into your Karmada deployment. Following observability best practices, this guide will help you monitor the health, performance, and reliability of your multi-cluster environment.
Overview
Karmada exports rich Prometheus metrics across all its components, enabling you to monitor:
- Multi-cluster scheduling performance and reliability
- Resource propagation and synchronization health
- Cluster resource utilization and availability
- Failover and eviction operations
- Autoscaling behavior
All Karmada components expose metrics endpoints that can be scraped by Prometheus. This guide demonstrates how to leverage these metrics for effective observability.
Prerequisites
Before you begin, ensure you have:
- A running Karmada instance
- Prometheus installed and configured (see Use Prometheus to monitor Karmada control plane)
- Basic familiarity with Prometheus query language (PromQL)
- (Optional) Grafana for visualization
Quick Start
Get Karmada monitoring running in 5 minutes:
- Install Prometheus - Follow the setup guide to configure Prometheus scraping
- Import Grafana dashboards - Download and import the pre-built dashboards below for instant visibility
- Set up critical alerts - Copy the essential alerting rules to get notified of issues
- Verify monitoring works - Run the validation queries to confirm metrics are flowing
For comprehensive monitoring guidance, continue reading the sections below.
Karmada Metrics Architecture
Karmada exports Prometheus metrics from the following components:
- karmada-apiserver
- karmada-controller-manager
- karmada-scheduler
- karmada-scheduler-estimator
- karmada-agent
- karmada-webhook
- karmada-descheduler
For a complete list of available metrics, see the Karmada Metrics Reference.
Critical Metrics and Health Signals
This section covers the most important metrics for monitoring Karmada health. For the complete list of available metrics, see the Karmada Metrics Reference.
Priority Levels:
- ⚡ Critical - Monitor these first; essential for production operations
- ⚠️ Important - Add after critical metrics are in place
- 💡 Optional - Advanced monitoring and optimization
⚡ Critical Metrics (Monitor First)
cluster_ready_state
- Type: Gauge
- Labels:
member_cluster - Description: Indicates whether each member cluster is ready (1) or not ready (0).
- Why it matters: A cluster with value 0 cannot accept workload scheduling - this is the most critical health indicator.
- Example query:
# Show not-ready clusters
cluster_ready_state == 0
karmada_scheduler_schedule_attempts_total
- Type: Counter
- Labels:
result(success/error),schedule_type - Description: Count of scheduling attempts by result.
- Why it matters: High error rates mean workloads cannot be placed, blocking deployments.
- Example query:
# Scheduling success rate
rate(karmada_scheduler_schedule_attempts_total{result="success"}[5m])
/
rate(karmada_scheduler_schedule_attempts_total[5m])
karmada_scheduler_e2e_scheduling_duration_seconds
- Type: Histogram
- Labels:
result,schedule_type - Description: End-to-end time to schedule a resource.
- Why it matters: High latency delays application deployments.
- Example query:
# P95 scheduling latency
histogram_quantile(0.95,
rate(karmada_scheduler_e2e_scheduling_duration_seconds_bucket[5m]))
cluster_cpu_allocated_number
- Type: Gauge
- Labels:
member_cluster - Description: Number of CPU cores currently allocated (requested) in the cluster.
- Why it matters: Used with
cluster_cpu_allocatable_numberto calculate CPU utilization. High utilization (>85%) causes scheduling failures. - Example query:
# CPU utilization per cluster
(cluster_cpu_allocated_number / cluster_cpu_allocatable_number) * 100
cluster_cpu_allocatable_number
- Type: Gauge
- Labels:
member_cluster - Description: Total number of CPU cores available for allocation in the cluster.
- Why it matters: Represents cluster CPU capacity. Used with
cluster_cpu_allocated_numberto calculate utilization and available capacity. - Example query:
# Available CPU capacity by cluster
cluster_cpu_allocatable_number - cluster_cpu_allocated_number
cluster_memory_allocated_bytes
- Type: Gauge
- Labels:
member_cluster - Description: Amount of memory in bytes currently allocated (requested) in the cluster.
- Why it matters: Used with
cluster_memory_allocatable_bytesto calculate memory utilization. Memory exhaustion prevents new pod scheduling. - Example query:
# Memory utilization per cluster
(cluster_memory_allocated_bytes / cluster_memory_allocatable_bytes) * 100
cluster_memory_allocatable_bytes
- Type: Gauge
- Labels:
member_cluster - Description: Total amount of memory in bytes available for allocation in the cluster.
- Why it matters: Represents cluster memory capacity. Used with
cluster_memory_allocated_bytesto calculate utilization and available capacity. - Example query:
# Available memory in gigabytes by cluster
(cluster_memory_allocatable_bytes - cluster_memory_allocated_bytes) / 1024 / 1024 / 1024
karmada_work_sync_workload_duration_seconds
- Type: Histogram
- Labels:
result - Description: Time to sync Work objects to member clusters.
- Why it matters: Critical for end-to-end propagation latency - delays here delay all workload deployments.
- Example query:
# P95 work sync latency
histogram_quantile(0.95,
rate(karmada_work_sync_workload_duration_seconds_bucket[5m]))
controller_runtime_webhook_requests_total
- Type: Counter
- Labels:
code(HTTP status),webhook - Description: Webhook request count by HTTP status code.
- Why it matters: Webhook failures block all API operations (creates, updates, deletes).
- Example query:
# Webhook error rate
sum(rate(controller_runtime_webhook_requests_total{code!~"2.."}[5m]))
/
sum(rate(controller_runtime_webhook_requests_total[5m]))
⚠️ Important Metrics (Add After Critical)
scheduler_pending_bindings
- Type: Gauge
- Labels:
queue(active/backoff/unschedulable) - Description: Number of bindings in each queue.
- Why it matters: Bindings stuck in unschedulable queue indicate placement problems.
- Example query:
# Unschedulable bindings
scheduler_pending_bindings{queue="unschedulable"}
workqueue_depth
- Type: Gauge
- Labels:
name(controller name) - Description: Number of items in each controller's work queue.
- Why it matters: High/growing depth means controllers are falling behind.
- Example query:
# Controllers with deep queues
workqueue_depth > 100
karmada_eviction_queue_depth
- Type: Gauge
- Labels:
name - Description: Number of resources awaiting eviction during failover.
- Why it matters: Growing queue means failover is delayed.
- Example query:
# Eviction queue depth
karmada_eviction_queue_depth
resource_match_policy_duration_seconds
- Type: Histogram
- Description: Time to match resources to PropagationPolicies.
- Why it matters: First step in propagation pipeline - delays here cascade downstream.
- Example query:
# P95 policy matching latency
histogram_quantile(0.95,
rate(resource_match_policy_duration_seconds_bucket[5m]))
karmada_policy_apply_attempts_total
- Type: Counter
- Labels:
result(success/error) - Description: Count of policy application attempts by result.
- Why it matters: Errors prevent resources from propagating.
- Example query:
# Policy application error rate
rate(karmada_policy_apply_attempts_total{result="error"}[5m])
/
rate(karmada_policy_apply_attempts_total[5m])
create_resource_to_cluster
- Type: Counter
- Labels:
result,apiversion,kind,member_cluster - Description: Count of resource creation operations to member clusters by result.
- Why it matters: Errors indicate connectivity or permission issues when creating resources in member clusters.
- Example query:
# Resource creation error rate by cluster
sum by (member_cluster) (
rate(create_resource_to_cluster{result="error"}[5m])
)
update_resource_to_cluster
- Type: Counter
- Labels:
result,apiversion,kind,member_cluster - Description: Count of resource update operations to member clusters by result.
- Why it matters: Errors indicate connectivity or permission issues when updating resources in member clusters.
- Example query:
# Resource update error rate by kind
sum by (kind) (
rate(update_resource_to_cluster{result="error"}[5m])
)
delete_resource_from_cluster
- Type: Counter
- Labels:
result,apiversion,kind,member_cluster - Description: Count of resource deletion operations from member clusters by result.
- Why it matters: Errors indicate connectivity or permission issues when deleting resources from member clusters.
- Example query:
# Resource deletion error rate by cluster
sum by (member_cluster) (
rate(delete_resource_from_cluster{result="error"}[5m])
)
workqueue_retries_total
- Type: Counter
- Labels:
name(controller name) - Description: Count of retries per controller.
- Why it matters: High retry rates indicate controller failures or transient issues.
- Example query:
# Retry ratio by controller
rate(workqueue_retries_total[5m])
/
rate(workqueue_adds_total[5m])