Karmada Metrics Reference
This document provides a comprehensive reference of all Prometheus metrics exported by Karmada components. For guidance on using these metrics for monitoring and observability, see the Karmada Observability Guide.
Karmada Metrics Architecture
Karmada exports Prometheus metrics from the following components:
- karmada-apiserver
- karmada-controller-manager
- karmada-scheduler
- karmada-scheduler-estimator
- karmada-agent
- karmada-webhook
- karmada-descheduler
Metric Conventions
Naming
- Karmada-specific metrics: Prefixed with
karmada_(e.g.,karmada_scheduler_schedule_attempts_total) - Subsystem metrics: Include subsystem name (e.g.,
karmada_scheduler_estimator_*) - Standard controller-runtime metrics: Use standard prefixes (e.g.,
workqueue_*,controller_runtime_*) - Units: Time metrics use
_seconds, byte metrics use_bytes - Base units: CPU in cores, memory in bytes, duration in seconds
Types
- Counter: Monotonically increasing value (e.g., total requests, errors). Use
rate()orincrease()in queries. - Gauge: Value that can go up or down (e.g., queue depth, resource utilization). Use directly in queries.
- Histogram: Observations bucketed by value (e.g., latency). Use
histogram_quantile()for percentiles.
Stability Levels
Karmada follows Kubernetes metric stability conventions:
-
STABLE: No explicit stability annotation in source code. Guaranteed consistent naming and semantics across versions. Breaking changes require minimum 2 release cycles notice. Safe for production use.
-
ALPHA: Explicitly marked with
StabilityLevel: metrics.ALPHAin source code. May change or be removed without notice. Use with caution in production.
Metrics by Category
Scheduler Metrics
These metrics track scheduling operations and performance.
karmada_scheduler_schedule_attempts_total
- Component: karmada-scheduler
- Type: Counter
- Stability: STABLE
- Labels:
result:successorerrorschedule_type:ResourceBindingorClusterResourceBinding
- Description: Number of attempts to schedule a ResourceBinding or ClusterResourceBinding. High error rate indicates scheduling failures preventing workload placement.
- Example queries:
# Scheduling success rate
rate(karmada_scheduler_schedule_attempts_total{result="success"}[5m])
/
rate(karmada_scheduler_schedule_attempts_total[5m])
# Scheduling error rate
rate(karmada_scheduler_schedule_attempts_total{result="error"}[5m])
/
rate(karmada_scheduler_schedule_attempts_total[5m])
karmada_scheduler_e2e_scheduling_duration_seconds
- Component: karmada-scheduler
- Type: Histogram
- Stability: STABLE
- Labels:
result:successorerrorschedule_type:ResourceBindingorClusterResourceBinding
- Description: End-to-end scheduling latency from when binding enters queue to placement decision. Use to monitor overall scheduling performance.
- Buckets: Exponential from 0.001s to ~32s
- Example queries:
# P95 scheduling latency
histogram_quantile(0.95, rate(karmada_scheduler_e2e_scheduling_duration_seconds_bucket[5m]))
# P99 scheduling latency
histogram_quantile(0.99, rate(karmada_scheduler_e2e_scheduling_duration_seconds_bucket[5m]))
karmada_scheduler_scheduling_algorithm_duration_seconds
- Component: karmada-scheduler
- Type: Histogram
- Stability: STABLE
- Labels:
schedule_step: Scheduling step (Filter, Score, Select, AssignReplicas, etc.)
- Description: Latency for each step of the scheduling algorithm. Use to identify slow steps in the scheduling pipeline.
- Buckets: Exponential from 0.001s to ~32s
- Example query:
# Latency by step (P95)
histogram_quantile(0.95, rate(karmada_scheduler_scheduling_algorithm_duration_seconds_bucket[5m])) by (schedule_step)
karmada_scheduler_queue_incoming_bindings_total
- Component: karmada-scheduler
- Type: Counter
- Stability: STABLE
- Labels:
event:Add,Update, orDelete
- Description: Number of bindings added to scheduling queues by event type. Use to understand workload churn and queue activity.
- Example query:
# Queue incoming rate by event
rate(karmada_scheduler_queue_incoming_bindings_total[5m]) by (event)
karmada_scheduler_framework_extension_point_duration_seconds
- Component: karmada-scheduler
- Type: Histogram
- Stability: STABLE
- Labels:
extension_point: Extension point name (Filter, Score, etc.)result:successorerror
- Description: Latency for running all plugins at a specific extension point. Use to identify slow extension points.
- Buckets: Exponential from 0.0001s to ~4s
- Example query:
# Extension point latency (P95)
histogram_quantile(0.95, rate(karmada_scheduler_framework_extension_point_duration_seconds_bucket[5m])) by (extension_point)
karmada_scheduler_plugin_execution_duration_seconds
- Component: karmada-scheduler
- Type: Histogram
- Stability: STABLE
- Labels:
plugin: Plugin nameextension_point: Extension point nameresult:successorerror
- Description: Duration for executing a specific plugin at an extension point. Use to identify slow plugins.
- Buckets: Exponential from 0.00001s to ~0.3s
- Example query:
# Top 10 slowest plugins (P95)
topk(10, histogram_quantile(0.95, rate(karmada_scheduler_plugin_execution_duration_seconds_bucket[5m])) by (plugin))
scheduler_pending_bindings
- Component: karmada-scheduler
- Type: Gauge
- Stability: ALPHA
- Labels:
queue:active,backoff, orunschedulable
- Description: Number of pending bindings in each scheduling queue. Bindings in
unschedulablequeue indicate resources that cannot be placed due to constraints. - Example query:
# Unschedulable bindings
scheduler_pending_bindings{queue="unschedulable"}
Scheduler Estimator Metrics
These metrics track cluster capacity estimation operations.
karmada_scheduler_estimator_estimating_request_total
- Component: karmada-scheduler-estimator
- Type: Counter
- Stability: STABLE
- Labels:
result:successorerrortype: Estimation type
- Description: Number of scheduler estimator requests. High error rate indicates inability to assess cluster capacity, leading to poor placement decisions.
- Example query:
# Estimator error rate
rate(karmada_scheduler_estimator_estimating_request_total{result="error"}[5m])
/
rate(karmada_scheduler_estimator_estimating_request_total[5m])
karmada_scheduler_estimator_estimating_algorithm_duration_seconds
- Component: karmada-scheduler-estimator
- Type: Histogram
- Stability: STABLE
- Labels:
result:successorerrortype: Estimation typestep: Algorithm step
- Description: Latency for each step of the estimation algorithm. Impacts overall scheduling latency.
- Buckets: Exponential from 0.001s to ~1s
- Example query:
# P95 estimation latency by step
histogram_quantile(0.95, rate(karmada_scheduler_estimator_estimating_algorithm_duration_seconds_bucket[5m])) by (step)
karmada_scheduler_estimator_estimating_plugin_extension_point_duration_seconds
- Component: karmada-scheduler-estimator
- Type: Histogram
- Stability: STABLE
- Labels:
estimating_plugin_extension_point: Extension point name
- Description: Latency for running all plugins at a specific estimator extension point.
- Buckets: Exponential from 0.0001s to ~4s
karmada_scheduler_estimator_estimating_plugin_execution_duration_seconds
- Component: karmada-scheduler-estimator
- Type: Histogram
- Stability: STABLE
- Labels:
estimating_plugin: Plugin nameestimating_plugin_extension_point: Extension point name
- Description: Duration for executing a specific plugin at an estimator extension point.
- Buckets: Exponential from 0.00001s to ~0.3s
Cluster Health Metrics
These metrics track member cluster state and resource capacity.
cluster_ready_state
- Component: karmada-controller-manager
- Type: Gauge
- Stability: STABLE
- Labels:
member_cluster: Cluster name
- Description: Cluster readiness state. Value of 1 indicates ready, 0 indicates not ready. Critical metric - value of 0 blocks all scheduling and workload operations to that cluster.
- Example query:
# Not-ready clusters
cluster_ready_state == 0
cluster_node_number
- Component: karmada-controller-manager
- Type: Gauge
- Stability: STABLE
- Labels:
member_cluster: Cluster name
- Description: Total number of nodes in the cluster. Use with
cluster_ready_node_numberto calculate node health ratio.
cluster_ready_node_number
- Component: karmada-controller-manager
- Type: Gauge
- Stability: STABLE
- Labels:
member_cluster: Cluster name
- Description: Number of ready nodes in the cluster. Low ratio vs total nodes indicates node health issues.
- Example query:
# Node readiness ratio per cluster
cluster_ready_node_number / cluster_node_number
# Clusters with less than 80% ready nodes
(cluster_ready_node_number / cluster_node_number) < 0.8
cluster_memory_allocatable_bytes
- Component: karmada-controller-manager
- Type: Gauge
- Stability: STABLE
- Labels:
member_cluster: Cluster name
- Description: Allocatable cluster memory in bytes. Represents total memory capacity available for scheduling.
cluster_cpu_allocatable_number
- Component: karmada-controller-manager
- Type: Gauge
- Stability: STABLE
- Labels:
member_cluster: Cluster name
- Description: Allocatable CPU in cores. Represents total CPU capacity available for scheduling.
cluster_pod_allocatable_number
- Component: karmada-controller-manager
- Type: Gauge
- Stability: STABLE
- Labels:
member_cluster: Cluster name
- Description: Allocatable pod capacity. Represents maximum number of pods that can be scheduled.
cluster_memory_allocated_bytes
- Component: karmada-controller-manager
- Type: Gauge
- Stability: STABLE
- Labels:
member_cluster: Cluster name
- Description: Allocated cluster memory in bytes. Sum of memory requests for all scheduled pods.
cluster_cpu_allocated_number
- Component: karmada-controller-manager
- Type: Gauge
- Stability: STABLE
- Labels:
member_cluster: Cluster name
- Description: Allocated CPU in cores. Sum of CPU requests for all scheduled pods.
- Example query:
# CPU utilization by cluster
(cluster_cpu_allocated_number / cluster_cpu_allocatable_number) * 100
# Clusters with high CPU utilization (>85%)
(cluster_cpu_allocated_number / cluster_cpu_allocatable_number) > 0.85
cluster_pod_allocated_number
- Component: karmada-controller-manager
- Type: Gauge
- Stability: STABLE
- Labels:
member_cluster: Cluster name
- Description: Number of allocated pods in the cluster. Count of pods currently scheduled.
cluster_sync_status_duration_seconds
- Component: karmada-controller-manager
- Type: Histogram
- Stability: STABLE
- Labels:
member_cluster: Cluster name
- Description: Duration to sync cluster status once. High latency means delayed detection of cluster state changes.
- Example query:
# P95 cluster status sync latency
histogram_quantile(0.95, rate(cluster_sync_status_duration_seconds_bucket[5m])) by (member_cluster)
Resource Propagation Metrics
These metrics track the propagation of resources from Karmada control plane to member clusters.
Propagation Pipeline: Policy Match → Policy Apply → Binding Sync → Work Sync → Resource Create/Update
resource_match_policy_duration_seconds
- Component: karmada-controller-manager
- Type: Histogram
- Stability: STABLE
- Labels: None
- Description: Duration to find a matched PropagationPolicy or ClusterPropagationPolicy for a resource template. First step in propagation pipeline - delays here impact the entire process.
- Buckets: Exponential from 0.001s to ~4s
- Example query:
# P95 policy matching latency
histogram_quantile(0.95, rate(resource_match_policy_duration_seconds_bucket[5m]))
resource_apply_policy_duration_seconds
- Component: karmada-controller-manager
- Type: Histogram
- Stability: STABLE
- Labels:
result:successorerror
- Description: Duration to apply a PropagationPolicy or ClusterPropagationPolicy to a resource template.
- Buckets: Exponential from 0.001s to ~4s
karmada_policy_apply_attempts_total
- Component: karmada-controller-manager
- Type: Counter
- Stability: STABLE
- Labels:
result:successorerror
- Description: Number of attempts to apply a PropagationPolicy or ClusterPropagationPolicy. High error rate indicates policy application failures preventing propagation.
- Example query:
# Policy application error rate
rate(karmada_policy_apply_attempts_total{result="error"}[5m])
/
rate(karmada_policy_apply_attempts_total[5m])
policy_preemption_total
- Component: karmada-controller-manager
- Type: Counter
- Stability: STABLE
- Labels:
result:successorerror
- Description: Number of preemptions where a resource is taken over by a higher-priority propagation policy. High error rate indicates policy conflict issues.
karmada_binding_sync_work_duration_seconds
- Component: karmada-controller-manager
- Type: Histogram
- Stability: STABLE
- Labels:
result:successorerror
- Description: Duration to sync Work objects for a ResourceBinding or ClusterResourceBinding. Second step in propagation - converts bindings to executable work.
- Buckets: Exponential from 0.001s to ~4s
- Example query:
# P95 binding-to-work sync latency
histogram_quantile(0.95, rate(karmada_binding_sync_work_duration_seconds_bucket[5m]))
karmada_work_sync_workload_duration_seconds
- Component: karmada-controller-manager, karmada-agent
- Type: Histogram
- Stability: STABLE
- Labels:
result:successorerror
- Description: Duration to sync the workload to a target member cluster. Critical metric - measures end-to-end sync time from Work object to cluster.
- Buckets: Exponential from 0.001s to ~4s
- Example query:
# P95 work-to-cluster sync latency
histogram_quantile(0.95, rate(karmada_work_sync_workload_duration_seconds_bucket[5m]))
karmada_create_resource_to_cluster
- Component: karmada-controller-manager, karmada-agent
- Type: Counter
- Stability: STABLE
- Labels:
result:successorerrorapiversion: Resource API versionkind: Resource kindmember_cluster: Target cluster namerecreate:trueorfalse(indicates recreating a previously deleted resource)
- Description: Number of resource creation operations to member clusters. Errors indicate connectivity or permission issues.
- Example query:
# Resource creation error rate by cluster
sum by (member_cluster) (
rate(karmada_create_resource_to_cluster{result="error"}[5m])
)
karmada_update_resource_to_cluster
- Component: karmada-controller-manager, karmada-agent
- Type: Counter
- Stability: STABLE
- Labels:
result:successorerrorapiversion: Resource API versionkind: Resource kindmember_cluster: Target cluster nameoperationResult: Type of update performed
- Description: Number of resource update operations to member clusters.
- Example query:
# Resource update error rate by kind
sum by (kind) (
rate(karmada_update_resource_to_cluster{result="error"}[5m])
)
karmada_delete_resource_from_cluster
- Component: karmada-controller-manager, karmada-agent
- Type: Counter
- Stability: STABLE
- Labels:
result:successorerrorapiversion: Resource API versionkind: Resource kindmember_cluster: Target cluster name
- Description: Number of resource deletion operations from member clusters. Errors indicate issues removing resources during cleanup or migration.
Failover and Eviction Metrics
These metrics track cluster failure handling and resource eviction.
karmada_eviction_queue_depth
- Component: karmada-controller-manager
- Type: Gauge
- Stability: STABLE
- Labels:
name: Queue name
- Description: Current depth of the eviction queue. High or growing values indicate failover processing backlog.
- Example query:
# Eviction queue depth
karmada_eviction_queue_depth
# Queue growth rate (positive = backlog)
deriv(karmada_eviction_queue_depth[5m])
karmada_eviction_kind_total
- Component: karmada-controller-manager
- Type: Gauge
- Stability: STABLE
- Labels:
member_cluster: Source cluster nameresource_kind: Resource kind
- Description: Number of resources in eviction queue by kind and source cluster. Use to identify which resource types are being evicted.
- Example query:
# Resources by kind across all clusters
sum by (resource_kind) (karmada_eviction_kind_total)
karmada_eviction_processing_latency_seconds
- Component: karmada-controller-manager
- Type: Histogram
- Stability: STABLE
- Labels:
name: Queue name
- Description: Latency for processing an eviction task (moving resource from failed cluster to healthy cluster). High latency delays failover.
- Buckets: Exponential from 0.001s to ~32s
- Example query:
# P95 eviction processing latency
histogram_quantile(0.95, rate(karmada_eviction_processing_latency_seconds_bucket[5m]))
karmada_eviction_processing_total
- Component: karmada-controller-manager
- Type: Counter
- Stability: STABLE
- Labels:
name: Queue nameresult:successorerror
- Description: Total evictions processed. High error rate indicates issues rescheduling workloads during cluster failures.
- Example query:
# Eviction success rate
rate(karmada_eviction_processing_total{result="success"}[5m])
/
rate(karmada_eviction_processing_total[5m])
Autoscaling Metrics
These metrics track FederatedHPA and CronFederatedHPA controllers.
karmada_federatedhpa_process_duration_seconds
- Component: karmada-controller-manager
- Type: Histogram
- Stability: STABLE
- Labels:
result:successorerror
- Description: Duration to process a FederatedHPA object (one reconciliation loop). High latency delays scaling decisions.
- Buckets: Exponential from 0.01s to ~40s
- Example query:
# P95 FederatedHPA processing latency
histogram_quantile(0.95, rate(karmada_federatedhpa_process_duration_seconds_bucket[5m]))
karmada_federatedhpa_pull_metrics_duration_seconds
- Component: karmada-controller-manager
- Type: Histogram
- Stability: STABLE
- Labels:
result:successorerrormetricType: Metric type
- Description: Duration for FederatedHPA to pull metrics from member clusters. High latency or errors impact scaling responsiveness.
- Buckets: Exponential from 0.01s to ~40s
- Example query:
# P95 metrics pull latency by type
histogram_quantile(0.95, rate(karmada_federatedhpa_pull_metrics_duration_seconds_bucket[5m])) by (metricType)
karmada_cronfederatedhpa_process_duration_seconds
- Component: karmada-controller-manager
- Type: Histogram
- Stability: STABLE
- Labels:
result:successorerror
- Description: Duration to process a CronFederatedHPA object.
- Buckets: Exponential from 0.001s to ~4s
karmada_cronfederatedhpa_rule_process_duration_seconds
- Component: karmada-controller-manager
- Type: Histogram
- Stability: STABLE
- Labels:
result:successorerror
- Description: Duration to process a single CronFederatedHPA rule. Use to identify slow or problematic cron rules.
- Buckets: Exponential from 0.001s to ~4s
HPA Controller Metrics
These metrics are from the horizontal pod autoscaler controller (based on Kubernetes upstream).
horizontal_pod_autoscaler_controller_reconciliations_total
- Component: karmada-controller-manager
- Type: Counter
- Stability: ALPHA
- Labels:
action:scale_up,scale_down, ornoneerror:spec,internal, ornone
- Description: Number of HPA reconciliations.
actionindicates scaling decision,errorindicates failure type. Spec errors are configuration issues, internal errors are runtime failures.
horizontal_pod_autoscaler_controller_reconciliation_duration_seconds
- Component: karmada-controller-manager
- Type: Histogram
- Stability: ALPHA
- Labels:
action:scale_up,scale_down, ornoneerror:spec,internal, ornone
- Description: Time for one HPA reconciliation loop. Critical for understanding HPA responsiveness.
- Buckets: Exponential from 0.001s to ~32s
horizontal_pod_autoscaler_controller_metric_computation_total
- Component: karmada-controller-manager
- Type: Counter
- Stability: ALPHA
- Labels:
action: Scaling actionerror: Error typemetric_type:Resource,Pods,Object,External, orContainerResource
- Description: Number of metric computations. Use to identify problematic metric types.
horizontal_pod_autoscaler_controller_metric_computation_duration_seconds
- Component: karmada-controller-manager
- Type: Histogram
- Stability: ALPHA
- Labels:
action: Scaling actionerror: Error typemetric_type: Metric type
- Description: Time to compute one metric. Slow metrics delay scaling.
- Buckets: Exponential from 0.001s to ~32s
- Example query:
# P95 metric computation latency by type
histogram_quantile(0.95, rate(horizontal_pod_autoscaler_controller_metric_computation_duration_seconds_bucket[5m])) by (metric_type)
Controller Workqueue Metrics
Standard controller-runtime workqueue metrics available on all Karmada controllers.
workqueue_adds_total
- Component: All controllers
- Type: Counter
- Stability: STABLE
- Labels:
name: Controller/queue name
- Description: Total items added to the workqueue. Rate indicates incoming work load.
- Example query:
# Workqueue add rate
rate(workqueue_adds_total[5m]) by (name)
workqueue_depth
- Component: All controllers
- Type: Gauge
- Stability: STABLE
- Labels:
name: Controller/queue name
- Description: Current workqueue depth. High values indicate controller cannot keep up. Values >100 warrant investigation.
- Example query:
# Controllers with deep queues
workqueue_depth > 100
workqueue_queue_duration_seconds
- Component: All controllers
- Type: Histogram
- Stability: STABLE
- Labels:
name: Controller/queue name
- Description: How long items stay in queue before processing. High P95/P99 indicates controller saturation.
- Buckets: Exponential from 1e-8s to ~1000s
- Example query:
# P95 queue wait time by controller
histogram_quantile(0.95, rate(workqueue_queue_duration_seconds_bucket[5m])) by (name)
workqueue_work_duration_seconds
- Component: All controllers
- Type: Histogram
- Stability: STABLE
- Labels:
name: Controller/queue name
- Description: How long processing one item takes. Identifies slow reconciliation loops.
- Buckets: Exponential from 1e-8s to ~1000s
- Example query:
# P95 reconciliation duration by controller
histogram_quantile(0.95, rate(workqueue_work_duration_seconds_bucket[5m])) by (name)
workqueue_retries_total
- Component: All controllers
- Type: Counter
- Stability: STABLE
- Labels:
name: Controller/queue name
- Description: Total retries processed by the workqueue. High retry rate indicates transient failures or bugs. Retry rate >10% of adds rate is concerning.
- Example query:
# Retry ratio (should be <10%)
rate(workqueue_retries_total[5m]) / rate(workqueue_adds_total[5m])
workqueue_unfinished_work_seconds
- Component: All controllers
- Type: Gauge
- Stability: STABLE
- Labels:
name: Controller/queue name
- Description: How long the oldest unfinished work item has been in progress. High values indicate stuck reconciliations.
- Example query:
# Controllers with stuck items (>60 seconds)
workqueue_unfinished_work_seconds > 60
workqueue_longest_running_processor_seconds
- Component: All controllers
- Type: Gauge
- Stability: STABLE
- Labels:
name: Controller/queue name
- Description: How long the longest running processor has been running. Identifies stuck worker goroutines.
Webhook Metrics
These metrics track admission webhook operations.
controller_runtime_webhook_requests_total
- Component: karmada-webhook
- Type: Counter
- Stability: STABLE
- Labels:
webhook: Webhook namecode: HTTP status code
- Description: Total requests processed by each webhook. Non-2xx codes indicate webhook failures that can block API operations. Critical to monitor.
- Example query:
# Webhook error rate
sum(rate(controller_runtime_webhook_requests_total{code!~"2.."}[5m]))
/
sum(rate(controller_runtime_webhook_requests_total[5m]))
# Failing webhooks by name
sum by (webhook) (
rate(controller_runtime_webhook_requests_total{code!~"2.."}[5m])
)
Pool Metrics
These metrics track object pool usage for reducing memory allocations.
pool_get_operation_total
- Component: karmada-controller-manager, karmada-agent
- Type: Counter
- Stability: STABLE
- Labels:
name: Pool namefrom:pool(cache hit) ornew(cache miss)
- Description: Total get operations from object pools.
from=poolindicates cache hit,from=newindicates new allocation. High miss rate may indicate undersized pools.
pool_put_operation_total
- Component: karmada-controller-manager, karmada-agent
- Type: Counter
- Stability: STABLE
- Labels:
name: Pool nameto:pool(returned to pool) ordrop(pool full)
- Description: Total put operations to object pools.
to=dropindicates object dropped because pool is full.
Build Information
This metric provides version and build information for all components.
karmada_build_info
- Component: All components
- Type: Gauge
- Stability: STABLE
- Labels:
git_version: Version stringgit_commit: Full commit hashgit_short_commit: Short commit hashgit_tree_state: Git tree statebuild_date: Build datego_version: Go versioncompiler: Compiler usedplatform: Platform (OS/arch)
- Description: Constant gauge (value=1) with build metadata as labels. Essential for tracking component versions and correlating issues with releases.
- Example queries:
# List all component versions
karmada_build_info
# Count components by version
count by (git_version) (karmada_build_info)
# Check for version mismatches
count(count by (git_version) (karmada_build_info)) > 1
Component Metrics Summary
Quick reference for which component emits which metrics:
karmada-controller-manager
- All
cluster_*metrics - All
resource_*,policy_*,binding_*,work_*propagation metrics - All
*_resource_to_cluster/*_resource_from_clustermetrics - All
eviction_*metrics - All
federatedhpa_*andcronfederatedhpa_*metrics - All
horizontal_pod_autoscaler_*metrics pool_*metricsworkqueue_*metricskarmada_build_info
karmada-scheduler
- All
karmada_scheduler_*metrics (except estimator) scheduler_pending_bindingsworkqueue_*metricskarmada_build_info
karmada-scheduler-estimator
- All
karmada_scheduler_estimator_*metrics karmada_build_info
karmada-agent
- Subset of
cluster_*metrics (for local cluster only) work_sync_workload_duration_seconds*_resource_to_cluster/*_resource_from_clustermetricspool_*metricsworkqueue_*metricskarmada_build_info
karmada-webhook
controller_runtime_webhook_requests_totalkarmada_build_info
karmada-descheduler
workqueue_*metricskarmada_build_info
Querying Best Practices
Rate Calculations
Use appropriate time windows for rate() and increase():
# Good: 5-minute rate for dashboards
rate(karmada_scheduler_schedule_attempts_total[5m])
# Good: 1-minute rate for alerting (faster response)
rate(karmada_scheduler_schedule_attempts_total[1m])
# Bad: Too short, susceptible to noise
rate(karmada_scheduler_schedule_attempts_total[30s])
Histogram Quantiles
Use appropriate quantiles for different purposes:
# P50: Median latency (typical case)
histogram_quantile(0.50, rate(karmada_scheduler_e2e_scheduling_duration_seconds_bucket[5m]))
# P95: Good for understanding user experience
histogram_quantile(0.95, rate(karmada_scheduler_e2e_scheduling_duration_seconds_bucket[5m]))
# P99: Captures tail latency (worst case)
histogram_quantile(0.99, rate(karmada_scheduler_e2e_scheduling_duration_seconds_bucket[5m]))
Error Rates
Calculate error rates as ratio of errors to total:
# Scheduling error rate
rate(karmada_scheduler_schedule_attempts_total{result="error"}[5m])
/
rate(karmada_scheduler_schedule_attempts_total[5m])
# Workqueue retry rate
rate(workqueue_retries_total[5m])
/
rate(workqueue_adds_total[5m])
Resource Utilization
Calculate utilization as percentage:
# CPU utilization by cluster
(cluster_cpu_allocated_number / cluster_cpu_allocatable_number) * 100
# Memory utilization
(cluster_memory_allocated_bytes / cluster_memory_allocatable_bytes) * 100
Label Cardinality
Be mindful of label cardinality in queries:
member_cluster: Bounded by number of clusters (typically <100)kind,apiversion: Bounded by resource types usedname(workqueue): Bounded by number of controllers (~20-30)plugin,webhook: Bounded by installed plugins/webhooks
High-cardinality labels like namespace or individual resource names are intentionally excluded.
Deprecated Labels
The following labels are deprecated and will be removed in v1.18:
| Old Label | New Label | Affected Metrics | Deprecated In |
|---|---|---|---|
cluster | member_cluster | All cluster metrics | v1.16 |
cluster_name | member_cluster | All cluster metrics | v1.16 |
Migration: Update all queries, dashboards, and alerts to use member_cluster.
Related Documentation
- Karmada Observability Guide - Complete monitoring setup and best practices
- PromQL Query Reference - Ready-to-use queries
- Working with Prometheus - Prometheus setup guide