Schedule based on Cluster Resource Modeling
Overview
When scheduling an application to a specific cluster, the resource status of the destination cluster is a factor that cannot be ignored. When cluster resources are insufficient to run a given replica, we want the scheduler to avoid this scheduling behavior as much as possible. This article will focus on how Karmada performs scheduling based on the cluster resource modeling.
Cluster Resource Modeling
During the scheduling process, karmada-scheduler
makes decisions based on a number of factors, one of which is the state of the cluster's resources. Karmada currently has two different ways of scheduling based on cluster resources, one of which is a generic cluster modeling and the other is a customized cluster modeling.
General Cluster Modeling
Start to use General Cluster Resource Models
For that purpose, we introduced ResourceSummary
to the Cluster API.
For example:
resourceSummary:
allocatable:
cpu: "4"
ephemeral-storage: 206291924Ki
hugepages-1Gi: "0"
hugepages-2Mi: "0"
memory: 16265856Ki
pods: "110"
allocated:
cpu: 950m
memory: 290Mi
pods: "11"
From the example above, we can know the allocatable and allocated resources of the cluster.
Schedule based on General Cluster Resource Models
Assume that there is a Pod which will be scheduled to one of the member clusters managed by Karmada.
Member1:
resourceSummary:
allocatable:
cpu: "4"
ephemeral-storage: 206291924Ki
hugepages-1Gi: "0"
hugepages-2Mi: "0"
memory: 16265856Ki
pods: "110"
allocated:
cpu: 950m
memory: 290Mi
pods: "11"
Member2:
resourceSummary:
allocatable:
cpu: "4"
ephemeral-storage: 206291924Ki
hugepages-1Gi: "0"
hugepages-2Mi: "0"
memory: 16265856Ki
pods: "110"
allocated:
cpu: "2"
memory: 290Mi
pods: "11"
Member3:
resourceSummary:
allocatable:
cpu: "4"
ephemeral-storage: 206291924Ki
hugepages-1Gi: "0"
hugepages-2Mi: "0"
memory: 16265856Ki
pods: "110"
allocated:
cpu: "2"
memory: 290Mi
pods: "110"
Assume that the Pod's request is 500m CPU. Member1 and Member2 have sufficient resources to run this replica but Member3 has no quota for Pods. Considering the amount of available resources, the scheduler prefers to schedule the Pod to member1.
Cluster | member1 | member2 | member3 |
---|---|---|---|
AvailableReplicas | (4 - 0.95) / 0.5 = 6.1 | (4 - 2) / 0.5 = 4 | 0 |
Customized Cluster Modeling
Background
ResourceSummary
describes the overall available resources of the cluster.
However, the ResourceSummary
is not precise enough, it mechanically counts the resources on all nodes, but ignores the fragmented resources of these nodes. For example, a cluster with 2000 nodes has only 1 core CPU left on each node.
From the ResourceSummary
, we get that there are 2000 cores CPU left for the cluster, but actually, this cluster cannot run any pod that requires more than 1 core CPU.
Therefore, we introduce a CustomizedClusterResourceModeling
for each cluster that records the resource profile of each node.
Karmada collects node and pod information from each cluster and computes the appropriate user configured
resource model to categorize the node into.
Start to use Customized Cluster Resource Models
CustomizedClusterResourceModeling
feature gate has evolved to Beta sine Karmada v1.4 and is enabled by default.
If you use Karmada v1.3, you need to enable this feature gate in karmada-scheduler
, karmada-aggregated-server
and karmada-controller-manager
.
For example, you can use the command below to turn on the feature gate in the karmada-controller-manager
.
kubectl --kubeconfig ~/.kube/karmada.config --context karmada-host edit deploy/karmada-controller-manager -nkarmada-system
- command:
- /bin/karmada-controller-manager
- --kubeconfig=/etc/kubeconfig
- --bind-address=0.0.0.0
- --cluster-status-update-frequency=10s
- --secure-port=10357
- --feature-gates=CustomizedClusterResourceModeling=true
- --v=4
After that, when a cluster is registered to the Karmada control plane, Karmada will automatically sets up a generic model for the cluster. You can see it in cluster.spec
.
By default, a resource model
is as follows:
resourceModels:
- grade: 0
ranges:
- max: "1"
min: "0"
name: cpu
- max: 4Gi
min: "0"
name: memory
- grade: 1
ranges:
- max: "2"
min: "1"
name: cpu
- max: 16Gi
min: 4Gi
name: memory
- grade: 2
ranges:
- max: "4"
min: "2"
name: cpu
- max: 32Gi
min: 16Gi
name: memory
- grade: 3
ranges:
- max: "8"
min: "4"
name: cpu
- max: 64Gi
min: 32Gi
name: memory
- grade: 4
ranges:
- max: "16"
min: "8"
name: cpu
- max: 128Gi
min: 64Gi
name: memory
- grade: 5
ranges:
- max: "32"
min: "16"
name: cpu
- max: 256Gi
min: 128Gi
name: memory
- grade: 6
ranges:
- max: "64"
min: "32"
name: cpu
- max: 512Gi
min: 256Gi
name: memory
- grade: 7
ranges:
- max: "128"
min: "64"
name: cpu
- max: 1Ti
min: 512Gi
name: memory
- grade: 8
ranges:
- max: "9223372036854775807"
min: "128"
name: cpu
- max: "9223372036854775807"
min: 1Ti
name: memory
Customize your cluster resource models
In some cases, the default cluster resource model may not match your cluster. You can adjust the granularity of the cluster resource model to better distribute resources to your cluster.
For example, you can use the command below to customize the cluster resource models of member1.
kubectl --kubeconfig ~/.kube/karmada.config --context karmada-apiserver edit cluster/member1
A Customized resource model should meet the following requirements:
- The grade of each model should not be the same.
- The number of resource types in each model should be the same.
- Currently only four resource types are supported cpu, memory, storage, ephemeral-storage.
- The max value of each resource must be greater than the min value.
- The min value of each resource in the first model should be 0.
- The max value of each resource in the last model should be MaxInt64.
- The resource types of each model should be the same.
- Model intervals for resources must be contiguous and non-overlapping from low-grade to high-grade models.
For example, a customized cluster resource model is given below:
resourceModels:
- grade: 0
ranges:
- max: "1"
min: "0"
name: cpu
- max: 4Gi
min: "0"
name: memory
- grade: 1
ranges:
- max: "2"
min: "1"
name: cpu
- max: 16Gi
min: 4Gi
name: memory
- grade: 2
ranges:
- max: "9223372036854775807"
min: "2"
name: cpu
- max: "9223372036854775807"
min: 16Gi
name: memory
The above is a cluster resource model with three grades, each grade defines the resource ranges for two resources, CPU and memory. At this point if a node has remaining available resources of 0.5 cores CPU and 2Gi memory, it will be classified as a grade 0 resource model, while if it has 1.5 cores CPU and 10Gi memory, it will be classified as grade 1.
Schedule based on Customized Cluster Resource Models
Cluster resource model
divides the nodes into levels of different intervals. When a Pod needs to be scheduled to a specific cluster, karmada-scheduler
compares the number of nodes in different clusters that satisfy the requirement based on the resource request of the Pod instance, and it schedules it to a cluster that satisfies the requirement with a larger number of nodes.
Assume that there is a Pod to be scheduled to one of the member clusters managed by Karmada with the same cluster resource models. The remaining available resources of these member clusters are as follows:
Member1:
spec:
...
- grade: 2
ranges:
- max: "4"
min: "2"
name: cpu
- max: 32Gi
min: 16Gi
name: memory
- grade: 3
ranges:
- max: "8"
min: "4"
name: cpu
- max: 64Gi
min: 32Gi
name: memory
...
...
status:
- count: 1
grade: 2
- count: 6
grade: 3
Member2:
spec:
...
- grade: 2
ranges:
- max: "4"
min: "2"
name: cpu
- max: 32Gi
min: 16Gi
name: memory
- grade: 3
ranges:
- max: "8"
min: "4"
name: cpu
- max: 64Gi
min: 32Gi
name: memory
...
...
status:
- count: 4
grade: 2
- count: 4
grade: 3
Member3:
spec:
...
- grade: 6
ranges:
- max: "64"
min: "32"
name: cpu
- max: 512Gi
min: 256Gi
name: memory
...
...
status:
- count: 1
grade: 6
Suppose the Pod's resource request is for a 3-cores CPU and 20Gi of memory. All nodes that meet Grade 2 and above fulfill this request. Considering the number of nodes available in each cluster, the scheduler prefers to schedule the Pod to member3.
Cluster | member1 | member2 | member3 |
---|---|---|---|
AvailableReplicas | 1 + 6 = 7 | 4 + 4 = 8 | 1 * min(32/3, 256/20) = 10 |
Suppose now that the Pod requires 3C and 60Gi. Grade 2 nodes do not satisfy every resource request, so after considering the number of nodes available in each cluster, the scheduler prefers to schedule the Pod to member1.
Cluster | member1 | member2 | member3 |
---|---|---|---|
AvailableReplicas | 6 * 1 = 6 | 4 * 1 = 4 | 1 * min(32/3, 256/60) = 4 |
Disable Cluster Resource Modeling
The resource modeling is always be used by the scheduler to make scheduling decisions in scenarios of dynamic replica assignment based on cluster free resources. In the process of resource modeling, it will collect node and pod information from all clusters managed by Karmada. This imposes a considerable performance burden in large-scale scenarios.
You can disable cluster resource modeling by setting --enable-cluster-resource-modeling
to false in karmada-controller-manager
and karmada-agent
.