kubernetesdevopsaffinity

Assigning Pods to different nodepools with NodeAffinity


I am trying to assign a cluster of pods to nodepools and I would like those nodepools to change based on the resources requested by the cluster pods. However, I'd like the pods to prefer the smaller of the nodepools (worker) and ignore the larger nodes (lgworker) (so, do not trigger a scale up).

        extraPodConfig:
          tolerations:
            - key: toleration_label
              value: worker
              operator: Equal
          affinity:
            nodeAffinity:
              requiredDuringSchedulingIgnoredDuringExecution:
                nodeSelectorTerms:
                - matchExpressions:
                  - key: a_node_selector_label
                    operator: In
                    values:
                      - worker
              preferredDuringSchedulingIgnoredDuringExecution:
                - weight: 100
                  preference:
                    matchExpressions:
                    - key: node_label
                      operator: In
                      values: 
                      - worker
                - weight: 90
                  preference:
                    matchExpressions:
                    - key: node_label
                      operator: In
                      values: 
                      - lgworker


The cluster pods default resource requests will fit on the smaller nodes easily and so I want to cause that to be used first. The larger nodepool should only be triggered when more resources than would fit on the smaller node be requested.

I have tried to weight the preferences, however the default cluster pods are being scheduled on to the larger nodepool.

Is there something I am missing that would help me properly assign pods to the smaller nodes over the larger nodes?


Solution

  • Using appropriate weighting helps to prefer the correct node, however, when enough Dask workers are requested, a number of those workers may end up on the lgworker nodes. The fix to this would be to update the kube-scheduler to consider 100% of nodes when considering scheduling. By default the kube-scheduler will consider N% (dynamically determined) of nodes at a time to evaluate by filtering and scoring v1.21.Kube-Scheduler.

    NodeAffinity will only go so far and due to it's non-guarantee of enforcing affinities, this could cause pods to be scheduled on unpreferred nodes.

    Node Affinity v1 :

    The scheduler will prefer to schedule pods to nodes that satisfy the affinity expressions specified by this field, but it may choose a node that violates one or more of the expressions. The node that is most preferred is the one with the greatest sum of weights, i.e. for each node that meets all of the scheduling requirements (resource request, requiredDuringScheduling affinity expressions, etc.), compute a sum by iterating through the elements of this field and adding "weight" to the sum if the node matches the corresponding matchExpressions; the node(s) with the highest sum are the most preferred.

            extraPodConfig:
              tolerations:
                - key: node_toleration
                  value: worker
                  operator: Equal
              affinity:
                nodeAffinity:
                  requiredDuringSchedulingIgnoredDuringExecution:
                    nodeSelectorTerms:
                    - matchExpressions:
                      - key: node_label
                        operator: In
                        values:
                          - worker
                          - lgworker
                  preferredDuringSchedulingIgnoredDuringExecution:
                    - weight: 100
                      preference:
                        matchExpressions:
                        - key: node_label
                          operator: In
                          values: 
                          - worker
                    - weight: 1
                      preference:
                        matchExpressions:
                        - key: node_label
                          operator: In
                          values: 
                          - lgworker
    

    So, affecting the kube-scheduler would involve updating its configuration: Example

    apiVersion: kubescheduler.config.k8s.io/v1alpha1
    kind: KubeSchedulerConfiguration
    algorithmSource:
      provider: DefaultProvider
    
    ...
    
    percentageOfNodesToScore: 100