Kubernetes Book 2 - principles of operation: Difference between revisions

From 탱이의 잡동사니
Jump to navigation Jump to search
Line 40: Line 40:
==== Scheduler ====
==== Scheduler ====
At a high level, the scheduler watches for new work tasks and assigns them to appropriate healthy nodes. Behind the scenes, it implements complex logic that filters out nodes incapable of running the Pod and then ranks the nodes that are capable. The ranking system itself is complex, but the node with the highest ranking point is eventually selected to run the Pod.
At a high level, the scheduler watches for new work tasks and assigns them to appropriate healthy nodes. Behind the scenes, it implements complex logic that filters out nodes incapable of running the Pod and then ranks the nodes that are capable. The ranking system itself is complex, but the node with the highest ranking point is eventually selected to run the Pod.
When identifying nodes that are capable of running the Pod, the scheduler performs various predicate checks. These include; is the node tainted, are there any affinity or anti-affinity rules, is the Pod's network port available on the node, does the node have sufficient free resources etc. Any node incapable of running the Pod is ignored, and the remaining Pods are ranked according to things such as; does the node already have the required image, how much free resource does the node have, how many Pods is the node already running. Each criteria is worth points, and the node with the most points is selected to run the Pod.
If the scheduler cannot find a suitable node, the Pod cannot be scheduled and goes into pending.
It's not the job of the scheduler to perform the mechanics of running Pods, it just picks the nodes they will be scheduled on.


[[category:kubernetes]]
[[category:kubernetes]]

Revision as of 00:03, 7 January 2020

Master and nodes

A Kubernetes cluster is made of masters and nodes. These are Linux hosts that can be VMs, bare metal servers in your data center, or instances in a private or public cloud.

Masters(control plane)

A Kubernetes master is a collection of system services that make up the control plane of the cluster.

The simplest setups run all the master services on a single host. However, this is only suitable for labs and test environments. For production environments, multi-master high availability (HA) is a must have. This is why the major cloud providers implement HA master as part of their Kubernetes-as-a-Service platforms such as Azure Kubernetes Service(AKS), AWS Elastic Kubernetes Service(EKS), and Google Kubernetes Engine(GKE).

API server

The API server is the Grand Central Station of Kubernetes. All communication, between all components, goes through the API server.

It exposes a RESTful API that the users POST YAML configuration files to over HTTPS. These YAML files, which we sometimes call manifests, contain the desired state of our application. This includes things like; which container image to use, which ports to expose, and how many Pod replicas to run.

All requests to the API Server are subject to authentication and authorization checks, but once these are done, the config in the YAML file is validated, persisted to the cluster store, and deployed to the cluster.

Cluster store

If the API server is the brains of the cluster, the cluster store is its heart. It's the only stateful part of the control plane, and it persistently stores the entire configuration and state of the cluster. As such, it's a vital component of the cluster - no cluster store, no cluster.

The cluster store is currently based on etcd, a popular distributed database. As it's the single source of truth for the cluster, you should run between 3-5 etcd replicas for high-availability, and you should provide adequate ways to recover when things go wrong.

On the topic of availability, etcd prefers consistency over availability. This means that it will not tolerate a split-brain situation and will halt update to the cluster in order to maintain consistency. However, if etcd becomes unavailable, applications running on the cluster should continue to work, it's just updates to the cluster configuration that will be halted.

As with all distributed databases, consistency of writes to the database is important. For example, multiple writes to the same value originating from diffrent nodes needs to be handled. etcd uses the popular RAFT consensus algorithm to accomplish this.

Controller manager

The controller manager is a controller of controllers and is shipped as a single monolithic binary. However, despite it running as a single process, it implements multiple independent control loops that watch the cluster and responds to events.

Some of the control loops include; the node controller, the endpoints controller, and the replicaset controller. Each one runs as a background watch-loop that is constantly watching the API Server for changes - the aim of the game is to ensure the current state of the cluster matches the desired state(more on this shortly).

The logic implemented by each control loop is effectively this: - Obtain the desired state. - Observe the current state. - Determine the differences. - Reconcile the differences.

This logic is at the heart of Kubernestes and declarative design patterns.

Each control loops is also extremely specialized and only interested in its own little corner of the Kubernetes world. No attempt is made to over-complicate things by implementing awareness of other parts of the system - each takes care of its own task and leaves other components alone. This is key to the distributed design of Kubernetes and adheres to the Unix philosophy of building complex systems from small specialized parts.

Scheduler

At a high level, the scheduler watches for new work tasks and assigns them to appropriate healthy nodes. Behind the scenes, it implements complex logic that filters out nodes incapable of running the Pod and then ranks the nodes that are capable. The ranking system itself is complex, but the node with the highest ranking point is eventually selected to run the Pod.

When identifying nodes that are capable of running the Pod, the scheduler performs various predicate checks. These include; is the node tainted, are there any affinity or anti-affinity rules, is the Pod's network port available on the node, does the node have sufficient free resources etc. Any node incapable of running the Pod is ignored, and the remaining Pods are ranked according to things such as; does the node already have the required image, how much free resource does the node have, how many Pods is the node already running. Each criteria is worth points, and the node with the most points is selected to run the Pod.

If the scheduler cannot find a suitable node, the Pod cannot be scheduled and goes into pending.

It's not the job of the scheduler to perform the mechanics of running Pods, it just picks the nodes they will be scheduled on.