Kubernetes Book 2 - principles of operation

From 탱이의 잡동사니
Jump to navigation Jump to search

Master and nodes

A Kubernetes cluster is made of masters and nodes. These are Linux hosts that can be VMs, bare metal servers in your data center, or instances in a private or public cloud.

Masters(control plane)

A Kubernetes master is a collection of system services that make up the control plane of the cluster.

The simplest setups run all the master services on a single host. However, this is only suitable for labs and test environments. For production environments, multi-master high availability (HA) is a must have. This is why the major cloud providers implement HA master as part of their Kubernetes-as-a-Service platforms such as Azure Kubernetes Service(AKS), AWS Elastic Kubernetes Service(EKS), and Google Kubernetes Engine(GKE).

API server

The API server is the Grand Central Station of Kubernetes. All communication, between all components, goes through the API server.

It exposes a RESTful API that the users POST YAML configuration files to over HTTPS. These YAML files, which we sometimes call manifests, contain the desired state of our application. This includes things like; which container image to use, which ports to expose, and how many Pod replicas to run.

All requests to the API Server are subject to authentication and authorization checks, but once these are done, the config in the YAML file is validated, persisted to the cluster store, and deployed to the cluster.

Cluster store

If the API server is the brains of the cluster, the cluster store is its heart. It's the only stateful part of the control plane, and it persistently stores the entire configuration and state of the cluster. As such, it's a vital component of the cluster - no cluster store, no cluster.

The cluster store is currently based on etcd, a popular distributed database. As it's the single source of truth for the cluster, you should run between 3-5 etcd replicas for high-availability, and you should provide adequate ways to recover when things go wrong.

On the topic of availability, etcd prefers consistency over availability. This means that it will not tolerate a split-brain situation and will halt update to the cluster in order to maintain consistency. However, if etcd becomes unavailable, applications running on the cluster should continue to work, it's just updates to the cluster configuration that will be halted.

As with all distributed databases, consistency of writes to the database is important. For example, multiple writes to the same value originating from diffrent nodes needs to be handled. etcd uses the popular RAFT consensus algorithm to accomplish this.

Controller manager

The controller manager is a controller of controllers and is shipped as a single monolithic binary. However, despite it running as a single process, it implements multiple independent control loops that watch the cluster and responds to events.

Some of the control loops include; the node controller, the endpoints controller, and the replicaset controller. Each one runs as a background watch-loop that is constantly watching the API Server for changes - the aim of the game is to ensure the current state of the cluster matches the desired state(more on this shortly).

The logic implemented by each control loop is effectively this: - Obtain the desired state. - Observe the current state. - Determine the differences. - Reconcile the differences.

This logic is at the heart of Kubernestes and declarative design patterns.

Each control loops is also extremely specialized and only interested in its own little corner of the Kubernetes world. No attempt is made to over-complicate things by implementing awareness of other parts of the system - each takes care of its own task and leaves other components alone. This is key to the distributed design of Kubernetes and adheres to the Unix philosophy of building complex systems from small specialized parts.

Scheduler

At a high level, the scheduler watches for new work tasks and assigns them to appropriate healthy nodes. Behind the scenes, it implements complex logic that filters out nodes incapable of running the Pod and then ranks the nodes that are capable. The ranking system itself is complex, but the node with the highest ranking point is eventually selected to run the Pod.

When identifying nodes that are capable of running the Pod, the scheduler performs various predicate checks. These include; is the node tainted, are there any affinity or anti-affinity rules, is the Pod's network port available on the node, does the node have sufficient free resources etc. Any node incapable of running the Pod is ignored, and the remaining Pods are ranked according to things such as; does the node already have the required image, how much free resource does the node have, how many Pods is the node already running. Each criteria is worth points, and the node with the most points is selected to run the Pod.

If the scheduler cannot find a suitable node, the Pod cannot be scheduled and goes into pending.

It's not the job of the scheduler to perform the mechanics of running Pods, it just picks the nodes they will be scheduled on.

cloud controller manager

If the cluster on a supported public cloud platform, cush as AWS, Azure, or GCP, the control plane will be running a cloud controller manager. Its job is to manage integrations with underlying cloud technologies and services such as instances, load-balancers and storage.

Control plane

Kubernetes masters run all of the cluster's control plane services. Think of it as brains of the cluster where all the control and scheduling decisions are made. Behind the scenes, a master is made up of lots of small specialized control loops and services. These include the API server, the cluster store, the controller manager, and the scheduler.

The API Server is the front-end into the control plane and the only component in the control plane that we interact with directly. By default, it exposes a RESTful endpoint on port 443.

Nodes

Nodes are the workers of a Kubernetes cluster. At a high0level they do three things:

  • Watch the API Server for new work assignments.
  • Execute new work assignments.
  • Report back to the control plane.

Kubelet

The Kubelet is the star of the show on every Node. It's the main Kubernetes agent, and it runs on every node in the cluster. In fact, it's common to use the terms node and kubelet interchangeably.

When someone joins a new node to a cluster, the process involves installation of the kubelet which is then responsible for the node registration process. This effectively pools the node's CPU, RAM, and storage into the wider cluster pool.Think back to the previous chapter where we talked about Kubernetes being a data center OS and abstracting data center resources into a single usable pool.

One of the main jobs of the kubelet is to watch the API server for new work assignments. Any time it sees one, it executes the task and maintains a reporting channel back to the control plane. It also keeps an eye on local static Pod definitions.

If a kubelet can't run a particular task, it reports back to the master and lets the control plane decide what actions to take. For example, if a Pod fails to start on a node, the kubelet is not responsible for finding another node to run it on. It simply reports back to the control plane and the control plane decides what to do.

Container runtime

The Kubelet needs a container runtime to perform container-related tasks - things like pulling images and starting and stopping containers.

In the early days, Kubernetes had native support for a few container runtimes such as Docker. More recently, it has moved to a plugin model called the Container Runtime Interface (CRI). This is an abstraction layer for external (3rd-party) container runtimes to plug in to. At a high-level, the CRI masks the internal machinery of Kubernetes and exposes a clean documented interface for 3rd-party container runtime to plug in to.

The CRI is the supported method for integrating runtimes into Kubernetes.

There are lots of container runtimes availble for Kubernetes. One popular example is cri-containerd.

Kube-proxy

The last piece of the node puzzle is the kube-proxy. This runs on every node i the cluster and is responsible for local networking. For example, it makes sure each node gets its own unique IP address, and implements local IPTABLES or IPVS rules to handle routing and load-balancing of traffic on the Pod network.

Kubernetes DNS

As well as the various control plane and node components, every Kubernetes cluster has an internal DNS service that is vital to operations.

The cluster's DNS service has a static IP address that is hard-coded into every Pod on the cluster, meaning all containers and Pods know how to find it. Every new service is automatically registered with the cluster's DNS so that all components in the cluster can find every Service by name. Some other components that are registered with the cluster DNS are StatfulSets and the individual Pods that a StatefulSet manages.

Cluster DNS is based on CoreDNS(https://coredns.io/).

Packaging apps

For an application to run on a Kubernetes cluster it needs few things.

  • Packaged as a container.
  • Wrapped in a Pod.
  • Deployed via declarative manifest file.

Declarative model and desired state

The declarative model and the concept of desired state are at the very heart of Kubernetes.

In Kubernetes, the declarative model works like this:

  • Declare the desired state of the application(microservice) in a manifest file.
  • POST it to the Kubernetes API server.
  • Kubernetes stores this in the cluster store as the application's desired state.
  • Kubernetes implements the desired state on the cluster.
  • Kubernetes implements watch loops to make sure the current state of the application doesn't vary from the desired state.

Manifest files are written in simple YAML, and they tell Kubernetes how we want an application to look. We call this is the desired state. It includes things such as; which image to use, how many replicas to have, which network ports to listen on, and how to perform updates.

Once we've created the manifest, we POST it to the API server. The most common way of doing this is with the kubectl command-line-utility. This POSTs the manifest as a request to the control plane, usually port 443.

Once the request is authenticated and authorized, Kubernetes inspects the manifest, indentifies which controller to send it to (e.g. the Deployments controller), and records the config in the cluster store as part of the cluster's overall desired state. Once this is done, the work gets scheduled on the cluster. This includes the hard work of pulling images, starting containers, building networks, and starting the application's processes.

Finally, Kubernetes utilizes background reconciliation loops that constantly monitor the state of the cluster. If the current state of the cluster varies from the desired state, Kubernetes will perform whatever takes are necessary to reconcile the issue.

It's important to understand that what we've described is the opposite of the traditional imperative model. The imperative model is where we issue long lists of platform-specific commands to build things.

Not only is the declarative model a lot simpler than long lists of imperative commands, it also enables self-healing, scaling, and lends itself to version control and self-documentation. It does this by telling the cluster how things should look. If they stop looking like this, the cluster notices the discrepancy and does all of the hard work to reconcile the situation.

But if things go wrong, things change. When the current state of the cluster no longer matches the desired state, as soon as this happens, Kubernetes kicks into action and attempts to bring the two back into harmony.