Kubernetes Book 2 - principles of operation

From 탱이의 잡동사니
Jump to navigation Jump to search

Master and nodes

A Kubernetes cluster is made of masters and nodes. These are Linux hosts that can be VMs, bare metal servers in your data center, or instances in a private or public cloud.

Masters(control plane)

A Kubernetes master is a collection of system services that make up the control plane of the cluster.

The simplest setups run all the master services on a single host. However, this is only suitable for labs and test environments. For production environments, multi-master high availability (HA) is a must have. This is why the major cloud providers implement HA master as part of their Kubernetes-as-a-Service platforms such as Azure Kubernetes Service(AKS), AWS Elastic Kubernetes Service(EKS), and Google Kubernetes Engine(GKE).

API server

The API server is the Grand Central Station of Kubernetes. All communication, between all components, goes through the API server.

It exposes a RESTful API that the users POST YAML configuration files to over HTTPS. These YAML files, which we sometimes call manifests, contain the desired state of our application. This includes things like; which container image to use, which ports to expose, and how many Pod replicas to run.

All requests to the API Server are subject to authentication and authorization checks, but once these are done, the config in the YAML file is validated, persisted to the cluster store, and deployed to the cluster.

Cluster store

If the API server is the brains of the cluster, the cluster store is its heart. It's the only stateful part of the control plane, and it persistently stores the entire configuration and state of the cluster. As such, it's a vital component of the cluster - no cluster store, no cluster.

The cluster store is currently based on etcd, a popular distributed database. As it's the single source of truth for the cluster, you should run between 3-5 etcd replicas for high-availability, and you should provide adequate ways to recover when things go wrong.

On the topic of availability, etcd prefers consistency over availability. This means that it will not tolerate a split-brain situation and will halt update to the cluster in order to maintain consistency. However, if etcd becomes unavailable, applications running on the cluster should continue to work, it's just updates to the cluster configuration that will be halted.

As with all distributed databases, consistency of writes to the database is important. For example, multiple writes to the same value originating from diffrent nodes needs to be handled. etcd uses the popular RAFT consensus algorithm to accomplish this.

Controller manager

The controller manager is a controller of controllers and is shipped as a single monolithic binary. However, despite it running as a single process, it implements multiple independent control loops that watch the cluster and responds to events.

Some of the control loops include; the node controller, the endpoints controller, and the replicaset controller. Each one runs as a background watch-loop that is constantly watching the API Server for changes - the aim of the game is to ensure the current state of the cluster matches the desired state(more on this shortly).

The logic implemented by each control loop is effectively this: - Obtain the desired state. - Observe the current state. - Determine the differences. - Reconcile the differences.

This logic is at the heart of Kubernestes and declarative design patterns.

Each control loops is also extremely specialized and only interested in its own little corner of the Kubernetes world. No attempt is made to over-complicate things by implementing awareness of other parts of the system - each takes care of its own task and leaves other components alone. This is key to the distributed design of Kubernetes and adheres to the Unix philosophy of building complex systems from small specialized parts.

Scheduler

At a high level, the scheduler watches for new work tasks and assigns them to appropriate healthy nodes. Behind the scenes, it implements complex logic that filters out nodes incapable of running the Pod and then ranks the nodes that are capable. The ranking system itself is complex, but the node with the highest ranking point is eventually selected to run the Pod.

When identifying nodes that are capable of running the Pod, the scheduler performs various predicate checks. These include; is the node tainted, are there any affinity or anti-affinity rules, is the Pod's network port available on the node, does the node have sufficient free resources etc. Any node incapable of running the Pod is ignored, and the remaining Pods are ranked according to things such as; does the node already have the required image, how much free resource does the node have, how many Pods is the node already running. Each criteria is worth points, and the node with the most points is selected to run the Pod.

If the scheduler cannot find a suitable node, the Pod cannot be scheduled and goes into pending.

It's not the job of the scheduler to perform the mechanics of running Pods, it just picks the nodes they will be scheduled on.

cloud controller manager

If the cluster on a supported public cloud platform, cush as AWS, Azure, or GCP, the control plane will be running a cloud controller manager. Its job is to manage integrations with underlying cloud technologies and services such as instances, load-balancers and storage.

Control plane

Kubernetes masters run all of the cluster's control plane services. Think of it as brains of the cluster where all the control and scheduling decisions are made. Behind the scenes, a master is made up of lots of small specialized control loops and services. These include the API server, the cluster store, the controller manager, and the scheduler.

The API Server is the front-end into the control plane and the only component in the control plane that we interact with directly. By default, it exposes a RESTful endpoint on port 443.

Nodes

Nodes are the workers of a Kubernetes cluster. At a high0level they do three things:

  • Watch the API Server for new work assignments.
  • Execute new work assignments.
  • Report back to the control plane.

Kubelet

The Kubelet is the star of the show on every Node. It's the main Kubernetes agent, and it runs on every node in the cluster. In fact, it's common to use the terms node and kubelet interchangeably.

When someone joins a new node to a cluster, the process involves installation of the kubelet which is then responsible for the node registration process. This effectively pools the node's CPU, RAM, and storage into the wider cluster pool.Think back to the previous chapter where we talked about Kubernetes being a data center OS and abstracting data center resources into a single usable pool.

One of the main jobs of the kubelet is to watch the API server for new work assignments. Any time it sees one, it executes the task and maintains a reporting channel back to the control plane. It also keeps an eye on local static Pod definitions.

If a kubelet can't run a particular task, it reports back to the master and lets the control plane decide what actions to take. For example, if a Pod fails to start on a node, the kubelet is not responsible for finding another node to run it on. It simply reports back to the control plane and the control plane decides what to do.

Container runtime

The Kubelet needs a container runtime to perform container-related tasks - things like pulling images and starting and stopping containers.

In the early days, Kubernetes had native support for a few container runtimes such as Docker. More recently, it has moved to a plugin model called the Container Runtime Interface (CRI). This is an abstraction layer for external (3rd-party) container runtimes to plug in to. At a high-level, the CRI masks the internal machinery of Kubernetes and exposes a clean documented interface for 3rd-party container runtime to plug in to.

The CRI is the supported method for integrating runtimes into Kubernetes.

There are lots of container runtimes availble for Kubernetes. One popular example is cri-containerd.

Kube-proxy

The last piece of the node puzzle is the kube-proxy. This runs on every node i the cluster and is responsible for local networking. For example, it makes sure each node gets its own unique IP address, and implements local IPTABLES or IPVS rules to handle routing and load-balancing of traffic on the Pod network.

Kubernetes DNS

As well as the various control plane and node components, every Kubernetes cluster has an internal DNS service that is vital to operations.

The cluster's DNS service has a static IP address that is hard-coded into every Pod on the cluster, meaning all containers and Pods know how to find it. Every new service is automatically registered with the cluster's DNS so that all components in the cluster can find every Service by name. Some other components that are registered with the cluster DNS are StatfulSets and the individual Pods that a StatefulSet manages.

Cluster DNS is based on CoreDNS(https://coredns.io/).

Packaging apps

For an application to run on a Kubernetes cluster it needs few things.

  • Packaged as a container.
  • Wrapped in a Pod.
  • Deployed via declarative manifest file.

Declarative model and desired state

The declarative model and the concept of desired state are at the very heart of Kubernetes.

In Kubernetes, the declarative model works like this:

  • Declare the desired state of the application(microservice) in a manifest file.
  • POST it to the Kubernetes API server.
  • Kubernetes stores this in the cluster store as the application's desired state.
  • Kubernetes implements the desired state on the cluster.
  • Kubernetes implements watch loops to make sure the current state of the application doesn't vary from the desired state.

Manifest files are written in simple YAML, and they tell Kubernetes how we want an application to look. We call this is the desired state. It includes things such as; which image to use, how many replicas to have, which network ports to listen on, and how to perform updates.

Once we've created the manifest, we POST it to the API server. The most common way of doing this is with the kubectl command-line-utility. This POSTs the manifest as a request to the control plane, usually port 443.

Once the request is authenticated and authorized, Kubernetes inspects the manifest, indentifies which controller to send it to (e.g. the Deployments controller), and records the config in the cluster store as part of the cluster's overall desired state. Once this is done, the work gets scheduled on the cluster. This includes the hard work of pulling images, starting containers, building networks, and starting the application's processes.

Finally, Kubernetes utilizes background reconciliation loops that constantly monitor the state of the cluster. If the current state of the cluster varies from the desired state, Kubernetes will perform whatever takes are necessary to reconcile the issue.

It's important to understand that what we've described is the opposite of the traditional imperative model. The imperative model is where we issue long lists of platform-specific commands to build things.

Not only is the declarative model a lot simpler than long lists of imperative commands, it also enables self-healing, scaling, and lends itself to version control and self-documentation. It does this by telling the cluster how things should look. If they stop looking like this, the cluster notices the discrepancy and does all of the hard work to reconcile the situation.

But if things go wrong, things change. When the current state of the cluster no longer matches the desired state, as soon as this happens, Kubernetes kicks into action and attempts to bring the two back into harmony.

Pods

Containers must always run inside of Pods.

Pods and containers

The simplest model is to run a single container per Pod. However, there are advanced use-cases that run multiple containers inside a single Pod. For example:

  • Service meshes.
  • Web containers supported by a helper container that pulls the latest content.
  • Containers with a tightly coupled log scraper.

Pod anatomy

At the highest-level, a Pod is aring-fenced environment to run containers. The Pod itself doesn't actually run anything, it's just a sandbox for hosting containers. Keeping it high level, you ring-fence an area of the host OS, build a network stack, create a bunch of kernel namespaces and run one or more containers in it. That's a Pod.

If you're running multiple containers in a Pod, they all share the same environment. This includes thing like the IPC namespace, shared memory, volumes, network stack and more. As an example, this means that all containers in the same Pod will share the same IP address(the Pod's IP).

If two containers in the same Pod need to talk to each other(container-to-container within the Pod) they can use ports on the Pod's localhost interface.

Multi-container Pods are ideal when you have requirements for tightly coupled containers that may need to share memory and storage. However, if you don't need to tightly couple your containers, you should pu them in their own Pods and loosely couple them over the network. This keeps things clean by having each Pod dedicated to a single task.

Pods as the unit of scaling

Pods are also the minimum unit of scheduling in Kubernetes. If it needs to scale the app, it adds or removes the Pods. It doesn't scale by adding more containers to an existing Pod. Multi-container Pods are only for situations where two different, but complimentary, containers need to share resources.

Pods - atomic operations

The deployment of a Pod is an atomic operation. This means that a Pod is either entirely deployed, or not deployed at all. There is never a situation where a partially deployed Pod will be servicing requests. The entire Pod either comes up and is put into service, or it doesn't, and it fails.

A single Pod can only be scheduled to a single node. This is also true of multi-container Pods - all containers in the same Pod will run on the same node.

Pod lifecycle

Pods are mortal. They're created, live and die. If they die unexpectedly, k8s doesn't bring them back to life. Instead, Kubernetes starts a new one in its place. However, even though the new Pod looks, smells, and feels like the old one, it isn't. It's a shiny new Pod with a shiny new ID and IP address.

This has implications on how we should design our applications. Don't design them so they are tightly coupled to a particular instace of a Pod. Instead, design them so what when Pods fail, a totally new one(with a new ID and IP address) can pop up somewhere else in the cluster and seamessly take its place.

Deployments

Normally deploy Pods indirectly as part of something bigger.

For example, a Deployment is a higher-level Kubernetes object that wraps around a paricular Pod and adds features such as scaling, zero-downtime updates, and versioned rollbacks.

Behind the scenes, they implement a controller and a watch loop that is contantly observing the cluster making sure that current state matches desired state.

Services

Pods are mortal and can die. However, if they're managed via Deployments or DaemonSets, they get replaced when they fail. But replacements come with totally different IPs. This also happens when we perform scaling operations - scaling up adds new Pods with new IP addresses, whereas scaling down takes existingPods away. Events like these cause a lot of IP churn.

The point is, that Pods are unreliable, which poses a challenge... How this work if other parts of the app that need to use them?

This is where Services come in to play. Services provide reliable networking for a set of Pods.

Digging in to a bit more detail. Services are fully fledged objects in the Kubernetes API - just like Pods and Deployments. They have a front-end that consists of a stable DNS name, IP address, and port. On the back-end, they load-balance across a dynamic set of Pods. Pods come and go, the Service oberves this, automatically updates itself, and continues to provide that stable networking endpoint.

The same applies if it scale the number of Pods up or down. New Pods are seamlessly added to the Service, whereas terminated Pods are seamlessly removed.

That's the job of a Service - it's a stable network abstraction point that provides TCP and UDP load-balancing across a dynamic set of Pods.

As they operate at the TCP and UDP layer, Services do not possess application intelligence and cannot provide application-layer routing. For that, it needs an Ingress, which understand HTTP and provides host and path-based routing.

Connecting Pods to Services

Services use labels and a label selector to know which set of Pods to load-balance traffic to. The Service has a label selector that is a list of all the labels a Pod must possess in order for it to receive traffic from the Service.

Summary

The masters are where the control plane components run. Under-the-hood, there's a combination of several system-services, including the API Server that exposes the public REST interface. Masters make all of the deployment and scheduling decsisions, and multi-master HA is important for production-grade environments.

Nodes are where user applications run. Each node runs a service called the kubelet that registers the node with the cluster and communicates with the control plane. This includes receiving new work tasks and maintaining areporting channel. Nodes also have a container runtime and the kube-proxy service. The container runtime, such as Docker or containerd, is responsible for all container-related operations. The kube-proxy is responsible for networking on the node.