To self hosted and back again

Sean McCord
March 30, 2021

One of the goals of Sidero, as a company, is to help make secure installation and operation of Kubernetes easy (or at least simpler). When we first started Talos OS, we were good citizens of the Kubernetes world: we used kubeadm for cluster installation. The idea of a common tool for all Kubernetes installations sounds good, but in practice, especially when you are building a highly-advanced, special-purpose, Kubernetes-focused operating system like Talos OS, there are all kinds of assumptions which just don’t fit and are actually wrong for us.

It didn’t take long to feel the strain of using the wrong tool for the job. Kubeadm is designed around a conventional Linux system. It expects all sorts of things which are anathema for Talos OS, such as SSH, shells, a raft of arbitrary tooling, etc. After we’d given kubeadm enough of a fake environment in which to run, and worked out all of the cruft and misaligned pieces, it still wasn’t a good solution. The real pain-point is that it is a one time system: you run kubeadm once, and it installs the critical pieces of Kubernetes across a set of nodes, but the kubeadm configuration is not kept current with the running configuration, so they will ‘drift’ apart. This makes recreating systems … an adventure.

So we next implemented Bootkube. Bootkube hadn’t received much attention since even before CoreOS was fed to the wolves, but it offered what we wanted:

few assumptions about the tools available on the host OS
dynamic (self-hosted) control plane components

Like kubeadm, bootkube is still a one-shot fire into place a Kubernetes cluster kind of thing. However, it doesn’t have a bunch of shelling out from Go. It doesn’t require a bunch of extraneous and potentially dangerous tools. Most importantly, it allows you to install the controlplane of Kubernetes in a self-hosted manner. That is, after it is installed, all the components of the control plane that are necessary to have a working Kubernetes cluster, such as the apiserver, controller-manager, scheduler, etc, run as pods managed by Kubernetes itself. This means you can update the controlplane in the same way that you update any other component in Kubernetes.

After we polished up bootkube to the latest version of Kubernetes, gave it plugin support, and just generally modernized it a bit, it worked out really well for streamlining our Kubernetes installation process, and our users were able to fairly easily upgrade their Kubernetes clusters over time.

Bootkube still has some pretty serious problems, though. Foremost, because the controlplane is self-hosted (effectively managing itself), if you ever lose that controlplane, it can be very difficult to get it back up. Bootkube includes tooling such as the pod-checkpointer (which checkpoints critical Pods and replaces them if they die, outside the scope of the Kubernetes scheduler) and its built-in recovery tool (which reads configuration from etcd and allows you to re-bootstrap your controlplane from that data) which take care of the worst dangers, but it is still possible to break your controlplane. We added additional APIs to recover that controlplane, which work simply and quickly, and it was generally pretty good.

However, under bootkube, automatic upgrades are much more dangerous than we were comfortable with. There are a few reasons for this.

While having the controlplane components themselves managed by Kubernetes makes it easier for the user to upgrade Kubernetes, it actually makes it harder to maintain structure and automate upgrades of Kubernetes. This is a direct result of allowing the user full control over the controlplane manifests. You can look at this as a good problem to have: it provides much-desired flexibility for users. On balance, though, that additional flexibility inevitably means it is easy to completely destroy your cluster.

And for our purposes, because Talos OS does not completely control the controlplane components, it cannot gauge whether conflicting options have been injected into controlplane manifests before it asserts its own changes. There are various potential ways around this, but they are all fairly complicated, involving a lot of tracking and machinery, all to reach a sub-optimal result.

Another risk is that some Kubernetes controlplane components (I’m looking at you, controller-manager), are bad about declaring themselves to be up and ready when, in fact, they are going to fail badly. This leads to situations where updates are rolled out and continue to be rolled out, replacing the good components with bad ones all before the bad replacements have a chance to signal that they are actually bad. And once the control plane has gone down in an upgrade, it becomes quite difficult to recover Kubernetes. This is worse than losing a controlplane because all the nodes went down. In this case, even your checkpoints are now poisoned, so manual intervention is the only option.

Having explored the self-hosted controlplane, we found that, while we needed dynamism of the controlplane, trading structure for complete lack of control, and the fragility inherent in a control plane managing itself, was ultimately not a good choice. So while bootkube addressed many of the problems we had with kubeadm, it wasn’t really the ideal tool for us, after all.

During this time, we started developing a system called COSI, the Common Operating System Interface. One of the main concepts in COSI is the use of resources and controllers. Resources (static or dynamic) are used by controllers to continually try to reach a desired state. This concept is core to Kubernetes itself, and it is an important design concept for self-healing and distributed systems. It also happens to offer very nice things to an operating system.

With COSI, the real answer became obvious. We had a resource, the Talos Machine Config, which described exactly how we wanted our Kubernetes cluster to be created. All we really needed to do was develop a controller which continually tried to make that cluster a reality by iteratively generating and applying the controlplane manifests.

With this, we could gain the stability of having static, file-based manifests, which weren’t themselves dependent on Kubernetes running. That is, there was no more chicken-and-egg problem or fragility. However, unlike the kubeadm solution, our file-based manifests were dynamic in the sense that they are constantly being kept up-to-date from the specifications of our Machine Configs.

If you want to update Kubernetes, all you have to do is update the Machine Config. After that, our controller will automatically work to upgrade (or downgrade!) Kubernetes to exactly what you want. If it encounters a problem, the configuration can simply be rolled back. Importantly, this rollback is possible because all of these components are generated out-of-band from Kubernetes: they are files on the filesystem, not manifests stored inside the cluster. If the cluster dies, we can recover it easily, naturally, and automatically.

So now, starting with Talos OS v0.9, we have a rock-solid controlplane which doesn’t need to be directly manipulated by the end user and which even works reliably for single-node clusters and other odd cluster configurations. But users still have all the flexibility they need by using our structured, declarative Machine Configs. Stability and flexibility at the same time.

Oh, it’s also much faster and easier to debug, because a Talos OS based controlplane has all the regular Kubernetes based debugging tools. However, we also have additional debug tooling on the Talos OS side, which works whether or not the Kubernetes cluster is operational.

So if you want the fastest, easiest, most flexible Kubernetes cluster around, whether you’re in the cloud, on-premises, or even directly running on baremetal, give Talos OS a try today!

https://talos.dev

Subscribe!

Occasional Updates On Sidero Labs, Kubernetes And More!