Note: Arges is under active development. For the latest and greatest deployment instructions, see the README here.
A new tool for bare metal provisioning with Kubernetes and Talos
In part one of this blog series, I'll be introducing our new suite of projects, codenamed Arges. The goal of the Arges project is to provide Talos users with a robust and reliable way to build and manage Talos-based secure Kubernetes clusters in any environment - in public cloud, on virtualization platforms, or on bare metal. In this blog we'll walk through a bare metal deployment, as that is the most commonly challenging.
We've tried to achieve this by building out a set of tools to help solve the traditional datacenter bootstrapping problems. These tools include an asset management server, a metadata server, and a pair of Cluster API-aware providers for infrastructure provisioning and config generation.
Let's start exploring! If you want to skip right to the code, check it out on GitHub: talos-systems/arges
There are five components in play when running the Arges platform:
- Cluster API
The upstream Cluster API components, that provide CRDs for various abstractions on Kubernetes clusters.
- Cluster API Bootstrap Provider Talos (optional)
A Cluster API (CAPI) bootstrap provider to generate machine configurations across any environment.
- Cluster API Provider Metal
Provides a Cluster API (CAPI) infrastructure provider for bare metal. Given a reference to a bare metal server and some BMC info, this provider will reconcile the necessary custom resources and boot the nodes using IPMI.
- Metal Controller Manager
Provides a "server" CRD in the management cluster, as well as iPXE and TFTP services. The CRD is used to store discovered data (obtained via SMBIOS) about servers that PXE boot against the manager.
- Metal Metadata Server
Provides a Cluster API-aware metadata server for bootstrapping bare metal nodes. The server will attempt to look up a given CAPI machine resource, given the UUID of the system. Once the system is found, it will return the bootstrap data associated with that machine resource.
The goal of the Arges platform is to provide a single point where users can provision and manage their clusters. This single point is referred to as the "management plane". The management plane can be one or more physical or virtual machines that you designate to run the components mentioned above. The only requirement is that these management plane machines must have connectivity to the BMC of the servers that will be provisioned, as well as to the servers themselves. A deployment may look something like this in your environment:
One of the most exciting parts of building Arges was acquiring hardware for our homelabs. (Talos Systems is an all-remote company.) When choosing servers for the home, we were mostly interested in two things: small form factor and IPMI. We wanted to be able to replicate a real datacenter deployment as closely as possible, so this included interacting with the out-of-band management interfaces.
After a fair amount of research, we procured 4 of these small nodes from MITXPC. One of the machines acts as the management plane in the lab, while the other three are control plane nodes for a created cluster. These proved to be pretty capable little machines, with 4 NICs each and the IPMI interfaces we were looking for. To each node, we added 8GB of DDR4 RAM and a 128GB SSD from Amazon. The all-in cost of this hardware was right around $2,000.
Because I like to try and keep my home office tidy when I can, you can see that I also picked up a small rack that is actually geared towards musicians, as well as a couple of rack-mount shelves. These, plus a rack-mount power strip and some zip ties led to a fairly clean cable management story for all of these nodes.
Note: If you'd like more details on the hardware or my specific setup, feel free to ping me on the Talos community Slack!
Because of our desire to create a layout that mimics a real datacenter, we also wanted to prove out the ability to use some of the wonderful networking capabilities available with more advanced networking equipment, specifically BGP (Border Gateway Protocol) and ECMP (Equal-cost Multipath) routing for Kubernetes control plane nodes. I was fortunate enough to already have a home router that supported these capabilities, the Ubiquiti EdgeRouter X. I've had this router in my home setup for a few years now and I've been continually impressed by the things it can do.
Since networking setups and routers can differ so greatly, it may not be super helpful to detail the EdgeRouter setup, but at a high level I did the following:
- Enable BGP on the router itself by setting up an Autonomous System (AS) number for the router
- Designate another AS number for the Kubernetes cluster that will get created and add each physical node as a neighbor to the router
- When deploying Talos I added the same, known IP address to the loopback interface for each node
- When deploying Kubernetes I deployed a bird (a daemon that supports BGP) daemonset that publishes the node's IPs (including the loopback mentioned above) and kube-router that publishes the pod and service IPs to the router
Once these things were done, I was able to have an HA controlplane set up where all of the API servers were accessible by hitting
https://10.254.0.5:6443. Additionally, each pod and service in the cluster itself was directly accessible from my home network without the need for ingresses or node port services.
A view of the route propagation looks like:
This seems like a great place to stop for part one of the "Building Arges" series. I will chase this post with an in-depth guide to deploying the management plane and creating our first Kubernetes cluster.