More than 13 years ago in 2006, Red Hat announced a new project called Stateless Linux. The big idea: the traditional methods of systems management starts with a series of identically configured systems, say, your fleet of web servers or load balancers. Over time, though, the state of these machines diverged from each other because of package upgrades, configuration changes, and so on. Manual modifications made by well-meaning sysadmins caused this divergence, and the result was a set of “snowflakes”–individual systems each with their own identity and configuration.
Stateless Linux intended to solve this problem by using a read-only, immutable filesystem for the root partition of each class of machines. When you needed to make a change to the system’s configuration or upgrade a package, that change would be applied to the “golden image” and rolled out to the appropriate machines. This eliminated the snowflake problem, and ensured that the configuration and versions of the underlying software did not diverge between systems.
The benefits here are pretty clear: it should be easier to deploy new machines, easier and safer to manage multiple systems which are intended to be identical, and the infrastructure’s general security posture improved. You also knew exactly what version of each package was running on each machine, and you knew that the configuration was consistent and appropriate for the workload. It would be easier to test updates and changes before they go to production, too.
Stateless Linux’s specific implementation didn’t exactly catch on. The idea didn’t die entirely, but operators instead embraced tools like Puppet, Chef, and Ansible: tools which accomplished the same thing–enforcing a common state and configuration across a fleet of systems–just using a different technique.
Over the years, advances in cloud and container technology and changes in the ways that apps and services are deployed have made the idea of a stateless system more approachable. Netflix helped popularize the approach of a “blue/green deployment”. This deployment methodology introduced the concept of standing up an entirely new ("green") group of servers instead of updating live ("blue") production servers directly. Testing and validation could be done against the green stack without impacting users. Once the testing and validation passed, the traffic would be shifted to the green stack and the blue stack would be terminated. Using a series of golden images to do this, Netflix and others got closer to the goal of fully immutable systems.
Container technology like Docker, Kubernetes, HashiCorp’s Packer, and others have continued to push the industry closer to system immutability, by providing fast and easy ways to generate container images that are disposed of and replaced instead of updated in place. The industry realized huge benefits in app stability, security, and operations teams workload by treating servers as “cattle, not pets”.
Most of these benefits have applied to the applications running on top of the cloud or container orchestration platform, but little has been done to bring the benefits of immutable systems to the underlying host environments that power these clusters. What if the same concepts that increased reliability and eased the maintenance burden of the applications were also applied to the underlying cluster components? An ephemeral, immutable operating system delivering security and operational stability?
To achieve this goal, we have designed the host OS components of Talos to be immutable, with a SquashFS root filesystem that runs out of RAM and never touches disk. This means that even a dedicated attacker who manages to access the system cannot force the root filesystem to be remounted with write access. This, along with the fact that there is no shell, SSH server, or other way to obtain console access to the Talos hosts, means that the system is very difficult to modify while running. You can be confident that the configuration and code that is running in production is the same as what you developed and tested, free from unexpected changes by well-intended operators, or nefarious intruders.
By using an immutable host architecture for your infrastructure, you gain an additional layer of defense against attackers, assured consistency across all of your development stages, and greater flexibility and agility for upgrades, rollbacks, and other high-stakes operations. Have you experimented with immutable system architecture in your infrastructure? Curious about how Talos’s security-first design and API-based management? We would love to hear from you!