Platform engineering is shifting

It’s 2026. Platform engineering is shifting. Your users aren’t just developers anymore. They’re AI agents. Plan for it.

Join IaCConf 2026 to hear from the people building this sh*ft.Ā 

Hear from Corey Quinn on ā€œAI Speaks Terraform Like a Tourist,ā€ Matt Gowie on the move from IaC to agents, and Amin Astaneh on 10x code velocity and operational risk.

Plus: how teams replace Terraform workflows with policy-driven automation, deploy AI agents safely, and scale infrastructure with GitOps and Kubernetes.

Join IaCConf 2026, a free virtual event on May 14.

Hey {{first name | there}},

In what feels like a not-so-distant past, I learned about Kubernetes, and as I learned about the cloud native world, the more I realized how much there is to learn about distributed systems.

In this issue, I would like to highlight some of the common distributed systems concepts that pop in and around cloud native tools and how that influences the barrier to entry for learning.

Housekeeping:

To make sure you don’t miss future emails, here are two quick GIFs showing how to move this email to your Primary tab and add this address to your contacts.

Kubernetes and its life RAFT

A good place to begin is Kubernetes as it houses many different components, many of which are direct results of years of distributed systems research.

Often called "the brain" of Kubernetes, etcd is how Kubernetes houses all cluster data. Peeling one layer back, you'd find that etcd is a distributed key value store. Upon learning about this, I eventually tried to find out how etcd is able to store massive amounts of cluster data across multiple nodes.

Powering etcd is a consensus algorithm called Raft. Developed by Diego Ongaro and John Ousterhout at Stanford University, Raft was created as a simpler alternative to Paxos, which was a leading consensus algorithm at the time.

Stacking all these eventually leads to stable storage for cluster data which you often do not have to think of when you spin up a new cluster.

Stable storage Prometheus

Prometheus has become a standard for how we monitor cloud-native applications. A number of factors have contributed to the project's success, but arguably one of the biggest ones is its ability to scale.

Peeling one layer back, Prometheus stores data via TSDB (a time series database), and TSDB itself leans on a write-ahead log to stay durable.

A write-ahead log is exactly what it sounds like. Before any sample lands in the in-memory head block, it's first appended to a log on disk. If Prometheus crashes mid-ingest, on restart it replays the WAL and rebuilds state from the last good checkpoint.

The pattern shows up everywhere in distributed systems.

Postgres has one, etcd's Raft log is essentially a WAL, Kafka treats the log as the entire abstraction. Sequential disk writes are cheap, and a log is the simplest way to get durability without slowing down ingestion.

Ganesh Vernekar wrote a fantastic deep dive into how this all works. Point being, write ahead logs have various applications in distributed systems, and it's incredible how nice an abstraction tools like Prometheus have been able to build.

Platform engineering is shifting

It’s 2026. Platform engineering is shifting. Your users aren’t just developers anymore. They’re AI agents. Plan for it.

Join IaCConf 2026 to hear from the people building this sh*ft.Ā 

Hear from Corey Quinn on ā€œAI Speaks Terraform Like a Tourist,ā€ Matt Gowie on the move from IaC to agents, and Amin Astaneh on 10x code velocity and operational risk.

Plus: how teams replace Terraform workflows with policy-driven automation, deploy AI agents safely, and scale infrastructure with GitOps and Kubernetes.

Join IaCConf 2026, a free virtual event on May 14.

Vitess and logical clocks

Vitess is a database clustering system for MySQL, originally built at YouTube to keep a single MySQL-shaped problem from collapsing under scale. It shards MySQL horizontally and presents the shards as one logical database, which is the kind of thing that sounds straightforward until you have to reason about ordering events across independent servers.

The concept worth highlighting here is logical clocks.

Wall clocks across machines drift, and even with NTP you cannot trust them to order events that happen microseconds apart on different nodes. Leslie Lamport's 1978 paper introduced the idea that you do not actually need real time to order events in a distributed system, you need a consistent way to establish happens-before relationships. MySQL's GTIDs (Global Transaction Identifiers) are a practical descendant of this. Every transaction gets a unique identifier that captures which server originated it and where it sits in that server's sequence.

I've personally tried my hand at implementing a Lamport clock in the past and it was hardly an easy feat.

I'm sure we could go a long time discussing how many distributed systems concepts carry over into cloud native tooling, and that's partially the aim of this post. But more importantly, I think this all goes to show how well some of these tools create abstractions.

I cannot imagine fighting consensus in Kubernetes, reasoning about logical clocks in Vitess, and thinking about compaction for something like Prometheus.

So to everyone who has made a contribution, written a guide, or even opened up an issue, a huge thanks for making these tools easier to use.

I managed to get control of this week's issue back from Divine on the condition that I ask you to share this link with a colleague or fellow DevOps Engineer who’d find it useful.

And it’s a wrap!

Jubril Oyentunji
Chief Technology Officer, EverythingDevOps

Keep Reading