Arun Lakshman's Blog

AWS EC2 : What's Running Underneath

Arun Lakshman R — Fri, 12 Jun 2026 10:01:23 GMT

Every developer who’s worked with AWS has launched an EC2 instance. You pick an instance type, choose an AMI, SSH in, and deploy your app. Somewhere in the back of your mind, you know there’s virtualization happening. But that’s where most people stop thinking about it.

Here’s what might surprise you: when AWS launched EC2 in August 2006, every instance ran on Xen - an open-source Type 1 bare-metal hypervisor originally created by Ian Pratt and Keir Fraser at the University of Cambridge in 2003. Then, starting around 2017 with the C5 instance family, AWS began migrating to Nitro: a custom platform built on KVM, which is a Type 2 hosted hypervisor. In the textbook hierarchy, Type 1 sits closer to hardware and is considered superior. So why would AWS move down a tier?

The answer is that the Type 1 vs Type 2 distinction is misleading. What actually matters is where I/O is handled. And Nitro solved that problem in dedicated hardware, making the hypervisor classification almost irrelevant.

Glossary

Before we dive in, here are the key terms you’ll encounter throughout this post

Hypervisor Software (or firmware) that creates and manages virtual machines. It sits between the physical hardware and the guest operating systems, dividing resources among them.

Type 1 (bare-metal) hypervisor A hypervisor that runs directly on physical hardware with no host operating system underneath. Examples: Xen, VMware ESXi.

Type 2 (hosted) hypervisor A hypervisor that runs as software inside a conventional operating system. Examples: KVM (inside Linux), VirtualBox (inside Windows/macOS).Virtual

Machine (VM) A software emulation of a complete computer. It has its own CPU, memory, disk, and network - all virtual - and runs its own operating system.Guest

OS The operating system running inside a virtual machine. It believes it’s running on real hardware (unless paravirtualized).

Host OS The operating system running on the physical machine that hosts the hypervisor and virtual machines.

Paravirtualization A technique where the guest OS knows it’s virtualized and uses special lightweight drivers to communicate with the hypervisor, instead of the hypervisor emulating real hardware. Faster, but requires guest modification.

Full virtualization The guest OS runs unmodified, believing it’s on real hardware. The hypervisor intercepts and translates privileged instructions. Slower than paravirtualization, but compatible with any OS.

KVM (Kernel-based Virtual Machine) A Linux kernel module that turns Linux into a hypervisor. Handles CPU and memory virtualization using hardware extensions.

QEMU (Quick Emulator) A userspace program that emulates I/O devices (disks, NICs, USB, etc.) for virtual machines. Often paired with KVM - KVM handles CPU/memory, QEMU handles everything else.

Dom0 (Domain 0) In Xen, the first and most privileged virtual machine that boots. It runs Linux, has direct hardware access, and manages all other VMs.

DomU (Domain U) In Xen, an unprivileged guest virtual machine. The “U” stands for unprivileged. It has no direct hardware access and relies on Dom0 for I/O.

VMkernel ESXi’s custom operating system kernel. Not Linux - VMware built it from scratch to handle scheduling, networking, storage, and device drivers within the hypervisor itself.

Hypercall A system call from a guest OS to the hypervisor, analogous to a syscall from a userspace program to the kernel. Used in paravirtualization.

VT-x / AMD-V Hardware virtualization extensions built into Intel and AMD processors. They allow the CPU to natively run guest code without software emulation of privileged instructions.

EPT / NPT Extended Page Tables (Intel) / Nested Page Tables (AMD). Hardware extensions for memory virtualization that let the CPU translate guest virtual addresses to physical addresses without hypervisor intervention on every memory access.

NVMe Non-Volatile Memory Express. A protocol for accessing storage devices over PCIe. In the Nitro context, EBS volumes appear as NVMe devices to the guest.

ENA Elastic Network Adapter. AWS’s custom network driver for Nitro instances, replacing Xen’s paravirtual network frontend.

PCIe Peripheral Component Interconnect Express. The standard high-speed bus for connecting hardware devices (GPUs, NICs, storage controllers) to a CPU.

SR-IOV Single Root I/O Virtualization. A hardware standard that allows a single physical PCIe device to present itself as multiple virtual devices, each assignable directly to a VM - bypassing the hypervisor for I/O.

The Setup

Xen was the natural choice for early EC2: it was open source (GPLv2), battle-tested, and designed from the ground up to run multiple operating systems on a single physical machine. Amazon could take it, modify it, and build a cloud on top without licensing fees or vendor lock-in.

KVM itself was created by Avi Kivity at Qumranet in 2006 and merged into the Linux kernel (version 2.6.20) in February 2007. AWS acquired Annapurna Labs, an Israeli chip design company, in January 2015 for approximately $350 million - and Annapurna became the team that built the Nitro hardware cards.

Pillar 1: What Each Virtualization Layer Actually Does

If you want a visual primer on how virtualization works before we go deeper, this is a solid overview:

To understand why the hypervisor type matters less than you think, you first need to understand what each piece of the virtualization stack is responsible for.

KVM virtualizes CPU execution and memory management. It’s a Linux kernel module that leverages hardware extensions - Intel VT-x (introduced in November 2005 with the Pentium 4 662/672) and AMD-V (introduced in May 2006 with the Athlon 64) - to run guest code directly on the physical CPU. Before these extensions, hypervisors had to use complex binary translation techniques to trap and emulate privileged instructions. With VT-x/AMD-V, the CPU itself understands the concept of a guest and a host, switching between them in hardware. For compute-bound work, the overhead is near zero.

Memory virtualization followed a similar path. Early hypervisors maintained “shadow page tables“ - a software layer that translated guest virtual addresses to host physical addresses, intercepting every page table update. This was expensive. Intel EPT (introduced in 2008 with the Nehalem architecture) and AMD NPT (introduced in 2007 with Barcelona) moved this translation into hardware, letting the CPU walk nested page tables without hypervisor intervention.

QEMU (Quick Emulator), originally written by Fabrice Bellard in 2003, emulates I/O devices - virtual disk controllers, network cards, USB devices, graphics adapters, and so on. It presents what looks like real hardware to the guest. Each VM is a QEMU process running in userspace on the host. Before KVM existed, QEMU could do full system emulation entirely in software - including CPU emulation - but it was slow. The KVM+QEMU pairing splits the work: KVM handles the fast path (CPU and memory in kernel space), QEMU handles the complex path (device emulation in userspace).

But here’s the part people miss: you still need a host OS. KVM is a kernel module - it’s not a standalone program. It depends on Linux’s CFS (Completely Fair Scheduler) to schedule VCPUs (each VCPU is just a Linux thread). It depends on Linux’s memory manager for page tables, NUMA awareness, hugepages, and KSM (Kernel Same-page Merging) (which deduplicates identical memory pages across VMs). QEMU is a regular process that makes syscalls for file I/O, networking, and signal handling. Without Linux underneath, neither can function. If you run ps aux on a KVM host, you’ll see one QEMU process per VM, just like any other program.

And you still need a guest OS. KVM and QEMU together build you a virtual computer - CPU, memory, disk, NIC, all virtualized. But a computer with no operating system is just hardware sitting idle. Something still has to:

Boot up and initialize the virtual hardware
Load drivers for the virtual devices QEMU presents
Implement a filesystem (ext4, XFS, NTFS) on the virtual disk
Provide a TCP/IP networking stack
Offer a kernel that applications can make syscalls against
Manage processes, users, permissions, and libraries

Virtual hardware still needs software to run on it. (This is also why containers became popular - for many workloads, you can skip the guest OS entirely by sharing the host kernel. Docker, released in March 2013, and Firecracker, open-sourced in November 2018, both exploit this insight.)

Pillar 2: Type 1 vs Type 2 - And Why the Line Is Blurry

The textbook distinction, formalized by Gerald J. Popek and Robert P. Goldberg in their 1974 paper “Formal Requirements for Virtualizable Third Generation Architectures,” is clean:

Type 1 (bare-metal): The hypervisor runs directly on hardware. No host OS. It manages hardware resources and guest VMs itself.
Type 2 (hosted): The hypervisor runs as software inside a conventional operating system. It depends on the host OS for hardware access.

Type 1 hypervisors introduced an important concept: paravirtualization. The term was coined by the Xen team in their 2003 SOSP paper “Xen and the Art of Virtualization.” Instead of tricking the guest into thinking it’s on real hardware (full virtualization), the guest knows it’s virtualized and cooperates with the hypervisor. Xen’s guests used lightweight “frontend” drivers - blkfront for block devices, netfront for networking - that communicated with “backend” drivers in Dom0 through shared memory ring buffers and event channels. No hardware emulation, no trap-and-emulate overhead. This was critical in 2003 because hardware virtualization extensions (VT-x/AMD-V) didn’t exist yet - paravirtualization was the only way to get acceptable performance.

Type 2 guests, by contrast, are typically unaware they’re virtualized. QEMU emulates a complete hardware environment - an Intel e1000 NIC, an IDE or SCSI disk controller - and the guest runs its standard drivers against what it believes is real hardware.

But here’s where it gets interesting. The Linux kernel ships with both Xen guest drivers and KVM host code in the same binary. The drivers/xen/ directory contains the paravirtual frontend drivers for running as a guest on Xen - these were merged upstream between 2007 and 2009 through a sustained effort by the Xen community, particularly Jeremy Fitzhardinge and others at XenSource (later acquired by Citrix in 2007 for $500 million). The virt/kvm/ directory contains the code that makes Linux a hypervisor, merged in February 2007. They coexist peacefully.

At boot, the kernel detects what’s underneath:

# On a Xen instance:
dmesg | grep -i hypervisor
> Hypervisor detected: Xen

# On a KVM/Nitro instance:
dmesg | grep -i hypervisor
> Hypervisor detected: KVM

The same kernel image works on bare metal, on Xen, or on KVM without modification. It simply activates the right code path based on what it finds.

And KVM itself blurs the Type 1/Type 2 line. Yes, it runs inside Linux. But it operates in kernel space (ring 0) with direct access to hardware virtualization extensions. It doesn’t emulate a CPU - it runs guest code natively using VT-x/AMD-V. The guest enters a special CPU mode (VMX non-root on Intel), executes at near-native speed, and only exits back to the hypervisor (”VM exit”) when it does something that requires intervention. Performance benchmarks consistently put KVM alongside Type 1 hypervisors. Some people call it “Type 1.5,” which tells you the classification system from 1974 doesn’t map cleanly onto modern architectures.

Pillar 3: I/O Is the Real Differentiator

If CPU and memory virtualization are essentially solved by hardware extensions, then the real question becomes: who handles I/O, and how?

Each major hypervisor answered this differently, and the differences reveal where performance is actually won or lost.

Xen used Dom0 as an I/O proxy. When Xen boots on a physical machine, the first thing it launches is Dom0 - a privileged Linux VM with direct access to all physical hardware. Dom0 runs real Linux device drivers: the actual Intel NIC driver, the actual SATA controller driver, everything. Every unprivileged guest (DomU) that wants to read a disk block or send a network packet goes through Dom0:

The guest’s frontend driver (blkfront) places a request into a shared memory ring buffer
An event channel notifies Dom0
Dom0’s backend driver (blkback) picks up the request
Dom0 talks to the real hardware using standard Linux drivers
The result travels back through the shared memory ring

This was elegant - Xen itself stayed tiny (around 150,000 lines of code in early versions), and it reused Linux’s entire driver ecosystem through Dom0. But Dom0 was a bottleneck. It consumed CPU and memory on every physical host just to proxy I/O. Under heavy I/O load, Dom0 could become saturated. And it was a single point of failure - if Dom0 crashed, every VM on that host lost I/O.

ESXi took the monolithic approach. VMware, founded in 1998 by Diane Greene, Mendel Rosenblum, and others at Stanford, released ESX Server in 2001 and the thin ESXi variant in 2007. VMware built their own mini operating system from scratch - the VMkernel - with its own scheduler, its own TCP/IP stack, its own filesystem (VMFS, a clustered filesystem designed for VM disk images), and its own device drivers. No Dom0, no Linux, no middleman. The hypervisor is the I/O layer. ESXi installs from a ~150MB ISO.

The upside: fewer layers, lower latency, no I/O proxy bottleneck. The downside: VMware has to write and maintain drivers for every piece of hardware they support, which is why they publish a strict Hardware Compatibility List (HCL). You can’t just plug in any NIC and expect it to work - it needs a VMware driver.

KVM/QEMU delegates I/O to userspace. Each VM’s QEMU process emulates virtual devices and translates I/O operations into host Linux syscalls. Guest writes to virtual disk → QEMU catches it → QEMU calls pwrite() on the host → Linux kernel handles the actual disk I/O. It’s flexible and benefits from Linux’s entire driver ecosystem, but there’s overhead in the userspace-to-kernel context switches. Technologies like virtio (a standardized paravirtual I/O framework, proposed by Rusty Russell in 2007 and merged into Linux 2.6.25) reduced this overhead significantly by giving guests lightweight drivers that cooperate with QEMU, similar in spirit to Xen’s frontend/backend model.

Notice the pattern: in every case, the performance bottleneck isn’t CPU virtualization - hardware extensions made that nearly free. It’s the I/O path. Dom0 proxying, VMkernel processing, QEMU translating - that’s where the latency lives.

Pillar 4: How Nitro Made the Hypervisor Type Irrelevant

AWS saw this clearly. The problem was never “Type 1 vs Type 2.” The problem was that I/O was handled in software, and software I/O has overhead no matter how you architect it.

The Nitro journey happened in stages:

2013: AWS introduced enhanced networking using SR-IOV (Single Root I/O Virtualization) on C3 instances. SR-IOV is a PCIe hardware standard (ratified in 2007 by the PCI-SIG) that allows a single physical NIC to present multiple virtual functions, each assignable directly to a VM. This bypassed Dom0 for networking - the guest talked directly to a virtual function on the physical NIC. It was the first crack in Dom0’s monopoly on I/O.
January 2015: AWS acquired Annapurna Labs for ~$350 million. Annapurna, founded in 2011 in Yokneam, Israel, by Avigdor Willenz (who had previously founded Galileo Technology and Marvell), specialized in custom ARM-based SoCs. This acquisition gave AWS the silicon design capability to build custom I/O hardware.
2016: The Nitro card for EBS appeared, offloading storage I/O from the host CPU to a dedicated hardware card. No more Dom0 or QEMU in the storage path.
2017: AWS launched the C5 instance family - the first instance type running on the full Nitro platform. The hypervisor was KVM-based. Networking was handled by the Nitro card for VPC (with ENA drivers). Storage was handled by the Nitro card for EBS (with NVMe drivers). Security and management ran on the Nitro security chip. The host CPU ran a minimal KVM hypervisor that handled only CPU and memory isolation.
2018: AWS open-sourced Firecracker, the microVM monitor built on KVM that powers Lambda and Fargate. Firecracker boots a VM in ~125 milliseconds with ~5MB of memory overhead - demonstrating just how thin the virtualization layer can be when I/O is handled elsewhere.
2023: AWS announced Nitro v5 with further performance improvements and the Nitro Trusted Platform Module (TPM) for enhanced security.

The architecture shift looks like this

Dom0 is gone. QEMU is not in the I/O path. The hypervisor is so thin it barely exists. And the migration from Xen to KVM was transparent to customers because - as we covered in Pillar 2 - the Linux kernel already carried both Xen guest drivers and KVM support. Existing AMIs worked without modification. The kernel detected KVM instead of Xen at boot and activated the right code path. Customers on newer instance types saw NVMe and ENA devices instead of Xen paravirtual devices, but those drivers were already in the kernel too.

Nobody had to rebuild their AMI. Nobody had to change their deployment scripts. The entire hypervisor substrate changed underneath millions of running workloads, and the abstraction held.

The Takeaway

The Type 1 vs Type 2 classification made sense in 1974 when Popek and Goldberg formalized it, and it still made sense in 2003 when the choice between Xen and VMware Workstation was a meaningful architectural decision. But hardware virtualization extensions leveled the playing field for CPU and memory. What remained was the I/O problem - and that turned out to be a hardware design problem, not a software classification problem.

AWS didn’t move “down” from Type 1 to Type 2. They moved the thing that actually mattered - I/O - into dedicated silicon, and made the hypervisor layer so thin that its classification became academic. The question isn’t “Type 1 or Type 2?” The question is “where does I/O happen?” And if the answer is “in purpose-built hardware,” the hypervisor type barely matters.

The next time you launch an EC2 instance, you’re not just running a VM. You’re running on a decade of architectural decisions - from a Cambridge research project in 2003, through an Israeli chip startup acquisition in 2015, to custom silicon that made the oldest debate in virtualization irrelevant.

Inside Flink’s Control Plane: How Apache Pekko Powers the RPC Layer

Arun Lakshman R — Fri, 05 Jun 2026 19:00:47 GMT

Flink’s distributed components must communicate constantly. TaskManagers report task state changes to JobMaster. JobMaster requests slots from ResourceManager. Dispatchers serve REST API queries about job status. All these components access shared state, particularly the ExecutionGraph. Traditional multi-threading with locks would create race conditions, deadlocks, and unmaintainable code. Flink solves this by adopting the Actor Model through the Akka/Pekko framework. Each component processes all requests on a single thread through a FIFO mailbox. This design eliminates concurrency bugs by architecture, not by locks.

The Problem: Distributed Components and Shared State

Why Components Must Communicate

Flink’s runtime consists of distributed components that exchange messages continuously. The table below shows the primary RPC interactions in a running Flink cluster.

These interactions happen thousands of times per second in a production cluster. A single JobMaster coordinates with hundreds of TaskManagers. Each TaskManager runs dozens of tasks. Every task state change, checkpoint acknowledgment, and heartbeat flows through this RPC layer.

The Shared State Challenge

The ExecutionGraph sits at the center of JobMaster. It tracks the complete state of job execution: which tasks are running, which have finished, which checkpoints are in progress, and which resources are allocated. Multiple components access ExecutionGraph for different purposes.

TaskManagers update ExecutionGraph when they report state changes. A task transitions from DEPLOYING to RUNNING. Another task finishes and transitions to FINISHED. Each update modifies the graph’s internal state.

The CheckpointCoordinator reads ExecutionGraph to trigger checkpoints. It iterates through all execution vertices. It sends checkpoint barriers to each task. It tracks acknowledgments as they arrive.

The Dispatcher serves REST API queries. A user requests job status. The Dispatcher reads ExecutionGraph to return current state. Another user requests checkpoint details. The Dispatcher reads checkpoint metrics from the same graph.

What Breaks Without Protection

Consider what happens if these operations execute concurrently without protection. Thread 1 iterates through ExecutionGraph vertices to trigger a checkpoint. Thread 2 updates a task’s state, modifying the vertex collection. Thread 1’s iterator becomes invalid. The JVM throws ConcurrentModificationException. The checkpoint fails.

The alternative is worse. Without an exception, Thread 1 reads partially updated state. It triggers checkpoints on some tasks but misses others. It sees a task as RUNNING when it has already FINISHED. The checkpoint completes with inconsistent state. Data corruption follows.

Traditional solutions require locks. Every method that reads ExecutionGraph acquires a read lock. Every method that writes acquires a write lock. The code becomes littered with lock.readLock().lock() and lock.writeLock().lock() calls. Developers must remember to release locks in finally blocks. They must avoid nested lock acquisitions that cause deadlocks. They must reason about every possible thread interleaving across hundreds of methods.

This approach does not scale. Lock contention becomes a performance bottleneck. Debugging deadlocks in production takes days. New engineers introduce subtle race conditions because they forgot to acquire a lock in one code path.

The Solution: Actor Model via Akka/Pekko

Flink adopts the Actor Model to eliminate these concurrency challenges. The Actor Model, popularized by Erlang and implemented in Java by Akka (now Apache Pekko), provides a simple guarantee: each actor processes one message at a time on a single thread. This guarantee makes shared state access inherently thread-safe without locks.

Core Mechanism: Single Thread Execution

The fundamental insight is simple. Instead of allowing multiple threads to access shared state concurrently, route all access through a single thread. Messages from different callers queue up in a mailbox. A single worker thread processes them one at a time in FIFO order. No two messages execute concurrently. No race conditions are possible.

Multiple Threads → Single Actor. When TaskManager reports a state change, it does not call JobMaster directly. It sends a message to JobMaster’s actor. When CheckpointCoordinator triggers a checkpoint, it sends another message. When REST API queries job status, it sends yet another message. Three different callers. Three different threads. All messages arrive at the same actor.

Actor Mailbox = FIFO Queue. The actor maintains an internal mailbox. Messages arrive and queue up in order. The first message to arrive is the first message processed. The second message waits until the first completes. The third waits for the second. This ordering provides deterministic execution. Given the same message sequence, the actor produces the same results.

MainThreadExecutor = Single Thread. The RpcEndpoint base class provides a MainThreadExecutor. This executor runs on a single thread dedicated to the endpoint. Every RPC method executes on this thread. Every internal callback executes on this thread. Every scheduled task executes on this thread. The endpoint owns this thread exclusively.

No Synchronization Needed. Because all code runs on a single thread, no synchronization is necessary. The ExecutionGraph has no locks. Methods read and write state directly. Iterators remain valid because no concurrent modification is possible. The code reads like a simple single-threaded program. Developers reason about sequential execution, not thread interleavings.

How Message Processing Works

Consider a concrete example. JobMaster receives three messages in quick succession.

Message 1 arrives from TaskManager: updateTaskExecutionState(task=A, state=FINISHED). The mailbox queues this message. The main thread picks it up. JobMaster accesses ExecutionGraph, finds the execution for task A, and updates its state to FINISHED. The main thread completes processing.

Message 2 arrives from CheckpointCoordinator: triggerCheckpoint(checkpointId=42). The mailbox already has this message queued. The main thread picks it up after completing Message 1. JobMaster accesses ExecutionGraph, iterates through all vertices, and triggers checkpoint 42 on each. The iteration is safe because Message 1 already completed. ExecutionGraph is in a consistent state.

Message 3 arrives from REST API: requestJobDetails(). The mailbox queues it behind Message 2. The main thread picks it up after completing Message 2. JobMaster reads ExecutionGraph and returns job details. The read sees all updates from Messages 1 and 2.

This sequential processing eliminates every concurrency concern. Message 2 never sees ExecutionGraph mid-update from Message 1. Message 3 always sees a consistent view. No locks required. No race conditions possible.

Architecture: The RPC Abstraction Layers

Flink builds its RPC system in layers. Each layer has a specific responsibility. The layers compose to provide type-safe, single-threaded, distributed method invocation.

To understand Flink’s RPC architecture, it helps to draw parallels with familiar Java patterns. If you’ve used the AWS SDK, Apache Tomcat, or Java Servlets, you already understand the core concepts - just with different names.

Mapping Flink RPC to Familiar Java Patterns

RpcGateway: The Interface Contract (Like AWS SDK Service Clients)

RpcGateway defines the contract for remote calls. It serves the same purpose as an AWS SDK service client interface.

AWS SDK Analogy: When you use S3Client from the AWS SDK, you call methods like putObject() or getObject(). You don’t think about HTTP, serialization, or retries. The interface abstracts the network layer completely. RpcGateway does the same for Flink’s internal communication.

// AWS SDK pattern - you're familiar with this
public interface S3Client {
    PutObjectResponse putObject(PutObjectRequest request);
    GetObjectResponse getObject(GetObjectRequest request);
}

// Flink RPC pattern - same concept, different domain
public interface JobMasterGateway extends RpcGateway {
    CompletableFuture updateTaskExecutionState(TaskExecutionState state);
    CompletableFuture cancel(Duration timeout);
    CompletableFuture triggerSavepoint(String targetDirectory, boolean cancelJob);
}

Key Differences from AWS SDK:

Async by Default: Every RpcGateway method returns CompletableFuture. AWS SDK v2 offers both sync (S3Client) and async (S3AsyncClient) variants. Flink chose async-only to make the non-blocking nature explicit. Callers never block waiting for results - they attach callbacks or chain operations.
Bidirectional: AWS SDK clients only make outbound calls. Flink gateways are bidirectional. TaskExecutorGateway lets JobMaster call into TaskManager. JobMasterGateway lets TaskManager call into JobMaster. Both sides expose gateways.
Internal Network: AWS SDK calls traverse the public internet to AWS services. Flink RPC calls stay within the cluster’s internal network, typically using direct TCP connections.

JobMasterGateway declares methods that callers can invoke on JobMaster. The interface serves as documentation - new engineers read it to understand what operations JobMaster supports. Method signatures specify exact parameter types and return types. Javadoc explains semantics. The interface is the source of truth for the RPC contract.

RpcEndpoint: The Base Class (Like a Servlet or Spring Controller)

RpcEndpoint is the server-side handler. Every distributed component extends this class. Think of it as a Servlet that handles incoming requests, but with a critical difference: all requests execute on a single thread.

Servlet Analogy: In a traditional Java web application, you write a Servlet to handle HTTP requests:

// Traditional Servlet - Tomcat spawns a thread per request
public class OrderServlet extends HttpServlet {
    private OrderRepository repository;  // Shared state - needs synchronization!
    
    @Override
    protected void doPost(HttpServletRequest req, HttpServletResponse resp) {
        // WARNING: Multiple threads execute this concurrently
        // Must synchronize access to repository
        synchronized(repository) {
            repository.createOrder(parseOrder(req));
        }
    }
}

Flink RpcEndpoint - Same concept, but single-threaded:

// Flink RpcEndpoint - only ONE thread ever executes methods
public class JobMaster extends FencedRpcEndpoint 
        implements JobMasterGateway {
    
    private SchedulerNG schedulerNG;  // Shared state - NO synchronization needed!
    
    @Override
    public CompletableFuture updateTaskExecutionState(
            TaskExecutionState state) {
        // SAFE: Only main thread executes this
        // No locks, no synchronization, no race conditions
        schedulerNG.updateTaskExecutionState(state);
        return CompletableFuture.completedFuture(Acknowledge.get());
    }
}

Why Single-Threaded Beats Multi-Threaded Here:

Tomcat’s thread-per-request model works well for stateless web applications. Each request is independent. But Flink’s components maintain complex shared state (ExecutionGraph with thousands of vertices, checkpoint state, slot allocations). The single-threaded model eliminates an entire class of bugs.

Key RpcEndpoint Features:

MainThreadExecutor: The constructor creates a dedicated executor bound to the endpoint. All RPC calls execute through this executor. The class provides methods to schedule work on the main thread:
- runAsync(Runnable) - queues a task for later execution
- callAsync(Callable) - queues a task and returns CompletableFuture
- scheduleRunAsync(Runnable, Duration) - queues work with a delay
Lifecycle Hooks: Like Servlet’s init() and destroy():
- onStart() - runs when the endpoint begins accepting messages
- onStop() - runs during shutdown Both execute on the main thread, making initialization and cleanup thread-safe.
Thread Safety Check: The validateRunsInMainThread() method catches programming errors early:

protected void validateRunsInMainThread() {
    if (!rpcServer.isCurrentThreadMainThread()) {
        throw new IllegalStateException(
            "This method must be called from within the main thread.");
    }
}

Component Hierarchy:

JobMaster extends FencedRpcEndpoint - coordinates job execution
TaskExecutor extends RpcEndpoint - runs tasks on worker nodes
ResourceManager extends FencedRpcEndpoint - manages cluster resources
Dispatcher extends FencedRpcEndpoint - handles job submission

RpcService: The Factory and Connection Manager (Like Tomcat’s Connector)

RpcService is an abstraction that manages endpoint lifecycles and gateway connections. It defines the contract for how endpoints are created and how connections are established - but not how messages travel over the wire.

Currently, the only production implementation is PekkoRpcService, which uses Pekko’s actor remoting over TCP. However, the abstraction exists precisely so the transport can be swapped without changing Flink’s core components. Future implementations could use:

gRPC - Industry-standard RPC with HTTP/2, protobuf serialization, and mature tooling
HTTP/REST - Simpler debugging, standard load balancers, firewall-friendly
Custom TCP - Optimized binary protocol without Pekko’s overhead

The key insight: JobMaster, TaskExecutor, and ResourceManager don’t know or care whether messages travel via Pekko actors, gRPC streams, or HTTP requests. They only interact with the RpcService abstraction.

Tomcat Analogy: Tomcat’s Connector accepts incoming connections, manages the thread pool, and routes requests to Servlets. RpcService does the same for Flink. Just as Tomcat can swap between NIO, NIO2, or APR connectors without changing your Servlets, Flink could swap RpcService implementations without changing endpoints:

AWS SDK Analogy: RpcService also resembles SdkClientBuilder combined with connection pooling. The SDK abstracts whether it uses Apache HttpClient, Netty, or URL Connection under the hood:

// AWS SDK - builder creates configured client (transport abstracted)
S3Client s3 = S3Client.builder()
    .region(Region.US_EAST_1)
    .httpClient(NettyNioAsyncHttpClient.create())  // Could swap to ApacheHttpClient
    .build();

// Flink - RpcService abstraction (transport abstracted)
// Today: PekkoRpcService (actor-based TCP)
// Future: Could be GrpcRpcService, HttpRpcService, etc.
RpcService rpcService = new PekkoRpcService(config, actorSystem);

// These calls work identically regardless of RpcService implementation:
// Start a server (like deploying a Servlet)
rpcService.startServer(jobMaster);

// Connect to remote server (like creating SDK client)
JobMasterGateway gateway = rpcService.connect(address, JobMasterGateway.class).get();

Key RpcService Responsibilities (Interface Contract):

These responsibilities are defined by the RpcService interface. Any implementation - Pekko, gRPC, or HTTP - must fulfill them:

Server Creation: When JobMaster instantiates, it calls rpcService.startServer(this). The implementation creates whatever underlying machinery is needed (actors for Pekko, gRPC stubs for gRPC, servlet registration for HTTP) and starts the main thread executor. The endpoint is now ready to receive messages.
Client Connection: A TaskManager needs to communicate with JobMaster on another machine. It calls rpcService.connect(address, JobMasterGateway.class). The implementation returns a proxy object implementing JobMasterGateway. Whether that proxy sends Pekko messages, gRPC calls, or HTTP requests is an implementation detail hidden from the caller.
Transport Management: The implementation manages its transport layer - ActorSystem for Pekko, ManagedChannel for gRPC, HttpClient for HTTP. It handles configuration, connection pooling, and graceful shutdown.

Why This Abstraction Matters:

The Pekko (formerly Akka) license change in 2022 forced Flink to migrate from Akka to Pekko. This abstraction means a future migration to gRPC or HTTP would only require implementing a new RpcService - no changes to JobMaster, TaskExecutor, or ResourceManager.

RpcServer: The Message Dispatcher (Like DispatcherServlet)

RpcServer is the internal component that dispatches messages to the endpoint.

Spring MVC Analogy: Spring’s DispatcherServlet receives all HTTP requests, determines which controller method to invoke, and dispatches the call. RpcServer does the same for RPC messages:

Key RpcServer Responsibilities:

Thread Tracking: Knows which thread is the endpoint’s main thread. Provides isCurrentThreadMainThread() for safety checks.
Method Invocation: When a message arrives requesting updateTaskExecutionState():
- Locates the method on the endpoint class
- Deserializes the arguments
- Invokes the method reflectively
- Captures the return value
- Serializes the result and sends it back

PekkoInvocationHandler: The Client-Side Proxy (Like AWS SDK’s HTTP Layer)

PekkoInvocationHandler implements InvocationHandler for the dynamic proxy. It converts method calls into network messages.

AWS SDK Analogy: When you call s3Client.putObject(request), the SDK internally:

Serializes the request to HTTP
Signs the request
Sends over HTTPS
Deserializes the response

PekkoInvocationHandler does the same, but with Pekko’s actor messaging instead of HTTP:

// What you write
gateway.updateTaskExecutionState(state);

// What PekkoInvocationHandler does internally (simplified)
public Object invoke(Object proxy, Method method, Object[] args) {
    // 1. Create invocation object (like HTTP request)
    RpcInvocation invocation = new RpcInvocation(
        method.getName(),           // "updateTaskExecutionState"
        method.getParameterTypes(), // [TaskExecutionState.class]
        args                        // [state]
    );
    
    // 2. Send via actor (like HTTP send)
    CompletableFuture result = actorRef.ask(invocation, timeout);
    
    // 3. Return future (response will arrive asynchronously)
    return result;
}

HttpClient Comparison:

Network Path: Client to Server Flow

When an RPC call crosses machine boundaries, a complex flow executes. Understanding this flow helps debug network-related failures.

Client Side: Gateway to Network

The flow mirrors what happens in an AWS SDK call, but with actors instead of HTTP.

Step 1: Obtain Gateway (Like Creating SDK Client)

// AWS SDK
S3Client s3 = S3Client.builder().region(Region.US_EAST_1).build();

// Flink RPC
JobMasterGateway gateway = rpcService.connect(jobMasterAddress, JobMasterGateway.class).get();

The connect() call doesn’t return a real JobMasterGateway implementation. It returns a dynamic proxy created by Proxy.newProxyInstance(). The proxy implements the interface but delegates all calls to PekkoInvocationHandler.

Step 2: Method Invocation (Like SDK Method Call)

// AWS SDK
PutObjectResponse response = s3.putObject(request);  // Looks local, actually remote

// Flink RPC  
CompletableFuture future = gateway.updateTaskExecutionState(state);  // Same pattern

The proxy intercepts the call. No business logic executes locally.

Step 3: Create Invocation Object (Like HTTP Request Building)

// Conceptually similar to:
// HttpRequest.newBuilder()
//     .uri(URI.create("https://s3.amazonaws.com/bucket/key"))
//     .POST(BodyPublishers.ofByteArray(serialize(request)))
//     .build();

RpcInvocation invocation = new RpcInvocation(
    "updateTaskExecutionState",      // Method name (like URL path)
    new Class[]{TaskExecutionState.class},  // Parameter types
    new Object[]{state}              // Arguments (like request body)
);

Step 4: Serialize and Send (Like HTTP Transport)

// AWS SDK uses HTTP client internally
// httpClient.sendAsync(httpRequest, responseHandler);

// Flink uses Pekko actor messaging
actorRef.ask(invocation, timeout);  // Pekko serializes with Kryo, sends over TCP

Server Side: Network to Execution

Step 1: TCP Receive (Like Tomcat Accepting Connection)

The remote machine receives TCP bytes. Pekko’s network layer reads the frame and routes to the target actor based on the actor path.

Step 2: Actor Receives Message (Like Servlet.service())

PekkoRpcActor receives the message in its onReceive() method - the entry point for all incoming messages.

// Conceptually similar to:
// public void service(HttpServletRequest req, HttpServletResponse resp) {
//     String method = req.getMethod();
//     String path = req.getPathInfo();
//     // Route to appropriate handler
// }

public void onReceive(Object message) {
    if (message instanceof RpcInvocation) {
        handleRpcInvocation((RpcInvocation) message);
    }
}

Step 3: Mailbox Queuing (Unlike Tomcat - This is the Key Difference)

Here’s where Flink diverges from traditional web servers. Tomcat would spawn a thread and execute immediately. Flink enqueues in the mailbox:

Tomcat: Request arrives → New thread → Execute handler → Return response
Flink:  Message arrives → Enqueue in mailbox → Wait turn → Main thread executes → Return response

The message joins the queue behind any previously arrived messages. FIFO ordering guarantees deterministic execution.

Step 4: Main Thread Execution (Single-Threaded Handler)

The main thread dequeues the invocation when it reaches the front. It uses reflection to call updateTaskExecutionState(state) on the JobMaster instance. The method executes with full access to internal state - no locks needed.

Step 5: Response (Like HTTP Response)

The method returns CompletableFuture. The actor captures the result, serializes it, and sends bytes back over TCP. The caller’s CompletableFuture completes with the result.

Complete Flow Comparison

Practical Implications

Code Simplicity

The RpcEndpoint pattern transforms how developers write distributed coordination code. Compare two approaches to updating ExecutionGraph.

Without RpcEndpoint (Hypothetical - Like Traditional Servlet):

// Similar to a Servlet with shared state
class JobMaster {
    private ExecutionGraph executionGraph;
    private final ReentrantReadWriteLock lock = new ReentrantReadWriteLock();
    
    void updateTaskState(TaskExecutionState state) {
        lock.writeLock().lock();
        try {
            Execution exec = executionGraph.getExecution(state.getID());
            exec.updateState(state.getExecutionState());
        } finally {
            lock.writeLock().unlock();
        }
    }
    
    JobDetails getJobDetails() {
        lock.readLock().lock();
        try {
            return JobDetails.createFrom(executionGraph);
        } finally {
            lock.readLock().unlock();
        }
    }
}

With RpcEndpoint (Actual Flink):

public class JobMaster extends FencedRpcEndpoint 
        implements JobMasterGateway {
    
    private SchedulerNG schedulerNG;  // Contains ExecutionGraph
    
    @Override
    public CompletableFuture updateTaskExecutionState(
            TaskExecutionState state) {
        // No lock needed - runs on main thread
        Execution exec = schedulerNG.getExecutionGraph()
                                    .getExecution(state.getID());
        exec.updateState(state.getExecutionState());
        return CompletableFuture.completedFuture(Acknowledge.get());
    }
    
    @Override
    public CompletableFuture requestJobDetails(Duration timeout) {
        // No lock needed - runs on main thread
        return CompletableFuture.completedFuture(
            JobDetails.createFrom(schedulerNG.getExecutionGraph()));
    }
}

The actual Flink code has no locks. Methods read and write state directly. The single-threaded guarantee is architectural, not annotational. Developers cannot forget to acquire a lock because no lock exists.

Debugging Benefits

When investigating issues, the single-threaded model simplifies analysis. All state changes happen in sequence. Given a log of messages, you can reconstruct exact system state at any point. No thread interleavings to consider. No happens-before relationships to reason about.

Flink provides validateRunsInMainThread() for defensive programming. Critical methods call this check at entry. If a developer accidentally calls a state-modifying method from a wrong thread, the check throws immediately. The stack trace points to the violation. The bug is caught in development, not production.

Performance Considerations

The single-threaded model has a trade-off. All operations serialize through one thread. High message volume can create backlog in the mailbox. The main thread becomes a bottleneck.

Flink mitigates this in practice. RPC methods are designed to be fast. They update in-memory state and return quickly. Heavy computation offloads to separate thread pools via callAsync(). Blocking I/O never runs on the main thread.

For most workloads, the main thread handles thousands of messages per second without issue. The simplicity and correctness benefits outweigh the throughput limitation. Debugging a race condition costs more engineering time than optimizing a hot path.

Historical Context: Akka to Pekko

Flink used Akka from its early versions. Akka provided a mature, battle-tested actor implementation. Flink’s usage was focused: message passing between components, single-threaded execution guarantees, and failure detection via DeathWatch.

In September 2022, Lightbend changed Akka’s license from Apache 2.0 to Business Source License (BSL). This license is incompatible with Apache Software Foundation projects. Flink could not continue using new Akka versions.

The Apache Software Foundation responded by forking Akka 2.6.x as Apache Pekko. Pekko maintains Apache 2.0 licensing. It provides API compatibility with Akka 2.6.x. Migration requires updating imports from akka.* to org.apache.pekko.* and configuration keys from akka.* to pekko.*.

Flink 1.18 completed the migration to Pekko. The architecture remains identical. The single-threaded execution guarantee is unchanged. Existing Flink applications require no code changes. Only operators running custom Akka code directly (rare) need updates.

Summary

Flink’s RPC architecture solves a fundamental distributed systems problem. Multiple components must access shared state. Traditional locking creates complexity, deadlocks, and race conditions. The Actor Model provides an elegant alternative.

Each component extends RpcEndpoint. Each RpcEndpoint processes messages on a single thread. The mailbox queues messages in FIFO order. No concurrent access is possible. No locks are needed.

The RPC layer provides type-safe communication. RpcGateway interfaces define contracts (like AWS SDK client interfaces). Dynamic proxies implement these interfaces (like SDK internal handlers). RpcService abstracts the transport layer - currently Pekko, but designed to be swappable with gRPC or HTTP implementations. RpcEndpoint handles requests (like Servlets). The result is distributed method invocation that feels like local calls.

This architecture has served Flink well for years. It enables correct coordination across hundreds of distributed components. It simplifies debugging and testing. It allows developers to write straightforward sequential code for inherently concurrent problems.

Compare And Swap is all you need

Thu, 21 May 2026 00:48:15 GMT

This blog post is inspired by the first 6 chapters of The Art of Multiprocessor Programming by Maurice Herlihy and Nir Shavit.

Imagine building a distributed counter that must handle millions of updates per second across dozens of threads. Traditional locks serialize access, creating bottlenecks. You need something better: a way for threads to coordinate without blocking, without deadlocks, without the performance collapse that comes with contention. This isn’t just a performance optimization problem; it’s a fundamental question about what synchronization primitives are actually necessary. Can we build wait-free concurrent data structures? Which hardware instructions must processors provide? The answer, discovered through decades of theoretical work, reveals that one primitive, Compare-And-Swap (CAS), is universal.

Modern processors have dozens of cores, and our applications must leverage them all. From distributed databases processing millions of transactions per second to real-time analytics engines crunching streaming data, concurrent programming has moved from a specialized skill to a fundamental requirement. Yet building correct concurrent systems remains notoriously difficult: race conditions lurk in seemingly innocent code, deadlocks emerge from complex lock hierarchies, and performance bottlenecks appear where we least expect them.

The real challenge isn’t just making threads cooperate; it’s understanding which synchronization primitives actually give us the power we need. Over decades, computer scientists developed countless approaches: Peterson’s algorithm for mutual exclusion, Lamport’s bakery algorithm for fairness, sophisticated lock implementations with elaborate protocols. But a deeper question remained unanswered: Are these atomic building blocks fundamentally different in power? Can some primitives solve problems that others simply cannot? More critically for systems engineers: If we’re designing hardware or choosing synchronization mechanisms for a new platform, which primitives must we provide?

The answer fundamentally changed how we think about concurrent programming: Compare-And-Swap (CAS) is universal. This isn’t marketing hyperbole: it’s a mathematically proven property. Any concurrent object that can be specified sequentially can be implemented in a wait-free manner using CAS. From simple locks to complex data structures, from blocking algorithms to non-blocking ones, CAS provides sufficient power to construct them all. This universality explains why every modern processor, from ARM to x86, from mobile chips to data center CPUs, provides CAS or its equivalent as a fundamental instruction.

But what exactly is CAS? At its core, Compare-And-Swap is an atomic operation that reads a memory location, compares it to an expected value, and only updates it to a new value if the comparison succeeds. In Java, this is exposed through classes like AtomicInteger:

import java.util.concurrent.atomic.AtomicInteger;

// CAS operation: atomically compare and update
AtomicInteger counter = new AtomicInteger(0);  // Initialize to 0

// Thread 1: Try to increment from 0 to 1
int expected = 0;        // What we expect the current value to be
int newValue = 1;         // What we want to set it to
boolean success = counter.compareAndSet(expected, newValue);
// If counter was 0, it's now 1 and success = true
// If counter was already changed by another thread, success = false

// Thread 2 (concurrent): Also tries to increment
int myExpected = 0;
int myNewValue = 1;
boolean mySuccess = counter.compareAndSet(myExpected, myNewValue);
// Only one thread will succeed - CAS guarantees atomicity

The key insight is that compareAndSet executes atomically: it reads the current value, compares it to expected, and only updates to newValue if they match. If another thread modified the value between the read and write, the operation fails and returns false, allowing the thread to retry. This atomicity, the guarantee that the comparison and update happen as a single, indivisible operation, is what makes CAS powerful enough to build universal constructions.

Understanding this journey from basic mutual exclusion to universal constructions isn’t just academic: it’s the foundation for reasoning about concurrent systems at scale.

In this post, we’ll explore:

Early mutual exclusion algorithms (Peterson’s, Bakery) that revealed the limitations of read/write operations
Formal definitions of correctness (linearizability) and progress conditions that enable rigorous reasoning
The consensus hierarchy that measures primitive power and reveals fundamental limitations
Universal constructions that prove CAS can implement any concurrent object wait-free

The Synchronization Problem

Mutual exclusion is the foundational problem in concurrent programming, and early solutions revealed both the possibility and the inherent limitations of using only read/write operations. When multiple threads access shared resources, we need mechanisms to ensure critical sections execute atomically: one thread at a time. The pioneering algorithms from the 1960s through 1980s demonstrated that mutual exclusion could be achieved using only memory reads and writes, but at a cost that foreshadowed deeper theoretical constraints.

Peterson’s algorithm elegantly solved mutual exclusion for two threads using just two flags and a turn variable. Each thread signals its intent to enter the critical section, then yields priority to the other thread. The beauty lies in its simplicity: no special hardware instructions required, just careful ordering of reads and writes. Yet this simplicity masks a critical limitation: it only works for two threads. Extending Peterson’s approach to n threads requires exponentially complex tournament trees, and even then, threads must actively spin while waiting, burning CPU cycles.

Here’s Peterson’s algorithm implemented in Java, showing how mutual exclusion is achieved using only read/write operations:

class PetersonsLock {
    // Flags indicating each thread's desire to enter critical section
    private volatile boolean[] flag = new boolean[2];
    // Turn variable: which thread should yield priority
    private volatile int turn;
    
    // Thread 0 calls this to acquire the lock
    public void lock0() {
        flag[0] = true;           // Signal intent to enter
        turn = 1;                 // Give priority to thread 1
        // Wait while thread 1 wants to enter AND it's thread 1's turn
        while (flag[1] && turn == 1) {
            // Busy-wait: spin until condition is false
            Thread.yield();       // Hint to scheduler (optional)
        }
        // Now in critical section
    }
    
    // Thread 1 calls this to acquire the lock
    public void lock1() {
        flag[1] = true;           // Signal intent to enter
        turn = 0;                 // Give priority to thread 0
        // Wait while thread 0 wants to enter AND it's thread 0's turn
        while (flag[0] && turn == 0) {
            Thread.yield();
        }
        // Now in critical section
    }
    
    // Thread 0 releases the lock
    public void unlock0() {
        flag[0] = false;          // Signal we're done
    }
    
    // Thread 1 releases the lock
    public void unlock1() {
        flag[1] = false;          // Signal we're done
    }
}

The algorithm works through careful coordination: each thread sets its flag to true (indicating desire to enter), then sets turn to favor the other thread. The thread waits (spins) only if both threads want to enter AND it’s the other thread’s turn. This ensures mutual exclusion: at most one thread can be in the critical section. However, notice the while loop: threads must continuously check the condition, consuming CPU cycles. This busy-waiting is the blocking behavior that weaker primitives force upon us.

Why this matters in practice: In real systems, busy-waiting wastes CPU cycles that could be used for productive work. If a thread holding a lock is preempted or runs slowly, all waiting threads spin uselessly, consuming power and reducing overall throughput. This is why blocking algorithms can perform poorly under contention: they’re vulnerable to priority inversion, convoying (where slow threads delay fast ones), and wasted CPU cycles. Modern systems need synchronization mechanisms that provide progress guarantees even when some threads are slow or crash.

Lamport’s bakery algorithm generalized the solution to n threads by drawing inspiration from a bakery’s ticket system. Each thread takes a number, and threads enter in numerical order. This achieved both safety (mutual exclusion) and fairness (first-come, first-served), making it a significant theoretical advance. The algorithm’s elegance comes at the price of complexity: comparing ticket numbers requires careful handling of ties, and threads must scan all other threads’ tickets before entering. More critically, like Peterson’s algorithm, bakery forces threads into busy-waiting: they’re blocked not by sleeping, but by continuously checking conditions.

What these algorithms collectively demonstrate is profound: mutual exclusion is achievable with read/write operations alone, but the resulting solutions are inherently blocking. Whether through busy-waiting spins in Peterson’s protocol or ticket-checking loops in bakery, threads cannot make progress independently. They must wait, they must check, they must coordinate through shared memory locations that require constant polling. This blocking nature isn’t a flaw in the algorithms: it’s a fundamental consequence of the weakness of read/write operations themselves, a limitation that would prove mathematically inevitable.

But to prove that limitation mathematically, and to understand which primitives are truly necessary, we need formal definitions. What does it mean for a concurrent algorithm to be “correct”? How do we characterize different levels of blocking? These questions require precise answers before we can establish the hierarchy of primitive power.

Defining Correctness

Before we can evaluate synchronization mechanisms or prove algorithms correct, we must first define what “correct” actually means in concurrent systems. In sequential programming, correctness is straightforward: given the same inputs, your function produces the expected output. But in concurrent programming, operations overlap in time, multiple threads interleave their actions unpredictably, and the same sequence of function calls can produce different results depending on timing. Without a formal definition of correctness, we’re left arguing subjectively about whether an implementation “works” or debating whether a test failure represents a real bug or just unfortunate timing.

Linearizability provides the gold standard: a concurrent execution is correct if it appears equivalent to some sequential execution, where each operation takes effect instantaneously at some point between its invocation and response. This “linearization point” gives us a powerful mental model: despite the chaos of concurrent operations overlapping in time, we can reason about them as if they happened one at a time in some valid order. A concurrent queue is correct if it behaves like a sequential FIFO queue, just with operations atomically “snapping” into place at their linearization points. Critically, linearizability is compositional: if each individual object in your system is linearizable, the entire system is linearizable. This compositionality is what makes large-scale concurrent systems tractable: you can reason about components independently without worrying about how their combination might violate correctness.

A quick example makes this concrete. Suppose three threads hit a queue concurrently:

Plaintext

Time -------------------------------------------->

T1:  |--- enqueue(5) ---|
T2:       |--- enqueue(7) ---|
T3:            |-- dequeue() --|

These operations overlap, but linearizability says we can pick one instant within each operation’s interval where it “takes effect.” One valid linearization order: enqueue(5), then enqueue(7), then dequeue() → 5. The queue behaves as if those three calls happened sequentially in that order. Verifying correctness reduces to: does some valid sequential ordering exist that respects the real-time constraints?

Beyond correctness, we need to characterize how much blocking we’re willing to tolerate, which progress conditions formalize into a precise hierarchy. Wait-freedom is the strongest guarantee: every thread completes its operation in a bounded number of steps, regardless of what other threads do, even if they crash or run arbitrarily slowly. Lock-freedom weakens this slightly: at least one thread always makes progress, though individual threads might starve. Obstruction-freedom weakens further: a thread makes progress if it eventually runs without interference. At the bottom sits traditional blocking synchronization using locks, where threads can wait indefinitely.

This hierarchy isn’t just theoretical taxonomy: it has direct performance implications. Wait-free algorithms never stall on slow threads, making them ideal for real-time systems. Lock-free algorithms avoid deadlock and convoying but may starve individual threads. Blocking algorithms are simpler to write but vulnerable to priority inversion, deadlock, and performance collapse under contention.

The progress conditions form a clear hierarchy from strongest to weakest guarantees:

Progress ConditionGuaranteeNotesWait-FreedomEvery thread completes in bounded stepsEven if others crash or are slowLock-FreedomAt least one thread always makes progressIndividual threads may starveObstruction-FreedomThread makes progress if it runs aloneMay block under contentionBlocking (Locks)Threads can wait indefinitelyDeadlock, convoying possible

Each level weakens the guarantee: wait-freedom promises per-thread progress, lock-freedom promises system-wide progress, obstruction-freedom promises progress only when uncontended, and blocking makes no progress guarantees. This hierarchy helps us choose the right progress condition for our use case: real-time systems need wait-freedom, while many high-performance systems can tolerate lock-freedom’s potential starvation.

These definitions, linearizability for correctness and progress conditions for liveness, form the vocabulary that makes rigorous reasoning about concurrent systems possible. Without linearizability, we couldn’t formally state what it means for a concurrent hash table or queue to be “correct.” Without progress conditions, we couldn’t distinguish between a lock-free algorithm that guarantees system-wide progress and a wait-free algorithm that guarantees per-thread progress. More importantly, these definitions set up the critical questions that follow: Can we achieve wait-freedom with just read/write operations? Do different atomic primitives offer different guarantees? The precision of these definitions enables the mathematical proofs and impossibility results that come next.

Armed with these definitions, we can now ask the fundamental question: Are all synchronization primitives equally powerful, or do some offer capabilities that others simply cannot provide? The answer, discovered through the consensus problem, reveals a strict hierarchy that explains why modern processors provide CAS.

Primitive Power Hierarchy

Not all atomic operations are created equal: some primitives are fundamentally more powerful than others, capable of solving problems that weaker primitives cannot. We’ve seen that read/write operations suffice for mutual exclusion through algorithms like Peterson’s and Bakery. We’ve defined correctness through linearizability and characterized blocking through progress conditions. But a critical question remains: Are the primitives we choose merely a matter of convenience and performance, or do they fundamentally determine what’s algorithmically possible? Can we achieve wait-free synchronization with read/write registers alone, or do we need stronger hardware support?

Atomic registers establish the baseline by exploring what they can and cannot achieve. Atomic registers, memory locations supporting atomic read and write operations, form the weakest primitive in our hierarchy. They demonstrate their power through atomic snapshots, a technique that allows multiple registers to be read “simultaneously” in a consistent state despite concurrent updates. An atomic snapshot reads all registers atomically, giving a consistent view even if other threads are modifying them. Multi-reader, multi-writer registers can be constructed from single-writer registers using techniques like atomic snapshots, proving that certain concurrent abstractions are achievable with patient engineering.

Yet throughout these constructions, a pattern emerges: algorithms using only registers require threads to help each other, retry operations, and fundamentally cannot guarantee that every thread completes in bounded steps. The constructions work, but they’re complex, and they hint at fundamental limitations lurking beneath the surface.

The consensus problem and its associated hierarchy make these limitations precise. Consensus is deceptively simple: n threads each propose a value, and they must all agree on one of the proposed values. It’s the atomic commitment problem at the heart of distributed systems, the “all or nothing” decision that underlies everything from database transactions to leader election.

Critically, consensus is different from mutual exclusion. While Lamport’s bakery algorithm demonstrates that mutual exclusion can be solved for n threads using only read/write operations (albeit with blocking), consensus is a fundamentally harder problem. Mutual exclusion ensures only one thread accesses a resource at a time: it’s about exclusion. Consensus requires all threads to agree on a single value: it’s about agreement. More importantly, the consensus number measures a primitive’s ability to solve consensus wait-free, not just solve it with blocking. While read/write operations can achieve mutual exclusion for many threads through blocking algorithms, they cannot achieve wait-free consensus for even two threads.

Here’s a concrete example of the consensus problem:

// Consensus problem: n threads propose values, all must agree on one
// Example scenario:
//   Thread 1 proposes: "Alice"
//   Thread 2 proposes: "Bob"  
//   Thread 3 proposes: "Alice"
// All threads must agree on either "Alice" or "Bob" (one of the proposed values)
// This is harder than mutual exclusion because it requires agreement, not just exclusion

Herlihy’s breakthrough insight was that consensus serves as a measuring stick for primitive power. Every synchronization primitive has a “consensus number”: the maximum number of threads for which it can solve consensus wait-free. Read/write registers have consensus number 1 (they can’t even solve two-thread consensus wait-free). Test-and-set and swap have consensus number 2. Compare-and-swap, along with Load-Linked/Store-Conditional, have consensus number infinity: they can solve consensus for any number of threads.

Here’s how CAS solves 2-thread consensus, demonstrating its power:

import java.util.concurrent.atomic.AtomicReference;

class Consensus {
    private AtomicReference decision = new AtomicReference(null);
    
    /**
     * Solve consensus for 2 threads using CAS.
     * Each thread proposes a value, all agree on the first one to succeed.
     */
    public Object decide(Object proposed) {
        // Try to set decision to our proposed value
        // Only the first thread succeeds; others see non-null and return that value
        if (decision.compareAndSet(null, proposed)) {
            // We won! Our value is the decision
            return proposed;
        } else {
            // Another thread already decided; we agree with their choice
            return decision.get();
        }
    }
}

// Usage:
// Thread 1: result = consensus.decide("Alice")  // Might return "Alice" or "Bob"
// Thread 2: result = consensus.decide("Bob")     // Returns same value as Thread 1
// Both threads now have the same result - consensus achieved!

This simple implementation shows why CAS has consensus number infinity: it can solve consensus for any number of threads by ensuring only one thread’s proposal wins, and all others agree with that winner.

The consensus number tells us the maximum number of threads for which a primitive can solve the consensus problem wait-free. Primitives with higher consensus numbers are strictly more powerful: they can solve problems that weaker primitives cannot. This isn’t just a performance difference; it’s a fundamental computational limitation.

Here’s a comparison of how different primitives stack up:

PrimitiveConsensus NumberCan Solve Mutual Exclusion?Wait-Free Consensus?NotesRead/Write1Yes (blocks)NoLamport’s algorithm works but requires blocking/busy-waitingTest-and-Set2YesYes (2 threads max)Limited to 2 threads for wait-free consensusCAS∞YesYesUniversal - can solve wait-free consensus for any number of threads

The impossibility result is what makes this hierarchy mathematically rigorous rather than empirical observation: you cannot solve wait-free consensus for two or more threads using only read/write registers. This isn’t a statement about clever algorithms we haven’t discovered yet: it’s a fundamental impossibility proven through valency arguments and careful reasoning about execution schedules. No matter how ingenious your algorithm, no matter how many registers you use or how cleverly you structure them, you cannot build a wait-free consensus protocol for two threads with read/write operations alone. This explains why Peterson’s and Bakery algorithms must block: the blocking isn’t a design choice, it’s a mathematical necessity given their primitive operations. If you want wait-free synchronization for multiple threads, you need primitives with higher consensus numbers. The hierarchy isn’t about performance optimization: it’s about what’s computationally possible.

Why hardware designers care: When designing a processor, you face a fundamental question: which atomic instructions should you provide? The consensus hierarchy provides a clear answer: if you want software to be able to build wait-free concurrent algorithms, you must provide primitives with consensus number infinity (like CAS). Without CAS, certain classes of problems are literally impossible to solve wait-free. This isn’t a matter of performance: it’s a matter of computational capability. This is why every modern processor architecture converged on providing CAS or its equivalent: they recognized that weak primitives fundamentally limit what software can achieve.

These insights fundamentally reframe how we think about hardware synchronization support. Processors don’t provide compare-and-swap just because it’s faster than building complex protocols with reads and writes: they provide it because certain problems are literally impossible to solve wait-free without it. The consensus hierarchy explains why modern architectures converged on CAS-like instructions: they recognized that weak primitives fundamentally limit what software can achieve. This sets up the final revelation: primitives with infinite consensus numbers aren’t just powerful: they’re universal.

But universality is a bold claim. Does having consensus number infinity mean CAS can solve consensus for many threads, or does it mean something more profound? The universal construction theorem provides the answer: CAS doesn’t just solve consensus: it can implement any concurrent object whatsoever.

Universal Solution

The consensus hierarchy revealed a gap between primitives, but primitives with infinite consensus numbers bridge that gap completely. We know read/write registers cannot solve consensus for multiple threads. We know test-and-set gets us to two threads but no further. We know compare-and-swap has consensus number infinity. But infinity is a strange claim: does it simply mean “works for arbitrarily many threads,” or does it mean something more profound? The answer comes with mathematical precision: objects that solve consensus for n threads are universal for n threads. They can implement any concurrent object whatsoever.

The universal construction provides the explicit algorithm: given any sequential specification of an object and a consensus primitive, you can build a wait-free concurrent implementation. The construction is elegant in its directness. Maintain a log of operations applied to the object. When a thread wants to perform an operation, it proposes that operation as the “next” one to apply. Threads use consensus to agree on which operation wins. The winner’s operation gets appended to the log and applied to the object state. All threads can then compute the result by replaying the log. Repeat for the next operation. This isn’t an optimization or a special case: it’s a fully general construction that works for any object you can specify sequentially: queues, stacks, hash tables, counters, priority queues, or objects not yet invented.

The universal construction algorithm proceeds as follows:

Operation proposal: When a thread invokes an operation, it creates an operation descriptor containing the operation type and arguments, then proposes this descriptor as the next entry in the shared operation log.
Consensus decision: All threads concurrently proposing operations participate in a consensus protocol. The consensus primitive guarantees that exactly one proposal wins: this is the operation that will be applied next.
Log append: The winning operation descriptor is atomically appended to the shared log. This log serves as the linearization order: operations appear in the order they were decided by consensus.
State reconstruction: Each thread independently replays the log from the beginning, applying each operation sequentially to reconstruct the current object state. Since all threads see the same log, they compute identical states.
Result computation: Threads compute the operation’s return value by examining the reconstructed state. For read operations, this is straightforward. For write operations, the result may depend on the state after applying the operation.
Completion: The thread returns the computed result. Since consensus is wait-free (each thread completes in bounded steps), and log replay is deterministic, the entire operation completes wait-free.

The key insight is that consensus serializes operations (establishing a total order), while log replay ensures all threads compute consistent results without requiring explicit coordination beyond the consensus protocol itself.

Here’s a simplified example of how the universal construction builds a concurrent queue. The sequential specification is straightforward: a queue supports enqueue(item) and dequeue() operations that follow FIFO order.

// Simplified universal construction for a queue
class UniversalQueue {
    private List log = new ArrayList<>();  // Operation log
    private Queue state = new LinkedList<>();      // Sequential state
    
    // Consensus object to decide next operation
    private Consensus consensus = new Consensus<>();
    
    public void enqueue(T item) {
        // Propose enqueue operation
        Operation op = new EnqueueOp(item);
        
        // Use consensus to decide if this operation wins
        Operation winner = consensus.decide(op);
        
        // Append winner to log
        synchronized(log) {
            log.add(winner);
        }
        
        // All threads replay log to compute current state
        replayLog();
    }
    
    public T dequeue() {
        Operation op = new DequeueOp();
        Operation winner = consensus.decide(op);
        
        synchronized(log) {
            log.add(winner);
        }
        
        replayLog();
        
        // Return result based on final state
        return state.poll();  // Simplified - actual implementation tracks results
    }
    
    private void replayLog() {
        // Replay all operations to compute current state
        state.clear();
        for (Operation op : log) {
            op.apply(state);
        }
    }
}

This demonstrates the universal construction pattern: operations are proposed, consensus decides the winner, the log grows, and all threads independently compute results. While this simplified version has performance limitations (everyone replays the entire log), optimized versions use techniques like helping and early termination. The key insight is that CAS-based consensus makes this construction wait-free: every thread completes in bounded steps regardless of others’ behavior.

Why software engineers benefit: The universal construction provides a systematic recipe for building concurrent objects. Instead of inventing clever tricks for each data structure, you can apply the universal construction to any sequential specification. While optimized implementations often outperform the universal construction, it serves as a correctness proof: if CAS can build it wait-free using the universal construction, then optimized wait-free implementations are possible. This gives you confidence when designing concurrent systems: you know that CAS provides sufficient power to build whatever you need. Real-world systems like Java’s ConcurrentHashMap use sophisticated CAS-based algorithms that outperform the universal construction, but the universality theorem guarantees that such implementations exist.

What makes this truly universal is that it guarantees wait-freedom: every thread completes its operation in a bounded number of steps. No thread waits for locks. No thread spins checking conditions. No thread can be blocked by slower threads or crashed threads. Each thread proposes, participates in consensus, computes the result, and completes, all in predictable, bounded time. This is the theoretical ideal of concurrent programming: the responsiveness of sequential code combined with the scalability of parallel execution. The construction proves that wait-freedom isn’t some unattainable dream requiring clever tricks for each data structure: it’s a systematic consequence of having consensus objects.

This universality extends beyond wait-free algorithms to encompass the entire space of concurrent programming. The construction can implement locks themselves: mutual exclusion becomes just another concurrent object built atop consensus. It can implement semaphores, barriers, read-write locks, any synchronization primitive we’ve discussed or will invent. More subtly, while the universal construction produces wait-free implementations, consensus objects can also be used to build lock-free or even blocking implementations with different performance tradeoffs. The point isn’t that CAS forces you into wait-free algorithms; it’s that CAS gives you the power to choose. With weaker primitives like read/write registers, certain algorithmic approaches are simply impossible. With CAS, every approach becomes possible.

Performance Implications

Understanding when CAS helps versus when it might hurt is crucial for practical system design. CAS excels in low-contention scenarios where threads rarely conflict: operations typically succeed on the first attempt, providing excellent performance without the overhead of lock acquisition. CAS also shines when you need progress guarantees: wait-free and lock-free algorithms built with CAS never deadlock and provide stronger liveness guarantees than traditional locks.

However, CAS has trade-offs. Under high contention, CAS can suffer from cache line bouncing: multiple threads repeatedly modifying the same memory location cause expensive cache coherence traffic. In extreme cases, a simple lock might perform better because it serializes access and reduces cache misses. The retry loops in CAS-based algorithms can also waste CPU cycles when many threads compete, though at least one thread always makes progress (lock-freedom).

The choice between wait-free, lock-free, and blocking approaches depends on your requirements:

Wait-free: Best for real-time systems where every thread must complete in bounded time, even if others crash. Higher overhead but strongest guarantees.
Lock-free: Good for high-performance systems where deadlock is unacceptable but some starvation is tolerable. Better scalability than locks under contention.
Blocking (locks): Simplest to reason about and often fastest under high contention due to reduced cache traffic. Vulnerable to deadlock and priority inversion.

Modern systems often use hybrid approaches: CAS for hot paths with low contention, locks for high-contention scenarios, and sophisticated lock-free data structures (like Java’s ConcurrentHashMap) that combine multiple techniques.

Here’s a practical example: a lock-free counter implemented using CAS. This demonstrates CAS’s power in a simple, concrete form:

import java.util.concurrent.atomic.AtomicInteger;

class LockFreeCounter {
    // CAS-based counter: no locks, lock-free increment
    private AtomicInteger value = new AtomicInteger(0);
    
    /**
     * Increment the counter atomically using CAS.
     * This is lock-free: at least one thread makes progress, but retries may be unbounded.
     */
    public void increment() {
        int current;
        do {
            // Read current value
            current = value.get();
            // Try to update: CAS(current, current+1)
            // If another thread changed value, this fails and we retry
        } while (!value.compareAndSet(current, current + 1));
        // Loop exits when CAS succeeds (we won the race)
    }
    
    /**
     * Get the current counter value.
     * This is a simple read, always wait-free.
     */
    public int get() {
        return value.get();
    }
    
    /**
     * Decrement the counter atomically using CAS.
     * Same pattern as increment: retry until CAS succeeds.
     */
    public void decrement() {
        int current;
        do {
            current = value.get();
        } while (!value.compareAndSet(current, current - 1));
    }
}

The key pattern is the CAS loop: read the current value, attempt to update it, and retry if another thread modified it in between. This is lock-free (at least one thread makes progress) but not wait-free, as retries are unbounded: a thread could theoretically retry indefinitely if other threads keep modifying the value. Contrast this with Peterson’s algorithm, which requires busy-waiting and blocking. CAS gives us the power to build non-blocking algorithms that scale under contention.

The practical impact explains the hardware landscape we inhabit today. Every modern processor architecture, from x86’s CMPXCHG to ARM’s LDREX/STREX to RISC-V’s LR/SC to SPARC’s CAS, provides compare-and-swap or its equivalent precisely because universality isn’t just theoretical elegance: it’s engineering necessity. When designing a processor, you could provide dozens of specialized atomic instructions for different data structures. Or you could provide one universal primitive and let software build everything else. The consensus hierarchy proved that some primitives are fundamentally insufficient. The universality theorem proved that CAS is fundamentally sufficient. This is why CAS became the assembly language of concurrency: not through committee decision or vendor preference, but through mathematical inevitability. If your hardware provides consensus objects, your software can build anything. And that “anything” includes both the sophisticated lock-free algorithms powering high-performance systems and the simple, correct locks that make everyday programming tractable.

Key Takeaways

Before we conclude, let’s summarize the essential insights:

CAS is universal: Any concurrent object that can be specified sequentially can be implemented wait-free using CAS. This isn’t just convenient: it’s mathematically proven.
Consensus number measures primitive power: Every synchronization primitive has a consensus number: the maximum number of threads for which it can solve consensus wait-free. Higher consensus numbers mean strictly more powerful primitives.
Wait-free consensus for 2+ threads is impossible with read/write alone: This impossibility result explains why Peterson’s and Bakery algorithms must block: it’s not a design choice, it’s a mathematical necessity.
Modern processors provide CAS because it’s necessary, not just convenient: Hardware designers recognized that certain problems are literally impossible to solve wait-free without CAS-like primitives. This is why every modern architecture converged on CAS.
The universal construction provides a systematic approach: Rather than inventing clever tricks for each data structure, the universal construction gives us a general recipe for building wait-free concurrent objects from consensus primitives.
Performance trade-offs matter: CAS excels under low contention but can suffer from cache line bouncing under high contention. The choice between wait-free, lock-free, and blocking approaches depends on your specific requirements.

Conclusion

The journey from Peterson’s algorithm to universal constructions isn’t just a historical progression: it’s a logical proof that unfolded over decades. Each step builds on the previous, moving from concrete examples to abstract principles, from intuitive algorithms to mathematical impossibility results, and finally to the profound realization that one primitive can serve as the foundation for all concurrent programming.

This theoretical foundation has profound practical implications. When you reach for a concurrent data structure library like Java’s java.util.concurrent, you’re benefiting from algorithms built on CAS. When you debate lock-free versus locked implementations, you’re weighing trade-offs that the consensus hierarchy makes precise. When you evaluate whether your architecture provides adequate synchronization support, you’re applying insights that explain why every modern processor provides CAS.

For practicing engineers, understanding the consensus hierarchy provides a framework for making informed decisions:

Choose CAS-based algorithms when you need progress guarantees and can tolerate some retry overhead
Understand the limitations of read/write operations: they can solve mutual exclusion but require blocking
Recognize that CAS universality means you can build any concurrent object, but optimized implementations often outperform the universal construction
Appreciate why hardware matters: processors provide CAS not as a convenience, but as a necessity for certain classes of problems

As concurrent systems continue to scale, from multi-core processors to distributed systems spanning continents, the principles established by the consensus hierarchy remain foundational. CAS isn’t just another instruction in the processor’s repertoire. It’s the universal building block that makes modern concurrent systems possible, and understanding why it’s universal helps us build better systems for the future.

Bonus: Implementing a Lock with CAS

To make the universality of CAS concrete, let’s implement a simple spin lock using only CAS operations. This demonstrates how CAS can build the fundamental synchronization primitive, mutual exclusion, that we started with.

A lock needs to track whether it’s currently held. We’ll use an AtomicInteger where 0 means unlocked and 1 means locked. The lock() method must atomically check if the lock is 0 and set it to 1 if so. The unlock() method simply sets it back to 0.

import java.util.concurrent.atomic.AtomicInteger;

public class CASLock {
    private final AtomicInteger state = new AtomicInteger(0); // 0 = unlocked, 1 = locked
    
    /**
     * Acquire the lock by atomically transitioning from unlocked (0) to locked (1).
     * Spins until successful - this is lock-free (at least one thread makes progress)
     * but not wait-free (a thread may spin indefinitely).
     */
    public void lock() {
        // Keep trying until we successfully change state from 0 to 1
        while (!state.compareAndSet(0, 1)) {
            // Lock is held by another thread - spin (busy-wait)
            // In production, you might add Thread.yield() or exponential backoff
        }
        // We successfully acquired the lock
    }
    
    /**
     * Release the lock by setting state back to unlocked (0).
     * This is wait-free - always completes in one step.
     */
    public void unlock() {
        // Simply set state back to 0
        // No CAS needed - only the lock holder calls unlock()
        state.set(0);
    }
    
    /**
     * Try to acquire the lock without blocking.
     * Returns true if lock was acquired, false otherwise.
     */
    public boolean tryLock() {
        return state.compareAndSet(0, 1);
    }
}

How it works:

The lock() method uses CAS in a retry loop: it attempts to atomically change the state from 0 (unlocked) to 1 (locked). If another thread already holds the lock, the CAS fails (because state is already 1), and the thread retries. Only one thread can successfully transition from 0 to 1, ensuring mutual exclusion.

Why this matters:

This implementation demonstrates CAS’s power in a concrete way. We’ve built mutual exclusion, the problem Peterson’s algorithm solved with read/write operations, using CAS. Unlike Peterson’s algorithm, this lock:

Works for any number of threads (not just two)
Uses a single memory location (not multiple flags and turn variables)
Is simpler to understand and reason about

However, this is a spin lock: threads busy-wait when the lock is held. In practice, production locks combine CAS with OS-level blocking primitives (like futex on Linux) to avoid wasting CPU cycles. But the core mechanism, using CAS to atomically transition between states, remains the same.

The deeper insight:

This lock implementation is lock-free (at least one thread always makes progress) but not wait-free (individual threads may spin indefinitely). To build a wait-free lock, you’d need more sophisticated techniques, but the universality theorem guarantees such implementations exist: CAS provides sufficient power to build them.

This simple example illustrates why CAS is universal: if you can build locks with CAS, and locks can build any synchronization primitive, then CAS can build anything. The universal construction provides the general recipe; this lock is a concrete, practical example of CAS’s power.

References

Herlihy, M., & Shavit, N. (2012). The Art of Multiprocessor Programming (Revised First Edition). Morgan Kaufmann.
Herlihy, M. (1991). Wait-free synchronization. ACM Transactions on Programming Languages and Systems (TOPLAS), 13(1)
Herlihy, M. (1991). Impossibility and universality results for wait-free synchronization. Proceedings of the seventh annual ACM symposium on Principles of distributed computing
Java java.util.concurrent package: Real-world implementations of CAS-based concurrent data structures.
Peterson, G. L. (1981). Myths about the mutual exclusion problem. Information Processing Letters, 12(3)
Lamport, L. (1974). A new solution of Dijkstra’s concurrent programming problem. Communications of the ACM, 17(8)