Dev Tools Article

Beyond Epoll: Is io_uring Ready for Production?

A technical breakdown of Linux event loops, comparing readiness-based epoll with completion-based io_uring for high-performance networking.

Lenn Voss

Cloud & Infrastructure Writer · Jun 21, 2026 · 6 min read

Beyond Epoll: Is io_uring Ready for Production?

For nearly two decades, epoll has been the undisputed backbone of high-performance networking on Linux. Introduced in Linux 2.5.44 back in 2002, it rescued systems from the $O(N)$ scalability limits of select and poll by introducing an $O(1)$ event-notification mechanism. But as network interfaces pushed past 100 Gbps and storage devices achieved microsecond latencies, the architectural cracks in epoll began to show.

Enter io_uring, introduced in Linux 5.1 in 2019. Rather than merely notifying an application when an I/O channel is ready, io_uring shifts the paradigm entirely to a completion model, executing the actual I/O operations asynchronously within the kernel.

While some developers claim io_uring is a drop-in performance savior, others report negligible gains, and security teams have occasionally disabled it entirely. Deciding whether to stick with the battle-tested epoll or migrate to the modern io_uring requires understanding the deep architectural trade-offs between readiness and completion.

The Architectural Divide: Readiness vs. Completion

The fundamental difference between these two subsystems lies in who does the heavy lifting of executing the I/O operation.

epoll is a reactor (readiness-based) model. The kernel tracks interest in file descriptors using a red-black tree (rbr) inside a struct eventpoll. When a socket becomes readable, a virtual file system (VFS) wait-queue callback fires, placing the corresponding epitem onto a doubly-linked ready list (rdllist). The application calls epoll_wait to drain this ready list.

Here is the catch: epoll only tells you that I/O is possible. The application must still issue the actual read(2), write(2), or accept(2) system calls to move the data.

sequenceDiagram
    autonumber
    participant App as Userspace App
    participant Kernel as Linux Kernel
    
    Note over App, Kernel: The epoll (Reactor) Model
    App->>Kernel: epoll_wait()
    Kernel-->>App: Event Ready (FD is readable)
    App->>Kernel: read(FD, buffer)
    Kernel-->>App: Data copied to buffer

io_uring is a proactor (completion-based) model. It bypasses the readiness loop entirely. Instead of asking "is this socket ready to be read?" the application tells the kernel "read from this socket into this buffer, and let me know when you are done."

This is achieved through two ring buffers shared directly between userspace and the kernel: a Submission Queue (SQ) and a Completion Queue (CQ).

sequenceDiagram
    autonumber
    participant App as Userspace App
    participant Kernel as Linux Kernel
    
    Note over App, Kernel: The io_uring (Proactor) Model
    App->>Kernel: Write read request to SQ
    App->>Kernel: io_uring_enter() (Submit)
    Note over Kernel: Kernel performs read asynchronously
    Kernel-->>App: Write completion to CQ

By mapping these ring buffers into the application's address space via mmap, the application can queue multiple asynchronous operations—such as IORING_OP_SEND, IORING_OP_RECV, or IORING_OP_READ—without crossing the kernel boundary for each individual request.

The Cost of Crossing the Boundary

To understand why this matters, look at the math of system calls. Every time an application crosses the user-to-kernel boundary, it pays a performance tax. On modern processors, CPU vulnerability mitigations (like KPTI and Retpolines) have dramatically increased this overhead.

With mitigations enabled, a single system call context switch costs roughly 700ns. Even with mitigations disabled, it costs about 230ns.

Consider a highly optimized network proxy handling 100,000 requests per second. Under an epoll architecture, a single request cycle typically requires:

epoll_wait to block and wait for events.
recv to pull data from the socket.
send to dispatch the response.
epoll_ctl to re-arm the file descriptor (if using EPOLLONESHOT for thread safety).

That is four system calls per request, translating to 400,000 system calls per second. At 700ns per context switch, the system spends nearly 30% of its CPU cycles simply crossing the boundary.

io_uring mitigates this through batching and shared memory. The application can write dozens of Submission Queue Entries (SQEs) into the SQ and make a single io_uring_enter call to submit them all and reap completed operations from the CQ. The average cost per I/O operation drops from $(s + w)$ (where $s$ is the context switch overhead and $w$ is the kernel work) to $(s + t) / n$, where $t$ is the framework overhead and $n$ is the batch size.

For absolute zero-syscall execution, io_uring offers IORING_SETUP_SQPOLL. This flag spins up a dedicated kernel thread that continuously polls the SQ for new entries. While this burns a dedicated CPU core, it allows the application to perform high-throughput I/O with zero system calls in the steady state.

Developer Angle: Real-World Trade-Offs and Caveats

If io_uring is so fast, why hasn't everyone migrated? The reality of systems programming is that there is no free lunch. Switching to a completion-based model introduces several severe engineering challenges.

1. The Memory Footprint Problem

In an epoll loop, you only allocate read buffers when a socket is ready to be read. If you have 20,000 idle connections, they consume almost no buffer memory.

In io_uring, because you submit the read request before it occurs, you must pre-allocate and dedicate a buffer to that specific SQE. If you have 20,000 outstanding read requests, you must keep 20,000 buffers pinned in RAM. This can lead to massive memory bloat.

To address this, newer Linux kernel versions (6.0+) have introduced multishot modes for recv() and accept(), along with buffer pooling facilities. This allows a single SQE to repeatedly post completions as data arrives, drawing from a shared buffer pool and reducing memory pressure.

2. Concurrency and Memory Barriers

Writing a raw io_uring implementation without helper libraries is notoriously difficult. Because userspace and the kernel read and write to the shared ring buffers simultaneously, developers must deal with memory barriers to ensure the compiler and CPU do not reorder instructions.

For example, calculating indices requires masking index values with ring_mask (where ring_mask = ring_entries - 1) to handle integer overflows efficiently in hardware. To avoid these complexities, developers should almost always use liburing, a helper library that abstracts the low-level ring buffers and memory barriers into clean APIs like io_uring_get_sqe() and io_uring_prep_read().

3. File I/O Limitations of Epoll

One of the biggest wins for io_uring is regular file I/O. epoll is fundamentally broken for disk files; the kernel's poll implementation for regular files always returns POLLIN | POLLOUT because disk files are technically always "ready." However, the subsequent read() call can still block the thread if the data is not in the page cache, causing page faults. io_uring handles disk reads truly asynchronously, making it a massive upgrade for databases and storage engines.

4. Security and Tooling

Because io_uring allows applications to execute a wide variety of system calls asynchronously through a single entry point (io_uring_enter), it bypasses traditional security auditing tools. It has been a frequent source of kernel-level vulnerabilities, leading some organizations (including Google) to restrict or disable its use in highly sensitive environments.

Furthermore, standard debugging and profiling tools like strace and perf struggle with io_uring's asynchronous execution model. Profiling requires updated tooling and BPF-based tracing of io_uring_enter to get accurate performance metrics.

When to Port and When to Stay

Migrating a mature codebase from epoll to io_uring is rarely a simple refactoring exercise; it often requires a complete architectural rewrite.

Feature / Metric	`epoll`	`io_uring`
Model	Reactor (Readiness)	Proactor (Completion)
Syscall Overhead	High (1+ per operation)	Low (Batching or 0 with SQPOLL)
File I/O Support	No (Blocks on page faults)	Yes (Fully asynchronous)
Memory Efficiency	High (On-demand buffering)	Medium (Requires buffer pools/multishot)
Security Profile	Mature and highly restricted	Active attack surface; restricted by some hosts
Minimum Kernel	Linux 2.5.44+	Linux 5.1+ (5.10+ recommended for stability)
Complexity	Moderate	High (Requires `liburing` or barrier management)

Keep `epoll` if:

You have a mature, highly optimized event loop deeply integrated into an existing codebase (like Nginx, HAProxy, or Redis) and have not measured a clear system call bottleneck.
Your application runs on older enterprise kernels (pre-5.10) or in containerized environments where io_uring might be disabled by security policies.
Your service manages tens of thousands of mostly idle connections where memory footprint is a primary constraint.

Migrate to `io_uring` if:

You are building a greenfield, high-performance network server targeting modern Linux kernels (5.10+ or 6.0+ for multishot features).
Your application performs a mix of network I/O and disk I/O (such as web caches, databases, or storage engines).
You are using high-performance network frameworks like µWebSockets (via its uSockets backend), which have already done the heavy lifting of integrating io_uring and demonstrated massive throughput gains.

Ultimately, io_uring is not just an optimization; it is the future of Linux systems programming. While epoll remains a reliable tool for standard web applications, io_uring is the correct default for any new system pushing the absolute limits of modern hardware.

Sources & further reading

Epoll vs. io_uring in Linux — sibexi.co
Notes on epoll and io_uring — iafisher.com
asynchronous - Is epoll a better API than io_uring? - Stack Overflow — stackoverflow.com
io_uring vs. epoll – Which Is Better in Network Programming? - Alibaba Cloud Community — alibabacloud.com
io_uring vs epoll - Linux Kernel Internals — kernel-internals.org

#Performance #Systems Programming #Linux #Io Uring #Epoll

Written by

Lenn Voss · Cloud & Infrastructure Writer

Lenn writes about cloud platforms, Kubernetes internals, and the infrastructure decisions that quietly make or break engineering organizations. Based in Berlin's vibrant tech scene, they have a talent for turning dense platform-engineering topics into prose that people actually finish reading.

Discussion 0

Join the discussion

No comments yet

Be the first to weigh in.

Beyond Epoll: Is io_uring Ready for Production?

The Architectural Divide: Readiness vs. Completion

The Cost of Crossing the Boundary

Developer Angle: Real-World Trade-Offs and Caveats

1. The Memory Footprint Problem

2. Concurrency and Memory Barriers

3. File I/O Limitations of Epoll

4. Security and Tooling

When to Port and When to Stay

Keep `epoll` if:

Migrate to `io_uring` if:

Sources & further reading

Discussion 0

Related Reading

Why Developers Still Fail to Understand CORS

Linux 7.2 Kills strncpy: The Six-Year War on a Broken API

PostgresBench Brings Transparency to Managed Database Performance

Bun Wants Real Threads in JavaScript. Here's the Catch.

Beyond Epoll: Is io_uring Ready for Production?

The Architectural Divide: Readiness vs. Completion

The Cost of Crossing the Boundary

Developer Angle: Real-World Trade-Offs and Caveats

1. The Memory Footprint Problem

2. Concurrency and Memory Barriers

3. File I/O Limitations of Epoll

4. Security and Tooling

When to Port and When to Stay

Keep epoll if:

Migrate to io_uring if:

Sources & further reading

Discussion 0

Related Reading

Why Developers Still Fail to Understand CORS

Linux 7.2 Kills strncpy: The Six-Year War on a Broken API

PostgresBench Brings Transparency to Managed Database Performance

Bun Wants Real Threads in JavaScript. Here's the Catch.

Keep `epoll` if:

Migrate to `io_uring` if: