Beyond Epoll: Is io_uring Ready for Production?
A technical breakdown of Linux event loops, comparing readiness-based epoll with completion-based io_uring for high-performance networking.
For nearly two decades, epoll has been the undisputed backbone of high-performance networking on Linux. Introduced in Linux 2.5.44 back in 2002, it rescued systems from the $O(N)$ scalability limits of select and poll by introducing an $O(1)$ event-notification mechanism. But as network interfaces pushed past 100 Gbps and storage devices achieved microsecond latencies, the architectural cracks in epoll began to show.
Enter io_uring, introduced in Linux 5.1 in 2019. Rather than merely notifying an application when an I/O channel is ready, io_uring shifts the paradigm entirely to a completion model, executing the actual I/O operations asynchronously within the kernel.
While some developers claim io_uring is a drop-in performance savior, others report negligible gains, and security teams have occasionally disabled it entirely. Deciding whether to stick with the battle-tested epoll or migrate to the modern io_uring requires understanding the deep architectural trade-offs between readiness and completion.
The Architectural Divide: Readiness vs. Completion
The fundamental difference between these two subsystems lies in who does the heavy lifting of executing the I/O operation.
epoll is a reactor (readiness-based) model. The kernel tracks interest in file descriptors using a red-black tree (rbr) inside a struct eventpoll. When a socket becomes readable, a virtual file system (VFS) wait-queue callback fires, placing the corresponding epitem onto a doubly-linked ready list (rdllist). The application calls epoll_wait to drain this ready list.
Here is the catch: epoll only tells you that I/O is possible. The application must still issue the actual read(2), write(2), or accept(2) system calls to move the data.
sequenceDiagram
autonumber
participant App as Userspace App
participant Kernel as Linux Kernel
Note over App, Kernel: The epoll (Reactor) Model
App->>Kernel: epoll_wait()
Kernel-->>App: Event Ready (FD is readable)
App->>Kernel: read(FD, buffer)
Kernel-->>App: Data copied to buffer
io_uring is a proactor (completion-based) model. It bypasses the readiness loop entirely. Instead of asking "is this socket ready to be read?" the application tells the kernel "read from this socket into this buffer, and let me know when you are done."
This is achieved through two ring buffers shared directly between userspace and the kernel: a Submission Queue (SQ) and a Completion Queue (CQ).
sequenceDiagram
autonumber
participant App as Userspace App
participant Kernel as Linux Kernel
Note over App, Kernel: The io_uring (Proactor) Model
App->>Kernel: Write read request to SQ
App->>Kernel: io_uring_enter() (Submit)
Note over Kernel: Kernel performs read asynchronously
Kernel-->>App: Write completion to CQ
By mapping these ring buffers into the application's address space via mmap, the application can queue multiple asynchronous operations—such as IORING_OP_SEND, IORING_OP_RECV, or IORING_OP_READ—without crossing the kernel boundary for each individual request.
The Cost of Crossing the Boundary
To understand why this matters, look at the math of system calls. Every time an application crosses the user-to-kernel boundary, it pays a performance tax. On modern processors, CPU vulnerability mitigations (like KPTI and Retpolines) have dramatically increased this overhead.
With mitigations enabled, a single system call context switch costs roughly 700ns. Even with mitigations disabled, it costs about 230ns.
Consider a highly optimized network proxy handling 100,000 requests per second. Under an epoll architecture, a single request cycle typically requires:
epoll_waitto block and wait for events.recvto pull data from the socket.sendto dispatch the response.epoll_ctlto re-arm the file descriptor (if usingEPOLLONESHOTfor thread safety).
That is four system calls per request, translating to 400,000 system calls per second. At 700ns per context switch, the system spends nearly 30% of its CPU cycles simply crossing the boundary.
io_uring mitigates this through batching and shared memory. The application can write dozens of Submission Queue Entries (SQEs) into the SQ and make a single io_uring_enter call to submit them all and reap completed operations from the CQ. The average cost per I/O operation drops from $(s + w)$ (where $s$ is the context switch overhead and $w$ is the kernel work) to $(s + t) / n$, where $t$ is the framework overhead and $n$ is the batch size.
For absolute zero-syscall execution, io_uring offers IORING_SETUP_SQPOLL. This flag spins up a dedicated kernel thread that continuously polls the SQ for new entries. While this burns a dedicated CPU core, it allows the application to perform high-throughput I/O with zero system calls in the steady state.
Developer Angle: Real-World Trade-Offs and Caveats
If io_uring is so fast, why hasn't everyone migrated? The reality of systems programming is that there is no free lunch. Switching to a completion-based model introduces several severe engineering challenges.
1. The Memory Footprint Problem
In an epoll loop, you only allocate read buffers when a socket is ready to be read. If you have 20,000 idle connections, they consume almost no buffer memory.
In io_uring, because you submit the read request before it occurs, you must pre-allocate and dedicate a buffer to that specific SQE. If you have 20,000 outstanding read requests, you must keep 20,000 buffers pinned in RAM. This can lead to massive memory bloat.
To address this, newer Linux kernel versions (6.0+) have introduced multishot modes for recv() and accept(), along with buffer pooling facilities. This allows a single SQE to repeatedly post completions as data arrives, drawing from a shared buffer pool and reducing memory pressure.
2. Concurrency and Memory Barriers
Writing a raw io_uring implementation without helper libraries is notoriously difficult. Because userspace and the kernel read and write to the shared ring buffers simultaneously, developers must deal with memory barriers to ensure the compiler and CPU do not reorder instructions.
For example, calculating indices requires masking index values with ring_mask (where ring_mask = ring_entries - 1) to handle integer overflows efficiently in hardware. To avoid these complexities, developers should almost always use liburing, a helper library that abstracts the low-level ring buffers and memory barriers into clean APIs like io_uring_get_sqe() and io_uring_prep_read().
3. File I/O Limitations of Epoll
One of the biggest wins for io_uring is regular file I/O. epoll is fundamentally broken for disk files; the kernel's poll implementation for regular files always returns POLLIN | POLLOUT because disk files are technically always "ready." However, the subsequent read() call can still block the thread if the data is not in the page cache, causing page faults. io_uring handles disk reads truly asynchronously, making it a massive upgrade for databases and storage engines.
4. Security and Tooling
Because io_uring allows applications to execute a wide variety of system calls asynchronously through a single entry point (io_uring_enter), it bypasses traditional security auditing tools. It has been a frequent source of kernel-level vulnerabilities, leading some organizations (including Google) to restrict or disable its use in highly sensitive environments.
Furthermore, standard debugging and profiling tools like strace and perf struggle with io_uring's asynchronous execution model. Profiling requires updated tooling and BPF-based tracing of io_uring_enter to get accurate performance metrics.
When to Port and When to Stay
Migrating a mature codebase from epoll to io_uring is rarely a simple refactoring exercise; it often requires a complete architectural rewrite.
| Feature / Metric | epoll |
io_uring |
|---|---|---|
| Model | Reactor (Readiness) | Proactor (Completion) |
| Syscall Overhead | High (1+ per operation) | Low (Batching or 0 with SQPOLL) |
| File I/O Support | No (Blocks on page faults) | Yes (Fully asynchronous) |
| Memory Efficiency | High (On-demand buffering) | Medium (Requires buffer pools/multishot) |
| Security Profile | Mature and highly restricted | Active attack surface; restricted by some hosts |
| Minimum Kernel | Linux 2.5.44+ | Linux 5.1+ (5.10+ recommended for stability) |
| Complexity | Moderate | High (Requires liburing or barrier management) |
Keep epoll if:
- You have a mature, highly optimized event loop deeply integrated into an existing codebase (like Nginx, HAProxy, or Redis) and have not measured a clear system call bottleneck.
- Your application runs on older enterprise kernels (pre-5.10) or in containerized environments where
io_uringmight be disabled by security policies. - Your service manages tens of thousands of mostly idle connections where memory footprint is a primary constraint.
Migrate to io_uring if:
- You are building a greenfield, high-performance network server targeting modern Linux kernels (5.10+ or 6.0+ for multishot features).
- Your application performs a mix of network I/O and disk I/O (such as web caches, databases, or storage engines).
- You are using high-performance network frameworks like µWebSockets (via its
uSocketsbackend), which have already done the heavy lifting of integratingio_uringand demonstrated massive throughput gains.
Ultimately, io_uring is not just an optimization; it is the future of Linux systems programming. While epoll remains a reliable tool for standard web applications, io_uring is the correct default for any new system pushing the absolute limits of modern hardware.
Sources & further reading
- Epoll vs. io_uring in Linux — sibexi.co
- Notes on epoll and io_uring — iafisher.com
- asynchronous - Is epoll a better API than io_uring? - Stack Overflow — stackoverflow.com
- io_uring vs. epoll – Which Is Better in Network Programming? - Alibaba Cloud Community — alibabacloud.com
- io_uring vs epoll - Linux Kernel Internals — kernel-internals.org
Lenn writes about cloud platforms, Kubernetes internals, and the infrastructure decisions that quietly make or break engineering organizations. Based in Berlin's vibrant tech scene, they have a talent for turning dense platform-engineering topics into prose that people actually finish reading.
Discussion 0
No comments yet
Be the first to weigh in.