Here are the benefits and some caveats to running data-path applications in the user space. Discussed is Linux's UIO framework.
Editor's Note: this article was first published in the International Journal of Information and Education Technology
Traditionally, packet-processing or data-path applications in Linux have run in the kernel space due to the infrastructure provided by the Linux network stack. Frameworks such as netdevice drivers and netfilters have provided means for applications to directly hook into the packet-processing path within the kernel.
However, a shift toward running data-path applications in the user-space context is now occurring. The Linux user space provides several advantages for applications, including more robust and flexible process management, standardized system-call interface, simpler resource management, a large number of libraries for XML, and regular expression parsing, among others. It also makes applications more straightforward to debug by providing memory isolation and independent restart. At the same time, while kernel-space applications need to conform to General Public License guidelines, user-space applications are not bound by such restrictions.
User-space data-path processing comes with its own overheads. Since the network drivers run in kernel context and use kernel-space memory for packet storage, there is an overhead of copying the packet data from user-space to kernel-space memory and vice versa. Also, user/kernel-mode transitions usually impose a considerable performance overhead, thereby violates the low-latency and high-throughput requirements of data-path applications.
In the rest of this article, we shall explore an alternative approach to reduce these overheads for user-space data-path applications.
Mapping memory to user space
As an alternative to the traditional I/O model, the Linux kernel provides a user-space application with means to directly map the memory available to kernel to a user-space address range. In the context of device drivers, this can provide user-space applications direct access to the device memory, which includes register configuration and I/O descriptors. All accesses by the application to the assigned address range ends up directly accessing the device memory.
Several Linux system calls allow this kind of memory mapping, the simplest being the
mmap() call. The
mmap() call allows the user application to map a physical device address range one page at a time or a contiguous range of physical memory in multiples of page size.
Other Linux system calls for mapping memory include
vmsplice() , which allows an arbitrary kernel buffer to be read or written to from user space, while
tee() allows a copy between two kernel-space buffers without access from user space.
The task of mapping between the physical memory to the user-space memory is typically done using translation lookaside buffers or TLB. The number of TLB entries in a given processor is typically limited and are thus used as a cache by the Linux kernel. The size of the memory region mapped by each entry is typically restricted to the minimum page size supported by the processor, which is 4 kilobytes.
Linux maps the kernel memory using a small set of TLB entries that are fixed during initialization time. For user-space applications however, the number of TLB entries are limited and each TLB miss can result in a performance hit. To avoid such penalties, Linux provides concept of a Huge-TLB, which allows user-space applications to map pages larger than the default minimum page size of 4KB. This mapping can be used not only for application data but text segment as well.
Several efficient mechanisms have been developed in Linux to support zero-copy mechanisms between user space and kernel space based on memory mapping and other techniques. These can be used by the data-path applications while continuing to leverage the existing kernel-space network-driver implementation. However, these mechanisms still consume the precious CPU cycles and per-packet-processing cost still remains moderately higher. Having a direct access to the hardware from the user space can eliminate the need for any mechanism to transfer packets back and forth between user space and kernel space, thus reducing the per-packet-processing cost.
Linux provides a standard UIO (User I/O) framework for developing user-space-based device drivers. The UIO framework defines a small kernel-space component that performs two key tasks:
a. Indicate device memory regions to user space.
b. Register for device interrupts and provide interrupt indication to user space.
The kernel-space UIO component then exposes the device via a set of sysfs entries like
/dev/uioXX . The user-space component searches for these entries, reads the device address ranges and maps them to user space memory.
The user-space component can perform all device-management tasks including I/O from the device. For interrupts however, it needs to perform a blocking
read() on the device entry, which results in the kernel component putting the user-space application to sleep and waking it up once an interrupt is received.
User-space network drivers
The memory required by a network device driver can be of three types:
a. Configuration space: this refers to the common configuration registers of the device.
b. I/O descriptor space: this refers to the descriptors used by the device to access data from the device.
c. I/O data space: this refers to the actual I/O data accessed from the device.
Taking the case of a typical Ethernet device, the above can refer to the common device configuration (including MAC configuration), buffer-descriptor rings, and packet data buffers.
In case of kernel-space network drivers, all three regions are mapped to kernel space, and any access to these from the user space is typically abstracted out via either
ioctl() calls or
write() calls, from where a copy of the data is provided to the user-space application.
Click on image to enlarge.
Click on image to enlarge.
User-space network drivers, on the other hand, map all three regions directly to user-space memory. This allows the user-space application to directly drive the buffer descriptor rings from user space. Data buffers can be managed and accessed directly by the application without overhead of a copy.
Taking the specific example of an implementation of a user-space network driver for eTSEC Ethernet controller on a Freescale QorIQ P1020 platform, the configuration space is a single region of 4k size, which is page-boundary aligned. This contains all the device-specific registers including controller settings, MAC settings, and interrupts. Besides this, the MDIO region also needs to be mapped to allow configuration of the Ethernet Phy devices. The eTSEC provides for up to eight different individual buffer descriptor rings, each of which are mapped onto a separate memory region, to allow for simultaneous access by multiple applications. The data buffers referenced by the descriptor rings are allocated from a single contagious memory block, which is allocated and mapped to user space during initialization time.
Constraints of user-space drivers
Direct access to network devices brings its own set of complications for user-space applications, which were hidden by several layers of kernel stack and system calls.
a. Sharing a single network device across multiple applications.
b. Blocking access to network data.
c. Lack of network stack services like TCP/IP.
d. Memory management for packet buffers.
e. Resource management across application restarts.
f. Lack of a standardized driver interface for applications.
Unlike the Linux socket layer which allows multiple applications to open sockets–TCP, UDP, or raw IP–the user-space network drivers allow only a single application to access the data from an interface. However, most network interfaces nowadays provide multiple buffer descriptor rings in both receive and transmit direction. Further, these interfaces also provide some kind of hardware classification mechanism to divert incoming traffic to these multiple rings. Such a mechanism can be used to map individual buffer descriptor rings to different applications. This again limits the number of applications on a single interface to the number of rings supported by the hardware device. An alternate to this is to develop a dispatcher framework over the user-space driver, which will deal with multiple applications.
Blocking access to data
Unlike traditional socket-based access, which allows user-space applications to block until data was available on the socket, or to do a
poll() to wait on multiple inputs, the user-space application has to constantly poll the buffer descriptor ring for an indication for incoming data. This can be addressed by the use of a blocking
read() call on the UIO device entry, which would allow the user-space application to block on receive interrupts from the Ethernet device. This also provides the application with the freedom of when it wants to be notified of interrupts–in other words, instead of being interrupted for each packet, it can choose to implement a polling mechanism to consume a certain number of buffer descriptor entries before returning to other processing tasks. When all buffer descriptor entries are consumed, the application can again perform a
read() to block until further data arrives.
Lack of network-stack services
The Linux network stack and socket interface also abstract basic networking services from applications such as route lookup and ARP (Address Resolution Protocol). In the absence of such services, the application has to either runs its own equivalent of a network stack or maintain a local copy of the routing and neighbor databases in the kernel.
Memory management for buffers
The user-space application also needs to deal with the buffers provided to the network device for storing and retrieving data. Besides allocation and freeing of these buffers, it also needs to perform the translation of the user space's virtual address to the physical address before providing them to the device. Doing this translation for each buffer at runtime can be very costly. Also, since the number of TLBs in the processor may be limited, performance may be hit. The alternative is to use Huge-TLB to allocate a single large chunk of memory and carve out the data buffers out of this memory chunk.
The application is responsible for allocating and managing device resources and current state of the device. In case the application crashes or is restarted without being given control to perform cleanup, the device may be left in an inconsistent state. One way to resolve this could be to use the kernel space's UIO component to keep track of application process state and on restart, to reset the device and reset any memory mappings created by the application.
Standardized user interface
The current generation of user-space network drivers provide a low-level application programming interface (API), which is often very specific to the device implementation, rather than conforming to standard system-call API like
receive() . Device-specific APIs imply that the application needs to be ported to use each specific network device.
Freedom with limitations
While the UIO framework provides user-space applications with the freedom of having direct access to network devices, it brings its own share of limitations in terms of sharing across applications, resource and memory management. The current generation of user-space network drivers works well in a constrained use-case environment of a single application tightly coupled to a network device. However, further work on such drivers must address some of these limitations.
Note: this article was first published in the International Journal of Information and Education Technology ; Hemant Agrawal and Ravi Malhotra, Member, IACSIT, “Device Drivers in User Space: A Case for Network Device Driver,” International Journal of Information and Education Technology vol. 2, no. 5, pp. 461-463, 2012.
1. Dragan Stancevic, “Zero Copy I: User-Mode Perspective”; Linux Journal , Jan 2003, pp. 105
2. Moti N. Thadani et al., “An Efficient Zero-Copy I/O Framework for UNIX”; Sun Microsystem Inc, May 1995.
3. Matt Welsh et al., “Memory Management For User-Level Network Interfaces”; IEEE Micro , Mar-Apr 1998, pp. 77-82
4. Hans-Jürgen Koch, The Userspace I/O HOWTO, http://www.kernel.org/doc/htmldocs/uio-howto.html
Hemant Agrawal is a software architect for the Networking Processor Division of Freescale working on QorIQ product line. He holds a bachelor's degree in electrical engineering from Institute of Technology, BHU, India.
Ravi Malhotra is a software architect for the Networking Processor Division of Freescale working on QorIQ product line. He holds a bachelor's degree in electrical engineering from Institute of Technology, BHU, India.