Linux device driver design
This tutorial presents the author's practical experience with writing Linux device drivers to control custom-designed hardware. The tutorial starts by providing an overview of the driver writing process, and describes several example drivers provided with this tutorial . The reader is encouraged to experiment with those example drivers on their own x86 system, as it provides the best learning experience.
The ability of a user-space process to transfer data from multiple PCI boards is contingent on the implementation of both the hardware and driver. The requirements of both the hardware and software are presented.
The drivers in this tutorial are written for the Linux 2.6 kernel.
The drivers have been built against; 2.6.9-11 (Centos 4.1), 2.6.13, and
2.6.14 for x86 and PowerPC targets. Details that are clearly described
in the book 'Linux Device Drivers' , by Corbet, Rubini, and
Kroah-Hartman are not repeated in this tutorial, so the reader is
encouraged to obtain a copy.
The Linux 2.6 kernel presents a number of generalized interfaces that the driver writer must first understand, and then implement for their specific driver. The best way to understand the interfaces is to write simple drivers that exercise a subset of the kernel driver interfaces. The following sections describe the interfaces used to implement character device drivers.
The file simple_module.c implements a very basic kernel module. A device driver is a kernel module, but kernel modules are also used to add features to the kernel that have nothing to do with device drivers. Welcome to your first generalized kernel interface.
The basic requirements of a kernel module are that they implement an initialization and an exit function. Those two functions are identified by the macros module_init()and module_exit(). The example also shows how to pass load-time parameters to the module, and how to setup logging in a module.
The code sets up two logging macros; LOG_ERROR()and LOG_DEBUG(). The debug macro can be removed from the code at compile time (by not defining DEBUG), or can be compiled into the code and then enabled or disabled via the load-time parameter simple_debug. This method of adding log messages to code is easier to maintain (eg. disable) than a series of printk()calls littered throughout the code.
The following shows the driver usage; the // marks are comments, while the $(user) and #(root) prompts show the commands
you enter (bashshell syntax).
So with the load-time parameter simple_debugset to zero, the LOG_DEBUG()message does not appear in the output. The module load and unload messages are generated using the LOG_ERROR()macro so that they are always generated.
The file simple_driver.c implements a simple device driver. What makes it a device driver, and not just a kernel module? In simple_init the driver requests a range of major and minor numbers (the numbers used to represent device nodes in /dev), it then allocates memory for an array of device-specific simple_device_tstructures, and then registers the character device, cdev, member of each structure in the array with the kernel.
Registration of the character device requires a set of file operations, i.e., a kernel-level implementation of the functions that get called when user-space calls system calls, eg. open(), read(), write(), ioctl(), lseek(), select(),and mmap(). The file operations are stored as function-pointers in a structfile_operations; if this code was written in C++, then this structure would be the base-class, and your implementation of its functions would be an inherited class.
The file simple_driver_test.cis a user-space application that tests the functions of the driver. Install the module, type ls/dev/simple*and once you see device nodes there, run the test. After the test finishes, type dmesg to see the kernel-level messages triggered by the user-space test. Remove the driver, and reinstall it with load-time parameters, eg.
This creates three devices each responsible for three minor numbers (functions on the device). ls-al/dev/simple* will show the multiple devices created (and their major/minor numbers).
How did the device nodes magically appear in /dev?i Thats next.
Hotplug, sysfs, and udev
The simple driver initialization code, simple_init, also performs another step, it creates a kernel object, class_simple or class depending on the kernel version, that creates entries in the sys-file system, sysfs, in the directory /sys/class. Creation of the class object in the initialization code, creates the entry /sys/class/simple_driver. Devices managed by the driver are then added to the class object (see the code), creating the device nodes under /sys/class/simple_driver, eg. if no load-time parameters are specified, the driver creates one device, and the node /sys/class/simple_driver/simple_a0 is created.
Why create these class and device 'objects'? The Linux 2.6 kernel supports the concept of hot-pluggable devices, i.e., devices that can be plugged in while the system is turned on, eg. a USB camera. In older Linux systems, if you plugged in a camera, you'd have to look at the output of dmesg to see what the camera was detected as (if at all), and then try and figure out how to get images off the camera!
The Linux 2.6 system generates 'hot-plug' events every time a kernel object is created and destroyed, and these hotplug events trigger the execution of scripts in user-space. The (appropriately written) scripts then automatically populate the /deventries for a device. A nice feature of these scripts is that you can decide what name to give the device, eg., a camera detected as a USB mass-storage device might be detected as /dev/sda1in a non-hotplug system, but with hotplug you can setup the camera name to be /dev/camera, much nicer!
The automatic creation of /dev entries relies on three related kernel infrastructures; hotplug, sysfs,and udev. The man page, manudev, gives details on how the scripts can be setup to create the /dev entries with specific permissions, and how to map a kernel name (eg. that used when the device was added to the class object in simple_init) to a user-space defined name.
On Centos 4.1, the udevconfiguration files are kept in /etc/udev/, the line udev_log=noin in /etc/udev/udev.conf can be changed to udev_log=yes and hotplug events will be written to the system log. For example, as root type tail-f/var/log/messages, and then from another terminal install the simple_driver.ko, and you will see the logging of the hotplug events.
The default name given to a single device created by the simple driver is /dev/simple_a0. With no udevscripts in-place, the device node is created for use by root only, and is named identically to the string used in simple_init. The permissions on the device node can be changed by creating a udevscript containing a single line:
This changes the permission on all nodes matching the pattern simple_*to the owner dwh,group mm, with permissions 0660. The name of the device entry can be changed, or a symbolic link to a device entry can be created, by adding another script, eg. the following creates a symbolic link to the first device entry
The udevman page gives more details on the options for device naming (eg. a user-supplied program can be run to generate the device name). The automatic creation of /dev entries helps reduce the contents of /dev to just those devices installed. It also provides flexibility to user-space in the naming of device nodes.
For example, in the case of PCI devices it allows the PCI location, eg. bus:dev.fn to be remapped into a meaningful slot number, eg. instead of say a device named /dev/board_00:0c.0, the user-space name can be mapped to /dev/ board2.
The class_simple interface, as described in the Linux Device Drivers book , was removed from the kernel (according to the ChangeLog for that kernel), and the API changed again slightly. The parallel port user-space driver, ppdev.c, is a nice small (easily understandable) driver that uses the class interface. A diff of different kernel versions of this driver can be used to determine the usage of any API changes (eg. whether a new argument can be assigned NULL).
The driver simple_timer.c implements a single device that uses two different kernel mechanisms for delaying the calls read(), write(),and select(). The test program simple_timer_test.c tests the driver. The driver demonstrates the usage of timers and events.
The driver simple_irq.c implements a single device that uses the parallel port on an x86 PC. To test this driver, you might need to first remove the printer driver and parallel port driver, i.e., modprobe-rlp, modprobe-rparport_pc. The driver creates a kernel timer that fires every second.
The timer handler writes a low and then high to all the data lines on the parallel port. If a data line, one of pins 2 through 9, is jumpered to the interrupt line, pin 10, then an IRQ will be generated every second. The IRQ handler unblocks a blocked read(), write(),or select().
If a data line is not jumpered to the IRQ line, then the blocked calls will timeout (2s) and continue anyway. The test program simple_irq_test.c tests the driver. The driver demonstrates the usage of timers, IRQs, and events with timeouts.
The driver simple_buffer.c implements a single device that also uses the parallel port on an x86 PC (so you will need to remove simple_irq to test it). This driver is similar to simple_irq.c with the change that IRQs write a time-stamp to an internal buffer, user-space write()writes to that buffer, and read()reads from the buffer. The following are some tests that can be performed using standard command-line tools:
1) Connect the parallel port IRQ to a data line. Install the driver named insmodsimple_buffer.ko. Once the /dev/simplenode is valid, type cat/dev/simple. A UTC timestamp will be printed every second.
2) Remove the parallel port jumper. Remove the driver. Install the driver and disable the timer and timeout as follows:
On one terminal type "cat/dev/simple", on another type echo "Hello">/dev/simple". (You can also leave the timer enabled and it will just write messages to the log file).
3) Combine the first two tests (remove and re-install the driver without any load-time parameters); the IRQ will add a complete timestamp message every second, while write will add a complete string (whenever the user triggers a write). No messages will be interrupted, since each procedure locks the internal buffer.
The test shows that the driver works as one would expect, however, take a look at the source for the details. The internal buffer is a resource that is shared between read() (eg. one process), write() (eg. another process), and the IRQ handler (interrupt context).
The driver uses a spin-lock to protect access to the buffer (and its associated buffer count and pointers). Without this protection, an IRQ could interrupt a write, and insert a timestamp into the middle of the string echoed into the driver. Of course in a real driver, the results could be more disastrous.
If the resource (buffer) being protected by the driver was only ever accessed by processes, then a semaphore can be used to protect it. Semaphores can be used to block a process, causing it to sleep while waiting for a resource. Spin-locks are not quite so forgiving.
You are not allowed to sleep, or call a function that might sleep, while holding a spin-lock. Make sure to build your driver development kernel with CONFIG_DEBUG_SPINLOCK and CONFIG_DEBUG_SPINLOCK_SLEEP enabled, and the kernel will give you a nice reminder if you try to do something bad (eg. calling kmalloc while holding a lock).
The write() and read() operations of the driver need to copy data from (or to) user-space to (or from) a kernel buffer. However, a copy_from/to_user can sleep, so there is no way to copy directly to the spin-lock protected buffer!
There's also the following write sequencing issue; to write data into the buffer, you first need to check whether there is space. However, the spin-lock needs to be held to check the buffer state, so ideally you would hold the lock, check for space, release the lock, and then copy a matching amount of user-data to the kernel. But, since you are not holding the lock, an IRQ can come along and use up your space!
The solution, shown in the driver code, is to first copy all the user data into a kernel buffer, and then hold the lock while checking for space. This allows the (sleepable) copy and allocation calls to be performed before holding the lock. Of course in the case of a full buffer and non-blocking write, the allocation and copy from user-space was a waste of time.
The code that holds the spin-lock, checks for a condition, and then goes to sleep on a wait-queue if the condition is not met, should look eerily familiar to anyone who has programmed with Pthreads; it is the same pattern of code as used with a mutex and condition variable.
A mutex is used to protect a resource, while a condition variable is used to put a thread to sleep while waiting for some other thread to signal it that the condition has changed. The nice thing about this analogy is that you can write pthreads code to simulate driver buffering operations to 'figure it out' outside of the kernel.
The buffering used in the simple buffer driver is a bit contrived in that there are two 'producers' writing to the buffer, and one 'consumer'. A more likely scenario for a driver would be to have a buffer contended for by a single producer (say the receive IRQ), and a single consumer (say read), and another separate buffer for a single producer (write) and consumer (transmit IRQ).
But even in this situation, you can run into problems if the read from the buffer takes an excessive amount of time, blocking new data from the receive IRQ. One solution to this issue is to use two buffers for each producer-consumer pair; eg. the receive IRQ is initialized to point to an empty buffer, and receive IRQs fill the buffer until a read is issued, at that point IRQ buffer is passed to read, and the IRQ gets the second empty buffer.
Once read has consumed the contents of the first buffer, if the second buffer in-use by the IRQ has new data, then the buffers are swapped again. In this scheme, the lock only needs to be held to swap the buffers, and since read does not hold the lock once it has a valid buffer, a copy to user-space from the kernel buffer is allowed, removing the need to use an intermediate buffer as shown in the simple buffer driver. The kernel tty layer uses this form of buffering scheme and refers to it as flip-buffering (see linux/tty.h).
The simple buffer driver has (at least) two practical applications.
If you install it and "cat" the timer generated time stamps into a
file, a plot of the difierence between consecutive time stamps minus 1
second, will show the error in the kernel's ability to generate a 1
The observed error of the measured timestamp relative to that same
timestamp rounded to the nearest second was about ±0.5ms. If the
test PC (laptop) had its ethernet cable disconnected, or the NTP daemon
was stopped, the error of the logged timestamps relative to the GPS
1pps tick would gradually increase (100 to 200µs over 10
minutes). If you had a method of generating a higher-frequency
square-wave that was also locked to GPS, then you could determine the
interrupt latency, and interrupt handling overhead, of the kernel by
hammering the IRQ pin at a few kilohertz.
A 'real-world' PCI driver
The experience presented in this document was gained during the development of the Caltech-OVRO Broadband Reconfigurable Array (COBRA) Correlator System. The hardware developed is documented at www.ovro.caltech.edu/~dwh/correlator.
The hardware is currently in use on several radio astronomy projects, eg. the SZ Array (http://astro.uchicago.edu/sza/) and the CARMA array (http://www.mmarray.org). The cPCI digitizer and correlator boards used in the correlator system contain a PLX9054 PCI interface, a Texas Instruments DSP, Altera FLEX10K FPGAs, and on the digitizer, 1GHz analog-to-digital converters.
The digitizer output routes to the FPGAs on the digitizer board, where data is digitally filtered, delayed, and routed to front-panel high-speed connectors. The data travels over LVDS cabling (Ultra-SCSI cables) to the correlator boards, where FPGAs cross-correlate and average the data.
The on-board DSPs retrieve auto-and cross-correlation results from the FPGAs, perform FFTs, further corrections, and average the data for 100ms to 500ms. Data is then transferred to a Linux host.
The system uses a GPS based NTP server with a 1pps output. The 1pps signal is used to derive a hardware heartbeat, so that the 100ms and 500ms transfers are aligned with real-time. The Linux hosts run NTP pointing to the NTP server, and check that data from boards arrives within a 50ms window relative to a 100ms or 500ms boundary.
The Linux driver used in the COBRA system is shown graphically in Figure 1, below . The driver implements several character device interfaces to the board; a terminal-like interface with standard-input, output, and error, a read/write control interface, a read-only data interface, and a read-only monitoring interface.
The reason for using multiple devices, rather than a complex scheme of I/O control was determined by the usage of the driver. For example, one objective was to enable the use of standard command line tools like cat, od(octal dump), echo,and dd. These tools know nothing of I/O control calls, so need to be directed to a device node of a specific 'personality'.
|Figure 1: COBRA device driver block diagram. The block diagram shows the relationship between the /devnodes accessed by user-space applications and the files that implement the driver.|
The COBRA control system code controls up to 20 boards in a single sub-system, and data must be collected from each board at about the same time. The standard method for dealing with multiple sources of data is to use the select() call, which uses file-descriptors. So by separating out the data device and monitor device functionality at the driver-level, a user-space server can run a thread containing a select() call that collects all the data from all boards, and serves that data up to clients. Then another thread, or another process even, can run a monitor server containing a thread calling select() on all the monitor file descriptors.
Dr. David Hawkins, Senior Scientist at the California Institute of Technology, is currently involved with the design and development of high-speed digital correlator systems for Caltech, U. Chicago, and the CARMA (Caltech,Berkeley, U. Illonois, and U. Maryland) radio observatories.
This article is excerpted from a paper of the same name presented at the Embedded Systems Conference Silicon Valley 2006. Used with permission of the Embedded Systems Conference. For more information, please visit www.embedded.com/esc/sv.
 J. Corbet, A. Rubini, and G. Kroah-Hartman. LinuxDeviceDrivers. O'Reilly, 3nd edition, 2005.
 D. Hawkins. COBRA device driver. Caltech-OVRO documentation, 2004. (www.ovro.caltech.edu/fidwh/correlator/pdf/cobra driver.pdf).
 D. Hawkins. PLX-9054 PCI Performance Tests. Caltech-OVRO documentation, 2004. (www.ovro.caltech.edu/fidwh/correlator/pdf/pci performance.pdf).
 D. Hawkins. Linux driver design source code. Caltech-OVRO documentation, 2005. (www.ovro.caltech.edu/fidwh/correlator/software/driver design.tar.gz).