Debugging simulated hardware on Linux, Part 1
Device driver debugging
Test your driver's entire code flow
This content is part # of # in the series: Debugging simulated hardware on Linux, Part 1
This content is part of the series:Debugging simulated hardware on Linux, Part 1
Stay tuned for additional content in this series.
Imagine that you're asked to develop a device driver from scratch, but you don't have the target hardware available to you while you are developing the software. (Perhaps the hardware and software development are occurring simultaneously.) How would you go about designing, developing, testing, and debugging your driver?
Furthermore, suppose your device driver is very complex, having multiple threads, accessing hardware registers, and utilizing such advanced programming as DMA in user land by making use of
Or, what if your driver's Interrupt Service Routine (ISR) is too complex, having multiple interrupts in a setup where every interrupt has to be treated differently? Alternatively, what if you are asked to write a device driver for an embedded system that does not have a very sophisticated debugging environment? You may also be asked to write software to test the hardware itself.
This article explains the methods I follow during the design, development, and debugging stages of a driver. These methods are helpful when you test the complete code flow of a driver under one or all of the developmental stages. I consider this method a development strategy rather than a debugging technique.
For a more detailed discussion of the problem and implementation details, please read Part 2 of this two-part series, "Debugging simulated hardware on Linux, Part 2: Interrupts and Interrupt Service Routine."
Chipless or cardless test
The examination defined in this section is to test the flow of the various low-level helper functions and entry points in the device driver. This source code testing tests all the logic and control statements used in the driver. In this method, all the entry points and helper functions are either replaced by the dummy functions or mapped to some dummy macros.
To do this, you can do either of these alternative techniques:
Introduce pre-processor directives (
#ifdef) for the functions in the header file. This approach requires re-compilation. The macros defined in the header file would override the actual function calls and function implementations.
- Introduce separate files and modules that would have the dummy/stub functions. During linking, you could either link the original function implementations or link the dummy/stub functions. This approach does not require re-compilation; it only requires re-linking. You can put a provision in the make file to produce either the debug version or the release version of the final executable.
In both of these techniques, dummy functions may return success and would log the appropriate messages when they are being called (
These dummy or self-testing functions may do very minimal tasks in order to keep the entire system in a safe state, such as setting some device- or driver-related global flags and manipulating global structures. The rest of the driver operations may need these flag settings and structure manipulations.
You can do this testing without needing the actual hardware, so these functions comprise a pseudo driver.
How testing works
Say, for instance, in the
init_module() entry point, you scan the PCI bus for your devices. You might have a function
this_driver_pci_scan() to scan your devices. Upon successful detection of the controllers of interest, device-related global structures would be manipulated. Your PCI scanning function may return the number of cards found matching the device and vendor IDs. In the pseudo version of this function, you can always return a specific number (for instance, the number 1).
Instead of calling
pci_find_device() and PCI-related functions within
this_driver_pci_scan(), you can directly set the global structures. If you do not want to touch the original source, you may use wrappers so that the PCI-related functions are mapped not to access the original PCI, but instead to return success. You could put all of these wrapper functions and inline functions in a header file and include them based on conditional compilation.
To put it simply, you can map the
pci_find_device() kernel API into your own written function so that control will not go to the PCI subsystem of Linux. Similarly, you can map all other hardware I/O functions such as
pci_read_config_XXX, etc. either in a separate linkable C source file or in a detachable header file.
These source and header files would be the base for the "self testing" of your driver. With this approach, you could test the source of the
init_module() and other driver initialization routines.
The basic guideline is that any call that directly accesses the device (such as
outb, etc.) should be mapped to some dummy/pseudo function that will return success and may also do some simple tasks like updating global flags and structures.
Similarly, you can map and test the code flow of other driver entry points and their helper functions. You might have a big initialization routine of more than 100 lines of code, and it might have four or five direct hardware access calls, possibly within a loop. The remaining portion of the code could be driver- and device-specific details and the logic associated with them. In this case, you do not need to replace the entire function as a dummy/stub function. Instead, you map only the hardware access calls so that the entire code flow is covered during debugging and testing.
You could adopt the same method for non-PCI-based device drivers as well. I used the same technique while developing a USB gadget driver without any trouble.
Testing code flow in ISR
As you know, interrupts are asynchronous and need to be serviced carefully. The ISR runs in a special context, and you should take care not to make the regular kernel process race with the ISR.
So, what is the process to test the code flow in the ISR? The following sections give you a simple roadmap.
By using the polling method, you can trace the complete code path in the ISR.
Schedule the tasklet
You can schedule a tasklet to act as the actual ISR. The Tasklet context is close to that of the Interrupt context; this is why tasklets also have some restrictions, as in the case of ISR, in using certain blocking calls.
Write the kernel thread
You can disable the interrupt and write a simple kernel thread that could execute the ISR in some regular interval or in some particular sequence or settings programmatically.
Use soft IRQs
To test your ISR running in ISR context only, you have soft IRQs and software-generated interrupts. For instance, on an x86 platform, you could use the
INT instruction to raise an interrupt that would cause your driver's ISR to be invoked. You do need to pass your controller's IRQ number as a parameter for the
INT inline assembly instruction. A better way to do this would be to use the
Employ a special ioctl function
A sophisticated way to have more control over raising interrupts and testing the ISR is to follow a two-tier architecture, having a special
ioctl function that gives the user application the freedom to raise a particular interrupt or sequence of interrupts at specified times and the
ioctl implementation in kernel land. In this approach, you have more control over interrupt generation. You must set the appropriate fields and then pass the same to the special
ioctl that would, in turn, either raise the interrupts or signal the kernel thread to raise the interrupts.
My personal favorite
I prefer to load the software in the debugger and run each individual path/line of code at least once and test all the branching statements. If you want to try that, the ideal way is to disable the interrupts and perform polling while you are still in the debugger.
In the debugger, you could then set different values for the variables at runtime and check the effect of it. However, you must be very sure about the time-critical nature of any particular function that you are debugging.
I once had a nightmarish experience while debugging a 400MB VSS source that was developed a decade ago with multilayered and multiple modules. Every time I entered the debugger, the TTL (Time to Live) factor would disconnect the network communication I had established before I started the debugging session. If you are developing a connection-oriented networking driver, you should take this into account.
In the ultimate environment, you can run your driver in the debugger and debug the driver while interrupts are enabled and processed. You need some level of expertise with the debugger and interrupts to achieve this.
Actual test cases
If you've come this far, you are now ready to test your driver on the actual hardware. After testing all the code paths, carefully remove all the wrappers and dummy/stub functions and give access to the driver to access the hardware.
A natural question to ask at this point is: Why would anyone want to do all the simulation, polling, etc., when the hardware was available? Here are some reasons:
- If the device and/or kernel has some bugs, those bugs must be fixed first before you can proceed with testing and debugging the hardware. Otherwise, you won't know whether the unexpected events/outputs arose because of a hardware/kernel bug or because of a driver bug.
- If the driver you are developing has diagnostic routines to test the hardware, there should not be any bugs in the diagnostics code. Otherwise, you won't be able to test the hardware, since the driver itself is buggy.
Testing and debugging the device driver on the simulated environment is obviously easier than testing the same on the actual hardware. In the simulated environment, you have complete control over raising interrupts, and you can step through the source code. In the actual target device environment, interrupt generation could be more asynchronous. You also may not be able to step through some portion of the code in the target hardware environment if the environment is something similar to the connection-oriented network traffic mentioned in the previous section.
Another reason to go through the trouble of simulation when the hardware is available is that sometimes the kernel and/or the hardware will malfunction. It would be difficult to debug the untested code on a new kernel or hardware. As a developer, you get used to testing your code before it is tested on the hardware. If there were more than one odd issue, it would be hard to find.
In testing, first give access to all the hardware access calls in the initialization routine and see if everything works properly (
One very basic test to conduct on any hardware is the access test: You should be able to read and write to the hardware. This basic test must be conducted before you go further with any other control logic in the driver. Most devices have some well-known registers or memory locations that are pre-initialized (that is, filled with well-known values) during power-up. You should be able to read those registers or memories to confirm that basic hardware read access works properly.
Once these lowest-level functions are tested, you can replace all dummy functions with the actual entry points and helper functions, one by one and level by level.
I use these approaches every time I do a native migration or develop a fresh driver. Incremental testing and development is always better than testing all the untested code in a single shot (brute-force testing). It is always better to test the software as you develop, and implement new features one by one.
The following test cases provide step-by-step instructions on what to check for.
Locking primitives released
Check whether all the locking primitives (spinlock, semaphore, mutex, read write locks) are getting released in all possible code flows. This would help distinguish whether you are in an infinite loop or in a deadlock situation.
If you ever end up in system lock and suspect any of the locking primitives, log some message in all the places where these locking primitives are used. I also suggest having a logging or debug trace facility in the driver that can be enabled or disabled and can distinguish different levels of debugging. If you were to place a debug trace in an infinite loop, you would definitely get a console full of messages.
If the driver is a module, check to see if loading and unloading the driver succeeds.
Check whether automatic loading and automatic unloading works properly without any warnings or messages from the kernel.
Check whether proper shutdown happens after the driver is loaded and unloaded a few times. Any memory leaks or invalid memory accesses that were not thrown during driver access would most likely be thrown during shutdown.
Check whether open, close; open, read, close; and open, write, close succeed. Check all possible combinations at least once.
Also check to see that all
ioctl calls are working according to the requirement.
Check for memory leaks with the appropriate tools. Since the kernel modules have access to the entire memory space, it may be difficult to distinguish between legal and illegal memory access. Very small memory leaks in the driver may not be easy to catch. You might want to automate the testing if you're going to run your tests a number of times.
The best approach is to go through the entire source and come up with test cases that cover all possible code paths (for example, a
for loop, boundary condition, locks, etc.) Think carefully about all the places the source may fail, and produce test cases for those situations. Sometimes, self-testing code is a better option than a debugger. The debugger is not everything -- it is a tool, and it gives us better results when we apply analytical thinking and some common sense.
- In Part 2 of this two-part series, "Debugging simulated hardware on Linux, Part 2: Interrupts and Interrupt Service Routine" (developerWorks, November 2005), learn about strategies and implementation details that you can apply to interrupt simulation, including the prerequisites, hardware, software setup, and test cases for testing the Interrupt Service Routine (ISR).
- "Smashing performance with OProfile" (developerWorks, October 2003) introduces a tool that can help identify issues such as poor cache utilization, inefficient type conversion and redundant operations, and branch mispredictions.
- Understanding the Linux Kernel, Third Edition (O'Reilly, November 2005) gives you a guided tour of the code that forms the core of all Linux operating systems.
- Linux Device Drivers, Third Edition (O'Reilly, February 2005) includes full-featured examples that programmers can compile and run without special hardware.
Build your next development project on Linux with IBM trial software, available for download directly from developerWorks.