Debugging tools and techniques for Linux on Power

Debugging is a major software development activity, which as an application developer, you cannot avoid. Effective debugging can not only shorten the software development cycle, but can also save costs. This article introduces techniques for locating bugs in user-space C/C++ and Java applications and describes some of the debugging tools available on Linux for POWER architecture.

02 October 2013 - Per author request, updated list of memory misuses that Valgrind can identify. See text that follows Listing 7.

Calvin Sze (calvins@us.ibm.com), Linux consultant , IBM

Calvin Sze is a Linux consultant for the IBM eServer Solutions Enablement organization at IBM, based in Austin, TX. Calvin’s main role is to help solution developers bring their applications to Linux on POWER. Calvin has been involved in software development and system integration on both Linux and AIX platforms for more than 10 years. You can contact Calvin at calvins@us.ibm.com.



02 October 2013 (First published 04 August 2005)

Introduction

There are many ways to go about debugging a program, such as printing messages to the screen, using a debugger, or just thinking about how the program runs and making an educated guess about the problem.

Before you can fix a bug, you must locate its source. For example, when a program crashes and produces a core dump, you need to know on which line of code the crash occurred. After you find the line of code in question, you can determine the value of the variables in that function, how the function was called, and, specifically, why the error occurred. Using a debugger makes finding all of this information simple.

This article introduces several techniques for fixing bugs that can be difficult to find by visually inspecting your code and also illustrates how to use tools that are available on Linux on Power architecture.


Tools and techniques for debugging memory problems

Dynamic memory allocation seems straightforward enough: you allocate memory on demand -- using malloc() or one of its variants -- and free memory when it's no longer needed. Indeed, memory management problems are the most common bugs in software because they are usually not obvious when a program starts. For example, a memory leak in a program is not noticed until after it runs for days or even months. The following section introduces how to use the popular debugger, Valgrind to find and debug the most common memory bugs.

Before you start using any debugging tool, consider whether it might be beneficial to recompile your application and supporting libraries with debugging information enabled (the -g flag). Without debugging information enabled, the best a debugging tool can do is guess to which function a particular piece of code belongs. This makes both error messages and profiling output nearly useless. With -g, you'll hopefully get messages that point directly to the relevant source code lines.

Valgrind

Valgrind has been extensively used as an application debugger in the Linux application development community. It is especially powerful for finding memory management problems. It detects memory leaks/corruption in the program being run. It is being developed by Julian Seward and ported to the Power architecture by Paul Mackerras.

To install Valgrind, download its source code from the Valgrind Web site (see Resources). Go to the Valgrind directory and issue the following commands:

# make
# make check
# make install

The error report from Valgrind

The output from Valgrind has the following format:

Listing 1. Output from Valgrind
# valgrind du –x –s
.
.
==29404==  Address 0x1189AD84 is 0 bytes after a block of size 12 alloc'd
==29404==    at 0xFFB9964: malloc (vg_replace_malloc.c:130)
==29404==    by 0xFEE1AD0: strdup (in /lib/tls/libc.so.6)
==29404==    by 0xFE94D30: setlocale (in /lib/tls/libc.so.6)
==29404==    by 0x10001414: main (in /usr/bin/du)

The==29404== is the process ID. The message Address 0x1189AD84 is 0 bytes after a block of size 12 alloc'd indicates that there is no storage beyond the end of the array of 12 bytes. The second and subsequent lines indicate that the memory is allocated on line 130 (in the file vg_replace_malloc.c) in the routine strdup(); strdup() was called from setlocale() in the library libc.so.6; main() called setlocale().

Uninitialized memory

One of the most common bugs occurs when a program uses uninitialized memory. Sources of uninitialized data come from:

  • Variables that have not been initialized
  • Data allocated by the malloc function before writing some values there

The following example uses an uninitialized array:

Listing 2. Use of uninitialized memory
      2 {
      3         int i[5];
      4 
      5         if (i[0] == 0)
      6                 i[1]=1;
      7         return 0;
      8 }

In this example, the integer array i[5] is uninitialized; therefore, i[0] contains a random number. So using the value of i[0] to determine a conditional branch causes the program to produce an unpredictable outcome. This error condition can easily be caught by Valgrind. When you run Valgrind on this program, you’ll receive the following message:

Listing 3. Message from Valgrind
# gcc –g –o test1 test1.c
# valgrind ./test1
.
.
==31363== 
==31363== Conditional jump or move depends on uninitialised value(s)
==31363==    at 0x1000041C: main (test1.c:5)
==31363== 
==31363== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 7 from 1)
==31363== malloc/free: in use at exit: 0 bytes in 0 blocks.
==31363== malloc/free: 0 allocs, 0 frees, 0 bytes allocated.
==31363== For counts of detected errors, rerun with: -v
==31363== No malloc'd blocks -- no leaks are possible.

The output of Valgrind indicates that there is a conditional branch that depends on an uninitialized value in line five of file test1.c.

Memory leak

A memory leak is another common problem, yet the most difficult to identify, in many programs. The primary symptom of a memory leak is when the memory (or heap) associated with a program grows larger and larger as the program continues to run. As a result, when the memory consumed by the program approaches the system limit, it crashes itself or, in the worst case, it hangs or causes the system to crash. Here is an example program with a memory leak bug:

Listing 4. Memory leak example
      1 int main(void)
      2 {
      3         char *p1;
      4         char *p2;
      5 
      6         p1 = (char *) malloc(512);
      7         p2 = (char *) malloc(512);
      8 
      9         p1=p2;
     10 
     11         free(p1);
     12         free(p2);
     13 }

The code above allocates two 512-byte blocks of memory to char pointers p1 and p2, respectively, and then the pointer to the first block is set to the second block. As a result, the address of the first block is lost and causes a memory leak. When you run Valgrind on this program, it returns the following message:

Listing 5. Message from Valgrind
# gcc –g –o test2 test2.c
# valgrind ./test2
.
.
==31468== Invalid free() / delete / delete[]
==31468==    at 0xFFB9FF0: free (vg_replace_malloc.c:152)
==31468==    by 0x100004B0: main (test2.c:12)
==31468== Address 0x11899258 is 0 bytes inside a block of size 512 free'd
==31468==    at 0xFFB9FF0: free (vg_replace_malloc.c:152)
==31468==    by 0x100004A4: main (test2.c:11)
==31468== 
==31468== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 7 from 1)
==31468== malloc/free: in use at exit: 512 bytes in 1 blocks.
==31468== malloc/free: 2 allocs, 2 frees, 1024 bytes allocated.
==31468== For counts of detected errors, rerun with: -v
==31468== searching for pointers to 1 not-freed blocks.
==31468== checked 167936 bytes.
==31468== 
==31468== LEAK SUMMARY:
==31468==    definitely lost: 512 bytes in 1 blocks.
==31468==      possibly lost: 0 bytes in 0 blocks.
==31468==    still reachable: 0 bytes in 0 blocks.
==31468==         suppressed: 0 bytes in 0 blocks.
==31468== Use --leak-check=full to see details of leaked memory.

As you can see, Valgrind reports there are 512 bytes lost in the program.

Illegal write/read

This situation occurs when a program tries to write to or read from a memory address that does not belong to the program. On some systems, the program will abnormally terminate with segmentation fault when this type of error happens. The following example is a common bug that tries to read from and write to an array element beyond the array boundary.

Listing 6. Illegal write/read
      1 int main() {
      2         int i, *iw, *ir;
      3 
      4         iw = (int *)malloc(10*sizeof(int));
      5         ir = (int *)malloc(11*sizeof(int));
      6 
      7 
      8         for (i=0; i<11; i++)
      9                 iw[i] = i;
     10 
     11         for (i=0; i<11; i++)
     12                 ir[i] = iw[i];
     13 
     14         free(iw);
     15         free(ir);
     16 }

You can see from the program that access to both iw[10] and ir[10] is not legal since iw and ir only have ten elements, zero through nine. Please notice that int iw[10 ] and iw = (int *)malloc(10*sizeof(int)) are functionally equivalent -- both are used to allocate ten elements to an integer array iw.

When you run Valgrind on this program, it returns the following message:

Listing 7. Message from Valgrind
# gcc –g –o test3 test3.c
# valgrind ./test3
.
.
==31522== Invalid write of size 4
==31522==    at 0x100004C0: main (test3.c:9)
==31522==  Address 0x11899050 is 0 bytes after a block of size 40 alloc'd
==31522==    at 0xFFB9964: malloc (vg_replace_malloc.c:130)
==31522==    by 0x10000474: main (test10.c:4)
==31522== 
==31522== Invalid read of size 4
==31522==    at 0x1000050C: main (test3.c:12)
==31522==  Address 0x11899050 is 0 bytes after a block of size 40 alloc'd
==31522==    at 0xFFB9964: malloc (vg_replace_malloc.c:130)
==31522==    by 0x10000474: main (test10.c:4)
==31522== 
==31522== ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 7 from 1)
==31522== malloc/free: in use at exit: 0 bytes in 0 blocks.
==31522== malloc/free: 2 allocs, 2 frees, 84 bytes allocated.
==31522== For counts of detected errors, rerun with: -v
==31522== No malloc'd blocks -- no leaks are possible.

One illegal write of four bytes was found at line nine and one illegal read of four bytes was located at line 12 of test3.c

Valgrind also can help to identify memory misuses, such as:

  • Writing/reading memory area which has been freed
  • Mismatched use of malloc/new vs free/delete in C++ environment.
  • The memcheck and addrcheck tools work well. In addition, Helgrind, a data race detector, and other plugins are enabled on POWER7 and RHEL AS 5 and SLES 10.

In addition to Valgrind, there are various other memory debugging tools available, for example, Memwatch and Electric Fence.


Tools and techniques for debugging other program problems

In addition to the memory bug, developers often encounter a situation where a program compiles successfully, but it produces a core dump or gets segmentation fault when it runs. Sometimes, too, after a program completes, the output of the program does not match what is expected or designated. In both cases, it is likely that somewhere in the code a condition that you believe to be true is actually false. The debuggers described in the following sections will help you find the cause of these conditions.

GNU project debugger

GDB, the GNU project debugger, allows you to see what is going on "inside" another program while it executes or what another program was doing at the moment it crashed.

GDB does four main things to help you catch bugs in the program:

  • Specifies variables or conditions that might affect a program’s behavior before it starts
  • Stops a program at a specified place or condition
  • Examines what has happened when your program stops
  • Changes variables or conditions in your program in the middle of execution, so you can experiment with correcting the effects of one bug and go on to learn about another

The program being debugged can be written in C, C++, Pascal, Objective-C, and many other languages. The binary filename of GDB is gdb.

There are many commands in gdb. Use the help command to list each command and explain how to use them. The following table lists the most commonly used GDB commands.

Table 1. Most commonly used commands for gdb
CommandDescriptionExample
helpList classes of commandshelp - to list classes of commands
help breakpoints - list commands belonging to breakpoints class
help break - description of break command
runStart the debugged program 
killKill execution of program being debuggedUsually it is used when the execution line has been passed where the code you want to debug. Issue kill, reset the breakpoints, and run the program again to start over.
contContinue execution of the debugged application after a breakpoint, exception, or step. 
info breakDisplay the current breakpoints or watchpoints. 
breakSet breakpoint at specified line or functionbreak 93 if i=8 - to stop the program execution at line 93 when the variable i is equal to 8.
StepStep the program until it reaches a different source line. You can use s to abbreviate the step command. 
NextLike the step command, except it does not "step into" subroutines 
printPrint the value of a variable or an expressionprint pointer - print the content of variable pointer.
print *pointer - print the content of the data structure the pointer is pointing to.
deleteDelete some breakpoints or auto-display expressionsdelete 1 - to delete breakpoint number 1. The breakpoints can be displayed by info break.
watchSet a watchpoint for an expression. A watchpoint stops execution of your program whenever the value of an expression changes. 
wherePrint backtrace of all stack frameswhere - with no arguments, dumps the stack of the current thread.
where all - dumps the stack of all threads in the current thread group.
where threadindex - dumps the stack of the specified thread.
attachStart viewing an already running processattach <process_id> - attach the process with process_id. process_id can be found by the ps command.
info threadShow currently running threads 
thread apply threadno commandRun gdb command on a threadthread apply 3 where - run the where command on the thread 3
Thread threadnoSelect a thread to be the current thread 

If a program crashes and generates a core file, you can view the core file to determine the state the process was in when it terminated. Start gdb with the following command:

# gdb programname corefilename

To debug with a core file, you need the program executable and source files, as well as the core file. To start gdb with a core file, use the -c option:

# gdb -c core programname

The gdb shows what line of code caused the program to core dump.

Core dump is disabled by default on both Novell’s SUSE LINUX Enterprise Server 9 (SLES 9) and Red Hat® Enterprise Linux Advanced Server (RHEL AS 4). To enable core dump, issue ulimit –c unlimited on the command line as root.

The example in Listing 8 illustrates how to use gdb to locate a bug in a program. Listing 8 is a piece of C++ code that contains bugs.

Listing 8. Example code causing segmentation fault and unexpected result
1 #include <iostream>
2 
3 template <class T>
4 class NumBox {
5 public:
6         NumBox (const T &input_num, NumBox<T> *input_next) : Num(input_num),Next(input_next) {
7                 std::cout "" "Number Box " <<"\""  << Num  <<"\""  <<" created" <<std::endl;
8         }
9         ~NumBox () {
10                 std::cout << "Number Box " <<"\""  << GetValue() <<"\""  <<" deleted" << std::endl;
11                 Next = 0;
12         }
13 
14         NumBox<T>*GetNext() const { return Next; }
15         void SetNext (NumBox<T> *input_next) { Next = input_next; };
16         const T& GetValue () const { return Num; }
17         void GetValue (const T &input_num) { Num = input_num; }
18 
19 private:
20         NumBox ();
21         T Num;
22         NumBox<T> *Next;
23 };
24 
25 template <class T>
26 class NumChain {
27 
28 public:
29         NumChain () : pointer(0) {};
30         ~NumChain () { delete_boxes (); };
31 
32         int add (const T &new_item) {
33                 return ((pointer = new NumBox<T>(new_item, pointer)) != 0) ? 0 : -1;
34         }
35 
36         int RemoveBox (const T &item_to_remove) {
37                 NumBox<T> *current = pointer;
38                 NumBox<T> *temp = 0;
39 
40                 while (current != 0) {
41                         if (current->GetValue() == item_to_remove) {
42                                 if (temp == 0) {
43                                         // current is pointing to the first number box of the list
44                                         if (current->GetNext() == 0) {
45                                                 // Only one number box in the list
46                                                 pointer = 0;
47                                                 delete current;
48                                                 current = 0;
49                                         } else {
50                                                 delete current;
51                                                 current = 0;
52                                         }
53                                         return 0;
54                                 } else {
55                                         temp->SetNext (current->GetNext());
56                                         delete temp;
57                                         temp = 0;
58                                         return 0;
59                                 }
60                         }
61                         current = 0;
62                         temp = current;
63                         current = current->GetNext();
64                 }
65 
66                 return -1;
67         }
68 
69 
70 private:
71         void delete_boxes (void) {
72                 NumBox<T> *current = pointer;
73                 while (current != 0) {
74                         NumBox<T> *temp = current;
75                         delete current;
76                         current = temp->GetNext();
77                 }
78         }
79 
80         NumBox<T> *pointer;
81 };
82 
83 
84 int main (int argc, char **argv) {
85         int i;
86         NumChain<int> *list = new NumChain<int> ();
87 
88         for (i=0;i<10;i++)
89                 list->add (i);
90 
91         std::cout << "list created" << std::endl;
92 
93         for (i=9; i>=0; i--)
94                 list ->RemoveBox(i);
95 
96         delete list;
97 }

This C++ program in Listing 8 tries to build ten linked number boxes like this:

Figure 1. A list contains ten linked boxes with numbers
Image of ten numbered (0-9) boxes that are linked together

And then tries to remove them from the list one by one.

Compile and run the program, as shown below:

Listing 9. Compile and run program
# g++ -g -o gdbtest1 gdbtest1.cpp
# ./gdbtest1
Number Box "0" created
Number Box "1" created
Number Box "2" created
Number Box "3" created
Number Box "4" created
Number Box "5" created
Number Box "6" created
Number Box "7" created
Number Box "8" created
Number Box "9" created
list created
Number Box "9" deleted
Segmentation fault

As you can see, this program results in segmentation fault. Invoke gdb to take a look at the problem, as shown below:

Listing 10. Invoke gdb
# gdb ./gdbtest1
GNU gdb 6.2.1
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you 
are welcome to change it and/or distribute copies of it under certain 
conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for 
details.
This GDB was configured as "ppc-suse-linux"...Using host libthread_db 
library "/lib/tls/libthread_db.so.1".

(gdb)

You know that the segmentation fault occurs after the number box "9" is deleted. Issue the run and where commands to locate exactly where in the program the segmentation fault occurs.

Listing 11. Issue the run and where commands
(gdb) run
Starting program: /root/test/gdbtest1 
Number Box "0" created
Number Box "1" created
Number Box "2" created
Number Box "3" created
Number Box "4" created
Number Box "5" created
Number Box "6" created
Number Box "7" created
Number Box "8" created
Number Box "9" created
list created
Number Box "9" deleted

Program received signal SIGSEGV, Segmentation fault.
0x10000f74 in NumBox<int>::GetNext (this=0x0) at gdbtest1.cpp:14
14              NumBox<T>*GetNext() const { return Next; }
(gdb) where
#0  0x10000f74 in NumBox<int>::GetNext (this=0x0) at gdbtest1.cpp:14
#1  0x10000d10 in NumChain<int>::RemoveBox (this=0x10012008, 

item_to_remove=@0xffffe200) at gdbtest1.cpp:63 #2 0x10000978 in main (argc=1, argv=0xffffe554) at gdbtest1.cpp:94 (gdb)

The trace shows that the program receives a segmentation fault at line 14, NumBox<int>::GetNext (this=0x0). The address of Next pointer on the number box is 0x0, which is not a valid address for a number box. From the trace above, the GetNext function is called by line 63. Look at what happens near line 63 in the gdbtest1.cpp:

Listing 12. gdbtest1.cpp
     54                       } else {
     55                               temp->SetNext (current->GetNext());
     56                               delete temp;
     57                               temp = 0;
     58                               return 0;
     59                       }
     60               }
     61               current = 0;
     62               temp = current;
     63               current = current->GetNext();
     64       }
     65 
     66       return -1;

Line 61, current=0, sets the pointer to an invalided address, which is the root cause of the segmentation fault. Comment out line 61, save it to gdbtest2.cpp, compile it, and then run it again.

Listing 13. Run program again (gdbtest2.cpp)
# g++ -g -o gdbtest2 gdbtest2.cpp
# ./gdbtest2
Number Box "0" created
Number Box "1" created
Number Box "2" created
Number Box "3" created
Number Box "4" created
Number Box "5" created
Number Box "6" created
Number Box "7" created
Number Box "8" created
Number Box "9" created
list created
Number Box "9" deleted
Number Box "0" deleted

The program now finishes without segmentation faults. However, the result is not what you would expect: the program deletes Number Box "0" after it deletes Number Box "9," instead of deleting Number Box "8," as you would expect. Use gdb to take a look again.

Listing 14. Look again with gdb
# gdb ./gdbtest2
GNU gdb 6.2.1
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you 
are welcome to change it and/or distribute copies of it under certain 
conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for 
details.
This GDB was configured as "ppc-suse-linux"...Using host libthread_db 
library "/lib/tls/libthread_db.so.1".

(gdb) break 94 if i==8
Breakpoint 1 at 0x10000968: file gdbtest2.cpp, line 94.
(gdb) run
Starting program: /root/test/gdbtest2 
Number Box "0" created
Number Box "1" created
Number Box "2" created
Number Box "3" created
Number Box "4" created
Number Box "5" created
Number Box "6" created
Number Box "7" created
Number Box "8" created
Number Box "9" created
list created
Number Box "9" deleted

Breakpoint 1, main (argc=1, argv=0xffffe554) at gdbtest2.cpp:94
94                      list ->RemoveBox(i);

You want to find out why the program deletes Number Box 0 instead of Number Box 8, so you should stop the program where you think the program will delete Number Box 8. Set the breakpoint: break 94 if i==8, to stop at line 94 when i is equal to 8. Then step into the RemoveBox() function.

Listing 15. Step into the RemoveBox() function
(gdb) s
38                      NumBox<T> *temp = 0;  
(gdb) s
40                      while (current != 0) {
(gdb) print pointer
$1 = (NumBox<int> *) 0x100120a8
 (gdb) print *pointer
$2 = {Num = 0, Next = 0x0}
(gdb)

The pointer is already pointing to Number Box "0," so the bug is likely in the code where the program deletes Number Box "9." To restart the program in gdb, use kill, delete the old breakpoint, then add a new breakpoint when i is equal to 9, and run the program again.

Listing 16. Restart the program in gdb
(gdb) kill
Kill the program being debugged? (y or n) y
(gdb) info break
Num Type           Disp Enb Address    What
1   breakpoint     keep y   0x10000968 in main at gdbtest2.cpp:94
        stop only if i == 8
        breakpoint already hit 1 time
(gdb) delete 1
(gdb) break 94 if i==9
Breakpoint 2 at 0x10000968: file gdbtest2.cpp, line 94.
(gdb) run
Starting program: /root/test/gdbtest2 
Number Box "0" created
Number Box "1" created
Number Box "2" created
Number Box "3" created
Number Box "4" created
Number Box "5" created
Number Box "6" created
Number Box "7" created
Number Box "8" created
Number Box "9" created
list created

Breakpoint 2, main (argc=1, argv=0xffffe554) at gdbtest2.cpp:94
94                      list ->RemoveBox(i);
(gdb)

When you step through the RemoveBox() function this time, pay extra attention to which number box list->pointer is pointing because the bug is probably where list->pointer starts to point to Number Box "0." Use the display *pointer command, which is the auto-display function, to monitor this.

Listing 17. Monitor using the display *pointer command
Breakpoint 2, main (argc=1, argv=0xffffe554) at gdbtest2.cpp:94
94            list ->RemoveBox(i);
(gdb) s
NumChain<int>::RemoveBox (this=0x10012008, item_to_remove=@0xffffe200) 
at gdbtest2.cpp:37 37 NumBox<T> *current = pointer; (gdb) display *pointer 1: *this->pointer = {Num = 9, Next = 0x10012098} (gdb) s 38 NumBox<T> *temp = 0; 1: *this->pointer = {Num = 9, Next = 0x10012098} (gdb) s 40 while (current != 0) { 1: *this->pointer = {Num = 9, Next = 0x10012098} (gdb) s 41 if (current->GetValue() == item_to_remove) { 1: *this->pointer = {Num = 9, Next = 0x10012098} (gdb) s NumBox<int>::GetValue (this=0x100120a8) at gdbtest2.cpp:16 16 const T& GetValue () const { return Num; } (gdb) s NumChain<int>::RemoveBox (this=0x10012008, item_to_remove=@0xffffe200)
at gdbtest2.cpp:42 42 if (temp == 0) { 1: *this->pointer = {Num = 9, Next = 0x10012098} (gdb) s 44 if (current->GetNext() == 0) { 1: *this->pointer = {Num = 9, Next = 0x10012098} (gdb) s NumBox<int>::GetNext (this=0x100120a8) at gdbtest2.cpp:14 14 NumBox<T>*GetNext() const { return Next; } (gdb) s NumChain<int>::RemoveBox (this=0x10012008, item_to_remove=@0xffffe200)
at gdbtest2.cpp:50 50 delete current; 1: *this->pointer = {Num = 9, Next = 0x10012098} (gdb) s ~NumBox (this=0x100120a8) at gdbtest2.cpp:10 10 std::cout << "Number Box " <<"\"" << GetValue()
<<"\"" <<" deleted" << std::endl; (gdb) s NumBox<int>::GetValue (this=0x100120a8) at gdbtest2.cpp:16 16 const T& GetValue () const { return Num; } (gdb) s Number Box "9" deleted ~NumBox (this=0x100120a8) at gdbtest2.cpp:11 11 Next = 0; (gdb) s NumChain<int>::RemoveBox (this=0x10012008, item_to_remove=@0xffffe200)
at gdbtest2.cpp:51 51 current = 0; 1: *this->pointer = {Num = 0, Next = 0x0} (gdb) s 53 return 0; 1: *this->pointer = {Num = 0, Next = 0x0} (gdb) s 0x10000d1c 66 return -1; 1: *this->pointer = {Num = 0, Next = 0x0}

From the above trace, you can see that list->pointer points to Number Box "0" right after Number Box "9" is deleted. This is not proper logic because list->pointer should point to Number Box "8" when Number Box "9" is deleted. It now becomes obvious that the statement, pointer = pointer->GetNext(); should be added before line 50, as shown below:

Listing 18. Add pointer = pointer->GetNext(); before line 50
     49                     } else {
     50                             pointer = pointer->GetNext();
     51                             delete current;
     52                             current = 0;
     53                      }
     54                      return 0;

Save the new modified program to gdbtest3.cpp, compile it, and run it again.

Listing 19. Run program again (gdbtest3.cpp)
# g++ -g -o gdbtest3 gdbtest3.cpp
# ./gdbtest3
Number Box "0" created
Number Box "1" created
Number Box "2" created
Number Box "3" created
Number Box "4" created
Number Box "5" created
Number Box "6" created
Number Box "7" created
Number Box "8" created
Number Box "9" created
list created
Number Box "9" deleted
Number Box "8" deleted
Number Box "7" deleted
Number Box "6" deleted
Number Box "5" deleted
Number Box "4" deleted
Number Box "3" deleted
Number Box "2" deleted
Number Box "1" deleted
Number Box "0" deleted

This is the result you expect to see.

Multi-threading environment

There are some special commands in GDB used for multi-threading application debugging. The following example illustrates a deadlock situation and how these commands are used to detect problems in a multi-threading application:

Listing 20. A multi-threading example
#include <stdio.h>
#include "pthread.h>

pthread_mutex_t AccountA_mutex;
pthread_mutex_t AccountB_mutex;


struct BankAccount {
     char account_name[1];
     int balance;
};

struct BankAccount  accountA = {"A", 10000 };
struct BankAccount  accountB = {"B", 20000 };


void * transferAB (void* amount_ptr) {
     int amount = *((int*)amount_ptr);

     pthread_mutex_lock(&AccountA_mutex);
     if (accountA.balance < amount)   {
             printf("There is not enough memory in Account A!\n");
             pthread_mutex_unlock(&AccountA_mutex);
             pthread_exit((void *)1);
     }
     accountA.balance -=amount;
     sleep(1);
     pthread_mutex_lock(&AccountB_mutex);
     accountB.balance +=amount;

     pthread_mutex_unlock(&AccountA_mutex); 
     pthread_mutex_unlock(&AccountB_mutex);
}


void * transferBA (void* amount_ptr) {
     int amount = *((int*)amount_ptr);

     pthread_mutex_lock(&AccountB_mutex);
     if (accountB.balance < amount)   {
             printf("There is not enough memory in Account B!\n");
             pthread_mutex_unlock(&AccountB_mutex);
             pthread_exit((void *)1);
     }
     accountB.balance -=amount;
     sleep(1);
     pthread_mutex_lock(&AccountA_mutex);
     accountA.balance +=amount;

     pthread_mutex_unlock(&AccountB_mutex);
     pthread_mutex_unlock(&AccountA_mutex);
}



int main(int argc, char* argv[]) {
     int             threadid[4];
     pthread_t       pthread[4];
     int             transfer_amount[4] = {100, 200, 300, 400};
     int             final_balanceA, final_balanceB;


     final_balanceA=accountA.balance-transfer_amount[0]-
transfer_amount[1]+transfer_amount[2]+transfer_amount[3]; final_balanceB=accountB.balance+transfer_amount[0]
+transfer_amount[1]-transfer_amount[2]-transfer_amount[3]; if (threadid[0] = pthread_create(&pthread[0], NULL, transferAB,
(void*)&transfer_amount[0]) " 0) { perror("Thread #0 creation failed."); exit (1); } if (threadid[1] = pthread_create(&pthread[1], NULL, transferAB,
(void*)&transfer_amount[1]) " 0) { perror("Thread #1 creation failed."); exit (1); } if (threadid[2] = pthread_create(&pthread[2], NULL, transferBA,
(void*)&transfer_amount[2]) < 0) { perror("Thread #2 creation failed."); exit (1); } if (threadid[3] = pthread_create(&pthread[3], NULL, transferBA,
(void*)&transfer_amount[3]) < 0) { perror("Thread #3 creation failed."); exit (1); } printf("Transitions are in progress.."); while ((accountA.balance != final_balanceA) && (accountB.balance
!= final_balanceB)) { printf(".."); } printf("\nAll the money is transferred !!\n"); }

Use gcc to compile the program, as shown below:

# gcc -g -o gdbtest2 gdbtest2.c -L/lib/tls -lpthread

The program gdbtest2 hangs and never comes back with an All the money is transferred !! message.

Attach gdb to the running process to take a look what is happening inside it.

Listing 21. Attach gdb to the running process
# ps -ef |grep gdbtest2
root      9510  8065  1 06:30 pts/1    00:00:00 ./gdbtest2
root      9516  9400  0 06:30 pts/4    00:00:00 grep gdbtest2
# gdb -pid 9510
GNU gdb 6.2.1
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you 
are welcome to change it and/or distribute copies of it under certain 
conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for 
details.
This GDB was configured as "ppc-suse-linux".
Attaching to process 9510
Reading symbols from /root/test/gdbtest2...done.
Using host libthread_db library "/lib/tls/libthread_db.so.1".
Reading symbols from /lib/tls/libpthread.so.0...done.
[Thread debugging using libthread_db enabled]
[New Thread 1073991712 (LWP 9510)]
[New Thread 1090771744 (LWP 9514)]
[New Thread 1086577440 (LWP 9513)]
[New Thread 1082383136 (LWP 9512)]
[New Thread 1078188832 (LWP 9511)]
Loaded symbols for /lib/tls/libpthread.so.0
Reading symbols from /lib/tls/libc.so.6...done.
Loaded symbols for /lib/tls/libc.so.6
Reading symbols from /lib/ld.so.1...done.
Loaded symbols for /lib/ld.so.1
0x0ff4ac40 in __write_nocancel () from /lib/tls/libc.so.6
(gdb) info thread
  5 Thread 1078188832 (LWP 9511)  0x0ffe94ec in __lll_lock_wait () 
from /lib/tls/libpthread.so.0 4 Thread 1082383136 (LWP 9512) 0x0ffe94ec in __lll_lock_wait ()
from /lib/tls/libpthread.so.0 3 Thread 1086577440 (LWP 9513) 0x0ffe94ec in __lll_lock_wait ()
from /lib/tls/libpthread.so.0 2 Thread 1090771744 (LWP 9514) 0x0ffe94ec in __lll_lock_wait ()
from /lib/tls/libpthread.so.0 1 Thread 1073991712 (LWP 9510) 0x0ff4ac40 in __write_nocancel ()
from /lib/tls/libc.so.6 (gdb)

From the info thread command, you know that all the threads, except the main thread (thread #1), are waiting for the function, __lll_lock_wait (), to finish.

Use the thread apply threadno where command to see exactly where each thread is located:

Listing 22. See where each thread is located
(gdb) thread apply 1 where

Thread 1 (Thread 1073991712 (LWP 9510)):
#0  0x0ff4ac40 in __write_nocancel () from /lib/tls/libc.so.6
#1  0x0ff4ac28 in __write_nocancel () from /lib/tls/libc.so.6
Previous frame identical to this frame (corrupt stack?)
#0  0x0ff4ac40 in __write_nocancel () from /lib/tls/libc.so.6
(gdb) thread apply 2 where

Thread 2 (Thread 1090771744 (LWP 9514)):
#0  0x0ffe94ec in __lll_lock_wait () from /lib/tls/libpthread.so.0
#1  0x0ffe466c in pthread_mutex_lock () from /lib/tls/libpthread.so.0
#2  0x0ffe466c in pthread_mutex_lock () from /lib/tls/libpthread.so.0
#3  0x0ffe466c in pthread_mutex_lock () from /lib/tls/libpthread.so.0
#4  0x0ffe466c in pthread_mutex_lock () from /lib/tls/libpthread.so.0
Previous frame inner to this frame (corrupt stack?)
#0  0x0ff4ac40 in __write_nocancel () from /lib/tls/libc.so.6
(gdb) thread apply 3 where

Thread 3 (Thread 1086577440 (LWP 9513)):
#0  0x0ffe94ec in __lll_lock_wait () from /lib/tls/libpthread.so.0
#1  0x0ffe466c in pthread_mutex_lock () from /lib/tls/libpthread.so.0
#2  0x0ffe466c in pthread_mutex_lock () from /lib/tls/libpthread.so.0
#3  0x0ffe466c in pthread_mutex_lock () from /lib/tls/libpthread.so.0
#4  0x0ffe466c in pthread_mutex_lock () from /lib/tls/libpthread.so.0
Previous frame inner to this frame (corrupt stack?)
#0  0x0ff4ac40 in __write_nocancel () from /lib/tls/libc.so.6
(gdb) thread apply 4 where

Thread 4 (Thread 1082383136 (LWP 9512)):
#0  0x0ffe94ec in __lll_lock_wait () from /lib/tls/libpthread.so.0
#1  0x0ffe466c in pthread_mutex_lock () from /lib/tls/libpthread.so.0
#2  0x0ffe466c in pthread_mutex_lock () from /lib/tls/libpthread.so.0
#3  0x0ffe466c in pthread_mutex_lock () from /lib/tls/libpthread.so.0
#4  0x0ffe466c in pthread_mutex_lock () from /lib/tls/libpthread.so.0
Previous frame inner to this frame (corrupt stack?)
#0  0x0ff4ac40 in __write_nocancel () from /lib/tls/libc.so.6
 (gdb) thread apply 5 where

Thread 5 (Thread 1078188832 (LWP 9511)):
#0  0x0ffe94ec in __lll_lock_wait () from /lib/tls/libpthread.so.0
#1  0x0ffe466c in pthread_mutex_lock () from /lib/tls/libpthread.so.0
#2  0x0ffe466c in pthread_mutex_lock () from /lib/tls/libpthread.so.0
#3  0x0ffe466c in pthread_mutex_lock () from /lib/tls/libpthread.so.0
#4  0x0ffe466c in pthread_mutex_lock () from /lib/tls/libpthread.so.0
Previous frame inner to this frame (corrupt stack?)
#0  0x0ff4ac40 in __write_nocancel () from /lib/tls/libc.so.6

Every thread tries to lock a mutex, which is not available, probably because it is locked (or owned) by another thread already. From the evidence above, you know there is a deadlock in the program. You also can see which thread owns a mutex.

Listing 23. See which thread owns a mutex
(gdb) print AccountA_mutex
$1 = {__m_reserved = 2, __m_count = 0, __m_owner = 0x2527, 
__m_kind = 0, __m_lock = {__status = 1, __spinlock = 0}} (gdb) print 0x2527 $2 = 9511 (gdb) print AccountB_mutex $3 = {__m_reserved = 2, __m_count = 0, __m_owner = 0x2529,
__m_kind = 0, __m_lock = {__status = 1, __spinlock = 0}} (gdb) print 0x2529 $4 = 9513 (gdb)

From the command above, you can tell that AccontA_mutex is locked (or owned) by thread 5 (LWP 9511) and AccontB_mutex is locked (or owned) by thread 3 (LWP 9513).

To resolve the deadlock situation above, lock the mutexes in the same order as shown in the following code:

Listing 24. Lock the mutex in the same order
.
.
void * transferAB (void* amount_ptr) {
        int amount = *((int*)amount_ptr);

        pthread_mutex_lock(&AccountA_mutex);
        pthread_mutex_lock(&AccountB_mutex);
        if (accountA.balance < amount)   {
                printf("There is not enough memory in Account A!\n");
                pthread_mutex_unlock(&AccountA_mutex);
                pthread_exit((void *)1);
        }
        accountA.balance -=amount;
        sleep(1);
        accountB.balance +=amount;

        pthread_mutex_unlock(&AccountA_mutex);
        pthread_mutex_unlock(&AccountB_mutex);
}


void * transferBA (void* amount_ptr) {
        int amount = *((int*)amount_ptr);

        pthread_mutex_lock(&AccountA_mutex);
        pthread_mutex_lock(&AccountB_mutex);
        if (accountB.balance < amount)   {
                printf("There is not enough memory in Account B!\n");
                pthread_mutex_unlock(&AccountB_mutex);
                pthread_exit((void *)1);
        }
        accountB.balance -=amount;
        sleep(1);
        accountA.balance +=amount;

        pthread_mutex_unlock(&AccountA_mutex);
        pthread_mutex_unlock(&AccountB_mutex);
}
.
.

Or, lock each account individually like this:

Listing 25. Lock each account individually
.
.
void * transferAB (void* amount_ptr) {
        int amount = *((int*)amount_ptr);

        pthread_mutex_lock(&AccountA_mutex);
        if (accountA.balance < amount)   {
                printf("There is not enough memory in Account A!\n");
                pthread_mutex_unlock(&AccountA_mutex);
                pthread_exit((void *)1);
        }
        accountA.balance -=amount;
        sleep(1);
        pthread_mutex_unlock(&AccountA_mutex);
        pthread_mutex_lock(&AccountB_mutex); 
        accountB.balance +=amount;

        pthread_mutex_unlock(&AccountB_mutex);
}


void * transferBA (void* amount_ptr) {
        int amount = *((int*)amount_ptr);

        pthread_mutex_lock(&AccountB_mutex);
        if (accountB.balance < amount)   {
                printf("There is not enough memory in Account B!\n");
                pthread_mutex_unlock(&AccountB_mutex);
                pthread_exit((void *)1);
        }
        accountB.balance -=amount;
        sleep(1);
        pthread_mutex_unlock(&AccountB_mutex);
        pthread_mutex_lock(&AccountA_mutex); 
        accountA.balance +=amount;

        pthread_mutex_unlock(&AccountA_mutex);
}
.
.
.

To debug 64-bit applications, which are compiled using –m64 with GCC or –q64 with IBM XL C/C++ compiler, a special version of gdb, gdb64, should be used.

Java debugger

The Java™ debugger, JDB, is a command line debugger for Java classes. It provides inspection and debugging of a local or remote Java Virtual Machine (JVM). Its binary file is jdb.

JDB is packaged with the java compiler, javac, in java2 rpm.

There are many ways to start a jdb session. The most common way is to have jdb launch a new Java Virtual Machine with the main class of the application to be debugged. This is done by substituting the command, jdb, for java in the command line. For example, if your application's main class is appClass, then use the following command to debug it under JDB:

# jdb appClass

Another way to use jdb is by attaching it to a JVM that is already running. A VM that is to be debugged with jdb must be started with the following options:

# java -Xdebug -Xnoagent - 
Xrunjdwp:transport=dt_socket,server=y,suspend=n,
address=8888 -Djava.compiler=NONEappClass

You can then attach jdb to the VM with the following command:

# jdb -attach 8888

The most commonly used commands for jdb are similar to those for gdb. Refer to Table 1 for details.

Graphical debuggers

One advantage of using the graphical mode of a debugger over its command line counterpart is that you can see each source code being executed while you step through the code in the debugger.

GNU DDD (Data Display Debugger) is a graphical front-end for debuggers, such as GDB and JDB. In addition to the usual front-end features, such as viewing source texts, DDD has become famous through its interactive graphical data display in which data structures are displayed as graphs.

For SLES 9, the DDD binary for PowerPC is shipped separately from the SUSE Linux SDK CDs, or it can be downloaded from Novell. (See Resources.) RedHat ships the DDD rpm with the RHEL AS 4 CDs.

Figure 2 is a screen capture of DDD when used to debug the example code shown in Listing 19 above (gdbtest3.cpp).

Figure 2. DDD screen shot
Screen capture of DDD as it debugs the example code

By default, DDD uses gdb as the back-end debugger; to switch to jdb, use ddd -jdb to launch DDD.

For more information about DDD, refer to the DDD section of the GNU project Web site. (See Resources.)

strace

The strace command is a powerful debugging tool available for the Linux on POWER architecture. It shows all of the system calls issued by a user-space program. strace displays the arguments to the calls and returns values in symbolic form. strace receives information from the kernel and does not require the kernel to be built in any special way. And the applications to be traced need not to be recompiled for strace, which is convenient especially when the source of the application is not available.

The following example uses strace to trace the execution of a normal user of cat /etc/shadow and then print the resulting trace to strace.cat.out. The program runs normally, however, it runs a little slower under strace, and at the end, you have a trace file.

$ strace -o strace.cat.out cat /etc/shadow
cat: /etc/shadow: Permission denied
$

Trace files tend to be fairly large. Even for this simple example, strace.cat.out was 111 lines long. The last part of the file contains lines such as:

Listing 26. Lines from strace.cat.out
     88 open("/usr/lib/locale/en_US.UTF-8/LC_NUMERIC", 
O_RDONLY) = -1 ENOENT (No such file or directory) 89 open("/usr/lib/locale/en_US.utf8/LC_NUMERIC", O_RDONLY) = 3 90 fstat64(3, {st_mode=S_IFREG|0644, st_size=54, ...}) = 0 91 mmap(NULL, 54, PROT_READ, MAP_PRIVATE, 3, 0) = 0x4010f000 92 close(3) = 0 93 open("/usr/lib/locale/en_US.UTF-8/LC_CTYPE",
O_RDONLY) = -1 ENOENT (No such file or directory) 94 open("/usr/lib/locale/en_US.utf8/LC_CTYPE", O_RDONLY) = 3 95 fstat64(3, {st_mode=S_IFREG|0644, st_size=208464, ...}) = 0 96 mmap(NULL, 208464, PROT_READ, MAP_PRIVATE, 3, 0) = 0x40110000 97 close(3) = 0 98 fstat64(1, {st_mode=S_IFCHR|0620,
st_rdev=makedev(136, 4), ...}) = 0 99 open("/etc/shadow", O_RDONLY|O_LARGEFILE) = -1 EACCES
(Permission denied) 100 write(2, "cat: ", 5) = 5 101 write(2, "/etc/shadow", 11) = 11 102 open("/usr/share/locale/en_US.UTF-8/LC_MESSAGES/libc.mo",
O_RDONLY) = -1 ENOENT (No such file or directory) 103 open("/usr/share/locale/en_US.utf8/LC_MESSAGES/libc.mo",
O_RDONLY) = -1 ENOENT (No such file or directory) 104 open("/usr/share/locale/en_US/LC_MESSAGES/libc.mo",
O_RDONLY) = -1 ENOENT (No such file or directory) 105 open("/usr/share/locale/en.UTF-8/LC_MESSAGES/libc.mo",
O_RDONLY) = -1 ENOENT (No such file or directory) 106 open("/usr/share/locale/en.utf8/LC_MESSAGES/libc.mo",
O_RDONLY) = -1 ENOENT (No such file or directory) 107 open("/usr/share/locale/en/LC_MESSAGES/libc.mo",
O_RDONLY) = -1 ENOENT (No such file or directory) 108 write(2, ": Permission denied", 19) = 19 109 write(2, "\n", 1) = 1 110 close(1) = 0 111 exit_group(1)

Notice that on line 99, the command failed because the system call open("/etc/shadow", O_RDONLY|O_LARGEFILE) failed with an EACCESS error code, which indicates permission denied.

In some situations, an application appears to hang and doesn’t respond to a signal, such as ctrl+c (SIGINT). This means that the system call being called by the application is hanging in its kernel mode and will never return to the user mode. strace can be very helpful for determining which system call is and parameters have been passed to the system call. Most likely the location of the wrong parameters is the root cause of the problem.

Similar to gdb64, which is used for gdb for 64-bit applications, strace64 is used to trace system calls requested by 64-bit applications.


Summary

There are many different tools available to help debug programs for Linux on POWER. The tools described in this article can help you solve many coding problems. Tools, such as Valgrind, which shows the location of memory leaks, illegal writes/reads, and the like, can solve memory management problems.

Using gdb and jdb helps solve problems that cause an application to terminate abnormally as well as bugs that lead to unexpected or unwanted results. The DDD utility helps make the debugging task easier by connecting lines of source code with the execution of it and visualizing the data structure in its data display window. In addition, strace is a powerful tool that can be used to trace all the system calls for an application. So, the next time when you are faced with squashing a bug in Linux, try one of these tools.


Acknowledgments

I would like to acknowledge Linda Kinnunen for her document template and helpful reviews and John Engel and Chakarat Skawratananond for their technical assistance and reviews of this document.


Special notices

Information is provided "AS IS" without warranty of any kind.

All customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual environmental costs and performance characteristics may vary by customer.

Information concerning non-IBM products was obtained from a supplier of these products, published announcement material, or other publicly available sources and does not constitute an endorsement of such products by IBM. Sources for non-IBM list prices and performance numbers are taken from publicly available information, including vendor announcements and vendor worldwide homepages. IBM has not tested these products and cannot confirm the accuracy of performance, capability, or any other claims related to non-IBM products. Questions on the capability of non-IBM products should be addressed to the supplier of those products.

All statements regarding IBM future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only. Contact your local IBM office or IBM authorized reseller for the full text of the specific Statement of Direction.

Some information addresses anticipated future capabilities. Such information is not intended as a definitive statement of a commitment to specific levels of performance, function or delivery schedules with respect to any future products. Such commitments are only made in IBM product announcements. The information is presented here to communicate IBM's current investment and development activities as a good faith effort to help with our customers' future planning.

Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon considerations such as the amount of multiprogramming in the user's job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve throughput or performance improvements equivalent to the ratios stated here.

Resources

  • Look at these:
  • Get products and technologies
    • Valgrind suite of tools for debugging and profiling x86-Linux programs.
    • GDB: The GNU project debugger.
    • DDD: The GNU graphical front-end for command-line debuggers, such as GDB.
    • DDD binary from Novell.

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Linux on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Linux
ArticleID=91225
ArticleTitle=Debugging tools and techniques for Linux on Power
publish-date=10022013