The Slightly Disgruntled Scientist

...now 7% more viral!


Valgrind and GDB: Tame the Wild C

| Comments

One thing I get asked a lot — almost daily, in fact — is: hey, why are you so amazing


…at identifying bugs related to undefined behaviour in C?

The answer is simple: by using Valgrind and GDB!

This tutorial is all about using Valgrind as part of your development workflow. Valgrind is an amazing tool for debugging, and I’ll start off by showing you what it actually does just as a standalone tool. From there I’ll show you how to use it in a systematic way to find errors via a debugger. Finally, you’ll see how you can actually add it to your code, so that you can catch runtime errors that might otherwise be concealed by the logic of your code.

So if, like me, you spend most of your waking life writing, maintaining and debugging embedded C code, then it’s time for you to crack open a console and put on your learning hat, and discover a few tricks that will make your life a great deal easier.

Oh, and if you want a shorter tutorial, have a look at Valgrind and GDB on the Cliffs of Inanity blog. It has fewer puns, but it's still good.

What are all these tools and concepts?

What is undefined behaviour?

This post assumes a basic level of knowledge about C, the standards that govern it, and the concept of undefined behaviour... but if you're new to these concepts, here's a quick summary and some references.

Unlike other languages languages (for example, Java), C programs are not required to keep runtime information about array bounds or whether memory accesses are valid. Neither are they required to initialise data to default values (except in very specific cases). If a programmer is not diligent about these things, their program can do something that is completely invalid — that is, undefined behaviour.

Undefined behaviour in a running C program means nothing less than: it is no longer possible to reason about your program. It simply isn’t. Cries of but it caaaaan’t be doing that! or shouldn’t x just be the last value? or I didn’t even have monkeys living in the server to begin with! mean nothing in the face of undefined behaviour.

This makes debugging very, very hard.

If you want to know more about undefined behaviour, refer to:

What is Valgrind? What is Memcheck?

Valgrind is not a single tool, but rather a set of tools for checking memory errors, cache usage, heap usage and other runtime behaviours, usually in C programs. This post focuses on Memcheck, a tool for identifying invalid or incorrect use of memory (stack or heap).

I am actually going to use the terms "Valgrind" and "Memcheck" interchangeably, since Memcheck is the default tool Valgrind uses when you run the command valgrind. Just be aware that there are other tools in there too.

The bugs that I’ve found using Valgrind are the worst of the worst, straight out of the C hall of shame. We’re talking about bugs that:

  • only appear on one person’s machine
  • seem to happen randomly, even in the same environment
  • don’t cause crashes, just give you the wrong output
  • crash, but the stack trace looks totally wrong (How did it crash there? I changed code somewhere else entirely!)
  • only occur at certain optimisation levels
  • only occur with newer compiler versions

Valgrind works by running your executable on a synthetic processer, and whichever tool you’ve selected inserts its own instrumentation code as it runs. You don’t need to recompile with Valgrind, or link with special libraries, or even run debugging builds (although it’s almost always the case that you should). Valgrind runs are significantly slower than normal runs though: about 50 times slower.

But since it cuts your debugging time down by a factor of about a thousand, it’s probably worth it.

What is GDB?

GDB is the GNU debugger. It’s a console driven tool that has the usual range of debugging capabilities (breakpoints with or without conditions, single stepping, etc) and quite a few other capabilities. It doesn’t have a GUI though, and if you’re not used to heavy console work, it can be a little hard on the eyes. While there are tools like Nemiver and Eclipse that provide a graphical interface to GDB, they won’t be useful for the techniques I go over here. Don’t use them for now.

What will you need for this tutorial?

I will be demonstrating this work on a GNU/Linux system (in my case, Ubuntu 14.04). It might work on OS X, or Windows under Cygwin, but I haven’t tried that and don’t really want to.

I don’t assume any prior knowledge of GDB or Valgrind, but you should have:

  • installed software:
    • GDB
    • Valgrind
    • Git
    • GCC or Clang (I’ve only tested with GCC, but Clang has a -ggdb option that will emit the appropriate debugging symbols)
  • knowledge of basic debugging functions (breakpoints, stepping, inspection)
  • proficiency using multiple command lines on Linux

My example code is on Github as valgrind-gdb-tute. Get a local copy with:

git clone https://github.com/detly/valgrind-gdb-tute

You now have a copy of my tutorial code in the directory valgrind-gdb-tute. It’ll actually be at the most recent revision where all the bugs are fixed, but I’ve tagged all the revisions that go with each part of the tutorial.

Example 1: Uninitialised variables

We’ll start with a simple example: we’re going to debug a simulator for a famous cruise liner. The program takes one piece of input from the user, a distance to travel, and displays the ship’s status on the console. So let’s start by checking out the code for the first example:

$ git checkout example-1

Building our simulator is easy:

$ make -sB

Running it is also easy:

$ ./titanic 10
  1 miles, still unsinkable!
3905110326 miles, still unsinkable!
3905110327 miles, still unsinkable!
3905110328 miles, still unsinkable!
3905110329 miles, still unsinkable!
3905110330 miles, still unsinkable!
3905110331 miles, still unsinkable!
3905110332 miles, still unsinkable!
3905110333 miles, still unsinkable!
3905110334 miles, still unsinkable!

That’s… not good. We told the simulator to take the ship 10 nautical miles, and instead it launched the thing out of the Solar System. This is definitely a bug.

This is, incidentally, a point where you might discover the joys of undefined behaviour. It is entirely possible that your machine does not demonstrate this bug, or any of the others! Or it does, but they look different. You may get crashes where I get weird output, or vice versa. This is the problem with undefined behaviour: it's not defined. There's no guarantee that you're going to see the same behaviour that I do, just like there's no guarantee you'll see the same behaviour at different optimisation levels or with different compilers.

It's a paradox of writing a tutorial like this. If you don't see the bugs I do, just go along with the examples anyway. The important thing is: as long as Valgrind gives you the same messages, the examples still work.

Incidentally, I have found that compiling with GCC 4.4 reproduces most of what I describe here. You can use make CC=gcc-4.4 -sB wherever I use make -sB below to do this, if you have GCC 4.4 installed (eg. with apt-get install gcc-4.4 on Ubuntu).

Now if you’re experienced with C you’ll probably recognise that this is due to an uninitialised variable somewhere. I realise our code here is simple enough that you could just inspect it and find out where it is. But this is a tutorial, so let’s pretend that it’s a much bigger system, and that we’re desparately looking for a tool that will let us do this in a systematic way.

Let’s run our program under Valgrind and see what it tells us:

$ valgrind ./titanic 10
==19150== Memcheck, a memory error detector
==19150== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al.
==19150== Using Valgrind-3.10.1 and LibVEX; rerun with -h for copyright info
==19150== Command: ./titanic 10
==19150== 
==19150== Use of uninitialised value of size 8
==19150==    at 0x4E80A4B: _itoa_word (_itoa.c:179)
==19150==    by 0x4E846F6: vfprintf (vfprintf.c:1660)
==19150==    by 0x4EA79DA: vsprintf (iovsprintf.c:42)
==19150==    by 0x4E8B5C6: sprintf (sprintf.c:32)
==19150==    by 0x400636: full_steam_ahead (titanic.c:18)
==19150==    by 0x400693: main (titanic.c:42)
==19150== 
==19150== Conditional jump or move depends on uninitialised value(s)
==19150==    at 0x4E80A55: _itoa_word (_itoa.c:179)
==19150==    by 0x4E846F6: vfprintf (vfprintf.c:1660)
==19150==    by 0x4EA79DA: vsprintf (iovsprintf.c:42)
==19150==    by 0x4E8B5C6: sprintf (sprintf.c:32)
==19150==    by 0x400636: full_steam_ahead (titanic.c:18)
==19150==    by 0x400693: main (titanic.c:42)

[... snip ...]

==19150== HEAP SUMMARY:
==19150==     in use at exit: 300 bytes in 10 blocks
==19150==   total heap usage: 10 allocs, 0 frees, 300 bytes allocated
==19150== 
==19150== LEAK SUMMARY:
==19150==    definitely lost: 300 bytes in 10 blocks
==19150==    indirectly lost: 0 bytes in 0 blocks
==19150==      possibly lost: 0 bytes in 0 blocks
==19150==    still reachable: 0 bytes in 0 blocks
==19150==         suppressed: 0 bytes in 0 blocks
==19150== Rerun with --leak-check=full to see details of leaked memory
==19150== 
==19150== For counts of detected and suppressed errors, rerun with: -v
==19150== Use --track-origins=yes to see where uninitialised values come from
==19150== ERROR SUMMARY: 302 errors from 12 contexts (suppressed: 0 from 0)

Note that all of Valgrind’s output is prefixed with the PID of your process. (The [... snip ...] was added by me, since Valgrind also displays a lot of errors that follow on from the first one.) We’re only interested in two things here. One is the memory leak, which we’ll come back to. The first thing we’ll look at is:

==19150== Use of uninitialised value of size 8
...
==19150==    by 0x400636: full_steam_ahead (titanic.c:18)
==19150==    by 0x400693: main (titanic.c:42)

This tells us that we’re using uninitialised memory somewhere, which is undefined behaviour in C (C90, §3.16). It also tells us we can find the culprit at line 18 of titanic.c, so let’s take a look:

titanic.c
14
15
16
17
18
const unsigned tugboat_distance; /* TODO: who do we ask about this? */
const unsigned total_distance = distance + tugboat_distance;

char *status = malloc(TOTAL_LENGTH * sizeof(char) * 1);
sprintf(status, "%*u %s", DIGITS_LENGTH, total_distance, LOG_TEXT);

Line 18 contains a result that’s calculated from a few other variables. Fortunately Valgrind tells us right there in its output what we can do to narrow it down:

==19150== For counts of detected and suppressed errors, rerun with: -v
==19150== Use --track-origins=yes to see where uninitialised values come from

Okay, let’s do that. Our first error message becomes:

==24107== Use of uninitialised value of size 8
==24107==    at 0x4E80A4B: _itoa_word (_itoa.c:179)
==24107==    by 0x4E846F6: vfprintf (vfprintf.c:1660)
==24107==    by 0x4EA79DA: vsprintf (iovsprintf.c:42)
==24107==    by 0x4E8B5C6: sprintf (sprintf.c:32)
==24107==    by 0x400636: full_steam_ahead (titanic.c:18)
==24107==    by 0x400693: main (titanic.c:42)
==24107==  Uninitialised value was created by a stack allocation
==24107==    at 0x4005E2: full_steam_ahead (titanic.c:10)

Valgrind tells us that the problematic variable was created on the stack, and gives us a line number (line 10). Line 10 is actually where the function scope starts, not where the variable is declared, but that’s plenty to go on.

It’s not hard to see that the problem is the variable total_distance, which is calculated from tugboat_distance, which is declared on the stack but not initialised.

But let’s pretend that we’re desperate programmers looking at much more complex code that we didn’t write. What would we do if it was hard to identify the problem?

This is where GDB comes to the rescue!

It’s GDB time!

There are two ways to use Valgrind with GDB:

  • Using the --db-attach=yes flag. Using this tells Valgrind to break as soon as it encounters an error. It’s simple, but not as powerful as the second method, since you can’t set breakpoints before the problem. Also, this relies on non-root users being able to utilise ptrace functionality, which requires a non-trivial amount of work on recent Linux distributions.

  • Using this --vgdb-error=0 option. This will set a breakpoint right at the start of main() and gives you instructions for debugging the program further via GDB. This is the mode we’ll be using.

In one window, run this:

==2472== Memcheck, a memory error detector
==2472== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al.
==2472== Using Valgrind-3.10.1 and LibVEX; rerun with -h for copyright info
==2472== Command: ./titanic 10
==2472== 
==2472== (action at startup) vgdb me ... 
==2472== 
==2472== TO DEBUG THIS PROCESS USING GDB: start GDB like this
==2472==   /path/to/gdb ./titanic
==2472== and then give GDB the following command
==2472==   target remote | /usr/lib/valgrind/../../bin/vgdb --pid=2472
==2472== --pid is optional if only one valgrind process is running

Once again Valgrind conveniently gives you instructions here, so let’s follow them. Open up another console and start GDB:

$ gdb ./titanic
GNU gdb (Ubuntu 7.7.1-0ubuntu5~14.04.2) 7.7.1
[... gdb startup text ...]
Reading symbols from ./titanic...done.
(gdb) 

Note that you don’t need to give this command any arguments, you’re just telling GDB what program it’s debugging. Now paste the command from Valgrind’s output to connect to it:

(gdb) target remote | /usr/lib/valgrind/../../bin/vgdb --pid=2472
Remote debugging using | /usr/lib/valgrind/../../bin/vgdb --pid=2472
relaying data between gdb and process 2472
Reading symbols from /lib64/ld-linux-x86-64.so.2...Reading symbols from /usr/lib/debug//lib/x86_64-linux-gnu/ld-2.19.so...done.
done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
0x00000000040012d0 in _start () from /lib64/ld-linux-x86-64.so.2
(gdb) 

Now let’s continue until we hit our first error (note: the program is already started, you don’t have to use run to run it). Valgrind will raise a trap signal (SIGTRAP) when it detects a problem. We can then use GDB’s frame command to get to the stack frame we care about: the one for the full_steam_ahead() function. (Ignore the _itoa.c: No such file or directory error, that happens if you don’t have the source code for your C runtime installed.)

(gdb) continue
Continuing.

Program received signal SIGTRAP, Trace/breakpoint trap.
0x0000000004e80a4b in _itoa_word (value=<optimised out>, value@entry=1, buflim=0xffefff897 "", buflim@entry=0xffefff898 "\340\370\377\376\017", base=base@entry=10, 
    upper_case=<optimised out>) at _itoa.c:179
179 _itoa.c: No such file or directory.
(gdb) backtrace
#0  0x0000000004e80a4b in _itoa_word (value=<optimised out>, value@entry=1, buflim=0xffefff897 "", buflim@entry=0xffefff898 "\340\370\377\376\017", base=base@entry=10, upper_case=<optimised out>) at _itoa.c:179
#1  0x0000000004e846f7 in _IO_vfprintf_internal (s=s@entry=0xffefff8e0, format=<optimised out>, format@entry=0x4007b0 "%*u %s", ap=ap@entry=0xffefffa08) at vfprintf.c:1660
#2  0x0000000004ea79db in __IO_vsprintf (string=0x51fc040 "", format=0x4007b0 "%*u %s", args=args@entry=0xffefffa08) at iovsprintf.c:42
#3  0x0000000004e8b5c7 in __sprintf (s=<optimised out>, format=<optimised out>) at sprintf.c:32
#4  0x0000000000400637 in full_steam_ahead (distance=1) at titanic.c:18
#5  0x0000000000400694 in main (argc=2, argv=0xffefffc28) at titanic.c:42
(gdb) 

Note that the normal program output still gets printed to the other console where you were running Valgrind.

Lastly, let’s move to the stack frame that we’re interested in:

(gdb) frame 4
#4  0x0000000000400637 in full_steam_ahead (distance=1) at titanic.c:18
18      sprintf(status, "%*u %s", DIGITS_LENGTH, total_distance, LOG_TEXT);

It’s Valgrind time!

Now we’re ready to debug in GDB with Valgrind! Any command you want to send to Valgrind must be wrapped up in GDB’s monitor command, so the syntax looks like:

monitor valgrind_command valgrind_command_arg0 etc

One of the most useful commands in Valgrind is get_vbits. This gives you information on a part of your program’s memory (stack or heap), such as whether it can be legitimately accessed and whether it’s been initialised. This monitor command takes an address and a length, so you might naïvely try to do this:

(gdb) monitor get_vbits &tugboat_distance sizeof(tugboat_distance)
missing or malformed address

…but that won’t work. The namespace you have access to in GDB itself is not available to Valgrind, which is operating “on the other side” of the monitor command. So what you have to do is:

(gdb) print &total_distance
$3 = (const unsigned int *) 0xffefffaf8
(gdb) print sizeof(total_distance)
$4 = 4
(gdb) monitor get_vbits 0xffefffaf8 4
ffffffff
(gdb) 

This is the prized information we are after! The get_vbits command tells you whether the bits of the data you’re asking about are:

  • initialised: a 0 is set for that bit (so if a whole byte is initialised, you’ll see 00)
  • uninitialised: a 1 is set for that bit (so if a whole byte is uninitialised, you’ll see ff)
  • unaddressable: two underscores __ are displayed for that whole byte

To make it really clear: Valgrind gives you the per-bit information, but condenses it into half-byte characters. If only half the bits of a char is set, it will show you 0f. If, somehow, every second bit is set, you’ll see aa. (It’s very unlikely that you’ll ever see anything other than 0 or f though.)

Personally, I find the clumsiness of the monitor get_vbits command annoying. So I created the following entry in my ~/.gdbinit file (the file GDB uses for configuration):

~/.gdbinit
0
1
2
define wat
  eval "monitor get_vbits %p %d", &$arg0, sizeof($arg0)
end

I chose wat because I’d remember it. You don’t have to be as sarcastic.

This means that instead of running three GDB commands to get one piece of information, I can now do:

(gdb) wat total_distance
ffffffff

(I got this from a kind user on Stack Overflow. The GDB aliases I use are available in the git repo we’re using, alongside the code, in a file named dot_gdbinit.)

We can use this to distinguish between the variables that go into the calculation for total_distance (distance and tugboat_distance):

(gdb) wat total_distance 
ffffffff
(gdb) wat distance
00000000
(gdb) wat tugboat_distance
ffffffff

Gotcha! This confirms what we found above: tugboat_distance is uninitialised, and it’s “contaminating” total_distance with its invalidity!

To stop debugging, you need to kill your program from within GDB (I forget this every time, of course).

(gdb) kill
Kill the program being debugged? (y or n) y
(gdb) quit
$

So what have we achieved so far? Well, we know that the bug we’re seeing is caused by an uninitialised stack variable, tugboat_distance. Let’s initialise it to 0 until we can confirm that requirement, recompile, and see what happens

$ git checkout example-1-fixed
$ make -sB
$ valgrind ./titanic 10
==25444== Memcheck, a memory error detector
==25444== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al.
==25444== Using Valgrind-3.10.1 and LibVEX; rerun with -h for copyright info
==25444== Command: ./titanic 10
==25444== 
  1 miles, still unsinkable!
  2 miles, still unsinkable!
  3 miles, still unsinkable!
  4 miles, still unsinkable!
  5 miles, still unsinkable!
  6 miles, still unsinkable!
  7 miles, still unsinkable!
  8 miles, still unsinkable!
  9 miles, still unsinkable!
 10 miles, still unsinkable!
==25444== 
==25444== HEAP SUMMARY:
==25444==     in use at exit: 300 bytes in 10 blocks
==25444==   total heap usage: 10 allocs, 0 frees, 300 bytes allocated
==25444== 
==25444== LEAK SUMMARY:
==25444==    definitely lost: 300 bytes in 10 blocks
==25444==    indirectly lost: 0 bytes in 0 blocks
==25444==      possibly lost: 0 bytes in 0 blocks
==25444==    still reachable: 0 bytes in 0 blocks
==25444==         suppressed: 0 bytes in 0 blocks
==25444== Rerun with --leak-check=full to see details of leaked memory
==25444== 
==25444== For counts of detected and suppressed errors, rerun with: -v
==25444== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

That 0 errors line looks pretty good! The bug is fixed! And what do we do when our code works?

We…

…ship it!

Example 2: memory leaks

In the previous example I mentioned that there were two problems that Valgrind told us about. The second thing worth worrying about is this:

==19150== HEAP SUMMARY:
==19150==     in use at exit: 300 bytes in 10 blocks
==19150==   total heap usage: 10 allocs, 0 frees, 300 bytes allocated
==19150== 
==19150== LEAK SUMMARY:
==19150==    definitely lost: 300 bytes in 10 blocks
==19150==    indirectly lost: 0 bytes in 0 blocks
==19150==      possibly lost: 0 bytes in 0 blocks
==19150==    still reachable: 0 bytes in 0 blocks
==19150==         suppressed: 0 bytes in 0 blocks
==19150== Rerun with --leak-check=full to see details of leaked memory

The text under LEAK SUMMARY means that we have a memory leak! This is not undefined behaviour in C, but it could be a problem, especially if the amount of memory leaked increases with some parameter of our program (which it does here). If we have a shred of honour about us, we should definitely fix it.

Once again Valgrind is nice enough to give us a suggestion for how to proceed (this is a wonderful trait of Valgrind’s, by the way):

$ git checkout example-2
$ make -sB
$ valgrind --leak-check=full ./titanic 1
==27879== Memcheck, a memory error detector
==27879== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al.
==27879== Using Valgrind-3.10.1 and LibVEX; rerun with -h for copyright info
==27879== Command: ./titanic 1
==27879== 
  1 miles, still unsinkable!
==27879== 
==27879== HEAP SUMMARY:
==27879==     in use at exit: 30 bytes in 1 blocks
==27879==   total heap usage: 1 allocs, 0 frees, 30 bytes allocated
==27879== 
==27879== 30 bytes in 1 blocks are definitely lost in loss record 1 of 1
==27879==    at 0x4C2AB80: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==27879==    by 0x400638: full_steam_ahead (titanic.c:17)
==27879==    by 0x4006BA: main (titanic.c:42)
==27879== 
==27879== LEAK SUMMARY:
==27879==    definitely lost: 30 bytes in 1 blocks
==27879==    indirectly lost: 0 bytes in 0 blocks
==27879==      possibly lost: 0 bytes in 0 blocks
==27879==    still reachable: 0 bytes in 0 blocks
==27879==         suppressed: 0 bytes in 0 blocks
==27879== 
==27879== For counts of detected and suppressed errors, rerun with: -v
==27879== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)

The section under HEAP SUMMARY has the details we need: there’s a malloc() call at titanic.c:17, but we never free() that allocated memory. It is possible to do more debugging from within GDB using monitor leak_check or monitor block_list, but I won’t go into that here.

Let’s add the necessary free() call and try it out:

$ git checkout example-2-fixed
$ make -sB
valgrind ./titanic 10
[... snip ...]
==31416== HEAP SUMMARY:
==31416==     in use at exit: 0 bytes in 0 blocks
==31416==   total heap usage: 10 allocs, 10 frees, 300 bytes allocated
==31416== 
==31416== All heap blocks were freed -- no leaks are possible
==31416== 
==31416== For counts of detected and suppressed errors, rerun with: -v
==31416== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

Another bug fixed!

Example 3: buffer overflows

Let’s say we release our titanic simulator to a customer. A few days later, we get a bug report back:

Hi,

The simulator crashes when we simulate travel distances more than about 1000 nautical miles.

This sounds like a bug, but how come we haven’t seen it ourselves? After all, we accidentally simulated billions of miles back in example 1, and it never actually crashed. Let’s try it out though:

$ ./titanic 1000
  1 miles, still unsinkable!
  2 miles, still unsinkable!
  3 miles, still unsinkable!
[... snip ...]
998 miles, still unsinkable!
999 miles, still unsinkable!
1000 miles, still unsinkable!
$

Well, gee, it works for me. But after some back-and-forth with the customer, the support department discover: they’ve got optimisations enabled. They’ve decided to pull out all the stops and compile with -O2! So look for a line in the makefile that says CFLAGS=-O0 -g and change the -O0 to -O2.

This change is tagged as example-3:

$ git checkout example-3

When I run the newly optimised binary, I get:

$ make -sB
$ ./titanic 1000
  1 miles, still unsinkable!
  2 miles, still unsinkable!
  3 miles, still unsinkable!
[... snip ...]
998 miles, still unsinkable!
999 miles, still unsinkable!
*** buffer overflow detected ***: ./titanic terminated
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x7338f)[0x7f62a16ad38f]
[... snip ...]
Aborted (core dumped)

Oh dear. Buffer overflow indeed.

Here is where we will see one of the most wonderful features of Valgrind, the thing that makes it worth all the pain of using GDB…

Even though we only get the crash with -O2, Valgrind will still identify the error at -O0! The tag example-3-debug has the makefile changed back to use -O0, so check that out and see what Valgrind has to say:

$ git checkout example-3-debug
$ make -sB
$ valgrind ./titanic 1000
==25729== Memcheck, a memory error detector
==25729== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al.
==25729== Using Valgrind-3.10.1 and LibVEX; rerun with -h for copyright info
==25729== Command: ./titanic 1000
==25729== 
  1 miles, still unsinkable!
  2 miles, still unsinkable!
  3 miles, still unsinkable!
[... snip ...]
997 miles, still unsinkable!
998 miles, still unsinkable!
999 miles, still unsinkable!
==25729== Invalid write of size 1
==25729==    at 0x4EA79E9: vsprintf (iovsprintf.c:43)
==25729==    by 0x4E8B5C6: sprintf (sprintf.c:32)
==25729==    by 0x40068D: full_steam_ahead (titanic.c:21)
==25729==    by 0x4006EA: main (titanic.c:45)
==25729==  Address 0x52136fe is 0 bytes after a block of size 30 alloc'd
==25729==    at 0x4C2AB80: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==25729==    by 0x400660: full_steam_ahead (titanic.c:20)
==25729==    by 0x4006EA: main (titanic.c:45)
==25729== 
==25729== Invalid read of size 1
==25729==    at 0x4E82A03: vfprintf (vfprintf.c:1661)
==25729==    by 0x4E8B498: printf (printf.c:33)
==25729==    by 0x400707: main (titanic.c:46)
==25729==  Address 0x52136fe is 0 bytes after a block of size 30 alloc'd
==25729==    at 0x4C2AB80: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==25729==    by 0x400660: full_steam_ahead (titanic.c:20)
==25729==    by 0x4006EA: main (titanic.c:45)
==25729== 
1000 miles, still unsinkable!
==25729== 
==25729== HEAP SUMMARY:
==25729==     in use at exit: 0 bytes in 0 blocks
==25729==   total heap usage: 1,000 allocs, 1,000 frees, 30,000 bytes allocated
==25729== 
==25729== All heap blocks were freed -- no leaks are possible
==25729== 
==25729== For counts of detected and suppressed errors, rerun with: -v
==25729== ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 0 from 0)

Why is this such an amazing feature? Well, as I mentioned earlier, a big problem with undefined behaviour is that it’s not always repeatable. That makes it hard to be scientific about debugging. But when you run a program under Valgrind, whole classes of undefined behaviour suddenly do become repeatable, meaning you can catch these problems ahead of time, or repeat them with all the debugging information intact.

(In fact, I feel I should warn you here: running Valgrind on optimised code can actually result in false positives. In theory, false positives are possible on debug builds too; in practice I’ve never seen it.)

Back to the debugging. Let’s delve into what’s happening here. Once again, in one console do this:

$ valgrind --vgdb-error=0 ./titanic 1000

…and in another, do this:

$ gdb ./titanic
[... snip ...]
(gdb) target remote | /usr/lib/valgrind/../../bin/vgdb --pid=9909
Remote debugging using | /usr/lib/valgrind/../../bin/vgdb --pid=9909
relaying data between gdb and process 9909
Reading symbols from /lib64/ld-linux-x86-64.so.2...Reading symbols from /usr/lib/debug//lib/x86_64-linux-gnu/ld-2.19.so...done.
done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
0x00000000040012d0 in _start () from /lib64/ld-linux-x86-64.so.2
(gdb) 

It seems that we hit this error in our full_steam_ahead() function, and only when our counter gets up to 1000, so let’s break on that:

(gdb) break titanic.c:21 if distance == 1000
Breakpoint 1 at 0x400665: file titanic.c, line 21.
(gdb) continue
Continuing.

Breakpoint 1, full_steam_ahead (distance=1000) at titanic.c:21
21      sprintf(status, "%*u %s", DIGITS_LENGTH, total_distance, LOG_TEXT);
(gdb) 

So what’s the problem here? Well, look at the length of our expected string:

(gdb) print strlen("1000 miles, still unsinkable!\n")
$1 = 30

To store this, we’d need a buffer that can hold the 30 characters plus the null terminator. Let’s pretend, for now, that it’s not obvious how much memory we allocated. How can we check how much memory is available?

There are a couple of ways. We can use the check_memory command, which requires us to get the address of the buffer:

(gdb) print status
$1 = 0x52136e0 ""
(gdb) monitor check_memory addressable 0x52136e0
Address 0x52136E0 len 1 addressable
 Address 0x52136e0 is 0 bytes inside a block of size 30 alloc'd
==9909==    at 0x4C2AB80: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==9909==    by 0x400660: full_steam_ahead (titanic.c:20)
==9909==    by 0x4006EA: main (titanic.c:45)

Note the line: inside a block of size 30 alloc'd — this tells us we have a 30 character buffer available. We can also explicitly ask whether we have a certain length available:

(gdb) monitor check_memory addressable 0x52136e0 31
Address 0x52136E0 len 31 not addressable:
bad address 0x52136FE
 Address 0x52136e0 is 0 bytes inside a block of size 30 alloc'd
==9909==    at 0x4C2AB80: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==9909==    by 0x400660: full_steam_ahead (titanic.c:20)
==9909==    by 0x4006EA: main (titanic.c:45)

See: Address 0x52136E0 len 31 not addressable.

Remember the get_vbits command from the first example? That works here too:

(gdb) monitor get_vbits 0x52136e0 31
ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffff__
Address 0x52136E0 len 31 has 1 bytes unaddressable

…and just like the other version, I use this command so often that I’ve defined a similar function for it in my ~/.gdbinit:

define lolwat
  eval "monitor get_vbits %p %d", &$arg0, $arg1
end

Note: just like the wat function, this takes the address of the object you give it. If you find that strange, because you almost always give this function a pointer, just change it to $arg0, $arg1 (ie. remove the & from $arg0).

(Again, you don’t have to be as whimsical as this; call it whatever you like.)

So now I can do:

(gdb) lolwat status[0] 31
ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffff__
Address 0x52136E0 len 31 has 1 bytes unaddressable

See the two underscores at the end of the range? That means that the 31st byte is unaddressable. Now we know what the problem is: we’re trying to write a 31 character string into a 30 character buffer. The solution is obvious! On line 7 in titanic.c, change this:

titanic.c
7
#define DIGITS_LENGTH (3)

…to, oh, I don’t know, this:

titanic.c
7
#define DIGITS_LENGTH (4)

Problem SOLVED FOREVER.

Exampl—

Oh okay, problem not solved. Let’s do the right thing and properly calculate the expected length of the string. Use git to checkout the Clive Palmer edition of our program:

$ git checkout -- titanic.c
$ git checkout clive-palmer-edition
$ make -sB
$ valgrind ./titanic 10000
[... snip ...]
==8077== 
==8077== HEAP SUMMARY:
==8077==     in use at exit: 0 bytes in 0 blocks
==8077==   total heap usage: 10,000 allocs, 10,000 frees, 308,894 bytes allocated
==8077== 
==8077== All heap blocks were freed -- no leaks are possible
==8077== 
==8077== For counts of detected and suppressed errors, rerun with: -v
==8077== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

Now we have a beautiful, error-free, properly functioning program. Wonderful!

Let’s mess it up some more. For science.

Example 4: buffer reuse

One of the problems with the examples I’ve used so far is that they’ve all involved fairly simple use of buffers. What happens if you’re working with code that actually allocates all the memory at the start, or reuses buffers in various parts of the program? This might seem like an unnecessary thing to do, and indeed in most cases I’d suggest avoiding these bugs by not doing this in the first place. But it’s not out of the question for embedded or highly optimised code to need to do this.

Buffer reuse means that Valgrind will see the memory as already initialised even if it really shouldn’t. Consider the situation where you have some embedded code that operates in two independent parts:

1
2
3
4
5
6
7
8
9
10
11
12
#include <stdlib.h>

int main(void)
{
    void * buffer = custom_malloc(max(PART_ONE_BUFFER_SIZE, PART_TWO_BUFFER_SIZE));

    part_one(buffer);
    part_two(buffer);

    custom_free(buffer);
    exit 0;
}

Let’s say that both PART_ONE_BUFFER_SIZE and PART_TWO_BUFFER_SIZE are 300 bytes. If part_two(buffer) tries to read byte 301, this is exactly the kind of error that Valgrind will happily catch.

Now consider the problem if PART_ONE_BUFFER_SIZE is actually 400 bytes, and part_one(buffer) writes to all of it. Not even Valgrind will catch the error if part_two(buffer) tries to read 301 bytes. Byte 301 is well within the 400 bytes allocated in buffer. It might not be undefined behaviour as far as C is concerned, but it’s still a bug in our code. The part_two() function should not be reading this memory.

It could be even more subtle: part_two() reads data from the first 300 bytes, but before it’s initialised it properly. The results will now depend on what part_one() has left behind in this space.

So what can we do?

We can demonstrate this scenario with some changes to our code. Update the code for our final demo by doing:

$ git checkout example-4

Imagine that this code is maintained by two separate teams, who don’t really like talking to each other. The team who work on the simulator core decided that anything above about 1000 nautical miles isn’t reliable for the ship, so they changed the message for this range to:

>1k miles, iffy!

Unfortunately, they also introduced a typo in the log text: stilluunsinkable!.

Meanwhile, the frontend team decided that having the exclamation mark at the end of the message made them look like they were surprised that the ship works at all, so they added some code in main() to overwrite it. While they were at it, they noticed the typo so they fixed that too. Finally, to make the code super-efficient, they re-use the buffer if the size doesn’t change:

titanic.c
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
for (nautical_miles = 1; nautical_miles <= to_the_atlantic; nautical_miles++)
{
    const unsigned current_length = log_buffer_length(nautical_miles);

    if (current_length > last_length)
    {
        free(captains_log);
        captains_log = malloc(current_length);
    }

    full_steam_ahead(nautical_miles, captains_log);
    /* The exclamation mark makes us seem surprised. */
    captains_log[current_length - 3] = '.';
    /* Fix spelling error. */
    captains_log[current_length - 14] = ' ';

    printf("%s", captains_log);

    last_length = current_length;
}

If we now build and run it for, say, 1001 miles, we get:

$ make -sB
$ ./titanic 1001
1 miles, still unsinkable.
2 miles, still unsinkable.
[... snip ...]
999 miles, still unsinkable.
1000 miles, still unsinkable.
>1k miles, iffy!
 unsinkable.

Uh oh. There’s something funny going on with that last message. The trailing unsinkable text shouldn’t be there. What’s going on?

Valgrind doesn’t help:

$ valgrind ./titanic 1001
==19724== Memcheck, a memory error detector
==19724== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al.
==19724== Using Valgrind-3.10.1 and LibVEX; rerun with -h for copyright info
==19724== Command: ./titanic 1001
==19724== 
1 miles, still unsinkable.
2 miles, still unsinkable.

[... snip ...]

998 miles, still unsinkable.
999 miles, still unsinkable.
1000 miles, still unsinkable.
>1k miles, iffy!
 unsinkable.
==19724== 
==19724== HEAP SUMMARY:
==19724==     in use at exit: 0 bytes in 0 blocks
==19724==   total heap usage: 4 allocs, 4 frees, 118 bytes allocated
==19724== 
==19724== All heap blocks were freed -- no leaks are possible
==19724== 
==19724== For counts of detected and suppressed errors, rerun with: -v
==19724== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

No errors here! Well, maybe it has something to do with that buffer we’re re-using. Let’s launch with --vgdb-error=0 this time, and use GDB to see what’s happening:

$ valgrind --vgdb-error=0 ./titanic 1001
==1304== Memcheck, a memory error detector
==1304== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al.
==1304== Using Valgrind-3.10.1 and LibVEX; rerun with -h for copyright info
==1304== Command: ./titanic 1001
==1304== 
==1304== (action at startup) vgdb me ... 
==1304== 
==1304== TO DEBUG THIS PROCESS USING GDB: start GDB like this
==1304==   /path/to/gdb ./titanic
==1304== and then give GDB the following command
==1304==   target remote | /usr/lib/valgrind/../../bin/vgdb --pid=1304
==1304== --pid is optional if only one valgrind process is running
==1304== 

…and in our other terminal:

$ gdb ./titanic
GNU gdb (Ubuntu 7.7.1-0ubuntu5~14.04.2) 7.7.1
Copyright (C) 2014 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
[... snip ...]
Reading symbols from ./titanic...done.
(gdb)  target remote | /usr/lib/valgrind/../../bin/vgdb --pid=1304
Remote debugging using | /usr/lib/valgrind/../../bin/vgdb --pid=1304
relaying data between gdb and process 1304
Reading symbols from /lib64/ld-linux-x86-64.so.2...Reading symbols from /usr/lib/debug//lib/x86_64-linux-gnu/ld-2.19.so...done.
done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
0x00000000040012d0 in _start () from /lib64/ld-linux-x86-64.so.2

Let’s break just outside the function call, but only when we’re about to create the 1001st message:

(gdb) break titanic.c:63 if nautical_miles == 1001
Breakpoint 1 at 0x400934: file titanic.c, line 63.
(gdb) continue
Continuing.

Breakpoint 1, main (argc=2, argv=0xffefffc28) at titanic.c:63
63          full_steam_ahead(nautical_miles, captains_log);
(gdb) 

The problem is that Valgrind sees the buffer as already initialised (because it is):

(gdb) lolwat captains_log[0] current_length
00000000 00000000 00000000 00000000 00000000 00000000 00000000 000000

What we need to do is to make it clear to Valgrind that even though we’re re-using the buffer here as an optimisation, it should be considered "new" each time we enter the function that actually writes to it. This can be done with the make_memory undefined command:

(gdb) print current_length 
$1 = 31
(gdb) print captains_log 
$2 = 0x5502160 "1000 miles, still unsinkable.\n"
(gdb) monitor make_memory undefined 0x5502160 31
(gdb) lolwat captains_log[0] current_length
ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffff
(gdb) 

Let’s take a second to reflect on how amazing this is! We can reach into our running program and change whether the memory it uses is initialised or even accessible!

So now our buffer is seen as uninitialised by Valgrind. Let’s set a temporary breakpoint just after the message is printed, and see if we get an error now:

(gdb) tbreak titanic.c:70
Temporary breakpoint 2 at 0x40097c: file titanic.c, line 70.
(gdb) continue
Continuing.

Program received signal SIGTRAP, Trace/breakpoint trap.
0x0000000005188a03 in _IO_vfprintf_internal (s=<optimised out>, format=<optimised out>, ap=ap@entry=0xffefffa38) at vfprintf.c:1661
1661    vfprintf.c: No such file or directory.

Aha! What does Valgrind have to say now? In the other terminal we see:

1000 miles, still unsinkable.
==1304== Conditional jump or move depends on uninitialised value(s)
==1304==    at 0x5188A03: vfprintf (vfprintf.c:1661)
==1304==    by 0x5191498: printf (printf.c:33)
==1304==    by 0x40097B: main (titanic.c:69)
==1304== 
==1304== (action on error) vgdb me ... 

Back in the GDB window, let’s see what our buffer state is now:

(gdb) frame 2
#2  0x000000000040097c in main (argc=2, argv=0xffefffc28) at titanic.c:69
69          printf("%s", captains_log);
(gdb) lolwat captains_log[0] current_length 
00000000 00000000 00000000 00000000 0000ffff ffffffff ffffffff 00ffff

Well that doesn’t look right. It looks like our naïve attempts at fixing up those spelling errors backfired on us. We’re overwriting the string terminator with a space because we didn’t realise the format had changed:

titanic.c
63
64
65
66
67
full_steam_ahead(nautical_miles, captains_log);
/* The exclamation mark makes us seem surprised. */
captains_log[current_length - 3] = '.';
/* Fix spelling error. */
captains_log[current_length - 14] = ' ';

Without the null terminator, printf just keeps printing past the intended end of the new string, and displays the text left over from previous iterations. And while this is not undefined behaviour, it is definitely a bug specifically related to buffer re-use. By marking the buffer as uninitialised, we were able to get Valgrind to flag this.

Example 5: Instrumenting code with Valgrind

What I’ve covered here are methods for incorporating Memcheck into your manual debugging workflow, so you can get to the bottom of specific bugs as you discover them.

This is, of course, the stupid way to create bug-free code. The smart way is to detect entire classes of bugs as you write new code, and never check them in. And the best way, of course, is to just not write them in the first place. My fool-proof guide to that will be out next week.

It is possible to use Valgrind’s features from within code, rather than manually typing commands into a GDB console. This is very similar to how we might use assert() to verify certain aspects of our code.

Valgrind provides a memcheck.h header file that contains all of the above features in a form that you can insert into your code. The great thing is, if you insert these commands but don’t run your program under Valgrind, nothing Valgrind-related happens! (My understanding is that Valgrind inserts special sequences of NOPs that a normal machine just ignores, but its synthetic machine understands.)

The best documentation for memcheck.h is, in fact, memcheck.h — that is, the header itself. On Ubuntu systems, if you install the valgrind package, you’ll find it in /usr/include/valgrind/memcheck.h and you can include it in any C file with:

1
#include <valgrind/memcheck.h>

A simple way to get acquainted with the features of memcheck.h is to try to replicate those earlier exercises in code ie. not using GDB.

For example, our first exercise, where we were betrayed by the variable tugboat_distance, we could have added these call into our code instead:

titanic.c
14
15
16
17
18
19
20
21
22
const unsigned tugboat_distance; /* TODO: who do we ask about this? */
const unsigned total_distance = distance + tugboat_distance;

VALGRIND_CHECK_VALUE_IS_DEFINED(distance);
VALGRIND_CHECK_VALUE_IS_DEFINED(tugboat_distance);
VALGRIND_CHECK_VALUE_IS_DEFINED(total_distance);

char *status = malloc(TOTAL_LENGTH * sizeof(char) * 1);
sprintf(status, "%*u %s", DIGITS_LENGTH, total_distance, LOG_TEXT);

It’s really a matter of circumstance as to what’s easier. If the error occurs consistently, it might be quicker to fire up GDB and step through with the monitor commands. If it’s harder to deduce the conditions that trigger the bug, putting one of these calls in can save you a huge amount of time.

You can actually combine the two methods! If you run this program as we did before, with --vgdb-error=0 and GDB doing remote debugging, then when you continue in GDB your program will automatically break if this check fails. This can help you systematically isolate a variable that’s causing problems. So even if you’re not inclined to put these calls into your code permanently, they’re a pretty powerful way to augment the kind of debugging we went through above.

What about the equivalent of my lolwat command — checking a whole range for definedness? In that case you’d use either VALGRIND_CHECK_MEM_IS_ADDRESSABLE() or VALGRIND_CHECK_MEM_IS_DEFINED(). Example 3 would look like:

titanic.c
20
21
22
23
24
25
26
27
28
char *status = malloc(TOTAL_LENGTH * sizeof(char) * 1);

/* We know that "1000 miles, still unsinkable!\n" is the
 * problematic string, so check whether that can fit.
 */
const size_t bad_string_length = strlen("1000 miles, still unsinkable!\n") + 1;
VALGRIND_CHECK_MEM_IS_ADDRESSABLE(buffer, bad_string_length);

sprintf(status, "%*u %s", DIGITS_LENGTH, total_distance, LOG_TEXT);

My simplified buffer reuse examples at the start of example 4 could have been checked with a call to VALGRIND_MAKE_MEM_UNDEFINED():

5
6
7
8
9
10
11
12
13
14
15
16
17
int main(void)
{
    const size_t buffer_size = max(PART_ONE_BUFFER_SIZE, PART_TWO_BUFFER_SIZE);
    void * buffer = malloc(buffer_size);

    VALGRIND_MAKE_MEM_UNDEFINED(buffer, buffer_size);
    part_one(buffer);

    VALGRIND_MAKE_MEM_UNDEFINED(buffer, buffer_size);
    part_two(buffer);

    exit 0;
}

It could even include multiple Valgrind features:

13
14
15
16
17
18
19
20
21
22
VALGRIND_MAKE_MEM_UNDEFINED(buffer, buffer_size);

/* clip_to_zero(x) is x for x >= 0, 0 for x < 0 */

VALGRIND_MAKE_MEM_UNADDRESSABLE(
    (char *)buffer + PART_TWO_BUFFER_SIZE
    ,clip_to_zero(buffer_size - PART_ONE_BUFFER_SIZE)
);

part_two(buffer);

The real example in example 4 would use:

titanic.c
63
64
65
66
67
68
69
70
71
72
73
VALGRIND_MAKE_MEM_UNDEFINED(captains_log, current_length);

full_steam_ahead(nautical_miles, captains_log);
/* The exclamation mark makes us seem surprised. */
captains_log[current_length - 3] = '.';
/* Fix spelling error. */
captains_log[current_length - 14] = ' ';

VALGRIND_CHECK_MEM_IS_DEFINED(captains_log, current_length);

printf("%s", captains_log);

The great thing about these calls is… you can just leave them in! In normal runs they shouldn’t affect program speed. In Valgrind runs they’ll work their magic. You could even have a mock memcheck.h or some indirection if you don’t want the NOPs in your final binary, or if you don’t want to force Valgrind to be a dependency.

It’s down to context as to whether these calls are too specific (ie. they’d only ever catch one bug and you already found it with these calls), but depending on your code they may be a great way to document and enforce some aspect of your API, especially if your code does any low level memory manipulation.

But my real code is much worse than this…

To get defensive for a moment: it would be reasonable for a reader to be sceptical of these examples, because the code is so simple and contrived. You might easily be able to see the bugs unaided, or see how they could be coded better to avoid them. That’s fair.

But trust me when I say that I’ve used these techniques (both GDB and Memcheck calls) on programs much more complex than what I’ve shown here, and it hasn’t been much harder than this. Valgrind has saved me an enormous amount of time and effort as a C developer — in coding, in maintenance, in testing, in support, and in teaching.

So please, use it. Play with it. Run it on your worst code or, if you’re really brave, on your best. And if you have any questions, feel free to ask them below.

Comments