Parallel Debugging

r7174 on the trunk added in a mechanism to easily debug parallel engines.


Debugging a parallel engine is difficult. Unlike the other serial components of VisIt, it is difficult to start a parallel engine under a debugger. Presumably we could add some internallauncher magic to start up N debuggers, but that would be cumbersome at best. Further, in some environments it could be very difficult to access the launched engines; they may launch at a later period of time, on another machine.

The solution is to wait on a developer-controlled resource.


By causing the engine to essentially spinlock on a resource you control, you can give yourself enough time to attach a debugger to the already-launched engine_par process[es]. Most people end up using files for this:

 while(access("/home/me/lockfile", R_OK) == -1) { /* spin */ }

One would recompile after adding this code, and then launch VisIt as normal. Then lookup the pid of the VisIt process[es] and attach to the ones you are interested in via the debugger. Finally, in another terminal run 'touch /home/me/lockfile'. All processors will exit out of their loop and you'll have the one[s] you care about under your control in the debugger.


An annoying feature of this is setup is that you have to remove and create the file all the time. At some point, you'll forget to remove it after changing something, your debug run will go too far, and you'll have to start the process again after you make sure to remove the file.

Since you're going to use the debugger to debug the program anyway, why not control process continuation in the debugger? That's what PAR_WaitForDebugger() does for you.

Usage Guide

Add a call to PAR_WaitForDebugger() somewhere in the code before the erroneous behavior you're observing is occurring. Recompile. Then start VisIt as normal.

Now attach to the first process in the engine's "group":

 gdb -q -ex "attach $(pgrep engine_par | head -n 1)"

(You can also attach to other processors as well, but you need to attach to at least the first process.)

Almost without fail, your process will be 'stuck' waiting for you:

0x00007f1dd0b7d9ea in PAR_WaitForDebugger () at 
272             do {

Now you just need to enter one simple command to allow your engine to exit this loop:

(gdb) set variable i=1

Of course, you get the debugging prompt right back. If you're getting a SIGSEGV and simply want to know where it's happening, just type continue. If you want to do more in-depth debugging, use finish to exit the PAR_WaitForDebugger() function.

As soon as you enter finish or continue, you effectively "release" all of the other processors in the job. Therefore, if you're getting a crash in the 5th engine_par process, make sure to attach a debugger to that process before continuing in the 0th process.