doc/faq/debugging.tex

   1 \section{Debugging}
   2
   3 \subsubsection{How can I debug Charm++ programs?}
   4
   5 There are many ways to debug programs written in Charm++:
   6
   7 \begin{description}
   8
   9 \item[print] By using {\tt CkPrintf}, values from critical point in the program can be
  10 printed.
  11
  12 \item[gdb] This can be used both on a single processor, and in parallel
  13 simulations. In the latter, each processor has a terminal window with a gdb
  14 connected.
  15
  16 \item[charmdebug] This is the most sophisticated method to debug parallel
  17 programs in Charm++. It is tailored to Charm++ and it can display and inspect
  18 chare objects as well as messages in the system. Single {\em gdb}s can be
  19 attached to specific processors on demand.
  20
  21 \end{description}
  22
  23 \subsubsection{How do I use charmdebug?}
  24
  25 Currently charmdebug is tested to work only under net- versions. With other versions,
  26 testing is pending. To get the Charm Debug tool, check out the source code from the repository.
  27 This will create a directory named ccs\_tools. Move to this directory and
  28 build Charm Debug.
  29
  30 \begin{verbatim}
  31  git clone git://charm.cs.uiuc.edu/ccs_tools.git
  32  cd ccs_tools
  33  ant
  34 \end{verbatim}
  35
  36 This will create the executable {\tt bin/charmdebug}. To start, simply substitute ``charmdebug'' to
  37 ``charmrun'':
  38
  39 \begin{alltt}shell> <path>/charmdebug ./myprogram\end{alltt}
  40
  41 You can find more detailed information in the debugger manual in
  42 \href{http://charm.cs.illinois.edu/manuals/html/debugger/manual-1p.html}{here}.
  43
  44 \subsubsection{Can I use TotalView?}
  45
  46 Yes, on mpi- versions of Charm++. In this case, the program is a regular MPI
  47 application, and as such any tool available for MPI programs can be used. Notice
  48 that some of the internal data structures (like messages in queue) might be
  49 difficult to find.
  50
  51 \subsubsection{How do I use {\em gdb} with Charm++ programs?}
  52
  53 It depends on the machine. On the net- versions of Charm++, like net-linux,
  54 you can just run the serial debugger:
  55 \begin{alltt}shell> gdb myprogram\end{alltt}
  56
  57 If the problem only shows up in parallel, and you're running on an X
  58 terminal, you can use the {\em ++debug} or {\em ++debug-no-pause} options of charmrun
  59 to get a separate window for each process:
  60 \begin{alltt}
  61 shell> export DISPLAY="myterminal:0"
  62 shell> ./charmrun ./myprogram +p2 ++debug
  63 \end{alltt}
  64
  65 %On the SGI Origin2000, you can again run with ++debug, but this only
  66 %prints out the process ID of each processor and waits 10 seconds. In another
  67 %window, you have to manually attach a debugger to the running process,
  68 %like this:
  69 %<br><tt>&nbsp; > ./charmrun ./myprogram +p2 ++debug</tt>
  70 %<br><tt>Running on 2 processors:&nbsp; ./myprogram ++debug</tt>
  71 %<br><tt>CHARMDEBUG> Processor 0 has PID 34554234</tt>
  72 %<br><tt>CHARMDEBUG> Processor 1 has PID 35086430</tt>
  73 %<br><tt>...</tt>
  74 %<br><tt>&nbsp; > dbx -p 34554234</tt>
  75
  76 \subsubsection{When I try to use the {\em ++debug} option I get: {\tt remote
  77 host not responding... connection closed}}
  78
  79 First, make sure the program at least starts to run properly without {\em ++debug}
  80 (i.e. charmrun is working and there are no problems with the program startup
  81 phase). You need to make sure that gdb or dbx, and xterm are installed
  82 on all the machines you are using (not the one that is running {\tt charmrun}).
  83 If you are working on remote machines from Linux, you may need to run ``xhost +''
  84 locally to give the remote machines permission to display an xterm on
  85 your desktop. If you are working from a Windows machine, you need an X-win
  86 application such as exceed. You need to set this up to give the right permissions
  87 for X windows. You need to make sure the DISPLAY environment variable on
  88 the remote machine is set correctly to your local machine. I recommend
  89 ssh and putty, because it will take care of the DISPLAY environment automatically,
  90 and you can set up ssh to use tunnels so that it even works from a private
  91 subnet(e.g. 192.168.0.8). Since the xterm is displayed from the node machines,
  92 you have to make sure they have the correct DISPLAY set. Again, setting
  93 up ssh in the nodelist file to spawn node programs should take care of
  94 that. If you are using rsh, you need to set DISPLAY in {\em ~/.charmrunrc}
  95 which will be read at start up time by each node program.
  96
  97 %<li>
  98 %<b>I've been having some trouble using </b><tt>charmrun</tt><b> with the
  99 %</b><tt>++debug</tt><b>
 100 %option. I have XWinPro running and use ttssh to do X forwarding. I can
 101 %get xemacs to pop up, but when I try to
 102 %</b><tt>charmrun pgm ++debug</tt><b>,
 103 %I receive the following error messages from ttssh:
 104 %</b><tt>Remote X application
 105 %sent incorrect authentication data. Its X session is being cancelled.</tt></li>
 106
 107 %<br>X forwarding generally doesn't work with charmrun, because X forwarding
 108 %assumes all the programs you want to see are running on the machine you're
 109 %logged in to.&nbsp; Charmrun starts your program on the nodes of the parallel
 110 %machine, which generally confuses X forwarding.&nbsp; So forget about forwarding
 111 %and just set the DISPLAY directly.
 112 %<br>&nbsp;
 113 %<li>
 114 %<b>How else can I debug my Charm++ programs?</b></li>
 115
 116 %<br>The usual methods still work in parallel--diagnostic printouts, selective
 117 %code removal, working up from tiny input sizes, etc.
 118
 119
 120 \subsubsection{My debugging printouts seem to be out of order. How can I prevent this?}
 121
 122 Printouts from different processors do not normally stay ordered. Consider
 123 the code:
 124 \begin{alltt}
 125 ...somewhere... \{
 126   CkPrintf("cause\textbackslash{}n");
 127   proxy.effect();
 128 \}
 129 void effect(void) \{
 130   CkPrintf("effect\textbackslash{}n");
 131 \}
 132 \end{alltt}
 133
 134 Though you might expect this code to always print ``cause, effect'', you
 135 may get ``effect, cause''. This can only happen when the cause and
 136 effect execute on different processors, so cause's output is delayed.
 137
 138 If you pass the extra command-line parameter {\em +syncprint}, then CkPrintf
 139 actually blocks until the output is queued, so your printouts should at
 140 least happen in causal order. Note that this does dramatically slow down
 141 output.
 142
 143 \subsubsection{Is there a way to flush the print buffers in Charm++ (like
 144 {\tt fflush()})?}
 145
 146 Charm++ automatically flushes the print buffers every newline and at
 147 program exit. There is no way to manually flush the buffers at another
 148 point.
 149
 150 \subsubsection{My Charm++ program is causing a seg fault, and the debugger shows that
 151 it's crashing inside {\em malloc} or {\em printf} or {\em fopen}!}
 152
 153 This isn't a bug in the C library, it's a bug in your program -- you're
 154 corrupting the heap. Link your program again with {\em -memory paranoid} and
 155 run it again in the debugger. {\em -memory paranoid} will check the heap and
 156 detect buffer over- and under-run errors, double-deletes, delete-garbage,
 157 and other common mistakes that trash the heap.
 158
 159 \subsubsection{Everything works fine on one processor, but when I run on
 160 multiple processors it crashes!}
 161
 162 It's very convenient to do your testing on one processor (i.e., with
 163 {\em +p1}); but there are several things that only happen on multiple processors.
 164
 165 A single processor has just one set of global variables, but multiple
 166 processors have different global variables. This means on one processor,
 167 you can set a global variable and it stays set ``everywhere'' (i.e., right
 168 here!), while on two processors the global variable never gets initialized
 169 on the other processor. If you must use globals, either set them on every
 170 processor or make them into {\em readonly} globals.
 171
 172 A single processor has just one address space, so you actually {\em can}
 173 pass pointers around between chares. When running on multiple processors,
 174 the pointers dangle. This can cause incredibly weird behavior -- reading
 175 from uninitialized data, corrupting the heap, etc. The solution is to never,
 176 ever send pointers in messages -- you need to send the data the pointer points
 177 to, not the pointer.
 178
 179 \subsubsection{I get the error: ``{\tt Group ID is zero-{}- invalid!}''. What does
 180 this mean?}
 181
 182 The {\em group} it is refering to is the chare group. This
 183 error is often due to using an uninitialized proxy or handle; but it's
 184 possible this indicates severe corruption. Run with {\em ++debug} and check
 185 it you just sent a message via an uninitialized proxy.
 186
 187 \subsubsection{I get the error: {\tt Null-Method Called. Program may have Unregistered
 188 Module!!} What does this mean?}
 189
 190 You are trying to use code from a module that has not been properly
 191 initialized.
 192
 193 So, in the {\em .ci} file for your {\em mainmodule}, you should
 194 add an ``extern module'' declaration:
 195 \begin{alltt}
 196 mainmodule whatever \{
 197   extern module someModule;
 198   ...
 199 \}
 200 \end{alltt}
 201
 202 \subsubsection{When I run my program, it gives this error:}
 203
 204 \begin{alltt}
 205 Charmrun: error on request socket-{}-
 206 Socket closed before recv.
 207 \end{alltt}
 208
 209 This means that the node program died without informing {\tt charmrun}
 210 about it, which typically means a segmentation fault while in the interrupt
 211 handler or other critical communications code. This indicates severe
 212 corruption in Charm++'s data structures, which is likely the result of
 213 a heap corruption bug in your program. Re-linking with {\em -memory paranoid}
 214 may clarify the true problem.
 215
 216 \subsubsection{When I run my program, sometimes I get a {\tt Hangup}, and
 217 sometimes {\tt Bus Error}. What do these messages indicate?}
 218
 219 {\tt Bus Error} and {\tt Hangup} both are indications that your
 220 program is terminating abnormally, i.e. with an uncaught signal (SEGV or
 221 SIGBUS). You should definitely run the program with gdb, or use {\em ++debug}.  Bus Errors often mean there is an alignment problem, check if your compiler or environment offers support for detection of these.