Bug #757: disable unnecassary network polling in multicore build
[charm.git] / doc / faq / debugging.tex
blobebf11b452a8cfca1dfd1ed1b856e46427838218a
1 \section{Debugging}
3 \subsubsection{How can I debug Charm++ programs?}
5 There are many ways to debug programs written in Charm++:
7 \begin{description}
9 \item[print] By using {\tt CkPrintf}, values from critical point in the program can be
10 printed.
12 \item[gdb] This can be used both on a single processor, and in parallel
13 simulations. In the latter, each processor has a terminal window with a gdb
14 connected.
16 \item[charmdebug] This is the most sophisticated method to debug parallel
17 programs in Charm++. It is tailored to Charm++ and it can display and inspect
18 chare objects as well as messages in the system. Single {\em gdb}s can be
19 attached to specific processors on demand.
21 \end{description}
23 \subsubsection{How do I use charmdebug?}
25 Currently charmdebug is tested to work only under net- versions. With other versions,
26 testing is pending. To get the Charm Debug tool, check out the source code from the repository.
27 This will create a directory named ccs\_tools. Move to this directory and
28 build Charm Debug.
30 \begin{verbatim}
31 git clone git://charm.cs.uiuc.edu/ccs_tools.git
32 cd ccs_tools
33 ant
34 \end{verbatim}
36 This will create the executable {\tt bin/charmdebug}. To start, simply substitute ``charmdebug'' to
37 ``charmrun'':
39 \begin{alltt}shell> <path>/charmdebug ./myprogram\end{alltt}
41 You can find more detailed information in the debugger manual in
42 \href{http://charm.cs.illinois.edu/manuals/html/debugger/manual-1p.html}{here}.
44 \subsubsection{Can I use TotalView?}
46 Yes, on mpi- versions of Charm++. In this case, the program is a regular MPI
47 application, and as such any tool available for MPI programs can be used. Notice
48 that some of the internal data structures (like messages in queue) might be
49 difficult to find.
51 \subsubsection{How do I use {\em gdb} with Charm++ programs?}
53 It depends on the machine. On the net- versions of Charm++, like net-linux,
54 you can just run the serial debugger:
55 \begin{alltt}shell> gdb myprogram\end{alltt}
57 If the problem only shows up in parallel, and you're running on an X
58 terminal, you can use the {\em ++debug} or {\em ++debug-no-pause} options of charmrun
59 to get a separate window for each process:
60 \begin{alltt}
61 shell> export DISPLAY="myterminal:0"
62 shell> ./charmrun ./myprogram +p2 ++debug
63 \end{alltt}
65 %On the SGI Origin2000, you can again run with ++debug, but this only
66 %prints out the process ID of each processor and waits 10 seconds. In another
67 %window, you have to manually attach a debugger to the running process,
68 %like this:
69 %<br><tt>&nbsp; > ./charmrun ./myprogram +p2 ++debug</tt>
70 %<br><tt>Running on 2 processors:&nbsp; ./myprogram ++debug</tt>
71 %<br><tt>CHARMDEBUG> Processor 0 has PID 34554234</tt>
72 %<br><tt>CHARMDEBUG> Processor 1 has PID 35086430</tt>
73 %<br><tt>...</tt>
74 %<br><tt>&nbsp; > dbx -p 34554234</tt>
76 \subsubsection{When I try to use the {\em ++debug} option I get: {\tt remote
77 host not responding... connection closed}}
79 First, make sure the program at least starts to run properly without {\em ++debug}
80 (i.e. charmrun is working and there are no problems with the program startup
81 phase). You need to make sure that gdb or dbx, and xterm are installed
82 on all the machines you are using (not the one that is running {\tt charmrun}).
83 If you are working on remote machines from Linux, you may need to run ``xhost +''
84 locally to give the remote machines permission to display an xterm on
85 your desktop. If you are working from a Windows machine, you need an X-win
86 application such as exceed. You need to set this up to give the right permissions
87 for X windows. You need to make sure the DISPLAY environment variable on
88 the remote machine is set correctly to your local machine. I recommend
89 ssh and putty, because it will take care of the DISPLAY environment automatically,
90 and you can set up ssh to use tunnels so that it even works from a private
91 subnet(e.g. 192.168.0.8). Since the xterm is displayed from the node machines,
92 you have to make sure they have the correct DISPLAY set. Again, setting
93 up ssh in the nodelist file to spawn node programs should take care of
94 that. If you are using rsh, you need to set DISPLAY in {\em ~/.charmrunrc}
95 which will be read at start up time by each node program.
97 %<li>
98 %<b>I've been having some trouble using </b><tt>charmrun</tt><b> with the
99 %</b><tt>++debug</tt><b>
100 %option. I have XWinPro running and use ttssh to do X forwarding. I can
101 %get xemacs to pop up, but when I try to
102 %</b><tt>charmrun pgm ++debug</tt><b>,
103 %I receive the following error messages from ttssh:
104 %</b><tt>Remote X application
105 %sent incorrect authentication data. Its X session is being cancelled.</tt></li>
107 %<br>X forwarding generally doesn't work with charmrun, because X forwarding
108 %assumes all the programs you want to see are running on the machine you're
109 %logged in to.&nbsp; Charmrun starts your program on the nodes of the parallel
110 %machine, which generally confuses X forwarding.&nbsp; So forget about forwarding
111 %and just set the DISPLAY directly.
112 %<br>&nbsp;
113 %<li>
114 %<b>How else can I debug my Charm++ programs?</b></li>
116 %<br>The usual methods still work in parallel--diagnostic printouts, selective
117 %code removal, working up from tiny input sizes, etc.
120 \subsubsection{My debugging printouts seem to be out of order. How can I prevent this?}
122 Printouts from different processors do not normally stay ordered. Consider
123 the code:
124 \begin{alltt}
125 ...somewhere... \{
126 CkPrintf("cause\textbackslash{}n");
127 proxy.effect();
129 void effect(void) \{
130 CkPrintf("effect\textbackslash{}n");
132 \end{alltt}
134 Though you might expect this code to always print ``cause, effect'', you
135 may get ``effect, cause''. This can only happen when the cause and
136 effect execute on different processors, so cause's output is delayed.
138 If you pass the extra command-line parameter {\em +syncprint}, then CkPrintf
139 actually blocks until the output is queued, so your printouts should at
140 least happen in causal order. Note that this does dramatically slow down
141 output.
143 \subsubsection{Is there a way to flush the print buffers in Charm++ (like
144 {\tt fflush()})?}
146 Charm++ automatically flushes the print buffers every newline and at
147 program exit. There is no way to manually flush the buffers at another
148 point.
150 \subsubsection{My Charm++ program is causing a seg fault, and the debugger shows that
151 it's crashing inside {\em malloc} or {\em printf} or {\em fopen}!}
153 This isn't a bug in the C library, it's a bug in your program -- you're
154 corrupting the heap. Link your program again with {\em -memory paranoid} and
155 run it again in the debugger. {\em -memory paranoid} will check the heap and
156 detect buffer over- and under-run errors, double-deletes, delete-garbage,
157 and other common mistakes that trash the heap.
159 \subsubsection{Everything works fine on one processor, but when I run on
160 multiple processors it crashes!}
162 It's very convenient to do your testing on one processor (i.e., with
163 {\em +p1}); but there are several things that only happen on multiple processors.
165 A single processor has just one set of global variables, but multiple
166 processors have different global variables. This means on one processor,
167 you can set a global variable and it stays set ``everywhere'' (i.e., right
168 here!), while on two processors the global variable never gets initialized
169 on the other processor. If you must use globals, either set them on every
170 processor or make them into {\em readonly} globals.
172 A single processor has just one address space, so you actually {\em can}
173 pass pointers around between chares. When running on multiple processors,
174 the pointers dangle. This can cause incredibly weird behavior -- reading
175 from uninitialized data, corrupting the heap, etc. The solution is to never,
176 ever send pointers in messages -- you need to send the data the pointer points
177 to, not the pointer.
179 \subsubsection{I get the error: ``{\tt Group ID is zero-{}- invalid!}''. What does
180 this mean?}
182 The {\em group} it is refering to is the chare group. This
183 error is often due to using an uninitialized proxy or handle; but it's
184 possible this indicates severe corruption. Run with {\em ++debug} and check
185 it you just sent a message via an uninitialized proxy.
187 \subsubsection{I get the error: {\tt Null-Method Called. Program may have Unregistered
188 Module!!} What does this mean?}
190 You are trying to use code from a module that has not been properly
191 initialized.
193 So, in the {\em .ci} file for your {\em mainmodule}, you should
194 add an ``extern module'' declaration:
195 \begin{alltt}
196 mainmodule whatever \{
197 extern module someModule;
200 \end{alltt}
202 \subsubsection{When I run my program, it gives this error:}
204 \begin{alltt}
205 Charmrun: error on request socket-{}-
206 Socket closed before recv.
207 \end{alltt}
209 This means that the node program died without informing {\tt charmrun}
210 about it, which typically means a segmentation fault while in the interrupt
211 handler or other critical communications code. This indicates severe
212 corruption in Charm++'s data structures, which is likely the result of
213 a heap corruption bug in your program. Re-linking with {\em -memory paranoid}
214 may clarify the true problem.
216 \subsubsection{When I run my program, sometimes I get a {\tt Hangup}, and
217 sometimes {\tt Bus Error}. What do these messages indicate?}
219 {\tt Bus Error} and {\tt Hangup} both are indications that your
220 program is terminating abnormally, i.e. with an uncaught signal (SEGV or
221 SIGBUS). You should definitely run the program with gdb, or use {\em ++debug}. Bus Errors often mean there is an alignment problem, check if your compiler or environment offers support for detection of these.