2 HACKING ON THE GNUBOY SOURCE TREE
7 In preparation for the first release, I'm putting together a simple
8 document to aid anyone interested in playing around with or improving
9 the gnuboy source. First of all, before working on anything, you
10 should know my policies as maintainer. I'm happy to accept contributed
11 code, but there are a few guidelines:
13 * Obviously, all code must be able to be distributed under the GNU
14 GPL. This means that your terms of use for the code must be equivalent
15 to or weaker than those of the GPL. Public domain and MIT-style
16 licenses are perfectly fine for new code that doesn't incorporate
17 existing parts of gnuboy, e.g. libraries, but anything derived from or
18 built upon the GPL'd code can only be distributed under GPL. When in
21 * Please stick to a coding and naming convention similar to the
22 existing code. I can reformat contributions if I need to when
23 integrating them, but it makes it much easier if that's already done
24 by the coder. In particular, indentions are a single tab (char 9), and
25 all symbols are all lowercase, except for macros which are all
28 * All code must be completely deterministic and consistent across all
29 platforms. this results in the two following rules...
31 * No floating point code whatsoever. Use fixed point or better yet
32 exact analytical integer methods as opposed to any approximation.
34 * No threads. Emulation with threads is a poor approximation if done
35 sloppily, and it's slow anyway even if done right since things must be
36 kept synchronous. Also, threads are not portable. Just say no to
39 * All non-portable code belongs in the sys/ or asm/ trees. #ifdef
40 should be avoided except for general conditionally-compiled code, as
41 opposed to little special cases for one particular cpu or operating
42 system. (i.e. #ifdef USE_ASM is ok, #ifdef __i386__ is NOT!)
44 * That goes for *nix code too. gnuboy is written in ANSI C, and I'm
45 not going to go adding K&R function declarations or #ifdef's to make
46 sure the standard library is functional. If your system is THAT
47 broken, fix the system, don't "fix" the emulator.
49 * Please no feature-creep. If something can be done through an
50 external utility or front-end, or through clever use of the rc
51 subsystem, don't add extra code to the main program.
53 * On that note, the modules in the sys/ tree serve the singular
54 purpose of implementing calls necessary to get input and display
55 graphics (and eventually sound). Unlike in poorly-designed emulators,
56 they are not there to give every different target platform its own gui
57 and different set of key bindings.
59 * Furthermore, the main loop is not in the platform-specific code, and
60 it will never be. Windows people, put your code that would normally go
61 in a message loop in ev_refresh and/or sys_sleep!
63 * Commented code is welcome but not required.
65 * I prefer asm in AT&T syntax (the style used by *nix assemblers and
66 likewise DJGPP) as opposed to Intel/NASM/etc style. If you really must
67 use a different style, I can convert it, but I don't want to add extra
68 dependencies on nonstandard assemblers to the build process. Also,
69 portable C versions of all code should be available.
71 * Have fun with it. If my demands stifle your creativity, feel free to
72 fork your own projects. I can always adapt and merge code later if
73 your rogue ideas are good enough. :)
75 OK, enough of that. Now for the fun part...
78 THE SOURCE TREE STRUCTURE
81 README - general information related to using gnuboy
82 INSTALL - compiling and installation instructions
83 HACKING - this file, obviously
84 COPYING - the gnu gpl, grants freedom under condition of preseving it
87 Version - doubles as a C and makefile include, identifies version number
88 Rules - generic build rules to be included by makefiles
89 Makefile.* - system-specific makefiles
90 configure* - script for generating *nix makefiles
93 sys/*/* - hardware and software platform-specific code
94 asm/*/* - optimized asm versions of some code, not used yet
95 asm/*/asm.h - header specifying which functions are replaced by asm
96 asm/i386/asmnames.h - #defines to fix _ prefix brain damage on DOS/Windows
99 main.c - entry point, event handler...basically a mess
100 loader.c - handles file io for rom and ram
101 emu.c - another mess, basically the frame loop that calls state.c
102 debug.c - currently just cpu trace, eventually interactive debugging
103 hw.c - interrupt generation, gamepad state, dma, etc.
104 mem.c - memory mapper, read and write operations
105 fastmem.h - short static functions that will inline for fast memory io
106 regs.h - macros for accessing hardware registers
107 save.c - savestate handling
110 cpu.c - main cpu emulation
111 cpuregs.h - macros for cpu registers and flags
112 cpucore.h - data tables for cpu emulation
113 asm/i386/cpu.s - entire cpu core, rewritten in asm
116 fb.h - abstract framebuffer definition, extern from platform-specifics
117 lcd.c - main control of refresh procedure
118 lcd.h - vram, palette, and internal structures for refresh
119 asm/i386/lcd.s - asm versions of a few critical functions
120 lcdc.c - lcdc phase transitioning
123 input.h - internal keycode definitions, etc.
124 keytables.c - translations between key names and internal keycodes
125 events.c - event queue
127 [resource/config subsystem]
128 rc.h - structure defs
129 rccmds.c - command parser/processor
130 rcvars.c - variable exports and command to set rcvars
131 rckeys.c - keybindingds
134 path.c - path searching
135 split.c - general purpose code to split strings into argv-style arrays
138 OVERVIEW OF PROGRAM FLOW
140 The initial entry point main() main.c, which will process the command
141 line, call the system/video initialization routines, load the
142 rom/sram, and pass control to the main loop in emu.c. Note that the
143 system-specific main() hook has been removed since it is not needed.
145 There have been significant changes to gnuboy's main loop since the
146 original 0.8.0 release. The former state.c is no more, and the new
147 code that takes its place, in lcdc.c, is now called from the cpu loop,
148 which although slightly unfortunate for performance reasons, is
149 necessary to handle some strange special cases.
151 Still, unlike some emulators, gnuboy's main loop is not the cpu
152 emulation loop. Instead, a main loop in emu.c which handles video
153 refresh, polling events, sleeping between frames, etc. calls
154 cpu_emulate passing it an idea number of cycles to run. The actual
155 number of cycles for which the cpu runs will vary slightly depending
156 on the length of the final instruction processed, but it should never
157 be more than 8 or 9 beyond the ideal cycle count passed, and the
158 actual number will be returned to the calling function in case it
159 needs this information. The cpu code now takes care of all timer and
160 lcdc events in its main loop, so the caller no longer needs to be
161 aware of such things.
163 Note that all cycle counts are measured in CGB double speed MACHINE
164 cycles (2**21 Hz), NOT hardware clock cycles (2**23 Hz). This is
165 necessary because the cpu speed can be switched between single and
166 double speed during a single call to cpu_emulate. When running in
167 single speed or DMG mode, all instruction lengths are doubled.
169 As for the LCDC state, things are much simpler now. No more huge
170 glorious state table, no more P/Q/R, just a couple simple functions.
171 Aside from the number of cycles left before the next state change, all
172 the state information fits nicely in the locations the Game Boy itself
173 provides for it -- the LCDC, STAT, and LY registers.
175 If the special cases for the last line of VBLANK look strange to you,
176 good. There's some weird stuff going on here. According to documents
177 I've found, LY changes from 153 to 0 early in the last line, then
178 remains at 0 until the end of the first visible scanline. I don't
179 recall finding any roms that rely on this behavior, but I implemented
182 That covers the basics. As for flow of execution, here's a simplified
183 call tree that covers most of the significant function calls taking
184 place in normal operation:
190 |_ loader_init loader.c
194 | |_ div_advance cpu.c *
195 | |_ timer_advance cpu.c *
196 | |_ lcdc_advance cpu.c *
197 | | \_ lcdc_trans lcdc.c
198 | | |_ lcd_refreshline lcd.c
199 | | |_ stat_change lcdc.c
200 | | | \_ lcd_begin lcd.c
201 | | \_ stat_trigger lcdc.c
202 | \_ sound_advance cpu.c *
209 (* included in cpu.c so they can inline; also in cpu.s)
212 MEMORY READ/WRITE MAP
214 Whenever possible, gnuboy avoids emulating memory reads and writes
215 with a function call. To this end, two pointer tables are kept -- one
216 for reading, the other for writing. They are indexed by bits 12-15 of
217 the address in Game Boy memory space, and yield a base pointer from
218 which the whole address can be used as an offset to access Game Boy
219 memory with no function calls whatsoever. For regions that cannot be
220 accessed without function calls, the pointer in the table is NULL.
222 For example, reading from address addr can be accomplished by testing
223 to make sure mbc.rmap[addr>>12] is not NULL, then simply reading
224 mbc.rmap[addr>>12][addr].
226 And for the disbelievers in this optimization, here are some numbers
227 to compare. First, FFL2 with memory tables disabled:
229 % cumulative self self total
230 time seconds seconds calls us/call us/call name
231 28.69 0.57 0.57 refresh_2
232 13.17 0.84 0.26 4307863 0.06 0.06 mem_read
233 11.63 1.07 0.23 cpu_emulate
235 Now, with memory tables enabled:
237 38.86 0.66 0.66 refresh_2
238 8.42 0.80 0.14 156380 0.91 0.91 spr_enum
239 6.76 0.91 0.11 483134 0.24 1.31 lcdc_trans
240 6.16 1.02 0.10 cpu_emulate
244 0.59 1.61 0.01 216497 0.05 0.05 mem_read
246 As you can see, not only does mem_read take up (proportionally) 1/20
247 as much time, since it is rarely called, but the main cpu loop in
248 cpu_emulate also runs considerably faster with all the function call
249 overhead and cache misses avoided.
251 These tests were performed on K6-2/450 with the assembly cores
252 enabled; your milage may vary. Regardless, however, I think it's clear
253 that using the address mapping tables is quite a worthwhile
257 LCD RENDERING CORE DESIGN
259 The LCD core presently used in gnuboy is very much a high-level one,
260 performing the task of rasterizing scanlines as many independent steps
261 rather than one big loop, as is often seen in other emulators and the
262 original gnuboy LCD core. In some ways, this is a bit of a tradeoff --
263 there's a good deal of overhead in rebuilding the tile pattern cache
264 for roms that change their tile patterns frequently, such as full
265 motion video demos. Even still, I consider the method we're presently
266 using far superior to generating the output display directly from the
267 gameboy tiledata -- in the vast majority of roms, tiles are changed so
268 infrequently that the overhead is irrelevant. Even if the tiles are
269 changed rapidly, the only chance for overhead beyond what would be
270 present in a monolithic rendering loop lies in (host cpu) cache misses
271 and the possibility that we might (tile pattern) cache a tile that has
272 changed but that will never actually be used, or that will only be
273 used in one orientation (horizontally and vertically flipped versions
274 of all tiles are cached as well). Such tile caching issues could be
275 addressed in the long term if they cause a problem, but I don't see it
276 hurting performance too significantly at the present. As for host cpu
277 cache miss issues, I find that putting multiple data decoding and
278 rendering steps together in a single loop harms performance much more
279 significantly than building a 256k (pattern) cache table, on account
280 of interfering with branch prediction, register allocation, and so on.
282 Well, with those justifications given, let's proceed to the steps
283 involved in rendering a scanline:
285 updatepatpix() - updates tile pattern cache.
287 tilebuf() - reads gb tile memory according to its complicated tile
288 addressing system which can be changed via the LCDC register, and
289 outputs nice linear arrays of the actual tile indices used in the
290 background and window on the present line.
292 Before continuing, let me explain the output format used by the
293 following functions. There is a byte array scan.buf, accessible by
294 macro as BUF, which is the output buffer for the line. The structure
295 of this array is simple: it is composed of 6 bpp gameboy color
296 numbers, where the bits 0-1 are the color number from the tile, bits
297 2-4 are the (cgb or dmg) palette index, and bit 5 is 0 for background
298 or window, 1 for sprite.
300 What is the justification for using a strange format like this, rather
301 than raw host color numbers for output? Well, believe it or not, it
302 improves performance. It's already necessary to have the gameboy color
303 numbers available for use in sprite priority. And, when running in
304 mono gb mode, building this output data is VERY fast -- it's just a
305 matter of doing 64 bit copies from the tile pattern cache to the
308 Furthermore, using a unified output format like this eliminates the
309 need to have separate rendering functions for each host color depth or
310 mode. We just call a one-line function to apply a palette to the
311 output buffer as we copy it to the video display, and we're done. And,
312 if you're not convinced about performance, just do some profiling.
313 You'll see that the vast majority of the graphics time is spent in the
314 one-line copy function (render_[124] depending on bytes per pixel),
315 even when using the fast asm versions of those routines. That is to
316 say, any overhead in the following functions is for all intents and
317 purposes irrelevant to performance. With that said, here they are:
319 bg_scan() - expands the background layer to the output buffer.
321 wnd_scan() - expands the window layer.
323 spr_scan() - expands the sprites. Note that this requires spr_enum()
324 to have been called already to build a list of which sprites are
325 visible on the current scanline and sort them by priority.
327 It should be noted that the background and window functions also have
328 color counterparts, which are considerably slower due to merging of
329 palette data. At this point, they're staying down around 8% time
330 according to the profiler, so I don't see a major need to rewrite them
331 anytime soon. It should be considered, however, that a different
332 intermediate format could be used for gbc, or that asm versions of
333 these two routines could be written, in the long term.
335 Finally, some notes on palettes. You may be wondering why the 6 bpp
336 intermediate output can't be used directly on 256-color display
337 targets. After all, that would give a huge performance boost. The
338 problem, however, is that the gameboy palette can change midscreen,
339 whereas none of the presently targetted host systems can handle such a
340 thing, much less do it portably. For color roms, using our own
341 internal color mappings in addition to the host system palette is
342 essential. For details on how this is accomplished, read palette.c.
344 Now, in the long term, it MAY be possible to use the 6 bpp color
345 "almost" directly for mono roms. Note that I say almost. The idea is
346 this. Using the color number as an index into a table is slow. It
347 takes an extra read and causes various pipeline stalls depending on
348 the host cpu architecture. But, since there are relatively few
349 possible mono palettes, it may actually be possible to set up the host
350 palette in a clever way so as to cover all the possibilities, then use
351 some fancy arithmetic or bit-twiddling to convert without a lookup
352 table -- and this could presumably be done 4 pixels at a time with
353 32bit operations. This area remains to be explored, but if it works,
354 it might end up being the last hurdle to getting realtime emulation
355 working on very low-end systems like i486.
360 Rather than processing sound after every few instructions (and thus
361 killing the cache coherency), we update sound in big chunks. Yet this
362 in no way affects precise sound timing, because sound_mix is always
363 called before reading or writing a sound register, and at the end of
366 The main sound module interfaces with the system-specific code through
367 one structure, pcm, and a few functions: pcm_init, pcm_close, and
368 pcm_submit. While the first two should be obvious, pcm_submit needs
369 some explaining. Whenever realtime sound output is operational,
370 pcm_submit is responsible for timing, and should not return until it
371 has successfully processed all the data in its input buffer (pcm.buf).
372 On *nix sound devices, this typically means just waiting for the write
373 syscall to return, but on systems such as DOS where low level IO must
374 be handled in the program, pcm_submit needs to delay until the current
375 position in the DMA buffer has advanced sufficiently to make space for
376 the new samples, then copy them.
378 For special sound output implementations like write-to-file or the
379 dummy sound device, pcm_submit should write the data immediately and
380 return 0, indicating to the caller that other methods must be used for
381 timing. On real sound devices that are presently functional,
382 pcm_submit should return 1, regardless of whether it buffered or
383 actually wrote the sound data.
385 And yes, for unices without OSS, we hope to add piped audio output
386 soon. Perhaps Sun audio device and a few others as well.
389 OPTIMIZED ASSEMBLY CODE
391 A lot can be said on this matter. Nothing has been said yet.
396 Apologies, there is no interactive debugger in gnuboy at present. I'm
397 still working out the design for it. In the long run, it should be
398 integrated with the rc subsystem, kinda like a cross between gdb and
399 Quake's ever-famous console. Whether it will require a terminal device
400 or support the graphical display remains to be determined.
402 In the mean time, you can use the debug trace code already
403 implemented. Just "set trace 1" from your gnuboy.rc or the command
404 line. Read debug.c for info on how to interpret the output, which is
405 condensed as much as possible and not quite self-explanatory.
410 On all systems on which it is available, the gnu compiler should
411 probably be used. Writing code specific to non-free compilers makes it
412 impossible for free software users to actively contribute. On the
413 other hand, compiler-specific code should always be kept to a minimum,
414 to make porting to or from non-gnu compilers easier.
416 Porting to new cpu architectures should not be necessary. Just make
417 sure you unset IS_LITTLE_ENDIAN in the makefiles to enable the big
418 endian default if the target system is big endian. If you do have
419 problems building on certain cpus, however, let us know. Eventually,
420 we will also want asm cpu and graphics code for popular host cpus, but
421 this can wait, since the c code should be sufficiently fast on most
424 The bulk of porting efforts will probably be spent on adding support
425 for new operating systems, and on systems with multiple video (or
426 sound, once that's implemented) architectures, new interfaces for
427 those. In general, the operating system interface code goes in a
428 directory under sys/ named for the os (e.g. sys/nix/ for *nix
429 systems), and display interfaces likewise go in their respective
430 directories under sys/ (e.g. sys/x11/ for the x window system
433 For guidelines in writing new system and display interface modules, i
434 recommend reading the files in the sys/dos/, sys/svga/, and sys/nix/
435 directories. These are some of the simpler versions (aside from the
436 tricky dos keyboard handling), as opposed to all the mess needed for
439 Also, please be aware that the existing system and display interface
440 modules are somewhat primitive; they are designed to be as quick and
441 sloppy as possible while still functioning properly. Eventually they
442 will be greatly improved.
444 Finally, remember your obligations under the GNU GPL. If you produce
445 any binaries that are compiled strictly from the source you received,
446 and you intend to release those, you *must* also release the exact
447 sources you used to produce those binaries. This is not pseudo-free
448 software like Snes9x where binaries usually appear before the latest
449 source, and where the source only compiles on one or two platforms;
450 this is true free software, and the source to all binaries always
451 needs to be available at the same time or sooner than the
452 corresponding binaries, if binaries are to be released at all. This of
453 course applies to all releases, not just new ports, but from
454 experience i find that ports people usually need the most reminding.
459 That's it for now. More info will eventually follow. Happy hacking!