doc/DONE

   1 DONE
   2
   3 * Improve cycle time; attack the current critical path which is in
   4   ME. First try switching from a tag-sequental cache to the more
   5   conventional way select (using a 32-bit 4-mux). This gives us
   6   another stage to do the load alignment and sign-extension.
   7
   8 * Implement D$ flushing!
   9
  10 * Implement a slew of performance counters, most specifically, count
  11
  12   - I$ misses
  13   - D$ misses
  14   - SB full hazards
  15   - load-use hazards
  16   - branch hazards
  17   - waiting for mult hazards
  18   - waiting for div hazards
  19
  20 * Implemented DIV and DIVU
  21
  22 * Added a 16-bit async sram controller for LPRP.
  23
  24 * sram_ctrl.v: burst reads can now be issued back to back, fully
  25   saturating the memory.
  26
  27 * Majorily redid the memory protocol. It now supports pipelining and
  28   burst reads. To simplity, *all* reads are burst reads of a fixed
  29   burst length (currently 4) and *all* writes are single, non-burst.
  30
  31 * Rewrote and chopped off early history. George Orwell would have been
  32   proud.
  33
  34 * Separate the memory bus from the peripheral bus.
  35
  36   Things gets interesting with peripherals that need DMA-like access
  37   to memory, such as video interfaces.
  38
  39   A consequence would be that Flash etc would be on the peripheral
  40   bus, so if we wish to cache this, we'd have to distinguish IO from
  41   uncachable.
  42
  43 * Find a way to preload the RTL simulated SRAM (loading the kernel
  44   with the bootloader under RTL simulation takes *WAY* too long).
  45
  46
  47
  48 * Once a sufficient testing structure is in place, start replacing the
  49   pipeline with a high performance one. [DONE]
  50 * Fix the arrays that can't be inferred as RAM. Notably the register
  51   file. [DONE]
  52 * Restart the pipeline instead of stalling [DONE]
  53 * Forward results instead of restarting [DONE]
  54 * Move configuration parameters (such as cache size, etc) to a global
  55   configuration file. [DONE]
  56 * Wizzard generated RAM blocks (to know exactly what I get) [DONE]
  57 * Extend the forwarding in DE and use dual-port memory w/o bypassing
  58   [DONE]
  59 * Handle uncached loads (uncached stores works as a side effect of the
  60   write-through cache).
  61
  62   1. Move the x_res mux up to just after the data cache out and make
  63      sure that x_res falls through the shifter network when not
  64      loading [DONE]
  65
  66   2. Add another memory event (uncached_load), make sure it doesn't
  67      accidentally writes to the data cache array or tags.
  68
  69   3. Make uncached loads kill the pipe and set up the uncached_load
  70      event.
  71
  72   4. When the data for uncached_load comes in, thread it through the
  73      shifter network and restart the next instruction.
  74
  75   5. Test by making all loads uncachable.
  76 [DONE]
  77 * Restart on load-use hazards. [DONE]
  78 * Restart branches whose delay slots gets delayed (ie. due to cache
  79   misses) [DONE]
  80
  81 * Fix the current strange behavior of tinymon. Fault points to a
  82   serial port problem. Prefer a software workaround, as the
  83   peripherals are going to get an overhaul later anyway. [DONE - it
  84   was two actual bugs in the pipeline]
  85 * Found a new home: git://repo.or.cz/yari.git
  86
  87 * Implement MULTU
  88 * Implement MULT
  89
  90 * Matching IO behaviour like I do now is unsustainable. As a general
  91   principle, while cosimulating, higher level models could take the IO
  92   events from the lower levels (which can ultimately be the running
  93   FPGA).