doc/PIPELINE_DESIGN

   1 YARI Pipeline Design, version 1
   2
   3 Assuming the classic five stage pipeline:
   4 - IF - instruction fetch,
   5 - DE - decode and register fetch,
   6 - EX - bypassing and execution,
   7 - ME - memory access, and
   8 - WB - results write-back.
   9
  10 Implementing this is trivial if you ignore hazards.
  11
  12 The main hazards we have to deal with are:
  13
  14 - WAR hazards - in the simplest pipeline this only happens for loads
  15   followed by an immediate use. The most obvious way to deal with this
  16   is stalling the use.
  17
  18 - Control transfers, such as branches (B and Bxx), jumps (J),
  19   subroutine calls (JAL and JRAL), etc. We'll call all of these
  20   branches for simplicity. MIPS adds the twist of delayed branches,
  21   meaning that the instruction immediately following the branch (the
  22   "delay slot") is executed before the control is transferred. If any
  23   of the instructions following the delay are present in the pipe they
  24   must be flushed.
  25
  26 - Memory busy / data cache (D$) misses. Again, the most obvious way to
  27   deal with this is to stall the pipe until the memory/cache is ready.
  28
  29 - Instruction cache (I$) misses.
  30
  31 Unfortunately, stalling the pipeline not only is very expensive (a
  32 global signal gating all flip-flops in the pipeline, and memory blocks
  33 can't be stalled so needs extra logic), but stalling for WAR hazards
  34 means introducing bubbles in the pipeline, which in turn makes dealing
  35 with the delay slot very complicated (branches must flush everything
  36 following the delay slot, but the delay slot can be in several
  37 locations depending on the history of stalls).
  38
  39 We thus reject stalling and instead restart the pipeline whenever we
  40 encounter a hazard. This generally means that means a higher cost of
  41 hazards in number of pipeline bubbles, but this is partially offset by
  42 a shorter cycle time and the fact that hazards doesn't happen all the
  43 time (given the probability of the hazard, you can work out much speed
  44 up you need to offset the additional hazard costs).
  45
  46
  47 Restarting upon hazards:
  48
  49 - I$ miss - not much choice here, the IF stage will emit bubbles until
  50   it's ready.
  51
  52 - Branches - to simplify the handling of delay slots we will require
  53   that a taken branch will never see a bubble in it's delay slot. If
  54   that is detected, we treat it as a different kind of hazard and
  55   restart the branch instruction, flushing everything following and
  56   including the branch itself. This trick only works because all
  57   branches in MIPS are idempotent. Otherwise, a taken branch flushes
  58   everything following the delay slot and restart from the branch
  59   target.
  60
  61 - D$ misses/memory busy - there are several options here, but they are
  62   somewhat dependent on how the ME stage is implemented.
  63
  64   The simplest option is to kill the pipeline while memory is
  65   busy/while the D$ fills and restart the instruction following the
  66   store/load. Unfortunately, as this hazard is fairly deep in the
  67   pipeline it pays the highest penalty for restarting as opposed to
  68   stalling.
  69
  70   A slightly better option is keep restarting the following
  71   instruction until we know that memory will be ready before that
  72   instruction propagates down to the ME stage.
  73
  74   We can further reduce store hazards by applying a small store buffer
  75   (~ FIFO). The caveat is of course that loads would either need to
  76   check against the store buffer or stall until it has drained.
  77
  78   Raising the D$ size and/or associativity will help lower the
  79   frequency of the load hazards. Another technique that can sometimes
  80   hide load latencies is non-blocking loads. Here the idea is simply
  81   to decouple the load pipeline, keeping track of which registers have
  82   outstanding loads, thus causes WAR hazards whenever a use of such a
  83   register is attempted. As data eventually arrives they can be
  84   written to the register file using a second write port (requiring
  85   two write ports to the register file), or they can wait from the
  86   write port to be otherwise idle and write back the loaded
  87   results. Alas, in practice for MIPS code there tends to be a fairly
  88   short distance between the load and its use, rendering this
  89   elaborate technique less valuable.
  90
  91 - WAR - simply restart (**) the use. This hazard is non-existent for
  92   properly scheduled code (assuming blocking caches).
  93
  94   (**) Alas, nothing is quite that simple. The instructions restarted
  95   in the WAR hazards could be delay-slots, and if so, would cause the
  96   effect of the branch to be ignored. To issue this restart, we must
  97   know if the instruction is in a delay-slot, and if it is, we restart
  98   the preceding instruction. (This again depends on the impotence of
  99   MIPS branches.)
 100
 101
 102 Choosing between hazards: as hazards are generally detected in
 103 detected in different stages, it's important to ensure that the hazard
 104 of the instruction closet to committing, that is, furthest along in
 105 the pipe, takes priority.
 106
 107
 108 Implementing the hazard handling
 109
 110 Intuitively we would want to restart as soon as a hazard can be
 111 detected, however to more stages that can issue a restart, the larger
 112 the priority arbitration as well larger the mux for the restart
 113 control (address and stages to flush). It will be better to focus on
 114 the frequent hazards (say, branches and D$ misses) and only accept
 115 restarts from the corresponding stages (say, EX and ME). A less
 116 important hazard as fx. WAR detected in DE, will have to ask EX to
 117 perform the restart. However, care must be taken to ensure than these
 118 pending hazards are handled correctly (for example, that no
 119 instruction that will be flushed by the pending restart can cause any
 120 hazards or other effects).
 121
 122 Naively implemented flushing a pipeline stage means clearing all the
 123 flip flops, turning instructions into nops. This is very expensive, so
 124 instead provide all stage with a "valid" bit. Any effect from a stage
 125 (hazards, memory ops, register writes etc) must be ignored if the
 126 stage is invalid (that is, it's a pipeline bubble).