2 * arch/alpha/lib/ev6-copy_page.S
7 /* The following comparison of this routine vs the normal copy_page.S
8 was written by an unnamed ev6 hardware designer and forwarded to me
9 via Steven Hobbs <hobbs@steven.zko.dec.com>.
11 First Problem: STQ overflows.
12 -----------------------------
14 It would be nice if EV6 handled every resource overflow efficiently,
15 but for some it doesn't. Including store queue overflows. It causes
16 a trap and a restart of the pipe.
18 To get around this we sometimes use (to borrow a term from a VSSAD
19 researcher) "aeration". The idea is to slow the rate at which the
20 processor receives valid instructions by inserting nops in the fetch
21 path. In doing so, you can prevent the overflow and actually make
22 the code run faster. You can, of course, take advantage of the fact
23 that the processor can fetch at most 4 aligned instructions per cycle.
25 I inserted enough nops to force it to take 10 cycles to fetch the
26 loop code. In theory, EV6 should be able to execute this loop in
27 9 cycles but I was not able to get it to run that fast -- the initial
28 conditions were such that I could not reach this optimum rate on
29 (chaotic) EV6. I wrote the code such that everything would issue
32 Second Problem: Dcache index matches.
33 -------------------------------------
35 If you are going to use this routine on random aligned pages, there
36 is a 25% chance that the pages will be at the same dcache indices.
37 This results in many nasty memory traps without care.
39 The solution is to schedule the prefetches to avoid the memory
40 conflicts. I schedule the wh64 prefetches farther ahead of the
41 read prefetches to avoid this problem.
43 Third Problem: Needs more prefetching.
44 --------------------------------------
46 In order to improve the code I added deeper prefetching to take the
47 most advantage of EV6's bandwidth.
49 I also prefetched the read stream. Note that adding the read prefetch
50 forced me to add another cycle to the inner-most kernel - up to 11
51 from the original 8 cycles per iteration. We could improve performance
52 further by unrolling the loop and doing multiple prefetches per cycle.
54 I think that the code below will be very robust and fast code for the
55 purposes of copying aligned pages. It is slower when both source and
56 destination pages are in the dcache, but it is my guess that this is
57 less important than the dcache miss case. */
67 /* Prefetch 5 read cachelines; write-hint 10 cache lines. */
103 /* Main prefetching/write-hinting loop. */
129 /* This gives the extra cycle of aeration above the minimum. */
160 /* Prefetch the final 5 cache lines of the read stream. */
171 /* Non-prefetching, non-write-hinting cleanup loop for the
172 final 10 cache lines. */