source/libs/gmp/gmp-src/mpn/ia64/README

   1 Copyright 2000-2005 Free Software Foundation, Inc.
   2
   3 This file is part of the GNU MP Library.
   4
   5 The GNU MP Library is free software; you can redistribute it and/or modify
   6 it under the terms of either:
   7
   8   * the GNU Lesser General Public License as published by the Free
   9     Software Foundation; either version 3 of the License, or (at your
  10     option) any later version.
  11
  12 or
  13
  14   * the GNU General Public License as published by the Free Software
  15     Foundation; either version 2 of the License, or (at your option) any
  16     later version.
  17
  18 or both in parallel, as here.
  19
  20 The GNU MP Library is distributed in the hope that it will be useful, but
  21 WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
  22 or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
  23 for more details.
  24
  25 You should have received copies of the GNU General Public License and the
  26 GNU Lesser General Public License along with the GNU MP Library.  If not,
  27 see https://www.gnu.org/licenses/.
  28
  29
  30
  31                       IA-64 MPN SUBROUTINES
  32
  33
  34 This directory contains mpn functions for the IA-64 architecture.
  35
  36
  37 CODE ORGANIZATION
  38
  39         mpn/ia64          itanium-2, and generic ia64
  40
  41 The code here has been optimized primarily for Itanium 2.  Very few Itanium 1
  42 chips were ever sold, and Itanium 2 is more powerful, so the latter is what
  43 we concentrate on.
  44
  45
  46
  47 CHIP NOTES
  48
  49 The IA-64 ISA keeps instructions three and three in 128 bit bundles.
  50 Programmers/compilers need to put explicit breaks `;;' when there are WAW or
  51 RAW dependencies, with some notable exceptions.  Such "breaks" are typically
  52 at the end of a bundle, but can be put between operations within some bundle
  53 types too.
  54
  55 The Itanium 1 and Itanium 2 implementations can under ideal conditions
  56 execute two bundles per cycle.  The Itanium 1 allows 4 of these instructions
  57 to do integer operations, while the Itanium 2 allows all 6 to be integer
  58 operations.
  59
  60 Taken cloop branches seem to insert a bubble into the pipeline most of the
  61 time on Itanium 1.
  62
  63 Loads to the fp registers bypass the L1 cache and thus get extremely long
  64 latencies, 9 cycles on the Itanium 1 and 6 cycles on the Itanium 2.
  65
  66 The software pipeline stuff using br.ctop instruction causes delays, since
  67 many issue slots are taken up by instructions with zero predicates, and
  68 since many extra instructions are needed to set things up.  These features
  69 are clearly designed for code density, not speed.
  70
  71 Misc pipeline limitations (Itanium 1):
  72 * The getf.sig instruction can only execute in M0.
  73 * At most four integer instructions/cycle.
  74 * Nops take up resources like any plain instructions.
  75
  76 Misc pipeline limitations (Itanium 2):
  77 * The getf.sig instruction can only execute in M0.
  78 * Nops take up resources like any plain instructions.
  79
  80
  81 ASSEMBLY SYNTAX
  82
  83 .align pads with nops in a text segment, but gas 2.14 and earlier
  84 incorrectly byte-swaps its nop bundle in big endian mode (eg. hpux), making
  85 it come out as break instructions.  We use the ALIGN() macro in
  86 mpn/ia64/ia64-defs.m4 when it might be executed across.  That macro
  87 suppresses any .align if the problem is detected by configure.  Lack of
  88 alignment might hurt performance but will at least be correct.
  89
  90 foo:: to create a global symbol is not accepted by gas.  Use separate
  91 ".global foo" and "foo:" instead.
  92
  93 .global is the standard global directive.  gas accepts .globl, but hpux "as"
  94 doesn't.
  95
  96 .proc / .endp generates the appropriate .type and .size information for ELF,
  97 so the latter directives don't need to be given explicitly.
  98
  99 .pred.rel "mutex"... is standard for annotating predicate register
 100 relationships.  gas also accepts .pred.rel.mutex, but hpux "as" doesn't.
 101
 102 .pred directives can't be put on a line with a label, like
 103 ".Lfoo: .pred ...", the HP assembler on HP-UX 11.23 rejects that.
 104 gas is happy with it, and past versions of HP had seemed ok.
 105
 106 // is the standard comment sequence, but we prefer "C" since it inhibits m4
 107 macro expansion.  See comments in ia64-defs.m4.
 108
 109
 110 REGISTER USAGE
 111
 112 Special:
 113    r0: constant 0
 114    r1: global pointer (gp)
 115    r8: return value
 116    r12: stack pointer (sp)
 117    r13: thread pointer (tp)
 118 Caller-saves: r8-r11 r14-r31 f6-f15 f32-f127
 119 Caller-saves but rotating: r32-
 120
 121
 122 ================================================================
 123 mpn_add_n, mpn_sub_n:
 124
 125 The current code runs at 1.25 c/l on Itanium 2.
 126
 127 ================================================================
 128 mpn_mul_1:
 129
 130 The current code runs at 2 c/l on Itanium 2.
 131
 132 Using a blocked approach, working off of 4 separate places in the operands,
 133 one could make use of the xma accumulation, and approach 1 c/l.
 134
 135         ldf8 [up]
 136         xma.l
 137         xma.hu
 138         stf8  [wrp]
 139
 140 ================================================================
 141 mpn_addmul_1:
 142
 143 The current code runs at 2 c/l on Itanium 2.
 144
 145 It seems possible to use a blocked approach, as with mpn_mul_1.  We should
 146 read rp[] to integer registers, allowing for just one getf.sig per cycle.
 147
 148         ld8  [rp]
 149         ldf8 [up]
 150         xma.l
 151         xma.hu
 152         getf.sig
 153         add+add+cmp+cmp
 154         st8  [wrp]
 155
 156 These 10 instructions can be scheduled to approach 1.667 cycles, and with
 157 the 4 cycle latency of xma, this means we need at least 3 blocks.  Using
 158 ldfp8 we could approach 1.583 c/l.
 159
 160 ================================================================
 161 mpn_submul_1:
 162
 163 The current code runs at 2.25 c/l on Itanium 2.  Getting to 2 c/l requires
 164 ldfp8 with all alignment headache that implies.
 165
 166 ================================================================
 167 mpn_addmul_N
 168
 169 For best speed, we need to give up using mpn_addmul_2 as the main multiply
 170 building block, and instead take multiple v limbs per loop.  For the Itanium
 171 1, we need to take about 8 limbs at a time for full speed.  For the Itanium
 172 2, something like mpn_addmul_4 should be enough.
 173
 174 The add+cmp+cmp+add we use on the other codes is optimal for shortening
 175 recurrencies (1 cycle) but the sequence takes up 4 execution slots.  When
 176 recurrency depth is not critical, a more standard 3-cycle add+cmp+add is
 177 better.
 178
 179 /* First load the 8 values from v */
 180         ldfp8           v0, v1 = [r35], 16;;
 181         ldfp8           v2, v3 = [r35], 16;;
 182         ldfp8           v4, v5 = [r35], 16;;
 183         ldfp8           v6, v7 = [r35], 16;;
 184
 185 /* In the inner loop, get a new U limb and store a result limb. */
 186         mov             lc = un
 187 Loop:   ldf8            u0 = [r33], 8
 188         ld8             r0 = [r32]
 189         xma.l           lp0 = v0, u0, hp0
 190         xma.hu          hp0 = v0, u0, hp0
 191         xma.l           lp1 = v1, u0, hp1
 192         xma.hu          hp1 = v1, u0, hp1
 193         xma.l           lp2 = v2, u0, hp2
 194         xma.hu          hp2 = v2, u0, hp2
 195         xma.l           lp3 = v3, u0, hp3
 196         xma.hu          hp3 = v3, u0, hp3
 197         xma.l           lp4 = v4, u0, hp4
 198         xma.hu          hp4 = v4, u0, hp4
 199         xma.l           lp5 = v5, u0, hp5
 200         xma.hu          hp5 = v5, u0, hp5
 201         xma.l           lp6 = v6, u0, hp6
 202         xma.hu          hp6 = v6, u0, hp6
 203         xma.l           lp7 = v7, u0, hp7
 204         xma.hu          hp7 = v7, u0, hp7
 205         getf.sig        l0 = lp0
 206         getf.sig        l1 = lp1
 207         getf.sig        l2 = lp2
 208         getf.sig        l3 = lp3
 209         getf.sig        l4 = lp4
 210         getf.sig        l5 = lp5
 211         getf.sig        l6 = lp6
 212         add+cmp+add     xx, l0, r0
 213         add+cmp+add     acc0, acc1, l1
 214         add+cmp+add     acc1, acc2, l2
 215         add+cmp+add     acc2, acc3, l3
 216         add+cmp+add     acc3, acc4, l4
 217         add+cmp+add     acc4, acc5, l5
 218         add+cmp+add     acc5, acc6, l6
 219         getf.sig        acc6 = lp7
 220         st8             [r32] = xx, 8
 221         br.cloop Loop
 222
 223         49 insn at max 6 insn/cycle:            8.167 cycles/limb8
 224         11 memops at max 2 memops/cycle:        5.5 cycles/limb8
 225         16 fpops at max 2 fpops/cycle:          8 cycles/limb8
 226         21 intops at max 4 intops/cycle:        5.25 cycles/limb8
 227         11+21 memops+intops at max 4/cycle      8 cycles/limb8
 228
 229 ================================================================
 230 mpn_lshift, mpn_rshift
 231
 232 The current code runs at 1 cycle/limb on Itanium 2.
 233
 234 Using 63 separate loops, we could use the double-word shrp instruction.
 235 That instruction has a plain single-cycle latency.  We need 63 loops since
 236 this instruction only accept immediate count.  That would lead to a somewhat
 237 silly code size, but the speed would be 0.75 c/l on Itanium 2 (by using shrp
 238 each cycle plus shl/shr going down I1 for a further limb every second
 239 cycle).
 240
 241 ================================================================
 242 mpn_copyi, mpn_copyd
 243
 244 The current code runs at 0.5 c/l on Itanium 2.  But that is just for L1
 245 cache hit.  The 4-way unrolled loop takes just 2 cycles, and thus load-use
 246 scheduling isn't great.  It might be best to actually use modulo scheduled
 247 loops, since that will allow us to do better load-use scheduling without too
 248 much unrolling.
 249
 250 Depending on size or operand alignment, we get 1 c/l or 0.5 c/l on Itanium
 251 2, according to tune/speed.  Cache bank conflicts?
 252
 253
 254
 255 REFERENCES
 256
 257 Intel Itanium Architecture Software Developer's Manual, volumes 1 to 3,
 258 Intel document 245317-004, 245318-004, 245319-004 October 2002.  Volume 1
 259 includes an Itanium optimization guide.
 260
 261 Intel Itanium Processor-specific Application Binary Interface (ABI), Intel
 262 document 245370-003, May 2001.  Describes C type sizes, dynamic linking,
 263 etc.
 264
 265 Intel Itanium Architecture Assembly Language Reference Guide, Intel document
 266 248801-004, 2000-2002.  Describes assembly instruction syntax and other
 267 directives.
 268
 269 Itanium Software Conventions and Runtime Architecture Guide, Intel document
 270 245358-003, May 2001.  Describes calling conventions, including stack
 271 unwinding requirements.
 272
 273 Intel Itanium Processor Reference Manual for Software Optimization, Intel
 274 document 245473-003, November 2001.
 275
 276 Intel Itanium-2 Processor Reference Manual for Software Development and
 277 Optimization, Intel document 251110-003, May 2004.
 278
 279 All the above documents can be found online at
 280
 281     http://developer.intel.com/design/itanium/manuals.htm