source/libs/gmp/gmp-src/mpn/cray/README

   1 Copyright 2000-2002 Free Software Foundation, Inc.
   2
   3 This file is part of the GNU MP Library.
   4
   5 The GNU MP Library is free software; you can redistribute it and/or modify
   6 it under the terms of either:
   7
   8   * the GNU Lesser General Public License as published by the Free
   9     Software Foundation; either version 3 of the License, or (at your
  10     option) any later version.
  11
  12 or
  13
  14   * the GNU General Public License as published by the Free Software
  15     Foundation; either version 2 of the License, or (at your option) any
  16     later version.
  17
  18 or both in parallel, as here.
  19
  20 The GNU MP Library is distributed in the hope that it will be useful, but
  21 WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
  22 or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
  23 for more details.
  24
  25 You should have received copies of the GNU General Public License and the
  26 GNU Lesser General Public License along with the GNU MP Library.  If not,
  27 see https://www.gnu.org/licenses/.
  28
  29
  30
  31
  32
  33
  34 The code in this directory works for Cray vector systems such as C90,
  35 J90, T90 (both the CFP variant and the IEEE variant) and SV1.  (For
  36 the T3E and T3D systems, see the `alpha' subdirectory at the same
  37 level as the directory containing this file.)
  38
  39 The cfp subdirectory is for systems utilizing the traditional Cray
  40 floating-point format, and the ieee subdirectory is for the newer
  41 systems that use the IEEE floating-point format.
  42
  43 There are several issues that reduces speed on Cray systems.  For
  44 systems with cfp floating point, the main obstacle is the forming of
  45 128-bit products.  For IEEE systems, adding, and in particular
  46 computing carry is the main issue.  There are no vectorizing
  47 unsigned-less-than instructions, and the sequence that implement that
  48 operation is very long.
  49
  50 Shifting is the only operation that is simple to make fast.  All Cray
  51 systems have a bitblt instructions (Vi Vj,Vj<Ak and Vi Vj,Vj>Ak) that
  52 should be really useful.
  53
  54 For best speed for cfp systems, we need a mul_basecase, since that
  55 reduces the need for carry propagation to a minimum.  Depending on the
  56 size (vn) of the smaller of the two operands (V), we should split U and V
  57 in different chunk sizes:
  58
  59 U split in 2 32-bit parts
  60 V split according to the table:
  61 parts                   4       5       6       7       8
  62 bits/part               16      13      11      10      8
  63 max allowed vn          1       8       32      64      256
  64 number of multiplies    8       10      12      14      16
  65 peak cycles/limb        4       5       6       7       8
  66
  67 U split in 3 22-bit parts
  68 V split according to the table:
  69 parts                   3       4       5
  70 bits/part               22      16      13
  71 max allowed vn          16      1024    8192
  72 number of multiplies    9       12      15
  73 peak cycles/limb        4.5     6       7.5
  74
  75 U split in 4 16-bit parts
  76 V split according to the table:
  77 parts                   4
  78 bits/part               16
  79 max allowed vn          65536
  80 number of multiplies    16
  81 peak cycles/limb        8
  82
  83 (A T90 CPU can accumulate two products per cycle.)
  84
  85 IDEA:
  86 * Rewrite mpn_add_n:
  87     short cy[n + 1];
  88     #pragma _CRI ivdep
  89       for (i = 0; i < n; i++)
  90         { s = up[i] + vp[i];
  91           rp[i] = s;
  92           cy[i + 1] = s < up[i]; }
  93       more_carries = 0;
  94     #pragma _CRI ivdep
  95       for (i = 1; i < n; i++)
  96         { s = rp[i] + cy[i];
  97           rp[i] = s;
  98           more_carries += s < cy[i]; }
  99       cys = 0;
 100       if (more_carries)
 101         {
 102           cys = rp[1] < cy[1];
 103           for (i = 2; i < n; i++)
 104             { rp[i] += cys;
 105               cys = rp[i] < cys; }
 106         }
 107       return cys + cy[n];
 108
 109 * Write mpn_add3_n for adding three operands.  First add operands 1
 110   and 2, and generate cy[].  Then add operand 3 to the partial result,
 111   and accumulate carry into cy[].  Finally propagate carry just like
 112   in the new mpn_add_n.
 113
 114 IDEA:
 115
 116 Store fewer bits, perhaps 62, per limb.  That brings mpn_add_n time
 117 down to 2.5 cycles/limb and mpn_addmul_1 times to 4 cycles/limb.  By
 118 storing even fewer bits per limb, perhaps 56, it would be possible to
 119 write a mul_mul_basecase that would run at effectively 1 cycle/limb.
 120 (Use VM here to better handle the romb-shaped multiply area, perhaps
 121 rounding operand sizes up to the next power of 2.)