doc/insref.src

   1 \A{iref} x86 Instruction Reference
   2
   3 This appendix provides a complete list of the machine instructions
   4 which NASM will assemble, and a short description of the function of
   5 each one.
   6
   7 It is not intended to be an exhaustive documentation on the fine
   8 details of the instructions' function, such as which exceptions they
   9 can trigger: for such documentation, you should go to Intel's Web
  10 site, \W{http://developer.intel.com/design/Pentium4/manuals/}\c{http://developer.intel.com/design/Pentium4/manuals/}.
  11
  12 Instead, this appendix is intended primarily to provide
  13 documentation on the way the instructions may be used within NASM.
  14 For example, looking up \c{LOOP} will tell you that NASM allows
  15 \c{CX} or \c{ECX} to be specified as an optional second argument to
  16 the \c{LOOP} instruction, to enforce which of the two possible
  17 counter registers should be used if the default is not the one
  18 desired.
  19
  20 The instructions are not quite listed in alphabetical order, since
  21 groups of instructions with similar functions are lumped together in
  22 the same entry. Most of them don't move very far from their
  23 alphabetic position because of this.
  24
  25
  26 \H{iref-opr} Key to Operand Specifications
  27
  28 The instruction descriptions in this appendix specify their operands
  29 using the following notation:
  30
  31 \b Registers: \c{reg8} denotes an 8-bit \i{general purpose
  32 register}, \c{reg16} denotes a 16-bit general purpose register,
  33 \c{reg32} a 32-bit one and \c{reg64} a 64-bit one. \c{fpureg} denotes
  34 one of the eight FPU stack registers, \c{mmxreg} denotes one of the
  35 eight 64-bit MMX registers, and \c{segreg} denotes a segment register.
  36 \c{xmmreg} denotes one of the 8, or 16 in x64 long mode, SSE XMM registers.
  37 In addition, some registers (such as \c{AL}, \c{DX}, \c{ECX} or \c{RAX})
  38 may be specified explicitly.
  39
  40 \b Immediate operands: \c{imm} denotes a generic \i{immediate operand}.
  41 \c{imm8}, \c{imm16} and \c{imm32} are used when the operand is
  42 intended to be a specific size. For some of these instructions, NASM
  43 needs an explicit specifier: for example, \c{ADD ESP,16} could be
  44 interpreted as either \c{ADD r/m32,imm32} or \c{ADD r/m32,imm8}.
  45 NASM chooses the former by default, and so you must specify \c{ADD
  46 ESP,BYTE 16} for the latter. There is a special case of the allowance
  47 of an \c{imm64} for particular x64 versions of the MOV instruction.
  48
  49 \b Memory references: \c{mem} denotes a generic \i{memory reference};
  50 \c{mem8}, \c{mem16}, \c{mem32}, \c{mem64} and \c{mem80} are used
  51 when the operand needs to be a specific size. Again, a specifier is
  52 needed in some cases: \c{DEC [address]} is ambiguous and will be
  53 rejected by NASM. You must specify \c{DEC BYTE [address]}, \c{DEC
  54 WORD [address]} or \c{DEC DWORD [address]} instead.
  55
  56 \b \i{Restricted memory references}: one form of the \c{MOV}
  57 instruction allows a memory address to be specified \e{without}
  58 allowing the normal range of register combinations and effective
  59 address processing. This is denoted by \c{memoffs8}, \c{memoffs16},
  60 \c{memoffs32} or \c{memoffs64}.
  61
  62 \b Register or memory choices: many instructions can accept either a
  63 register \e{or} a memory reference as an operand. \c{r/m8} is
  64 shorthand for \c{reg8/mem8}; similarly \c{r/m16} and \c{r/m32}.
  65 On legacy x86 modes, \c{r/m64} is MMX-related, and is shorthand for
  66 \c{mmxreg/mem64}. When utilizing the x86-64 architecture extension,
  67 \c{r/m64} denotes use of a 64-bit GPR as well, and is shorthand for
  68 \c{reg64/mem64}.
  69
  70
  71 \H{iref-opc} Key to Opcode Descriptions
  72
  73 This appendix also provides the opcodes which NASM will generate for
  74 each form of each instruction. The opcodes are listed in the
  75 following way:
  76
  77 \b A hex number, such as \c{3F}, indicates a fixed byte containing
  78 that number.
  79
  80 \b A hex number followed by \c{+r}, such as \c{C8+r}, indicates that
  81 one of the operands to the instruction is a register, and the
  82 `register value' of that register should be added to the hex number
  83 to produce the generated byte. For example, EDX has register value
  84 2, so the code \c{C8+r}, when the register operand is EDX, generates
  85 the hex byte \c{CA}. Register values for specific registers are
  86 given in \k{iref-rv}.
  87
  88 \b A hex number followed by \c{+cc}, such as \c{40+cc}, indicates
  89 that the instruction name has a condition code suffix, and the
  90 numeric representation of the condition code should be added to the
  91 hex number to produce the generated byte. For example, the code
  92 \c{40+cc}, when the instruction contains the \c{NE} condition,
  93 generates the hex byte \c{45}. Condition codes and their numeric
  94 representations are given in \k{iref-cc}.
  95
  96 \b A slash followed by a digit, such as \c{/2}, indicates that one
  97 of the operands to the instruction is a memory address or register
  98 (denoted \c{mem} or \c{r/m}, with an optional size). This is to be
  99 encoded as an effective address, with a \i{ModR/M byte}, an optional
 100 \i{SIB byte}, and an optional displacement, and the spare (register)
 101 field of the ModR/M byte should be the digit given (which will be
 102 from 0 to 7, so it fits in three bits). The encoding of effective
 103 addresses is given in \k{iref-ea}.
 104
 105 \b The code \c{/r} combines the above two: it indicates that one of
 106 the operands is a memory address or \c{r/m}, and another is a
 107 register, and that an effective address should be generated with the
 108 spare (register) field in the ModR/M byte being equal to the
 109 `register value' of the register operand. The encoding of effective
 110 addresses is given in \k{iref-ea}; register values are given in
 111 \k{iref-rv}.
 112
 113 \b The codes \c{ib}, \c{iw} and \c{id} indicate that one of the
 114 operands to the instruction is an immediate value, and that this is
 115 to be encoded as a byte, little-endian word or little-endian
 116 doubleword respectively.
 117
 118 \b The codes \c{rb}, \c{rw} and \c{rd} indicate that one of the
 119 operands to the instruction is an immediate value, and that the
 120 \e{difference} between this value and the address of the end of the
 121 instruction is to be encoded as a byte, word or doubleword
 122 respectively. Where the form \c{rw/rd} appears, it indicates that
 123 either \c{rw} or \c{rd} should be used according to whether assembly
 124 is being performed in \c{BITS 16} or \c{BITS 32} state respectively.
 125
 126 \b The codes \c{ow} and \c{od} indicate that one of the operands to
 127 the instruction is a reference to the contents of a memory address
 128 specified as an immediate value: this encoding is used in some forms
 129 of the \c{MOV} instruction in place of the standard
 130 effective-address mechanism. The displacement is encoded as a word
 131 or doubleword. Again, \c{ow/od} denotes that \c{ow} or \c{od} should
 132 be chosen according to the \c{BITS} setting.
 133
 134 \b The codes \c{o16} and \c{o32} indicate that the given form of the
 135 instruction should be assembled with operand size 16 or 32 bits. In
 136 other words, \c{o16} indicates a \c{66} prefix in \c{BITS 32} state,
 137 but generates no code in \c{BITS 16} state; and \c{o32} indicates a
 138 \c{66} prefix in \c{BITS 16} state but generates nothing in \c{BITS
 139 32}.
 140
 141 \b The codes \c{a16} and \c{a32}, similarly to \c{o16} and \c{o32},
 142 indicate the address size of the given form of the instruction.
 143 Where this does not match the \c{BITS} setting, a \c{67} prefix is
 144 required. Please note that \c{a16} is useless in long mode as
 145 16-bit addressing is depreciated on the x86-64 architecture extension.
 146
 147
 148 \S{iref-rv} Register Values
 149
 150 Where an instruction requires a register value, it is already
 151 implicit in the encoding of the rest of the instruction what type of
 152 register is intended: an 8-bit general-purpose register, a segment
 153 register, a debug register, an MMX register, or whatever. Therefore
 154 there is no problem with registers of different types sharing an
 155 encoding value.
 156
 157 Please note that for the register classes listed below, the register
 158 extensions (REX) classes require the use of the REX prefix, in which
 159 is only available when in long mode on the x86-64 processor. This
 160 pretty much goes for any register that has a number higher than 7.
 161
 162 The encodings for the various classes of register are:
 163
 164 \b 8-bit general registers: \c{AL} is 0, \c{CL} is 1, \c{DL} is 2,
 165 \c{BL} is 3, \c{AH} is 4, \c{CH} is 5, \c{DH} is 6 and \c{BH} is
 166 7. Please note that \c{AH}, \c{BH}, \c{CH} and \c{DH} are not
 167 addressable when using the REX prefix in long mode.
 168
 169 \b 8-bit general register extensions (REX): \c{SPL} is 4, \c{BPL} is 5,
 170 \c{SIL} is 6, \c{DIL} is 7, \c{R8B} is 8, \c{R9B} is 9, \c{R10B} is 10,
 171 \c{R11B} is 11, \c{R12B} is 12, \c{R13B} is 13, \c{R14B} is 14 and
 172 \c{R15B} is 15.
 173
 174 \b 16-bit general registers: \c{AX} is 0, \c{CX} is 1, \c{DX} is 2,
 175 \c{BX} is 3, \c{SP} is 4, \c{BP} is 5, \c{SI} is 6, and \c{DI} is 7.
 176
 177 \b 16-bit general register extensions (REX): \c{R8W} is 8, \c{R9W} is 9,
 178 \c{R10w} is 10, \c{R11W} is 11, \c{R12W} is 12, \c{R13W} is 13, \c{R14W}
 179 is 14 and \c{R15W} is 15.
 180
 181 \b 32-bit general registers: \c{EAX} is 0, \c{ECX} is 1, \c{EDX} is
 182 2, \c{EBX} is 3, \c{ESP} is 4, \c{EBP} is 5, \c{ESI} is 6, and
 183 \c{EDI} is 7.
 184
 185 \b 32-bit general register extensions (REX): \c{R8D} is 8, \c{R9D} is 9,
 186 \c{R10D} is 10, \c{R11D} is 11, \c{R12D} is 12, \c{R13D} is 13, \c{R14D}
 187 is 14 and \c{R15D} is 15.
 188
 189 \b 64-bit general register extensions (REX): \c{RAX} is 0, \c{RCX} is 1,
 190 \c{RDX} is 2, \c{RBX} is 3, \c{RSP} is 4, \c{RBP} is 5, \c{RSI} is 6,
 191 \c{RDI} is 7, \c{R8} is 8, \c{R9} is 9, \c{R10} is 10, \c{R11} is 11,
 192 \c{R12} is 12, \c{R13} is 13, \c{R14} is 14 and \c{R15} is 15.
 193
 194 \b \i{Segment registers}: \c{ES} is 0, \c{CS} is 1, \c{SS} is 2, \c{DS}
 195 is 3, \c{FS} is 4, and \c{GS} is 5.
 196
 197 \b \I{floating-point, registers}Floating-point registers: \c{ST0}
 198 is 0, \c{ST1} is 1, \c{ST2} is 2, \c{ST3} is 3, \c{ST4} is 4,
 199 \c{ST5} is 5, \c{ST6} is 6, and \c{ST7} is 7.
 200
 201 \b 64-bit \i{MMX registers}: \c{MM0} is 0, \c{MM1} is 1, \c{MM2} is 2,
 202 \c{MM3} is 3, \c{MM4} is 4, \c{MM5} is 5, \c{MM6} is 6, and \c{MM7}
 203 is 7.
 204
 205 \b 128-bit \i{XMM (SSE) registers}: \c{XMM0} is 0, \c{XMM1} is 1,
 206 \c{XMM2} is 2, \c{XMM3} is 3, \c{XMM4} is 4, \c{XMM5} is 5, \c{XMM6} is
 207 6 and \c{XMM7} is 7.
 208
 209 \b 128-bit \i{XMM (SSE) register} extensions (REX): \c{XMM8} is 8,
 210 \c{XMM9} is 9, \c{XMM10} is 10, \c{XMM11} is 11, \c{XMM12} is 12,
 211 \c{XMM13} is 13, \c{XMM14} is 14 and \c{XMM15} is 15.
 212
 213 \b \i{Control registers}: \c{CR0} is 0, \c{CR2} is 2, \c{CR3} is 3,
 214 and \c{CR4} is 4.
 215
 216 \b \i{Control register} extensions: \c{CR8} is 8.
 217
 218 \b \i{Debug registers}: \c{DR0} is 0, \c{DR1} is 1, \c{DR2} is 2,
 219 \c{DR3} is 3, \c{DR6} is 6, and \c{DR7} is 7.
 220
 221 \b \i{Test registers}: \c{TR3} is 3, \c{TR4} is 4, \c{TR5} is 5,
 222 \c{TR6} is 6, and \c{TR7} is 7.
 223
 224 (Note that wherever a register name contains a number, that number
 225 is also the register value for that register.)
 226
 227
 228 \S{iref-cc} \i{Condition Codes}
 229
 230 The available condition codes are given here, along with their
 231 numeric representations as part of opcodes. Many of these condition
 232 codes have synonyms, so several will be listed at a time.
 233
 234 In the following descriptions, the word `either', when applied to two
 235 possible trigger conditions, is used to mean `either or both'. If
 236 `either but not both' is meant, the phrase `exactly one of' is used.
 237
 238 \b \c{O} is 0 (trigger if the overflow flag is set); \c{NO} is 1.
 239
 240 \b \c{B}, \c{C} and \c{NAE} are 2 (trigger if the carry flag is
 241 set); \c{AE}, \c{NB} and \c{NC} are 3.
 242
 243 \b \c{E} and \c{Z} are 4 (trigger if the zero flag is set); \c{NE}
 244 and \c{NZ} are 5.
 245
 246 \b \c{BE} and \c{NA} are 6 (trigger if either of the carry or zero
 247 flags is set); \c{A} and \c{NBE} are 7.
 248
 249 \b \c{S} is 8 (trigger if the sign flag is set); \c{NS} is 9.
 250
 251 \b \c{P} and \c{PE} are 10 (trigger if the parity flag is set);
 252 \c{NP} and \c{PO} are 11.
 253
 254 \b \c{L} and \c{NGE} are 12 (trigger if exactly one of the sign and
 255 overflow flags is set); \c{GE} and \c{NL} are 13.
 256
 257 \b \c{LE} and \c{NG} are 14 (trigger if either the zero flag is set,
 258 or exactly one of the sign and overflow flags is set); \c{G} and
 259 \c{NLE} are 15.
 260
 261 Note that in all cases, the sense of a condition code may be
 262 reversed by changing the low bit of the numeric representation.
 263
 264 For details of when an instruction sets each of the status flags,
 265 see the individual instruction, plus the Status Flags reference
 266 in \k{iref-Flags}
 267
 268
 269 \S{iref-SSE-cc} \i{SSE Condition Predicates}
 270
 271 The condition predicates for SSE comparison instructions are the
 272 codes used as part of the opcode, to determine what form of
 273 comparison is being carried out. In each case, the imm8 value is
 274 the final byte of the opcode encoding, and the predicate is the
 275 code used as part of the mnemonic for the instruction (equivalent
 276 to the "cc" in an integer instruction that used a condition code).
 277 The instructions that use this will give details of what the various
 278 mnemonics are, this table is used to help you work out details of what
 279 is happening.
 280
 281 \c Predi-  imm8  Description Relation where:   Emula- Result   QNaN
 282 \c  cate  Encod-             A Is 1st Operand  tion   if NaN   Signal
 283 \c         ing               B Is 2nd Operand         Operand  Invalid
 284 \c
 285 \c EQ     000B   equal       A = B                    False     No
 286 \c
 287 \c LT     001B   less-than   A < B                    False     Yes
 288 \c
 289 \c LE     010B   less-than-  A <= B                   False     Yes
 290 \c                or-equal
 291 \c
 292 \c ---    ----   greater     A > B             Swap   False     Yes
 293 \c               than                          Operands,
 294 \c                                             Use LT
 295 \c
 296 \c ---    ----   greater-    A >= B            Swap   False     Yes
 297 \c               than-or-equal                 Operands,
 298 \c                                             Use LE
 299 \c
 300 \c UNORD  011B   unordered   A, B = Unordered         True      No
 301 \c
 302 \c NEQ    100B   not-equal   A != B                   True      No
 303 \c
 304 \c NLT    101B   not-less-   NOT(A < B)               True      Yes
 305 \c               than
 306 \c
 307 \c NLE    110B   not-less-   NOT(A <= B)              True      Yes
 308 \c               than-or-
 309 \c               equal
 310 \c
 311 \c ---    ----   not-greater NOT(A > B)        Swap   True      Yes
 312 \c               than                          Operands,
 313 \c                                             Use NLT
 314 \c
 315 \c ---    ----   not-greater NOT(A >= B)       Swap   True      Yes
 316 \c               than-                         Operands,
 317 \c               or-equal                      Use NLE
 318 \c
 319 \c ORD    111B   ordered      A , B = Ordered         False     No
 320
 321 The unordered relationship is true when at least one of the two
 322 values being compared is a NaN or in an unsupported format.
 323
 324 Note that the comparisons which are listed as not having a predicate
 325 or encoding can only be achieved through software emulation, as
 326 described in the "emulation" column. Note in particular that an
 327 instruction such as \c{greater-than} is not the same as \c{NLE}, as,
 328 unlike with the \c{CMP} instruction, it has to take into account the
 329 possibility of one operand containing a NaN or an unsupported numeric
 330 format.
 331
 332
 333 \S{iref-Flags} \i{Status Flags}
 334
 335 The status flags provide some information about the result of the
 336 arithmetic instructions. This information can be used by conditional
 337 instructions (such a \c{Jcc} and \c{CMOVcc}) as well as by some of
 338 the other instructions (such as \c{ADC} and \c{INTO}).
 339
 340 There are 6 status flags:
 341
 342 \c CF - Carry flag.
 343
 344 Set if an arithmetic operation generates a
 345 carry or a borrow out of the most-significant bit of the result;
 346 cleared otherwise. This flag indicates an overflow condition for
 347 unsigned-integer arithmetic. It is also used in multiple-precision
 348 arithmetic.
 349
 350 \c PF - Parity flag.
 351
 352 Set if the least-significant byte of the result contains an even
 353 number of 1 bits; cleared otherwise.
 354
 355 \c AF - Adjust flag.
 356
 357 Set if an arithmetic operation generates a carry or a borrow
 358 out of bit 3 of the result; cleared otherwise. This flag is used
 359 in binary-coded decimal (BCD) arithmetic.
 360
 361 \c ZF - Zero flag.
 362
 363 Set if the result is zero; cleared otherwise.
 364
 365 \c SF - Sign flag.
 366
 367 Set equal to the most-significant bit of the result, which is the
 368 sign bit of a signed integer. (0 indicates a positive value and 1
 369 indicates a negative value.)
 370
 371 \c OF - Overflow flag.
 372
 373 Set if the integer result is too large a positive number or too
 374 small a negative number (excluding the sign-bit) to fit in the
 375 destination operand; cleared otherwise. This flag indicates an
 376 overflow condition for signed-integer (two's complement) arithmetic.
 377
 378
 379 \S{iref-ea} Effective Address Encoding: \i{ModR/M} and \i{SIB}
 380
 381 An \i{effective address} is encoded in up to three parts: a ModR/M
 382 byte, an optional SIB byte, and an optional byte, word or doubleword
 383 displacement field.
 384
 385 The ModR/M byte consists of three fields: the \c{mod} field, ranging
 386 from 0 to 3, in the upper two bits of the byte, the \c{r/m} field,
 387 ranging from 0 to 7, in the lower three bits, and the spare
 388 (register) field in the middle (bit 3 to bit 5). The spare field is
 389 not relevant to the effective address being encoded, and either
 390 contains an extension to the instruction opcode or the register
 391 value of another operand.
 392
 393 The ModR/M system can be used to encode a direct register reference
 394 rather than a memory access. This is always done by setting the
 395 \c{mod} field to 3 and the \c{r/m} field to the register value of
 396 the register in question (it must be a general-purpose register, and
 397 the size of the register must already be implicit in the encoding of
 398 the rest of the instruction). In this case, the SIB byte and
 399 displacement field are both absent.
 400
 401 In 16-bit addressing mode (either \c{BITS 16} with no \c{67} prefix,
 402 or \c{BITS 32} with a \c{67} prefix), the SIB byte is never used.
 403 The general rules for \c{mod} and \c{r/m} (there is an exception,
 404 given below) are:
 405
 406 \b The \c{mod} field gives the length of the displacement field: 0
 407 means no displacement, 1 means one byte, and 2 means two bytes.
 408
 409 \b The \c{r/m} field encodes the combination of registers to be
 410 added to the displacement to give the accessed address: 0 means
 411 \c{BX+SI}, 1 means \c{BX+DI}, 2 means \c{BP+SI}, 3 means \c{BP+DI},
 412 4 means \c{SI} only, 5 means \c{DI} only, 6 means \c{BP} only, and 7
 413 means \c{BX} only.
 414
 415 However, there is a special case:
 416
 417 \b If \c{mod} is 0 and \c{r/m} is 6, the effective address encoded
 418 is not \c{[BP]} as the above rules would suggest, but instead
 419 \c{[disp16]}: the displacement field is present and is two bytes
 420 long, and no registers are added to the displacement.
 421
 422 Therefore the effective address \c{[BP]} cannot be encoded as
 423 efficiently as \c{[BX]}; so if you code \c{[BP]} in a program, NASM
 424 adds a notional 8-bit zero displacement, and sets \c{mod} to 1,
 425 \c{r/m} to 6, and the one-byte displacement field to 0.
 426
 427 In 32-bit addressing mode (either \c{BITS 16} with a \c{67} prefix,
 428 or \c{BITS 32} with no \c{67} prefix) the general rules (again,
 429 there are exceptions) for \c{mod} and \c{r/m} are:
 430
 431 \b The \c{mod} field gives the length of the displacement field: 0
 432 means no displacement, 1 means one byte, and 2 means four bytes.
 433
 434 \b If only one register is to be added to the displacement, and it
 435 is not \c{ESP}, the \c{r/m} field gives its register value, and the
 436 SIB byte is absent. If the \c{r/m} field is 4 (which would encode
 437 \c{ESP}), the SIB byte is present and gives the combination and
 438 scaling of registers to be added to the displacement.
 439
 440 If the SIB byte is present, it describes the combination of
 441 registers (an optional base register, and an optional index register
 442 scaled by multiplication by 1, 2, 4 or 8) to be added to the
 443 displacement. The SIB byte is divided into the \c{scale} field, in
 444 the top two bits, the \c{index} field in the next three, and the
 445 \c{base} field in the bottom three. The general rules are:
 446
 447 \b The \c{base} field encodes the register value of the base
 448 register.
 449
 450 \b The \c{index} field encodes the register value of the index
 451 register, unless it is 4, in which case no index register is used
 452 (so \c{ESP} cannot be used as an index register).
 453
 454 \b The \c{scale} field encodes the multiplier by which the index
 455 register is scaled before adding it to the base and displacement: 0
 456 encodes a multiplier of 1, 1 encodes 2, 2 encodes 4 and 3 encodes 8.
 457
 458 The exceptions to the 32-bit encoding rules are:
 459
 460 \b If \c{mod} is 0 and \c{r/m} is 5, the effective address encoded
 461 is not \c{[EBP]} as the above rules would suggest, but instead
 462 \c{[disp32]}: the displacement field is present and is four bytes
 463 long, and no registers are added to the displacement.
 464
 465 \b If \c{mod} is 0, \c{r/m} is 4 (meaning the SIB byte is present)
 466 and \c{base} is 5, the effective address encoded is not
 467 \c{[EBP+index]} as the above rules would suggest, but instead
 468 \c{[disp32+index]}: the displacement field is present and is four
 469 bytes long, and there is no base register (but the index register is
 470 still processed in the normal way).
 471
 472
 473 \S{iref-rex} Register Extensions: The \i{REX} Prefix
 474
 475 The Register Extensions, or \i{REX} for short, prefix is the means
 476 of accessing extended registers on the x86-64 architecture. \i{REX}
 477 is considered an instruction prefix, but is required to be after
 478 all other prefixes and thus immediately before the first instruction
 479 opcode itself. So overall, \i{REX} can be thought of as an "Opcode
 480 Prefix" instead. The \i{REX} prefix itself is indicated by a value
 481 of 0x4X, where X is one of 16 different combinations of the actual
 482 \i{REX} flags.
 483
 484 The \i{REX} prefix flags consist of four 1-bit extensions fields.
 485 These flags are found in the lower nibble of the actual \i{REX}
 486 prefix opcode. Below is the list of \i{REX} prefix flags, from
 487 high bit to low bit.
 488
 489 \c{REX.W}: When set, this flag indicates the use of a 64-bit operand,
 490 as opposed to the default of using 32-bit operands as found in 32-bit
 491 Protected Mode.
 492
 493 \c{REX.R}: When set, this flag extends the \c{reg (spare)} field of
 494 the \c{ModRM} byte. Overall, this raises the amount of addressable
 495 registers in this field from 8 to 16.
 496
 497 \c{REX.X}: When set, this flag extends the \c{index} field of the
 498 \c{SIB} byte. Overall, this raises the amount of addressable
 499 registers in this field from 8 to 16.
 500
 501 \c{REX.B}: When set, this flag extends the \c{r/m} field of the
 502 \c{ModRM} byte. This flag can also represent an extension to the
 503 opcode register \c{(/r)} field. The determination of which is used
 504 varies depending on which instruction is used. Overall, this raises
 505 the amount of addressable registers in these fields from 8 to 16.
 506
 507 Interal use of the \i{REX} prefix by the processor is consistent,
 508 yet non-trivial. Most instructions use the \i{REX} prefix as
 509 indicated by the above flags. Some instructions require the \i{REX}
 510 prefix to be present even if the flags are empty. Some instructions
 511 default to a 64-bit operand and require the \i{REX} prefix only for
 512 actual register extensions, and thus ignores the \c{REX.W} field
 513 completely.
 514
 515 At any rate, NASM is designed to handle, and fully supports, the
 516 \i{REX} prefix internally. Please read the appropriate processor
 517 documentation for further information on the \i{REX} prefix.
 518
 519 You may have noticed that opcodes 0x40 through 0x4F are actually
 520 opcodes for the INC/DEC instructions for each General Purpose
 521 Register. This is, of course, correct... for legacy x86. While
 522 in long mode, opcodes 0x40 through 0x4F are reserved for use as
 523 the REX prefix. The other opcode forms of the INC/DEC instructions
 524 are used instead.
 525
 526
 527 \H{iref-flg} Key to Instruction Flags
 528
 529 Given along with each instruction in this appendix is a set of
 530 flags, denoting the type of the instruction. The types are as follows:
 531
 532 \b \c{8086}, \c{186}, \c{286}, \c{386}, \c{486}, \c{PENT} and \c{P6}
 533 denote the lowest processor type that supports the instruction. Most
 534 instructions run on all processors above the given type; those that
 535 do not are documented. The Pentium II contains no additional
 536 instructions beyond the P6 (Pentium Pro); from the point of view of
 537 its instruction set, it can be thought of as a P6 with MMX
 538 capability.
 539
 540 \b \c{3DNOW} indicates that the instruction is a 3DNow! one, and will
 541 run on the AMD K6-2 and later processors. ATHLON extensions to the
 542 3DNow! instruction set are documented as such.
 543
 544 \b \c{CYRIX} indicates that the instruction is specific to Cyrix
 545 processors, for example the extra MMX instructions in the Cyrix
 546 extended MMX instruction set.
 547
 548 \b \c{FPU} indicates that the instruction is a floating-point one,
 549 and will only run on machines with a coprocessor (automatically
 550 including 486DX, Pentium and above).
 551
 552 \b \c{KATMAI} indicates that the instruction was introduced as part
 553 of the Katmai New Instruction set. These instructions are available
 554 on the Pentium III and later processors. Those which are not
 555 specifically SSE instructions are also available on the AMD Athlon.
 556
 557 \b \c{MMX} indicates that the instruction is an MMX one, and will
 558 run on MMX-capable Pentium processors and the Pentium II.
 559
 560 \b \c{PRIV} indicates that the instruction is a protected-mode
 561 management instruction. Many of these may only be used in protected
 562 mode, or only at privilege level zero.
 563
 564 \b \c{SSE} and \c{SSE2} indicate that the instruction is a Streaming
 565 SIMD Extension instruction. These instructions operate on multiple
 566 values in a single operation. SSE was introduced with the Pentium III
 567 and SSE2 was introduced with the Pentium 4.
 568
 569 \b \c{UNDOC} indicates that the instruction is an undocumented one,
 570 and not part of the official Intel Architecture; it may or may not
 571 be supported on any given machine.
 572
 573 \b \c{WILLAMETTE} indicates that the instruction was introduced as
 574 part of the new instruction set in the Pentium 4 and Intel Xeon
 575 processors. These instructions are also known as SSE2 instructions.
 576
 577 \b \c{X64} indicates that the instruction was introduced as part of
 578 the new instruction set in the x86-64 architecture extension,
 579 commonly referred to as x64, AMD64 or EM64T.
 580
 581
 582 \H{iref-inst} x86 Instruction Set
 583
 584
 585 \S{insAAA} \i\c{AAA}, \i\c{AAS}, \i\c{AAM}, \i\c{AAD}: ASCII
 586 Adjustments
 587
 588 \c AAA                           ; 37                   [8086]
 589
 590 \c AAS                           ; 3F                   [8086]
 591
 592 \c AAD                           ; D5 0A                [8086]
 593 \c AAD imm                       ; D5 ib                [8086]
 594
 595 \c AAM                           ; D4 0A                [8086]
 596 \c AAM imm                       ; D4 ib                [8086]
 597
 598 These instructions are used in conjunction with the add, subtract,
 599 multiply and divide instructions to perform binary-coded decimal
 600 arithmetic in \e{unpacked} (one BCD digit per byte - easy to
 601 translate to and from \c{ASCII}, hence the instruction names) form.
 602 There are also packed BCD instructions \c{DAA} and \c{DAS}: see
 603 \k{insDAA}.
 604
 605 \b \c{AAA} (ASCII Adjust After Addition) should be used after a
 606 one-byte \c{ADD} instruction whose destination was the \c{AL}
 607 register: by means of examining the value in the low nibble of
 608 \c{AL} and also the auxiliary carry flag \c{AF}, it determines
 609 whether the addition has overflowed, and adjusts it (and sets
 610 the carry flag) if so. You can add long BCD strings together
 611 by doing \c{ADD}/\c{AAA} on the low digits, then doing
 612 \c{ADC}/\c{AAA} on each subsequent digit.
 613
 614 \b \c{AAS} (ASCII Adjust AL After Subtraction) works similarly to
 615 \c{AAA}, but is for use after \c{SUB} instructions rather than
 616 \c{ADD}.
 617
 618 \b \c{AAM} (ASCII Adjust AX After Multiply) is for use after you
 619 have multiplied two decimal digits together and left the result
 620 in \c{AL}: it divides \c{AL} by ten and stores the quotient in
 621 \c{AH}, leaving the remainder in \c{AL}. The divisor 10 can be
 622 changed by specifying an operand to the instruction: a particularly
 623 handy use of this is \c{AAM 16}, causing the two nibbles in \c{AL}
 624 to be separated into \c{AH} and \c{AL}.
 625
 626 \b \c{AAD} (ASCII Adjust AX Before Division) performs the inverse
 627 operation to \c{AAM}: it multiplies \c{AH} by ten, adds it to
 628 \c{AL}, and sets \c{AH} to zero. Again, the multiplier 10 can
 629 be changed.
 630
 631
 632 \S{insADC} \i\c{ADC}: Add with Carry
 633
 634 \c ADC r/m8,reg8                 ; 10 /r                [8086]
 635 \c ADC r/m16,reg16               ; o16 11 /r            [8086]
 636 \c ADC r/m32,reg32               ; o32 11 /r            [386]
 637
 638 \c ADC reg8,r/m8                 ; 12 /r                [8086]
 639 \c ADC reg16,r/m16               ; o16 13 /r            [8086]
 640 \c ADC reg32,r/m32               ; o32 13 /r            [386]
 641
 642 \c ADC r/m8,imm8                 ; 80 /2 ib             [8086]
 643 \c ADC r/m16,imm16               ; o16 81 /2 iw         [8086]
 644 \c ADC r/m32,imm32               ; o32 81 /2 id         [386]
 645
 646 \c ADC r/m16,imm8                ; o16 83 /2 ib         [8086]
 647 \c ADC r/m32,imm8                ; o32 83 /2 ib         [386]
 648
 649 \c ADC AL,imm8                   ; 14 ib                [8086]
 650 \c ADC AX,imm16                  ; o16 15 iw            [8086]
 651 \c ADC EAX,imm32                 ; o32 15 id            [386]
 652
 653 \c{ADC} performs integer addition: it adds its two operands
 654 together, plus the value of the carry flag, and leaves the result in
 655 its destination (first) operand. The destination operand can be a
 656 register or a memory location. The source operand can be a register,
 657 a memory location or an immediate value.
 658
 659 The flags are set according to the result of the operation: in
 660 particular, the carry flag is affected and can be used by a
 661 subsequent \c{ADC} instruction.
 662
 663 In the forms with an 8-bit immediate second operand and a longer
 664 first operand, the second operand is considered to be signed, and is
 665 sign-extended to the length of the first operand. In these cases,
 666 the \c{BYTE} qualifier is necessary to force NASM to generate this
 667 form of the instruction.
 668
 669 To add two numbers without also adding the contents of the carry
 670 flag, use \c{ADD} (\k{insADD}).
 671
 672
 673 \S{insADD} \i\c{ADD}: Add Integers
 674
 675 \c ADD r/m8,reg8                 ; 00 /r                [8086]
 676 \c ADD r/m16,reg16               ; o16 01 /r            [8086]
 677 \c ADD r/m32,reg32               ; o32 01 /r            [386]
 678
 679 \c ADD reg8,r/m8                 ; 02 /r                [8086]
 680 \c ADD reg16,r/m16               ; o16 03 /r            [8086]
 681 \c ADD reg32,r/m32               ; o32 03 /r            [386]
 682
 683 \c ADD r/m8,imm8                 ; 80 /7 ib             [8086]
 684 \c ADD r/m16,imm16               ; o16 81 /7 iw         [8086]
 685 \c ADD r/m32,imm32               ; o32 81 /7 id         [386]
 686
 687 \c ADD r/m16,imm8                ; o16 83 /7 ib         [8086]
 688 \c ADD r/m32,imm8                ; o32 83 /7 ib         [386]
 689
 690 \c ADD AL,imm8                   ; 04 ib                [8086]
 691 \c ADD AX,imm16                  ; o16 05 iw            [8086]
 692 \c ADD EAX,imm32                 ; o32 05 id            [386]
 693
 694 \c{ADD} performs integer addition: it adds its two operands
 695 together, and leaves the result in its destination (first) operand.
 696 The destination operand can be a register or a memory location.
 697 The source operand can be a register, a memory location or an
 698 immediate value.
 699
 700 The flags are set according to the result of the operation: in
 701 particular, the carry flag is affected and can be used by a
 702 subsequent \c{ADC} instruction.
 703
 704 In the forms with an 8-bit immediate second operand and a longer
 705 first operand, the second operand is considered to be signed, and is
 706 sign-extended to the length of the first operand. In these cases,
 707 the \c{BYTE} qualifier is necessary to force NASM to generate this
 708 form of the instruction.
 709
 710
 711 \S{insADDPD} \i\c{ADDPD}: ADD Packed Double-Precision FP Values
 712
 713 \c ADDPD xmm1,xmm2/mem128        ; 66 0F 58 /r     [WILLAMETTE,SSE2]
 714
 715 \c{ADDPD} performs addition on each of two packed double-precision
 716 FP value pairs.
 717
 718 \c    dst[0-63]   := dst[0-63]   + src[0-63],
 719 \c    dst[64-127] := dst[64-127] + src[64-127].
 720
 721 The destination is an \c{XMM} register. The source operand can be
 722 either an \c{XMM} register or a 128-bit memory location.
 723
 724
 725 \S{insADDPS} \i\c{ADDPS}: ADD Packed Single-Precision FP Values
 726
 727 \c ADDPS xmm1,xmm2/mem128        ; 0F 58 /r        [KATMAI,SSE]
 728
 729 \c{ADDPS} performs addition on each of four packed single-precision
 730 FP value pairs
 731
 732 \c    dst[0-31]   := dst[0-31]   + src[0-31],
 733 \c    dst[32-63]  := dst[32-63]  + src[32-63],
 734 \c    dst[64-95]  := dst[64-95]  + src[64-95],
 735 \c    dst[96-127] := dst[96-127] + src[96-127].
 736
 737 The destination is an \c{XMM} register. The source operand can be
 738 either an \c{XMM} register or a 128-bit memory location.
 739
 740
 741 \S{insADDSD} \i\c{ADDSD}: ADD Scalar Double-Precision FP Values
 742
 743 \c ADDSD xmm1,xmm2/mem64         ; F2 0F 58 /r     [KATMAI,SSE]
 744
 745 \c{ADDSD} adds the low double-precision FP values from the source
 746 and destination operands and stores the double-precision FP result
 747 in the destination operand.
 748
 749 \c    dst[0-63]   := dst[0-63] + src[0-63],
 750 \c    dst[64-127) remains unchanged.
 751
 752 The destination is an \c{XMM} register. The source operand can be
 753 either an \c{XMM} register or a 64-bit memory location.
 754
 755
 756 \S{insADDSS} \i\c{ADDSS}: ADD Scalar Single-Precision FP Values
 757
 758 \c ADDSS xmm1,xmm2/mem32         ; F3 0F 58 /r     [WILLAMETTE,SSE2]
 759
 760 \c{ADDSS} adds the low single-precision FP values from the source
 761 and destination operands and stores the single-precision FP result
 762 in the destination operand.
 763
 764 \c    dst[0-31]   := dst[0-31] + src[0-31],
 765 \c    dst[32-127] remains unchanged.
 766
 767 The destination is an \c{XMM} register. The source operand can be
 768 either an \c{XMM} register or a 32-bit memory location.
 769
 770
 771 \S{insAND} \i\c{AND}: Bitwise AND
 772
 773 \c AND r/m8,reg8                 ; 20 /r                [8086]
 774 \c AND r/m16,reg16               ; o16 21 /r            [8086]
 775 \c AND r/m32,reg32               ; o32 21 /r            [386]
 776
 777 \c AND reg8,r/m8                 ; 22 /r                [8086]
 778 \c AND reg16,r/m16               ; o16 23 /r            [8086]
 779 \c AND reg32,r/m32               ; o32 23 /r            [386]
 780
 781 \c AND r/m8,imm8                 ; 80 /4 ib             [8086]
 782 \c AND r/m16,imm16               ; o16 81 /4 iw         [8086]
 783 \c AND r/m32,imm32               ; o32 81 /4 id         [386]
 784
 785 \c AND r/m16,imm8                ; o16 83 /4 ib         [8086]
 786 \c AND r/m32,imm8                ; o32 83 /4 ib         [386]
 787
 788 \c AND AL,imm8                   ; 24 ib                [8086]
 789 \c AND AX,imm16                  ; o16 25 iw            [8086]
 790 \c AND EAX,imm32                 ; o32 25 id            [386]
 791
 792 \c{AND} performs a bitwise AND operation between its two operands
 793 (i.e. each bit of the result is 1 if and only if the corresponding
 794 bits of the two inputs were both 1), and stores the result in the
 795 destination (first) operand. The destination operand can be a
 796 register or a memory location. The source operand can be a register,
 797 a memory location or an immediate value.
 798
 799 In the forms with an 8-bit immediate second operand and a longer
 800 first operand, the second operand is considered to be signed, and is
 801 sign-extended to the length of the first operand. In these cases,
 802 the \c{BYTE} qualifier is necessary to force NASM to generate this
 803 form of the instruction.
 804
 805 The \c{MMX} instruction \c{PAND} (see \k{insPAND}) performs the same
 806 operation on the 64-bit \c{MMX} registers.
 807
 808
 809 \S{insANDNPD} \i\c{ANDNPD}: Bitwise Logical AND NOT of
 810 Packed Double-Precision FP Values
 811
 812 \c ANDNPD xmm1,xmm2/mem128       ; 66 0F 55 /r     [WILLAMETTE,SSE2]
 813
 814 \c{ANDNPD} inverts the bits of the two double-precision
 815 floating-point values in the destination register, and then
 816 performs a logical AND between the two double-precision
 817 floating-point values in the source operand and the temporary
 818 inverted result, storing the result in the destination register.
 819
 820 \c    dst[0-63]   := src[0-63]   AND NOT dst[0-63],
 821 \c    dst[64-127] := src[64-127] AND NOT dst[64-127].
 822
 823 The destination is an \c{XMM} register. The source operand can be
 824 either an \c{XMM} register or a 128-bit memory location.
 825
 826
 827 \S{insANDNPS} \i\c{ANDNPS}: Bitwise Logical AND NOT of
 828 Packed Single-Precision FP Values
 829
 830 \c ANDNPS xmm1,xmm2/mem128       ; 0F 55 /r        [KATMAI,SSE]
 831
 832 \c{ANDNPS} inverts the bits of the four single-precision
 833 floating-point values in the destination register, and then
 834 performs a logical AND between the four single-precision
 835 floating-point values in the source operand and the temporary
 836 inverted result, storing the result in the destination register.
 837
 838 \c    dst[0-31]   := src[0-31]   AND NOT dst[0-31],
 839 \c    dst[32-63]  := src[32-63]  AND NOT dst[32-63],
 840 \c    dst[64-95]  := src[64-95]  AND NOT dst[64-95],
 841 \c    dst[96-127] := src[96-127] AND NOT dst[96-127].
 842
 843 The destination is an \c{XMM} register. The source operand can be
 844 either an \c{XMM} register or a 128-bit memory location.
 845
 846
 847 \S{insANDPD} \i\c{ANDPD}: Bitwise Logical AND For Single FP
 848
 849 \c ANDPD xmm1,xmm2/mem128        ; 66 0F 54 /r     [WILLAMETTE,SSE2]
 850
 851 \c{ANDPD} performs a bitwise logical AND of the two double-precision
 852 floating point values in the source and destination operand, and
 853 stores the result in the destination register.
 854
 855 \c    dst[0-63]   := src[0-63]   AND dst[0-63],
 856 \c    dst[64-127] := src[64-127] AND dst[64-127].
 857
 858 The destination is an \c{XMM} register. The source operand can be
 859 either an \c{XMM} register or a 128-bit memory location.
 860
 861
 862 \S{insANDPS} \i\c{ANDPS}: Bitwise Logical AND For Single FP
 863
 864 \c ANDPS xmm1,xmm2/mem128        ; 0F 54 /r        [KATMAI,SSE]
 865
 866 \c{ANDPS} performs a bitwise logical AND of the four single-precision
 867 floating point values in the source and destination operand, and
 868 stores the result in the destination register.
 869
 870 \c    dst[0-31]   := src[0-31]   AND dst[0-31],
 871 \c    dst[32-63]  := src[32-63]  AND dst[32-63],
 872 \c    dst[64-95]  := src[64-95]  AND dst[64-95],
 873 \c    dst[96-127] := src[96-127] AND dst[96-127].
 874
 875 The destination is an \c{XMM} register. The source operand can be
 876 either an \c{XMM} register or a 128-bit memory location.
 877
 878
 879 \S{insARPL} \i\c{ARPL}: Adjust RPL Field of Selector
 880
 881 \c ARPL r/m16,reg16              ; 63 /r                [286,PRIV]
 882
 883 \c{ARPL} expects its two word operands to be segment selectors. It
 884 adjusts the \i\c{RPL} (requested privilege level - stored in the bottom
 885 two bits of the selector) field of the destination (first) operand
 886 to ensure that it is no less (i.e. no more privileged than) the \c{RPL}
 887 field of the source operand. The zero flag is set if and only if a
 888 change had to be made.
 889
 890
 891 \S{insBOUND} \i\c{BOUND}: Check Array Index against Bounds
 892
 893 \c BOUND reg16,mem               ; o16 62 /r            [186]
 894 \c BOUND reg32,mem               ; o32 62 /r            [386]
 895
 896 \c{BOUND} expects its second operand to point to an area of memory
 897 containing two signed values of the same size as its first operand
 898 (i.e. two words for the 16-bit form; two doublewords for the 32-bit
 899 form). It performs two signed comparisons: if the value in the
 900 register passed as its first operand is less than the first of the
 901 in-memory values, or is greater than or equal to the second, it
 902 throws a \c{BR} exception. Otherwise, it does nothing.
 903
 904
 905 \S{insBSF} \i\c{BSF}, \i\c{BSR}: Bit Scan
 906
 907 \c BSF reg16,r/m16               ; o16 0F BC /r         [386]
 908 \c BSF reg32,r/m32               ; o32 0F BC /r         [386]
 909
 910 \c BSR reg16,r/m16               ; o16 0F BD /r         [386]
 911 \c BSR reg32,r/m32               ; o32 0F BD /r         [386]
 912
 913 \b \c{BSF} searches for the least significant set bit in its source
 914 (second) operand, and if it finds one, stores the index in
 915 its destination (first) operand. If no set bit is found, the
 916 contents of the destination operand are undefined. If the source
 917 operand is zero, the zero flag is set.
 918
 919 \b \c{BSR} performs the same function, but searches from the top
 920 instead, so it finds the most significant set bit.
 921
 922 Bit indices are from 0 (least significant) to 15 or 31 (most
 923 significant). The destination operand can only be a register.
 924 The source operand can be a register or a memory location.
 925
 926
 927 \S{insBSWAP} \i\c{BSWAP}: Byte Swap
 928
 929 \c BSWAP reg32                   ; o32 0F C8+r          [486]
 930
 931 \c{BSWAP} swaps the order of the four bytes of a 32-bit register:
 932 bits 0-7 exchange places with bits 24-31, and bits 8-15 swap with
 933 bits 16-23. There is no explicit 16-bit equivalent: to byte-swap
 934 \c{AX}, \c{BX}, \c{CX} or \c{DX}, \c{XCHG} can be used. When \c{BSWAP}
 935 is used with a 16-bit register, the result is undefined.
 936
 937
 938 \S{insBT} \i\c{BT}, \i\c{BTC}, \i\c{BTR}, \i\c{BTS}: Bit Test
 939
 940 \c BT r/m16,reg16                ; o16 0F A3 /r         [386]
 941 \c BT r/m32,reg32                ; o32 0F A3 /r         [386]
 942 \c BT r/m16,imm8                 ; o16 0F BA /4 ib      [386]
 943 \c BT r/m32,imm8                 ; o32 0F BA /4 ib      [386]
 944
 945 \c BTC r/m16,reg16               ; o16 0F BB /r         [386]
 946 \c BTC r/m32,reg32               ; o32 0F BB /r         [386]
 947 \c BTC r/m16,imm8                ; o16 0F BA /7 ib      [386]
 948 \c BTC r/m32,imm8                ; o32 0F BA /7 ib      [386]
 949
 950 \c BTR r/m16,reg16               ; o16 0F B3 /r         [386]
 951 \c BTR r/m32,reg32               ; o32 0F B3 /r         [386]
 952 \c BTR r/m16,imm8                ; o16 0F BA /6 ib      [386]
 953 \c BTR r/m32,imm8                ; o32 0F BA /6 ib      [386]
 954
 955 \c BTS r/m16,reg16               ; o16 0F AB /r         [386]
 956 \c BTS r/m32,reg32               ; o32 0F AB /r         [386]
 957 \c BTS r/m16,imm                 ; o16 0F BA /5 ib      [386]
 958 \c BTS r/m32,imm                 ; o32 0F BA /5 ib      [386]
 959
 960 These instructions all test one bit of their first operand, whose
 961 index is given by the second operand, and store the value of that
 962 bit into the carry flag. Bit indices are from 0 (least significant)
 963 to 15 or 31 (most significant).
 964
 965 In addition to storing the original value of the bit into the carry
 966 flag, \c{BTR} also resets (clears) the bit in the operand itself.
 967 \c{BTS} sets the bit, and \c{BTC} complements the bit. \c{BT} does
 968 not modify its operands.
 969
 970 The destination can be a register or a memory location. The source can
 971 be a register or an immediate value.
 972
 973 If the destination operand is a register, the bit offset should be
 974 in the range 0-15 (for 16-bit operands) or 0-31 (for 32-bit operands).
 975 An immediate value outside these ranges will be taken modulo 16/32
 976 by the processor.
 977
 978 If the destination operand is a memory location, then an immediate
 979 bit offset follows the same rules as for a register. If the bit offset
 980 is in a register, then it can be anything within the signed range of
 981 the register used (ie, for a 32-bit operand, it can be (-2^31) to (2^31 - 1)
 982
 983
 984 \S{insCALL} \i\c{CALL}: Call Subroutine
 985
 986 \c CALL imm                      ; E8 rw/rd             [8086]
 987 \c CALL imm:imm16                ; o16 9A iw iw         [8086]
 988 \c CALL imm:imm32                ; o32 9A id iw         [386]
 989 \c CALL FAR mem16                ; o16 FF /3            [8086]
 990 \c CALL FAR mem32                ; o32 FF /3            [386]
 991 \c CALL r/m16                    ; o16 FF /2            [8086]
 992 \c CALL r/m32                    ; o32 FF /2            [386]
 993
 994 \c{CALL} calls a subroutine, by means of pushing the current
 995 instruction pointer (\c{IP}) and optionally \c{CS} as well on the
 996 stack, and then jumping to a given address.
 997
 998 \c{CS} is pushed as well as \c{IP} if and only if the call is a far
 999 call, i.e. a destination segment address is specified in the
1000 instruction. The forms involving two colon-separated arguments are
1001 far calls; so are the \c{CALL FAR mem} forms.
1002
1003 The immediate \i{near call} takes one of two forms (\c{call imm16/imm32},
1004 determined by the current segment size limit. For 16-bit operands,
1005 you would use \c{CALL 0x1234}, and for 32-bit operands you would use
1006 \c{CALL 0x12345678}. The value passed as an operand is a relative offset.
1007
1008 You can choose between the two immediate \i{far call} forms
1009 (\c{CALL imm:imm}) by the use of the \c{WORD} and \c{DWORD} keywords:
1010 \c{CALL WORD 0x1234:0x5678}) or \c{CALL DWORD 0x1234:0x56789abc}.
1011
1012 The \c{CALL FAR mem} forms execute a far call by loading the
1013 destination address out of memory. The address loaded consists of 16
1014 or 32 bits of offset (depending on the operand size), and 16 bits of
1015 segment. The operand size may be overridden using \c{CALL WORD FAR
1016 mem} or \c{CALL DWORD FAR mem}.
1017
1018 The \c{CALL r/m} forms execute a \i{near call} (within the same
1019 segment), loading the destination address out of memory or out of a
1020 register. The keyword \c{NEAR} may be specified, for clarity, in
1021 these forms, but is not necessary. Again, operand size can be
1022 overridden using \c{CALL WORD mem} or \c{CALL DWORD mem}.
1023
1024 As a convenience, NASM does not require you to call a far procedure
1025 symbol by coding the cumbersome \c{CALL SEG routine:routine}, but
1026 instead allows the easier synonym \c{CALL FAR routine}.
1027
1028 The \c{CALL r/m} forms given above are near calls; NASM will accept
1029 the \c{NEAR} keyword (e.g. \c{CALL NEAR [address]}), even though it
1030 is not strictly necessary.
1031
1032
1033 \S{insCBW} \i\c{CBW}, \i\c{CWD}, \i\c{CDQ}, \i\c{CWDE}: Sign Extensions
1034
1035 \c CBW                           ; o16 98               [8086]
1036 \c CWDE                          ; o32 98               [386]
1037
1038 \c CWD                           ; o16 99               [8086]
1039 \c CDQ                           ; o32 99               [386]
1040
1041 All these instructions sign-extend a short value into a longer one,
1042 by replicating the top bit of the original value to fill the
1043 extended one.
1044
1045 \c{CBW} extends \c{AL} into \c{AX} by repeating the top bit of
1046 \c{AL} in every bit of \c{AH}. \c{CWDE} extends \c{AX} into
1047 \c{EAX}. \c{CWD} extends \c{AX} into \c{DX:AX} by repeating
1048 the top bit of \c{AX} throughout \c{DX}, and \c{CDQ} extends
1049 \c{EAX} into \c{EDX:EAX}.
1050
1051
1052 \S{insCLC} \i\c{CLC}, \i\c{CLD}, \i\c{CLI}, \i\c{CLTS}: Clear Flags
1053
1054 \c CLC                           ; F8                   [8086]
1055 \c CLD                           ; FC                   [8086]
1056 \c CLI                           ; FA                   [8086]
1057 \c CLTS                          ; 0F 06                [286,PRIV]
1058
1059 These instructions clear various flags. \c{CLC} clears the carry
1060 flag; \c{CLD} clears the direction flag; \c{CLI} clears the
1061 interrupt flag (thus disabling interrupts); and \c{CLTS} clears the
1062 task-switched (\c{TS}) flag in \c{CR0}.
1063
1064 To set the carry, direction, or interrupt flags, use the \c{STC},
1065 \c{STD} and \c{STI} instructions (\k{insSTC}). To invert the carry
1066 flag, use \c{CMC} (\k{insCMC}).
1067
1068
1069 \S{insCLFLUSH} \i\c{CLFLUSH}: Flush Cache Line
1070
1071 \c CLFLUSH mem                   ; 0F AE /7        [WILLAMETTE,SSE2]
1072
1073 \c{CLFLUSH} invalidates the cache line that contains the linear address
1074 specified by the source operand from all levels of the processor cache
1075 hierarchy (data and instruction). If, at any level of the cache
1076 hierarchy, the line is inconsistent with memory (dirty) it is written
1077 to memory before invalidation. The source operand points to a
1078 byte-sized memory location.
1079
1080 Although \c{CLFLUSH} is flagged \c{SSE2} and above, it may not be
1081 present on all processors which have \c{SSE2} support, and it may be
1082 supported on other processors; the \c{CPUID} instruction (\k{insCPUID})
1083 will return a bit which indicates support for the \c{CLFLUSH} instruction.
1084
1085
1086 \S{insCMC} \i\c{CMC}: Complement Carry Flag
1087
1088 \c CMC                           ; F5                   [8086]
1089
1090 \c{CMC} changes the value of the carry flag: if it was 0, it sets it
1091 to 1, and vice versa.
1092
1093
1094 \S{insCMOVcc} \i\c{CMOVcc}: Conditional Move
1095
1096 \c CMOVcc reg16,r/m16            ; o16 0F 40+cc /r      [P6]
1097 \c CMOVcc reg32,r/m32            ; o32 0F 40+cc /r      [P6]
1098
1099 \c{CMOV} moves its source (second) operand into its destination
1100 (first) operand if the given condition code is satisfied; otherwise
1101 it does nothing.
1102
1103 For a list of condition codes, see \k{iref-cc}.
1104
1105 Although the \c{CMOV} instructions are flagged \c{P6} and above, they
1106 may not be supported by all Pentium Pro processors; the \c{CPUID}
1107 instruction (\k{insCPUID}) will return a bit which indicates whether
1108 conditional moves are supported.
1109
1110
1111 \S{insCMP} \i\c{CMP}: Compare Integers
1112
1113 \c CMP r/m8,reg8                 ; 38 /r                [8086]
1114 \c CMP r/m16,reg16               ; o16 39 /r            [8086]
1115 \c CMP r/m32,reg32               ; o32 39 /r            [386]
1116
1117 \c CMP reg8,r/m8                 ; 3A /r                [8086]
1118 \c CMP reg16,r/m16               ; o16 3B /r            [8086]
1119 \c CMP reg32,r/m32               ; o32 3B /r            [386]
1120
1121 \c CMP r/m8,imm8                 ; 80 /7 ib             [8086]
1122 \c CMP r/m16,imm16               ; o16 81 /7 iw         [8086]
1123 \c CMP r/m32,imm32               ; o32 81 /7 id         [386]
1124
1125 \c CMP r/m16,imm8                ; o16 83 /7 ib         [8086]
1126 \c CMP r/m32,imm8                ; o32 83 /7 ib         [386]
1127
1128 \c CMP AL,imm8                   ; 3C ib                [8086]
1129 \c CMP AX,imm16                  ; o16 3D iw            [8086]
1130 \c CMP EAX,imm32                 ; o32 3D id            [386]
1131
1132 \c{CMP} performs a `mental' subtraction of its second operand from
1133 its first operand, and affects the flags as if the subtraction had
1134 taken place, but does not store the result of the subtraction
1135 anywhere.
1136
1137 In the forms with an 8-bit immediate second operand and a longer
1138 first operand, the second operand is considered to be signed, and is
1139 sign-extended to the length of the first operand. In these cases,
1140 the \c{BYTE} qualifier is necessary to force NASM to generate this
1141 form of the instruction.
1142
1143 The destination operand can be a register or a memory location. The
1144 source can be a register, memory location or an immediate value of
1145 the same size as the destination.
1146
1147
1148 \S{insCMPccPD} \i\c{CMPccPD}: Packed Double-Precision FP Compare
1149 \I\c{CMPEQPD} \I\c{CMPLTPD} \I\c{CMPLEPD} \I\c{CMPUNORDPD}
1150 \I\c{CMPNEQPD} \I\c{CMPNLTPD} \I\c{CMPNLEPD} \I\c{CMPORDPD}
1151
1152 \c CMPPD xmm1,xmm2/mem128,imm8   ; 66 0F C2 /r ib  [WILLAMETTE,SSE2]
1153
1154 \c CMPEQPD xmm1,xmm2/mem128      ; 66 0F C2 /r 00  [WILLAMETTE,SSE2]
1155 \c CMPLTPD xmm1,xmm2/mem128      ; 66 0F C2 /r 01  [WILLAMETTE,SSE2]
1156 \c CMPLEPD xmm1,xmm2/mem128      ; 66 0F C2 /r 02  [WILLAMETTE,SSE2]
1157 \c CMPUNORDPD xmm1,xmm2/mem128   ; 66 0F C2 /r 03  [WILLAMETTE,SSE2]
1158 \c CMPNEQPD xmm1,xmm2/mem128     ; 66 0F C2 /r 04  [WILLAMETTE,SSE2]
1159 \c CMPNLTPD xmm1,xmm2/mem128     ; 66 0F C2 /r 05  [WILLAMETTE,SSE2]
1160 \c CMPNLEPD xmm1,xmm2/mem128     ; 66 0F C2 /r 06  [WILLAMETTE,SSE2]
1161 \c CMPORDPD xmm1,xmm2/mem128     ; 66 0F C2 /r 07  [WILLAMETTE,SSE2]
1162
1163 The \c{CMPccPD} instructions compare the two packed double-precision
1164 FP values in the source and destination operands, and returns the
1165 result of the comparison in the destination register. The result of
1166 each comparison is a quadword mask of all 1s (comparison true) or
1167 all 0s (comparison false).
1168
1169 The destination is an \c{XMM} register. The source can be either an
1170 \c{XMM} register or a 128-bit memory location.
1171
1172 The third operand is an 8-bit immediate value, of which the low 3
1173 bits define the type of comparison. For ease of programming, the
1174 8 two-operand pseudo-instructions are provided, with the third
1175 operand already filled in. The \I{Condition Predicates}
1176 \c{Condition Predicates} are:
1177
1178 \c EQ     0   Equal
1179 \c LT     1   Less-than
1180 \c LE     2   Less-than-or-equal
1181 \c UNORD  3   Unordered
1182 \c NE     4   Not-equal
1183 \c NLT    5   Not-less-than
1184 \c NLE    6   Not-less-than-or-equal
1185 \c ORD    7   Ordered
1186
1187 For more details of the comparison predicates, and details of how
1188 to emulate the "greater-than" equivalents, see \k{iref-SSE-cc}
1189
1190
1191 \S{insCMPccPS} \i\c{CMPccPS}: Packed Single-Precision FP Compare
1192 \I\c{CMPEQPS} \I\c{CMPLTPS} \I\c{CMPLEPS} \I\c{CMPUNORDPS}
1193 \I\c{CMPNEQPS} \I\c{CMPNLTPS} \I\c{CMPNLEPS} \I\c{CMPORDPS}
1194
1195 \c CMPPS xmm1,xmm2/mem128,imm8   ; 0F C2 /r ib     [KATMAI,SSE]
1196
1197 \c CMPEQPS xmm1,xmm2/mem128      ; 0F C2 /r 00     [KATMAI,SSE]
1198 \c CMPLTPS xmm1,xmm2/mem128      ; 0F C2 /r 01     [KATMAI,SSE]
1199 \c CMPLEPS xmm1,xmm2/mem128      ; 0F C2 /r 02     [KATMAI,SSE]
1200 \c CMPUNORDPS xmm1,xmm2/mem128   ; 0F C2 /r 03     [KATMAI,SSE]
1201 \c CMPNEQPS xmm1,xmm2/mem128     ; 0F C2 /r 04     [KATMAI,SSE]
1202 \c CMPNLTPS xmm1,xmm2/mem128     ; 0F C2 /r 05     [KATMAI,SSE]
1203 \c CMPNLEPS xmm1,xmm2/mem128     ; 0F C2 /r 06     [KATMAI,SSE]
1204 \c CMPORDPS xmm1,xmm2/mem128     ; 0F C2 /r 07     [KATMAI,SSE]
1205
1206 The \c{CMPccPS} instructions compare the two packed single-precision
1207 FP values in the source and destination operands, and returns the
1208 result of the comparison in the destination register. The result of
1209 each comparison is a doubleword mask of all 1s (comparison true) or
1210 all 0s (comparison false).
1211
1212 The destination is an \c{XMM} register. The source can be either an
1213 \c{XMM} register or a 128-bit memory location.
1214
1215 The third operand is an 8-bit immediate value, of which the low 3
1216 bits define the type of comparison. For ease of programming, the
1217 8 two-operand pseudo-instructions are provided, with the third
1218 operand already filled in. The \I{Condition Predicates}
1219 \c{Condition Predicates} are:
1220
1221 \c EQ     0   Equal
1222 \c LT     1   Less-than
1223 \c LE     2   Less-than-or-equal
1224 \c UNORD  3   Unordered
1225 \c NE     4   Not-equal
1226 \c NLT    5   Not-less-than
1227 \c NLE    6   Not-less-than-or-equal
1228 \c ORD    7   Ordered
1229
1230 For more details of the comparison predicates, and details of how
1231 to emulate the "greater-than" equivalents, see \k{iref-SSE-cc}
1232
1233
1234 \S{insCMPSB} \i\c{CMPSB}, \i\c{CMPSW}, \i\c{CMPSD}: Compare Strings
1235
1236 \c CMPSB                         ; A6                   [8086]
1237 \c CMPSW                         ; o16 A7               [8086]
1238 \c CMPSD                         ; o32 A7               [386]
1239
1240 \c{CMPSB} compares the byte at \c{[DS:SI]} or \c{[DS:ESI]} with the
1241 byte at \c{[ES:DI]} or \c{[ES:EDI]}, and sets the flags accordingly.
1242 It then increments or decrements (depending on the direction flag:
1243 increments if the flag is clear, decrements if it is set) \c{SI} and
1244 \c{DI} (or \c{ESI} and \c{EDI}).
1245
1246 The registers used are \c{SI} and \c{DI} if the address size is 16
1247 bits, and \c{ESI} and \c{EDI} if it is 32 bits. If you need to use
1248 an address size not equal to the current \c{BITS} setting, you can
1249 use an explicit \i\c{a16} or \i\c{a32} prefix.
1250
1251 The segment register used to load from \c{[SI]} or \c{[ESI]} can be
1252 overridden by using a segment register name as a prefix (for
1253 example, \c{ES CMPSB}). The use of \c{ES} for the load from \c{[DI]}
1254 or \c{[EDI]} cannot be overridden.
1255
1256 \c{CMPSW} and \c{CMPSD} work in the same way, but they compare a
1257 word or a doubleword instead of a byte, and increment or decrement
1258 the addressing registers by 2 or 4 instead of 1.
1259
1260 The \c{REPE} and \c{REPNE} prefixes (equivalently, \c{REPZ} and
1261 \c{REPNZ}) may be used to repeat the instruction up to \c{CX} (or
1262 \c{ECX} - again, the address size chooses which) times until the
1263 first unequal or equal byte is found.
1264
1265
1266 \S{insCMPccSD} \i\c{CMPccSD}: Scalar Double-Precision FP Compare
1267 \I\c{CMPEQSD} \I\c{CMPLTSD} \I\c{CMPLESD} \I\c{CMPUNORDSD}
1268 \I\c{CMPNEQSD} \I\c{CMPNLTSD} \I\c{CMPNLESD} \I\c{CMPORDSD}
1269
1270 \c CMPSD xmm1,xmm2/mem64,imm8    ; F2 0F C2 /r ib  [WILLAMETTE,SSE2]
1271
1272 \c CMPEQSD xmm1,xmm2/mem64       ; F2 0F C2 /r 00  [WILLAMETTE,SSE2]
1273 \c CMPLTSD xmm1,xmm2/mem64       ; F2 0F C2 /r 01  [WILLAMETTE,SSE2]
1274 \c CMPLESD xmm1,xmm2/mem64       ; F2 0F C2 /r 02  [WILLAMETTE,SSE2]
1275 \c CMPUNORDSD xmm1,xmm2/mem64    ; F2 0F C2 /r 03  [WILLAMETTE,SSE2]
1276 \c CMPNEQSD xmm1,xmm2/mem64      ; F2 0F C2 /r 04  [WILLAMETTE,SSE2]
1277 \c CMPNLTSD xmm1,xmm2/mem64      ; F2 0F C2 /r 05  [WILLAMETTE,SSE2]
1278 \c CMPNLESD xmm1,xmm2/mem64      ; F2 0F C2 /r 06  [WILLAMETTE,SSE2]
1279 \c CMPORDSD xmm1,xmm2/mem64      ; F2 0F C2 /r 07  [WILLAMETTE,SSE2]
1280
1281 The \c{CMPccSD} instructions compare the low-order double-precision
1282 FP values in the source and destination operands, and returns the
1283 result of the comparison in the destination register. The result of
1284 each comparison is a quadword mask of all 1s (comparison true) or
1285 all 0s (comparison false).
1286
1287 The destination is an \c{XMM} register. The source can be either an
1288 \c{XMM} register or a 128-bit memory location.
1289
1290 The third operand is an 8-bit immediate value, of which the low 3
1291 bits define the type of comparison. For ease of programming, the
1292 8 two-operand pseudo-instructions are provided, with the third
1293 operand already filled in. The \I{Condition Predicates}
1294 \c{Condition Predicates} are:
1295
1296 \c EQ     0   Equal
1297 \c LT     1   Less-than
1298 \c LE     2   Less-than-or-equal
1299 \c UNORD  3   Unordered
1300 \c NE     4   Not-equal
1301 \c NLT    5   Not-less-than
1302 \c NLE    6   Not-less-than-or-equal
1303 \c ORD    7   Ordered
1304
1305 For more details of the comparison predicates, and details of how
1306 to emulate the "greater-than" equivalents, see \k{iref-SSE-cc}
1307
1308
1309 \S{insCMPccSS} \i\c{CMPccSS}: Scalar Single-Precision FP Compare
1310 \I\c{CMPEQSS} \I\c{CMPLTSS} \I\c{CMPLESS} \I\c{CMPUNORDSS}
1311 \I\c{CMPNEQSS} \I\c{CMPNLTSS} \I\c{CMPNLESS} \I\c{CMPORDSS}
1312
1313 \c CMPSS xmm1,xmm2/mem32,imm8    ; F3 0F C2 /r ib  [KATMAI,SSE]
1314
1315 \c CMPEQSS xmm1,xmm2/mem32       ; F3 0F C2 /r 00  [KATMAI,SSE]
1316 \c CMPLTSS xmm1,xmm2/mem32       ; F3 0F C2 /r 01  [KATMAI,SSE]
1317 \c CMPLESS xmm1,xmm2/mem32       ; F3 0F C2 /r 02  [KATMAI,SSE]
1318 \c CMPUNORDSS xmm1,xmm2/mem32    ; F3 0F C2 /r 03  [KATMAI,SSE]
1319 \c CMPNEQSS xmm1,xmm2/mem32      ; F3 0F C2 /r 04  [KATMAI,SSE]
1320 \c CMPNLTSS xmm1,xmm2/mem32      ; F3 0F C2 /r 05  [KATMAI,SSE]
1321 \c CMPNLESS xmm1,xmm2/mem32      ; F3 0F C2 /r 06  [KATMAI,SSE]
1322 \c CMPORDSS xmm1,xmm2/mem32      ; F3 0F C2 /r 07  [KATMAI,SSE]
1323
1324 The \c{CMPccSS} instructions compare the low-order single-precision
1325 FP values in the source and destination operands, and returns the
1326 result of the comparison in the destination register. The result of
1327 each comparison is a doubleword mask of all 1s (comparison true) or
1328 all 0s (comparison false).
1329
1330 The destination is an \c{XMM} register. The source can be either an
1331 \c{XMM} register or a 128-bit memory location.
1332
1333 The third operand is an 8-bit immediate value, of which the low 3
1334 bits define the type of comparison. For ease of programming, the
1335 8 two-operand pseudo-instructions are provided, with the third
1336 operand already filled in. The \I{Condition Predicates}
1337 \c{Condition Predicates} are:
1338
1339 \c EQ     0   Equal
1340 \c LT     1   Less-than
1341 \c LE     2   Less-than-or-equal
1342 \c UNORD  3   Unordered
1343 \c NE     4   Not-equal
1344 \c NLT    5   Not-less-than
1345 \c NLE    6   Not-less-than-or-equal
1346 \c ORD    7   Ordered
1347
1348 For more details of the comparison predicates, and details of how
1349 to emulate the "greater-than" equivalents, see \k{iref-SSE-cc}
1350
1351
1352 \S{insCMPXCHG} \i\c{CMPXCHG}, \i\c{CMPXCHG486}: Compare and Exchange
1353
1354 \c CMPXCHG r/m8,reg8             ; 0F B0 /r             [PENT]
1355 \c CMPXCHG r/m16,reg16           ; o16 0F B1 /r         [PENT]
1356 \c CMPXCHG r/m32,reg32           ; o32 0F B1 /r         [PENT]
1357
1358 \c CMPXCHG486 r/m8,reg8          ; 0F A6 /r             [486,UNDOC]
1359 \c CMPXCHG486 r/m16,reg16        ; o16 0F A7 /r         [486,UNDOC]
1360 \c CMPXCHG486 r/m32,reg32        ; o32 0F A7 /r         [486,UNDOC]
1361
1362 These two instructions perform exactly the same operation; however,
1363 apparently some (not all) 486 processors support it under a
1364 non-standard opcode, so NASM provides the undocumented
1365 \c{CMPXCHG486} form to generate the non-standard opcode.
1366
1367 \c{CMPXCHG} compares its destination (first) operand to the value in
1368 \c{AL}, \c{AX} or \c{EAX} (depending on the operand size of the
1369 instruction). If they are equal, it copies its source (second)
1370 operand into the destination and sets the zero flag. Otherwise, it
1371 clears the zero flag and copies the destination register to AL, AX or EAX.
1372
1373 The destination can be either a register or a memory location. The
1374 source is a register.
1375
1376 \c{CMPXCHG} is intended to be used for atomic operations in
1377 multitasking or multiprocessor environments. To safely update a
1378 value in shared memory, for example, you might load the value into
1379 \c{EAX}, load the updated value into \c{EBX}, and then execute the
1380 instruction \c{LOCK CMPXCHG [value],EBX}. If \c{value} has not
1381 changed since being loaded, it is updated with your desired new
1382 value, and the zero flag is set to let you know it has worked. (The
1383 \c{LOCK} prefix prevents another processor doing anything in the
1384 middle of this operation: it guarantees atomicity.) However, if
1385 another processor has modified the value in between your load and
1386 your attempted store, the store does not happen, and you are
1387 notified of the failure by a cleared zero flag, so you can go round
1388 and try again.
1389
1390
1391 \S{insCMPXCHG8B} \i\c{CMPXCHG8B}: Compare and Exchange Eight Bytes
1392
1393 \c CMPXCHG8B mem                 ; 0F C7 /1             [PENT]
1394
1395 This is a larger and more unwieldy version of \c{CMPXCHG}: it
1396 compares the 64-bit (eight-byte) value stored at \c{[mem]} with the
1397 value in \c{EDX:EAX}. If they are equal, it sets the zero flag and
1398 stores \c{ECX:EBX} into the memory area. If they are unequal, it
1399 clears the zero flag and stores the memory contents into \c{EDX:EAX}.
1400
1401 \c{CMPXCHG8B} can be used with the \c{LOCK} prefix, to allow atomic
1402 execution. This is useful in multi-processor and multi-tasking
1403 environments.
1404
1405
1406 \S{insCOMISD} \i\c{COMISD}: Scalar Ordered Double-Precision FP Compare and Set EFLAGS
1407
1408 \c COMISD xmm1,xmm2/mem64        ; 66 0F 2F /r     [WILLAMETTE,SSE2]
1409
1410 \c{COMISD} compares the low-order double-precision FP value in the
1411 two source operands. ZF, PF and CF are set according to the result.
1412 OF, AF and AF are cleared. The unordered result is returned if either
1413 source is a NaN (QNaN or SNaN).
1414
1415 The destination operand is an \c{XMM} register. The source can be either
1416 an \c{XMM} register or a memory location.
1417
1418 The flags are set according to the following rules:
1419
1420 \c    Result          Flags        Values
1421
1422 \c    UNORDERED:      ZF,PF,CF <-- 111;
1423 \c    GREATER_THAN:   ZF,PF,CF <-- 000;
1424 \c    LESS_THAN:      ZF,PF,CF <-- 001;
1425 \c    EQUAL:          ZF,PF,CF <-- 100;
1426
1427
1428 \S{insCOMISS} \i\c{COMISS}: Scalar Ordered Single-Precision FP Compare and Set EFLAGS
1429
1430 \c COMISS xmm1,xmm2/mem32        ; 66 0F 2F /r     [KATMAI,SSE]
1431
1432 \c{COMISS} compares the low-order single-precision FP value in the
1433 two source operands. ZF, PF and CF are set according to the result.
1434 OF, AF and AF are cleared. The unordered result is returned if either
1435 source is a NaN (QNaN or SNaN).
1436
1437 The destination operand is an \c{XMM} register. The source can be either
1438 an \c{XMM} register or a memory location.
1439
1440 The flags are set according to the following rules:
1441
1442 \c    Result          Flags        Values
1443
1444 \c    UNORDERED:      ZF,PF,CF <-- 111;
1445 \c    GREATER_THAN:   ZF,PF,CF <-- 000;
1446 \c    LESS_THAN:      ZF,PF,CF <-- 001;
1447 \c    EQUAL:          ZF,PF,CF <-- 100;
1448
1449
1450 \S{insCPUID} \i\c{CPUID}: Get CPU Identification Code
1451
1452 \c CPUID                         ; 0F A2                [PENT]
1453
1454 \c{CPUID} returns various information about the processor it is
1455 being executed on. It fills the four registers \c{EAX}, \c{EBX},
1456 \c{ECX} and \c{EDX} with information, which varies depending on the
1457 input contents of \c{EAX}.
1458
1459 \c{CPUID} also acts as a barrier to serialize instruction execution:
1460 executing the \c{CPUID} instruction guarantees that all the effects
1461 (memory modification, flag modification, register modification) of
1462 previous instructions have been completed before the next
1463 instruction gets fetched.
1464
1465 The information returned is as follows:
1466
1467 \b If \c{EAX} is zero on input, \c{EAX} on output holds the maximum
1468 acceptable input value of \c{EAX}, and \c{EBX:EDX:ECX} contain the
1469 string \c{"GenuineIntel"} (or not, if you have a clone processor).
1470 That is to say, \c{EBX} contains \c{"Genu"} (in NASM's own sense of
1471 character constants, described in \k{chrconst}), \c{EDX} contains
1472 \c{"ineI"} and \c{ECX} contains \c{"ntel"}.
1473
1474 \b If \c{EAX} is one on input, \c{EAX} on output contains version
1475 information about the processor, and \c{EDX} contains a set of
1476 feature flags, showing the presence and absence of various features.
1477 For example, bit 8 is set if the \c{CMPXCHG8B} instruction
1478 (\k{insCMPXCHG8B}) is supported, bit 15 is set if the conditional
1479 move instructions (\k{insCMOVcc} and \k{insFCMOVB}) are supported,
1480 and bit 23 is set if \c{MMX} instructions are supported.
1481
1482 \b If \c{EAX} is two on input, \c{EAX}, \c{EBX}, \c{ECX} and \c{EDX}
1483 all contain information about caches and TLBs (Translation Lookahead
1484 Buffers).
1485
1486 For more information on the data returned from \c{CPUID}, see the
1487 documentation from Intel and other processor manufacturers.
1488
1489
1490 \S{insCVTDQ2PD} \i\c{CVTDQ2PD}:
1491 Packed Signed INT32 to Packed Double-Precision FP Conversion
1492
1493 \c CVTDQ2PD xmm1,xmm2/mem64      ; F3 0F E6 /r     [WILLAMETTE,SSE2]
1494
1495 \c{CVTDQ2PD} converts two packed signed doublewords from the source
1496 operand to two packed double-precision FP values in the destination
1497 operand.
1498
1499 The destination operand is an \c{XMM} register. The source can be
1500 either an \c{XMM} register or a 64-bit memory location. If the
1501 source is a register, the packed integers are in the low quadword.
1502
1503
1504 \S{insCVTDQ2PS} \i\c{CVTDQ2PS}:
1505 Packed Signed INT32 to Packed Single-Precision FP Conversion
1506
1507 \c CVTDQ2PS xmm1,xmm2/mem128     ; 0F 5B /r        [WILLAMETTE,SSE2]
1508
1509 \c{CVTDQ2PS} converts four packed signed doublewords from the source
1510 operand to four packed single-precision FP values in the destination
1511 operand.
1512
1513 The destination operand is an \c{XMM} register. The source can be
1514 either an \c{XMM} register or a 128-bit memory location.
1515
1516 For more details of this instruction, see the Intel Processor manuals.
1517
1518
1519 \S{insCVTPD2DQ} \i\c{CVTPD2DQ}:
1520 Packed Double-Precision FP to Packed Signed INT32 Conversion
1521
1522 \c CVTPD2DQ xmm1,xmm2/mem128     ; F2 0F E6 /r     [WILLAMETTE,SSE2]
1523
1524 \c{CVTPD2DQ} converts two packed double-precision FP values from the
1525 source operand to two packed signed doublewords in the low quadword
1526 of the destination operand. The high quadword of the destination is
1527 set to all 0s.
1528
1529 The destination operand is an \c{XMM} register. The source can be
1530 either an \c{XMM} register or a 128-bit memory location.
1531
1532 For more details of this instruction, see the Intel Processor manuals.
1533
1534
1535 \S{insCVTPD2PI} \i\c{CVTPD2PI}:
1536 Packed Double-Precision FP to Packed Signed INT32 Conversion
1537
1538 \c CVTPD2PI mm,xmm/mem128        ; 66 0F 2D /r     [WILLAMETTE,SSE2]
1539
1540 \c{CVTPD2PI} converts two packed double-precision FP values from the
1541 source operand to two packed signed doublewords in the destination
1542 operand.
1543
1544 The destination operand is an \c{MMX} register. The source can be
1545 either an \c{XMM} register or a 128-bit memory location.
1546
1547 For more details of this instruction, see the Intel Processor manuals.
1548
1549
1550 \S{insCVTPD2PS} \i\c{CVTPD2PS}:
1551 Packed Double-Precision FP to Packed Single-Precision FP Conversion
1552
1553 \c CVTPD2PS xmm1,xmm2/mem128     ; 66 0F 5A /r     [WILLAMETTE,SSE2]
1554
1555 \c{CVTPD2PS} converts two packed double-precision FP values from the
1556 source operand to two packed single-precision FP values in the low
1557 quadword of the destination operand. The high quadword of the
1558 destination is set to all 0s.
1559
1560 The destination operand is an \c{XMM} register. The source can be
1561 either an \c{XMM} register or a 128-bit memory location.
1562
1563 For more details of this instruction, see the Intel Processor manuals.
1564
1565
1566 \S{insCVTPI2PD} \i\c{CVTPI2PD}:
1567 Packed Signed INT32 to Packed Double-Precision FP Conversion
1568
1569 \c CVTPI2PD xmm,mm/mem64         ; 66 0F 2A /r     [WILLAMETTE,SSE2]
1570
1571 \c{CVTPI2PD} converts two packed signed doublewords from the source
1572 operand to two packed double-precision FP values in the destination
1573 operand.
1574
1575 The destination operand is an \c{XMM} register. The source can be
1576 either an \c{MMX} register or a 64-bit memory location.
1577
1578 For more details of this instruction, see the Intel Processor manuals.
1579
1580
1581 \S{insCVTPI2PS} \i\c{CVTPI2PS}:
1582 Packed Signed INT32 to Packed Single-FP Conversion
1583
1584 \c CVTPI2PS xmm,mm/mem64         ; 0F 2A /r        [KATMAI,SSE]
1585
1586 \c{CVTPI2PS} converts two packed signed doublewords from the source
1587 operand to two packed single-precision FP values in the low quadword
1588 of the destination operand. The high quadword of the destination
1589 remains unchanged.
1590
1591 The destination operand is an \c{XMM} register. The source can be
1592 either an \c{MMX} register or a 64-bit memory location.
1593
1594 For more details of this instruction, see the Intel Processor manuals.
1595
1596
1597 \S{insCVTPS2DQ} \i\c{CVTPS2DQ}:
1598 Packed Single-Precision FP to Packed Signed INT32 Conversion
1599
1600 \c CVTPS2DQ xmm1,xmm2/mem128     ; 66 0F 5B /r     [WILLAMETTE,SSE2]
1601
1602 \c{CVTPS2DQ} converts four packed single-precision FP values from the
1603 source operand to four packed signed doublewords in the destination operand.
1604
1605 The destination operand is an \c{XMM} register. The source can be
1606 either an \c{XMM} register or a 128-bit memory location.
1607
1608 For more details of this instruction, see the Intel Processor manuals.
1609
1610
1611 \S{insCVTPS2PD} \i\c{CVTPS2PD}:
1612 Packed Single-Precision FP to Packed Double-Precision FP Conversion
1613
1614 \c CVTPS2PD xmm1,xmm2/mem64      ; 0F 5A /r        [WILLAMETTE,SSE2]
1615
1616 \c{CVTPS2PD} converts two packed single-precision FP values from the
1617 source operand to two packed double-precision FP values in the destination
1618 operand.
1619
1620 The destination operand is an \c{XMM} register. The source can be
1621 either an \c{XMM} register or a 64-bit memory location. If the source
1622 is a register, the input values are in the low quadword.
1623
1624 For more details of this instruction, see the Intel Processor manuals.
1625
1626
1627 \S{insCVTPS2PI} \i\c{CVTPS2PI}:
1628 Packed Single-Precision FP to Packed Signed INT32 Conversion
1629
1630 \c CVTPS2PI mm,xmm/mem64         ; 0F 2D /r        [KATMAI,SSE]
1631
1632 \c{CVTPS2PI} converts two packed single-precision FP values from
1633 the source operand to two packed signed doublewords in the destination
1634 operand.
1635
1636 The destination operand is an \c{MMX} register. The source can be
1637 either an \c{XMM} register or a 64-bit memory location. If the
1638 source is a register, the input values are in the low quadword.
1639
1640 For more details of this instruction, see the Intel Processor manuals.
1641
1642
1643 \S{insCVTSD2SI} \i\c{CVTSD2SI}:
1644 Scalar Double-Precision FP to Signed INT32 Conversion
1645
1646 \c CVTSD2SI reg32,xmm/mem64      ; F2 0F 2D /r     [WILLAMETTE,SSE2]
1647
1648 \c{CVTSD2SI} converts a double-precision FP value from the source
1649 operand to a signed doubleword in the destination operand.
1650
1651 The destination operand is a general purpose register. The source can be
1652 either an \c{XMM} register or a 64-bit memory location. If the
1653 source is a register, the input value is in the low quadword.
1654
1655 For more details of this instruction, see the Intel Processor manuals.
1656
1657
1658 \S{insCVTSD2SS} \i\c{CVTSD2SS}:
1659 Scalar Double-Precision FP to Scalar Single-Precision FP Conversion
1660
1661 \c CVTSD2SS xmm1,xmm2/mem64      ; F2 0F 5A /r     [KATMAI,SSE]
1662
1663 \c{CVTSD2SS} converts a double-precision FP value from the source
1664 operand to a single-precision FP value in the low doubleword of the
1665 destination operand. The upper 3 doublewords are left unchanged.
1666
1667 The destination operand is an \c{XMM} register. The source can be
1668 either an \c{XMM} register or a 64-bit memory location. If the
1669 source is a register, the input value is in the low quadword.
1670
1671 For more details of this instruction, see the Intel Processor manuals.
1672
1673
1674 \S{insCVTSI2SD} \i\c{CVTSI2SD}:
1675 Signed INT32 to Scalar Double-Precision FP Conversion
1676
1677 \c CVTSI2SD xmm,r/m32            ; F2 0F 2A /r     [WILLAMETTE,SSE2]
1678
1679 \c{CVTSI2SD} converts a signed doubleword from the source operand to
1680 a double-precision FP value in the low quadword of the destination
1681 operand. The high quadword is left unchanged.
1682
1683 The destination operand is an \c{XMM} register. The source can be either
1684 a general purpose register or a 32-bit memory location.
1685
1686 For more details of this instruction, see the Intel Processor manuals.
1687
1688
1689 \S{insCVTSI2SS} \i\c{CVTSI2SS}:
1690 Signed INT32 to Scalar Single-Precision FP Conversion
1691
1692 \c CVTSI2SS xmm,r/m32            ; F3 0F 2A /r     [KATMAI,SSE]
1693
1694 \c{CVTSI2SS} converts a signed doubleword from the source operand to a
1695 single-precision FP value in the low doubleword of the destination operand.
1696 The upper 3 doublewords are left unchanged.
1697
1698 The destination operand is an \c{XMM} register. The source can be either
1699 a general purpose register or a 32-bit memory location.
1700
1701 For more details of this instruction, see the Intel Processor manuals.
1702
1703
1704 \S{insCVTSS2SD} \i\c{CVTSS2SD}:
1705 Scalar Single-Precision FP to Scalar Double-Precision FP Conversion
1706
1707 \c CVTSS2SD xmm1,xmm2/mem32      ; F3 0F 5A /r     [WILLAMETTE,SSE2]
1708
1709 \c{CVTSS2SD} converts a single-precision FP value from the source operand
1710 to a double-precision FP value in the low quadword of the destination
1711 operand. The upper quadword is left unchanged.
1712
1713 The destination operand is an \c{XMM} register. The source can be either
1714 an \c{XMM} register or a 32-bit memory location. If the source is a
1715 register, the input value is contained in the low doubleword.
1716
1717 For more details of this instruction, see the Intel Processor manuals.
1718
1719
1720 \S{insCVTSS2SI} \i\c{CVTSS2SI}:
1721 Scalar Single-Precision FP to Signed INT32 Conversion
1722
1723 \c CVTSS2SI reg32,xmm/mem32      ; F3 0F 2D /r     [KATMAI,SSE]
1724
1725 \c{CVTSS2SI} converts a single-precision FP value from the source
1726 operand to a signed doubleword in the destination operand.
1727
1728 The destination operand is a general purpose register. The source can be
1729 either an \c{XMM} register or a 32-bit memory location. If the
1730 source is a register, the input value is in the low doubleword.
1731
1732 For more details of this instruction, see the Intel Processor manuals.
1733
1734
1735 \S{insCVTTPD2DQ} \i\c{CVTTPD2DQ}:
1736 Packed Double-Precision FP to Packed Signed INT32 Conversion with Truncation
1737
1738 \c CVTTPD2DQ xmm1,xmm2/mem128    ; 66 0F E6 /r     [WILLAMETTE,SSE2]
1739
1740 \c{CVTTPD2DQ} converts two packed double-precision FP values in the source
1741 operand to two packed single-precision FP values in the destination operand.
1742 If the result is inexact, it is truncated (rounded toward zero). The high
1743 quadword is set to all 0s.
1744
1745 The destination operand is an \c{XMM} register. The source can be
1746 either an \c{XMM} register or a 128-bit memory location.
1747
1748 For more details of this instruction, see the Intel Processor manuals.
1749
1750
1751 \S{insCVTTPD2PI} \i\c{CVTTPD2PI}:
1752 Packed Double-Precision FP to Packed Signed INT32 Conversion with Truncation
1753
1754 \c CVTTPD2PI mm,xmm/mem128        ; 66 0F 2C /r     [WILLAMETTE,SSE2]
1755
1756 \c{CVTTPD2PI} converts two packed double-precision FP values in the source
1757 operand to two packed single-precision FP values in the destination operand.
1758 If the result is inexact, it is truncated (rounded toward zero).
1759
1760 The destination operand is an \c{MMX} register. The source can be
1761 either an \c{XMM} register or a 128-bit memory location.
1762
1763 For more details of this instruction, see the Intel Processor manuals.
1764
1765
1766 \S{insCVTTPS2DQ} \i\c{CVTTPS2DQ}:
1767 Packed Single-Precision FP to Packed Signed INT32 Conversion with Truncation
1768
1769 \c CVTTPS2DQ xmm1,xmm2/mem128    ; F3 0F 5B /r     [WILLAMETTE,SSE2]
1770
1771 \c{CVTTPS2DQ} converts four packed single-precision FP values in the source
1772 operand to four packed signed doublewords in the destination operand.
1773 If the result is inexact, it is truncated (rounded toward zero).
1774
1775 The destination operand is an \c{XMM} register. The source can be
1776 either an \c{XMM} register or a 128-bit memory location.
1777
1778 For more details of this instruction, see the Intel Processor manuals.
1779
1780
1781 \S{insCVTTPS2PI} \i\c{CVTTPS2PI}:
1782 Packed Single-Precision FP to Packed Signed INT32 Conversion with Truncation
1783
1784 \c CVTTPS2PI mm,xmm/mem64         ; 0F 2C /r       [KATMAI,SSE]
1785
1786 \c{CVTTPS2PI} converts two packed single-precision FP values in the source
1787 operand to two packed signed doublewords in the destination operand.
1788 If the result is inexact, it is truncated (rounded toward zero). If
1789 the source is a register, the input values are in the low quadword.
1790
1791 The destination operand is an \c{MMX} register. The source can be
1792 either an \c{XMM} register or a 64-bit memory location. If the source
1793 is a register, the input value is in the low quadword.
1794
1795 For more details of this instruction, see the Intel Processor manuals.
1796
1797
1798 \S{insCVTTSD2SI} \i\c{CVTTSD2SI}:
1799 Scalar Double-Precision FP to Signed INT32 Conversion with Truncation
1800
1801 \c CVTTSD2SI reg32,xmm/mem64      ; F2 0F 2C /r    [WILLAMETTE,SSE2]
1802
1803 \c{CVTTSD2SI} converts a double-precision FP value in the source operand
1804 to a signed doubleword in the destination operand. If the result is
1805 inexact, it is truncated (rounded toward zero).
1806
1807 The destination operand is a general purpose register. The source can be
1808 either an \c{XMM} register or a 64-bit memory location. If the source is a
1809 register, the input value is in the low quadword.
1810
1811 For more details of this instruction, see the Intel Processor manuals.
1812
1813
1814 \S{insCVTTSS2SI} \i\c{CVTTSS2SI}:
1815 Scalar Single-Precision FP to Signed INT32 Conversion with Truncation
1816
1817 \c CVTTSD2SI reg32,xmm/mem32      ; F3 0F 2C /r    [KATMAI,SSE]
1818
1819 \c{CVTTSS2SI} converts a single-precision FP value in the source operand
1820 to a signed doubleword in the destination operand. If the result is
1821 inexact, it is truncated (rounded toward zero).
1822
1823 The destination operand is a general purpose register. The source can be
1824 either an \c{XMM} register or a 32-bit memory location. If the source is a
1825 register, the input value is in the low doubleword.
1826
1827 For more details of this instruction, see the Intel Processor manuals.
1828
1829
1830 \S{insDAA} \i\c{DAA}, \i\c{DAS}: Decimal Adjustments
1831
1832 \c DAA                           ; 27                   [8086]
1833 \c DAS                           ; 2F                   [8086]
1834
1835 These instructions are used in conjunction with the add and subtract
1836 instructions to perform binary-coded decimal arithmetic in
1837 \e{packed} (one BCD digit per nibble) form. For the unpacked
1838 equivalents, see \k{insAAA}.
1839
1840 \c{DAA} should be used after a one-byte \c{ADD} instruction whose
1841 destination was the \c{AL} register: by means of examining the value
1842 in the \c{AL} and also the auxiliary carry flag \c{AF}, it
1843 determines whether either digit of the addition has overflowed, and
1844 adjusts it (and sets the carry and auxiliary-carry flags) if so. You
1845 can add long BCD strings together by doing \c{ADD}/\c{DAA} on the
1846 low two digits, then doing \c{ADC}/\c{DAA} on each subsequent pair
1847 of digits.
1848
1849 \c{DAS} works similarly to \c{DAA}, but is for use after \c{SUB}
1850 instructions rather than \c{ADD}.
1851
1852
1853 \S{insDEC} \i\c{DEC}: Decrement Integer
1854
1855 \c DEC reg16                     ; o16 48+r             [8086]
1856 \c DEC reg32                     ; o32 48+r             [386]
1857 \c DEC r/m8                      ; FE /1                [8086]
1858 \c DEC r/m16                     ; o16 FF /1            [8086]
1859 \c DEC r/m32                     ; o32 FF /1            [386]
1860
1861 \c{DEC} subtracts 1 from its operand. It does \e{not} affect the
1862 carry flag: to affect the carry flag, use \c{SUB something,1} (see
1863 \k{insSUB}). \c{DEC} affects all the other flags according to the result.
1864
1865 This instruction can be used with a \c{LOCK} prefix to allow atomic
1866 execution.
1867
1868 See also \c{INC} (\k{insINC}).
1869
1870
1871 \S{insDIV} \i\c{DIV}: Unsigned Integer Divide
1872
1873 \c DIV r/m8                      ; F6 /6                [8086]
1874 \c DIV r/m16                     ; o16 F7 /6            [8086]
1875 \c DIV r/m32                     ; o32 F7 /6            [386]
1876
1877 \c{DIV} performs unsigned integer division. The explicit operand
1878 provided is the divisor; the dividend and destination operands are
1879 implicit, in the following way:
1880
1881 \b For \c{DIV r/m8}, \c{AX} is divided by the given operand; the
1882 quotient is stored in \c{AL} and the remainder in \c{AH}.
1883
1884 \b For \c{DIV r/m16}, \c{DX:AX} is divided by the given operand; the
1885 quotient is stored in \c{AX} and the remainder in \c{DX}.
1886
1887 \b For \c{DIV r/m32}, \c{EDX:EAX} is divided by the given operand;
1888 the quotient is stored in \c{EAX} and the remainder in \c{EDX}.
1889
1890 Signed integer division is performed by the \c{IDIV} instruction:
1891 see \k{insIDIV}.
1892
1893
1894 \S{insDIVPD} \i\c{DIVPD}: Packed Double-Precision FP Divide
1895
1896 \c DIVPD xmm1,xmm2/mem128        ; 66 0F 5E /r     [WILLAMETTE,SSE2]
1897
1898 \c{DIVPD} divides the two packed double-precision FP values in
1899 the destination operand by the two packed double-precision FP
1900 values in the source operand, and stores the packed double-precision
1901 results in the destination register.
1902
1903 The destination is an \c{XMM} register. The source operand can be
1904 either an \c{XMM} register or a 128-bit memory location.
1905
1906 \c    dst[0-63]   := dst[0-63]   / src[0-63],
1907 \c    dst[64-127] := dst[64-127] / src[64-127].
1908
1909
1910 \S{insDIVPS} \i\c{DIVPS}: Packed Single-Precision FP Divide
1911
1912 \c DIVPS xmm1,xmm2/mem128        ; 0F 5E /r        [KATMAI,SSE]
1913
1914 \c{DIVPS} divides the four packed single-precision FP values in
1915 the destination operand by the four packed single-precision FP
1916 values in the source operand, and stores the packed single-precision
1917 results in the destination register.
1918
1919 The destination is an \c{XMM} register. The source operand can be
1920 either an \c{XMM} register or a 128-bit memory location.
1921
1922 \c    dst[0-31]   := dst[0-31]   / src[0-31],
1923 \c    dst[32-63]  := dst[32-63]  / src[32-63],
1924 \c    dst[64-95]  := dst[64-95]  / src[64-95],
1925 \c    dst[96-127] := dst[96-127] / src[96-127].
1926
1927
1928 \S{insDIVSD} \i\c{DIVSD}: Scalar Double-Precision FP Divide
1929
1930 \c DIVSD xmm1,xmm2/mem64         ; F2 0F 5E /r     [WILLAMETTE,SSE2]
1931
1932 \c{DIVSD} divides the low-order double-precision FP value in the
1933 destination operand by the low-order double-precision FP value in
1934 the source operand, and stores the double-precision result in the
1935 destination register.
1936
1937 The destination is an \c{XMM} register. The source operand can be
1938 either an \c{XMM} register or a 64-bit memory location.
1939
1940 \c    dst[0-63]   := dst[0-63] / src[0-63],
1941 \c    dst[64-127] remains unchanged.
1942
1943
1944 \S{insDIVSS} \i\c{DIVSS}: Scalar Single-Precision FP Divide
1945
1946 \c DIVSS xmm1,xmm2/mem32         ; F3 0F 5E /r     [KATMAI,SSE]
1947
1948 \c{DIVSS} divides the low-order single-precision FP value in the
1949 destination operand by the low-order single-precision FP value in
1950 the source operand, and stores the single-precision result in the
1951 destination register.
1952
1953 The destination is an \c{XMM} register. The source operand can be
1954 either an \c{XMM} register or a 32-bit memory location.
1955
1956 \c    dst[0-31]   := dst[0-31] / src[0-31],
1957 \c    dst[32-127] remains unchanged.
1958
1959
1960 \S{insEMMS} \i\c{EMMS}: Empty MMX State
1961
1962 \c EMMS                          ; 0F 77                [PENT,MMX]
1963
1964 \c{EMMS} sets the FPU tag word (marking which floating-point registers
1965 are available) to all ones, meaning all registers are available for
1966 the FPU to use. It should be used after executing \c{MMX} instructions
1967 and before executing any subsequent floating-point operations.
1968
1969
1970 \S{insENTER} \i\c{ENTER}: Create Stack Frame
1971
1972 \c ENTER imm,imm                 ; C8 iw ib             [186]
1973
1974 \c{ENTER} constructs a \i\c{stack frame} for a high-level language
1975 procedure call. The first operand (the \c{iw} in the opcode
1976 definition above refers to the first operand) gives the amount of
1977 stack space to allocate for local variables; the second (the \c{ib}
1978 above) gives the nesting level of the procedure (for languages like
1979 Pascal, with nested procedures).
1980
1981 The function of \c{ENTER}, with a nesting level of zero, is
1982 equivalent to
1983
1984 \c           PUSH EBP            ; or PUSH BP         in 16 bits
1985 \c           MOV EBP,ESP         ; or MOV BP,SP       in 16 bits
1986 \c           SUB ESP,operand1    ; or SUB SP,operand1 in 16 bits
1987
1988 This creates a stack frame with the procedure parameters accessible
1989 upwards from \c{EBP}, and local variables accessible downwards from
1990 \c{EBP}.
1991
1992 With a nesting level of one, the stack frame created is 4 (or 2)
1993 bytes bigger, and the value of the final frame pointer \c{EBP} is
1994 accessible in memory at \c{[EBP-4]}.
1995
1996 This allows \c{ENTER}, when called with a nesting level of two, to
1997 look at the stack frame described by the \e{previous} value of
1998 \c{EBP}, find the frame pointer at offset -4 from that, and push it
1999 along with its new frame pointer, so that when a level-two procedure
2000 is called from within a level-one procedure, \c{[EBP-4]} holds the
2001 frame pointer of the most recent level-one procedure call and
2002 \c{[EBP-8]} holds that of the most recent level-two call. And so on,
2003 for nesting levels up to 31.
2004
2005 Stack frames created by \c{ENTER} can be destroyed by the \c{LEAVE}
2006 instruction: see \k{insLEAVE}.
2007
2008
2009 \S{insF2XM1} \i\c{F2XM1}: Calculate 2**X-1
2010
2011 \c F2XM1                         ; D9 F0                [8086,FPU]
2012
2013 \c{F2XM1} raises 2 to the power of \c{ST0}, subtracts one, and
2014 stores the result back into \c{ST0}. The initial contents of \c{ST0}
2015 must be a number in the range -1.0 to +1.0.
2016
2017
2018 \S{insFABS} \i\c{FABS}: Floating-Point Absolute Value
2019
2020 \c FABS                          ; D9 E1                [8086,FPU]
2021
2022 \c{FABS} computes the absolute value of \c{ST0},by clearing the sign
2023 bit, and stores the result back in \c{ST0}.
2024
2025
2026 \S{insFADD} \i\c{FADD}, \i\c{FADDP}: Floating-Point Addition
2027
2028 \c FADD mem32                    ; D8 /0                [8086,FPU]
2029 \c FADD mem64                    ; DC /0                [8086,FPU]
2030
2031 \c FADD fpureg                   ; D8 C0+r              [8086,FPU]
2032 \c FADD ST0,fpureg               ; D8 C0+r              [8086,FPU]
2033
2034 \c FADD TO fpureg                ; DC C0+r              [8086,FPU]
2035 \c FADD fpureg,ST0               ; DC C0+r              [8086,FPU]
2036
2037 \c FADDP fpureg                  ; DE C0+r              [8086,FPU]
2038 \c FADDP fpureg,ST0              ; DE C0+r              [8086,FPU]
2039
2040 \b \c{FADD}, given one operand, adds the operand to \c{ST0} and stores
2041 the result back in \c{ST0}. If the operand has the \c{TO} modifier,
2042 the result is stored in the register given rather than in \c{ST0}.
2043
2044 \b \c{FADDP} performs the same function as \c{FADD TO}, but pops the
2045 register stack after storing the result.
2046
2047 The given two-operand forms are synonyms for the one-operand forms.
2048
2049 To add an integer value to \c{ST0}, use the c{FIADD} instruction
2050 (\k{insFIADD})
2051
2052
2053 \S{insFBLD} \i\c{FBLD}, \i\c{FBSTP}: BCD Floating-Point Load and Store
2054
2055 \c FBLD mem80                    ; DF /4                [8086,FPU]
2056 \c FBSTP mem80                   ; DF /6                [8086,FPU]
2057
2058 \c{FBLD} loads an 80-bit (ten-byte) packed binary-coded decimal
2059 number from the given memory address, converts it to a real, and
2060 pushes it on the register stack. \c{FBSTP} stores the value of
2061 \c{ST0}, in packed BCD, at the given address and then pops the
2062 register stack.
2063
2064
2065 \S{insFCHS} \i\c{FCHS}: Floating-Point Change Sign
2066
2067 \c FCHS                          ; D9 E0                [8086,FPU]
2068
2069 \c{FCHS} negates the number in \c{ST0}, by inverting the sign bit:
2070 negative numbers become positive, and vice versa.
2071
2072
2073 \S{insFCLEX} \i\c{FCLEX}, \c{FNCLEX}: Clear Floating-Point Exceptions
2074
2075 \c FCLEX                         ; 9B DB E2             [8086,FPU]
2076 \c FNCLEX                        ; DB E2                [8086,FPU]
2077
2078 \c{FCLEX} clears any floating-point exceptions which may be pending.
2079 \c{FNCLEX} does the same thing but doesn't wait for previous
2080 floating-point operations (including the \e{handling} of pending
2081 exceptions) to finish first.
2082
2083
2084 \S{insFCMOVB} \i\c{FCMOVcc}: Floating-Point Conditional Move
2085
2086 \c FCMOVB fpureg                 ; DA C0+r              [P6,FPU]
2087 \c FCMOVB ST0,fpureg             ; DA C0+r              [P6,FPU]
2088
2089 \c FCMOVE fpureg                 ; DA C8+r              [P6,FPU]
2090 \c FCMOVE ST0,fpureg             ; DA C8+r              [P6,FPU]
2091
2092 \c FCMOVBE fpureg                ; DA D0+r              [P6,FPU]
2093 \c FCMOVBE ST0,fpureg            ; DA D0+r              [P6,FPU]
2094
2095 \c FCMOVU fpureg                 ; DA D8+r              [P6,FPU]
2096 \c FCMOVU ST0,fpureg             ; DA D8+r              [P6,FPU]
2097
2098 \c FCMOVNB fpureg                ; DB C0+r              [P6,FPU]
2099 \c FCMOVNB ST0,fpureg            ; DB C0+r              [P6,FPU]
2100
2101 \c FCMOVNE fpureg                ; DB C8+r              [P6,FPU]
2102 \c FCMOVNE ST0,fpureg            ; DB C8+r              [P6,FPU]
2103
2104 \c FCMOVNBE fpureg               ; DB D0+r              [P6,FPU]
2105 \c FCMOVNBE ST0,fpureg           ; DB D0+r              [P6,FPU]
2106
2107 \c FCMOVNU fpureg                ; DB D8+r              [P6,FPU]
2108 \c FCMOVNU ST0,fpureg            ; DB D8+r              [P6,FPU]
2109
2110 The \c{FCMOV} instructions perform conditional move operations: each
2111 of them moves the contents of the given register into \c{ST0} if its
2112 condition is satisfied, and does nothing if not.
2113
2114 The conditions are not the same as the standard condition codes used
2115 with conditional jump instructions. The conditions \c{B}, \c{BE},
2116 \c{NB}, \c{NBE}, \c{E} and \c{NE} are exactly as normal, but none of
2117 the other standard ones are supported. Instead, the condition \c{U}
2118 and its counterpart \c{NU} are provided; the \c{U} condition is
2119 satisfied if the last two floating-point numbers compared were
2120 \e{unordered}, i.e. they were not equal but neither one could be
2121 said to be greater than the other, for example if they were NaNs.
2122 (The flag state which signals this is the setting of the parity
2123 flag: so the \c{U} condition is notionally equivalent to \c{PE}, and
2124 \c{NU} is equivalent to \c{PO}.)
2125
2126 The \c{FCMOV} conditions test the main processor's status flags, not
2127 the FPU status flags, so using \c{FCMOV} directly after \c{FCOM}
2128 will not work. Instead, you should either use \c{FCOMI} which writes
2129 directly to the main CPU flags word, or use \c{FSTSW} to extract the
2130 FPU flags.
2131
2132 Although the \c{FCMOV} instructions are flagged \c{P6} above, they
2133 may not be supported by all Pentium Pro processors; the \c{CPUID}
2134 instruction (\k{insCPUID}) will return a bit which indicates whether
2135 conditional moves are supported.
2136
2137
2138 \S{insFCOM} \i\c{FCOM}, \i\c{FCOMP}, \i\c{FCOMPP}, \i\c{FCOMI},
2139 \i\c{FCOMIP}: Floating-Point Compare
2140
2141 \c FCOM mem32                    ; D8 /2                [8086,FPU]
2142 \c FCOM mem64                    ; DC /2                [8086,FPU]
2143 \c FCOM fpureg                   ; D8 D0+r              [8086,FPU]
2144 \c FCOM ST0,fpureg               ; D8 D0+r              [8086,FPU]
2145
2146 \c FCOMP mem32                   ; D8 /3                [8086,FPU]
2147 \c FCOMP mem64                   ; DC /3                [8086,FPU]
2148 \c FCOMP fpureg                  ; D8 D8+r              [8086,FPU]
2149 \c FCOMP ST0,fpureg              ; D8 D8+r              [8086,FPU]
2150
2151 \c FCOMPP                        ; DE D9                [8086,FPU]
2152
2153 \c FCOMI fpureg                  ; DB F0+r              [P6,FPU]
2154 \c FCOMI ST0,fpureg              ; DB F0+r              [P6,FPU]
2155
2156 \c FCOMIP fpureg                 ; DF F0+r              [P6,FPU]
2157 \c FCOMIP ST0,fpureg             ; DF F0+r              [P6,FPU]
2158
2159 \c{FCOM} compares \c{ST0} with the given operand, and sets the FPU
2160 flags accordingly. \c{ST0} is treated as the left-hand side of the
2161 comparison, so that the carry flag is set (for a `less-than' result)
2162 if \c{ST0} is less than the given operand.
2163
2164 \c{FCOMP} does the same as \c{FCOM}, but pops the register stack
2165 afterwards. \c{FCOMPP} compares \c{ST0} with \c{ST1} and then pops
2166 the register stack twice.
2167
2168 \c{FCOMI} and \c{FCOMIP} work like the corresponding forms of
2169 \c{FCOM} and \c{FCOMP}, but write their results directly to the CPU
2170 flags register rather than the FPU status word, so they can be
2171 immediately followed by conditional jump or conditional move
2172 instructions.
2173
2174 The \c{FCOM} instructions differ from the \c{FUCOM} instructions
2175 (\k{insFUCOM}) only in the way they handle quiet NaNs: \c{FUCOM}
2176 will handle them silently and set the condition code flags to an
2177 `unordered' result, whereas \c{FCOM} will generate an exception.
2178
2179
2180 \S{insFCOS} \i\c{FCOS}: Cosine
2181
2182 \c FCOS                          ; D9 FF                [386,FPU]
2183
2184 \c{FCOS} computes the cosine of \c{ST0} (in radians), and stores the
2185 result in \c{ST0}. The absolute value of \c{ST0} must be less than 2**63.
2186
2187 See also \c{FSINCOS} (\k{insFSIN}).
2188
2189
2190 \S{insFDECSTP} \i\c{FDECSTP}: Decrement Floating-Point Stack Pointer
2191
2192 \c FDECSTP                       ; D9 F6                [8086,FPU]
2193
2194 \c{FDECSTP} decrements the `top' field in the floating-point status
2195 word. This has the effect of rotating the FPU register stack by one,
2196 as if the contents of \c{ST7} had been pushed on the stack. See also
2197 \c{FINCSTP} (\k{insFINCSTP}).
2198
2199
2200 \S{insFDISI} \i\c{FxDISI}, \i\c{FxENI}: Disable and Enable Floating-Point Interrupts
2201
2202 \c FDISI                         ; 9B DB E1             [8086,FPU]
2203 \c FNDISI                        ; DB E1                [8086,FPU]
2204
2205 \c FENI                          ; 9B DB E0             [8086,FPU]
2206 \c FNENI                         ; DB E0                [8086,FPU]
2207
2208 \c{FDISI} and \c{FENI} disable and enable floating-point interrupts.
2209 These instructions are only meaningful on original 8087 processors:
2210 the 287 and above treat them as no-operation instructions.
2211
2212 \c{FNDISI} and \c{FNENI} do the same thing as \c{FDISI} and \c{FENI}
2213 respectively, but without waiting for the floating-point processor
2214 to finish what it was doing first.
2215
2216
2217 \S{insFDIV} \i\c{FDIV}, \i\c{FDIVP}, \i\c{FDIVR}, \i\c{FDIVRP}: Floating-Point Division
2218
2219 \c FDIV mem32                    ; D8 /6                [8086,FPU]
2220 \c FDIV mem64                    ; DC /6                [8086,FPU]
2221
2222 \c FDIV fpureg                   ; D8 F0+r              [8086,FPU]
2223 \c FDIV ST0,fpureg               ; D8 F0+r              [8086,FPU]
2224
2225 \c FDIV TO fpureg                ; DC F8+r              [8086,FPU]
2226 \c FDIV fpureg,ST0               ; DC F8+r              [8086,FPU]
2227
2228 \c FDIVR mem32                   ; D8 /7                [8086,FPU]
2229 \c FDIVR mem64                   ; DC /7                [8086,FPU]
2230
2231 \c FDIVR fpureg                  ; D8 F8+r              [8086,FPU]
2232 \c FDIVR ST0,fpureg              ; D8 F8+r              [8086,FPU]
2233
2234 \c FDIVR TO fpureg               ; DC F0+r              [8086,FPU]
2235 \c FDIVR fpureg,ST0              ; DC F0+r              [8086,FPU]
2236
2237 \c FDIVP fpureg                  ; DE F8+r              [8086,FPU]
2238 \c FDIVP fpureg,ST0              ; DE F8+r              [8086,FPU]
2239
2240 \c FDIVRP fpureg                 ; DE F0+r              [8086,FPU]
2241 \c FDIVRP fpureg,ST0             ; DE F0+r              [8086,FPU]
2242
2243 \b \c{FDIV} divides \c{ST0} by the given operand and stores the result
2244 back in \c{ST0}, unless the \c{TO} qualifier is given, in which case
2245 it divides the given operand by \c{ST0} and stores the result in the
2246 operand.
2247
2248 \b \c{FDIVR} does the same thing, but does the division the other way
2249 up: so if \c{TO} is not given, it divides the given operand by
2250 \c{ST0} and stores the result in \c{ST0}, whereas if \c{TO} is given
2251 it divides \c{ST0} by its operand and stores the result in the
2252 operand.
2253
2254 \b \c{FDIVP} operates like \c{FDIV TO}, but pops the register stack
2255 once it has finished.
2256
2257 \b \c{FDIVRP} operates like \c{FDIVR TO}, but pops the register stack
2258 once it has finished.
2259
2260 For FP/Integer divisions, see \c{FIDIV} (\k{insFIDIV}).
2261
2262
2263 \S{insFEMMS} \i\c{FEMMS}: Faster Enter/Exit of the MMX or floating-point state
2264
2265 \c FEMMS                         ; 0F 0E           [PENT,3DNOW]
2266
2267 \c{FEMMS} can be used in place of the \c{EMMS} instruction on
2268 processors which support the 3DNow! instruction set. Following
2269 execution of \c{FEMMS}, the state of the \c{MMX/FP} registers
2270 is undefined, and this allows a faster context switch between
2271 \c{FP} and \c{MMX} instructions. The \c{FEMMS} instruction can
2272 also be used \e{before} executing \c{MMX} instructions
2273
2274
2275 \S{insFFREE} \i\c{FFREE}: Flag Floating-Point Register as Unused
2276
2277 \c FFREE fpureg                  ; DD C0+r              [8086,FPU]
2278 \c FFREEP fpureg                 ; DF C0+r              [286,FPU,UNDOC]
2279
2280 \c{FFREE} marks the given register as being empty.
2281
2282 \c{FFREEP} marks the given register as being empty, and then
2283 pops the register stack.
2284
2285
2286 \S{insFIADD} \i\c{FIADD}: Floating-Point/Integer Addition
2287
2288 \c FIADD mem16                   ; DE /0                [8086,FPU]
2289 \c FIADD mem32                   ; DA /0                [8086,FPU]
2290
2291 \c{FIADD} adds the 16-bit or 32-bit integer stored in the given
2292 memory location to \c{ST0}, storing the result in \c{ST0}.
2293
2294
2295 \S{insFICOM} \i\c{FICOM}, \i\c{FICOMP}: Floating-Point/Integer Compare
2296
2297 \c FICOM mem16                   ; DE /2                [8086,FPU]
2298 \c FICOM mem32                   ; DA /2                [8086,FPU]
2299
2300 \c FICOMP mem16                  ; DE /3                [8086,FPU]
2301 \c FICOMP mem32                  ; DA /3                [8086,FPU]
2302
2303 \c{FICOM} compares \c{ST0} with the 16-bit or 32-bit integer stored
2304 in the given memory location, and sets the FPU flags accordingly.
2305 \c{FICOMP} does the same, but pops the register stack afterwards.
2306
2307
2308 \S{insFIDIV} \i\c{FIDIV}, \i\c{FIDIVR}: Floating-Point/Integer Division
2309
2310 \c FIDIV mem16                   ; DE /6                [8086,FPU]
2311 \c FIDIV mem32                   ; DA /6                [8086,FPU]
2312
2313 \c FIDIVR mem16                  ; DE /7                [8086,FPU]
2314 \c FIDIVR mem32                  ; DA /7                [8086,FPU]
2315
2316 \c{FIDIV} divides \c{ST0} by the 16-bit or 32-bit integer stored in
2317 the given memory location, and stores the result in \c{ST0}.
2318 \c{FIDIVR} does the division the other way up: it divides the
2319 integer by \c{ST0}, but still stores the result in \c{ST0}.
2320
2321
2322 \S{insFILD} \i\c{FILD}, \i\c{FIST}, \i\c{FISTP}: Floating-Point/Integer Conversion
2323
2324 \c FILD mem16                    ; DF /0                [8086,FPU]
2325 \c FILD mem32                    ; DB /0                [8086,FPU]
2326 \c FILD mem64                    ; DF /5                [8086,FPU]
2327
2328 \c FIST mem16                    ; DF /2                [8086,FPU]
2329 \c FIST mem32                    ; DB /2                [8086,FPU]
2330
2331 \c FISTP mem16                   ; DF /3                [8086,FPU]
2332 \c FISTP mem32                   ; DB /3                [8086,FPU]
2333 \c FISTP mem64                   ; DF /7                [8086,FPU]
2334
2335 \c{FILD} loads an integer out of a memory location, converts it to a
2336 real, and pushes it on the FPU register stack. \c{FIST} converts
2337 \c{ST0} to an integer and stores that in memory; \c{FISTP} does the
2338 same as \c{FIST}, but pops the register stack afterwards.
2339
2340
2341 \S{insFIMUL} \i\c{FIMUL}: Floating-Point/Integer Multiplication
2342
2343 \c FIMUL mem16                   ; DE /1                [8086,FPU]
2344 \c FIMUL mem32                   ; DA /1                [8086,FPU]
2345
2346 \c{FIMUL} multiplies \c{ST0} by the 16-bit or 32-bit integer stored
2347 in the given memory location, and stores the result in \c{ST0}.
2348
2349
2350 \S{insFINCSTP} \i\c{FINCSTP}: Increment Floating-Point Stack Pointer
2351
2352 \c FINCSTP                       ; D9 F7                [8086,FPU]
2353
2354 \c{FINCSTP} increments the `top' field in the floating-point status
2355 word. This has the effect of rotating the FPU register stack by one,
2356 as if the register stack had been popped; however, unlike the
2357 popping of the stack performed by many FPU instructions, it does not
2358 flag the new \c{ST7} (previously \c{ST0}) as empty. See also
2359 \c{FDECSTP} (\k{insFDECSTP}).
2360
2361
2362 \S{insFINIT} \i\c{FINIT}, \i\c{FNINIT}: initialize Floating-Point Unit
2363
2364 \c FINIT                         ; 9B DB E3             [8086,FPU]
2365 \c FNINIT                        ; DB E3                [8086,FPU]
2366
2367 \c{FINIT} initializes the FPU to its default state. It flags all
2368 registers as empty, without actually change their values, clears
2369 the top of stack pointer. \c{FNINIT} does the same, without first
2370 waiting for pending exceptions to clear.
2371
2372
2373 \S{insFISUB} \i\c{FISUB}: Floating-Point/Integer Subtraction
2374
2375 \c FISUB mem16                   ; DE /4                [8086,FPU]
2376 \c FISUB mem32                   ; DA /4                [8086,FPU]
2377
2378 \c FISUBR mem16                  ; DE /5                [8086,FPU]
2379 \c FISUBR mem32                  ; DA /5                [8086,FPU]
2380
2381 \c{FISUB} subtracts the 16-bit or 32-bit integer stored in the given
2382 memory location from \c{ST0}, and stores the result in \c{ST0}.
2383 \c{FISUBR} does the subtraction the other way round, i.e. it
2384 subtracts \c{ST0} from the given integer, but still stores the
2385 result in \c{ST0}.
2386
2387
2388 \S{insFLD} \i\c{FLD}: Floating-Point Load
2389
2390 \c FLD mem32                     ; D9 /0                [8086,FPU]
2391 \c FLD mem64                     ; DD /0                [8086,FPU]
2392 \c FLD mem80                     ; DB /5                [8086,FPU]
2393 \c FLD fpureg                    ; D9 C0+r              [8086,FPU]
2394
2395 \c{FLD} loads a floating-point value out of the given register or
2396 memory location, and pushes it on the FPU register stack.
2397
2398
2399 \S{insFLD1} \i\c{FLDxx}: Floating-Point Load Constants
2400
2401 \c FLD1                          ; D9 E8                [8086,FPU]
2402 \c FLDL2E                        ; D9 EA                [8086,FPU]
2403 \c FLDL2T                        ; D9 E9                [8086,FPU]
2404 \c FLDLG2                        ; D9 EC                [8086,FPU]
2405 \c FLDLN2                        ; D9 ED                [8086,FPU]
2406 \c FLDPI                         ; D9 EB                [8086,FPU]
2407 \c FLDZ                          ; D9 EE                [8086,FPU]
2408
2409 These instructions push specific standard constants on the FPU
2410 register stack.
2411
2412 \c  Instruction    Constant pushed
2413
2414 \c  FLD1           1
2415 \c  FLDL2E         base-2 logarithm of e
2416 \c  FLDL2T         base-2 log of 10
2417 \c  FLDLG2         base-10 log of 2
2418 \c  FLDLN2         base-e log of 2
2419 \c  FLDPI          pi
2420 \c  FLDZ           zero
2421
2422
2423 \S{insFLDCW} \i\c{FLDCW}: Load Floating-Point Control Word
2424
2425 \c FLDCW mem16                   ; D9 /5                [8086,FPU]
2426
2427 \c{FLDCW} loads a 16-bit value out of memory and stores it into the
2428 FPU control word (governing things like the rounding mode, the
2429 precision, and the exception masks). See also \c{FSTCW}
2430 (\k{insFSTCW}). If exceptions are enabled and you don't want to
2431 generate one, use \c{FCLEX} or \c{FNCLEX} (\k{insFCLEX}) before
2432 loading the new control word.
2433
2434
2435 \S{insFLDENV} \i\c{FLDENV}: Load Floating-Point Environment
2436
2437 \c FLDENV mem                    ; D9 /4                [8086,FPU]
2438
2439 \c{FLDENV} loads the FPU operating environment (control word, status
2440 word, tag word, instruction pointer, data pointer and last opcode)
2441 from memory. The memory area is 14 or 28 bytes long, depending on
2442 the CPU mode at the time. See also \c{FSTENV} (\k{insFSTENV}).
2443
2444
2445 \S{insFMUL} \i\c{FMUL}, \i\c{FMULP}: Floating-Point Multiply
2446
2447 \c FMUL mem32                    ; D8 /1                [8086,FPU]
2448 \c FMUL mem64                    ; DC /1                [8086,FPU]
2449
2450 \c FMUL fpureg                   ; D8 C8+r              [8086,FPU]
2451 \c FMUL ST0,fpureg               ; D8 C8+r              [8086,FPU]
2452
2453 \c FMUL TO fpureg                ; DC C8+r              [8086,FPU]
2454 \c FMUL fpureg,ST0               ; DC C8+r              [8086,FPU]
2455
2456 \c FMULP fpureg                  ; DE C8+r              [8086,FPU]
2457 \c FMULP fpureg,ST0              ; DE C8+r              [8086,FPU]
2458
2459 \c{FMUL} multiplies \c{ST0} by the given operand, and stores the
2460 result in \c{ST0}, unless the \c{TO} qualifier is used in which case
2461 it stores the result in the operand. \c{FMULP} performs the same
2462 operation as \c{FMUL TO}, and then pops the register stack.
2463
2464
2465 \S{insFNOP} \i\c{FNOP}: Floating-Point No Operation
2466
2467 \c FNOP                          ; D9 D0                [8086,FPU]
2468
2469 \c{FNOP} does nothing.
2470
2471
2472 \S{insFPATAN} \i\c{FPATAN}, \i\c{FPTAN}: Arctangent and Tangent
2473
2474 \c FPATAN                        ; D9 F3                [8086,FPU]
2475 \c FPTAN                         ; D9 F2                [8086,FPU]
2476
2477 \c{FPATAN} computes the arctangent, in radians, of the result of
2478 dividing \c{ST1} by \c{ST0}, stores the result in \c{ST1}, and pops
2479 the register stack. It works like the C \c{atan2} function, in that
2480 changing the sign of both \c{ST0} and \c{ST1} changes the output
2481 value by pi (so it performs true rectangular-to-polar coordinate
2482 conversion, with \c{ST1} being the Y coordinate and \c{ST0} being
2483 the X coordinate, not merely an arctangent).
2484
2485 \c{FPTAN} computes the tangent of the value in \c{ST0} (in radians),
2486 and stores the result back into \c{ST0}.
2487
2488 The absolute value of \c{ST0} must be less than 2**63.
2489
2490
2491 \S{insFPREM} \i\c{FPREM}, \i\c{FPREM1}: Floating-Point Partial Remainder
2492
2493 \c FPREM                         ; D9 F8                [8086,FPU]
2494 \c FPREM1                        ; D9 F5                [386,FPU]
2495
2496 These instructions both produce the remainder obtained by dividing
2497 \c{ST0} by \c{ST1}. This is calculated, notionally, by dividing
2498 \c{ST0} by \c{ST1}, rounding the result to an integer, multiplying
2499 by \c{ST1} again, and computing the value which would need to be
2500 added back on to the result to get back to the original value in
2501 \c{ST0}.
2502
2503 The two instructions differ in the way the notional round-to-integer
2504 operation is performed. \c{FPREM} does it by rounding towards zero,
2505 so that the remainder it returns always has the same sign as the
2506 original value in \c{ST0}; \c{FPREM1} does it by rounding to the
2507 nearest integer, so that the remainder always has at most half the
2508 magnitude of \c{ST1}.
2509
2510 Both instructions calculate \e{partial} remainders, meaning that
2511 they may not manage to provide the final result, but might leave
2512 intermediate results in \c{ST0} instead. If this happens, they will
2513 set the C2 flag in the FPU status word; therefore, to calculate a
2514 remainder, you should repeatedly execute \c{FPREM} or \c{FPREM1}
2515 until C2 becomes clear.
2516
2517
2518 \S{insFRNDINT} \i\c{FRNDINT}: Floating-Point Round to Integer
2519
2520 \c FRNDINT                       ; D9 FC                [8086,FPU]
2521
2522 \c{FRNDINT} rounds the contents of \c{ST0} to an integer, according
2523 to the current rounding mode set in the FPU control word, and stores
2524 the result back in \c{ST0}.
2525
2526
2527 \S{insFRSTOR} \i\c{FSAVE}, \i\c{FRSTOR}: Save/Restore Floating-Point State
2528
2529 \c FSAVE mem                     ; 9B DD /6             [8086,FPU]
2530 \c FNSAVE mem                    ; DD /6                [8086,FPU]
2531
2532 \c FRSTOR mem                    ; DD /4                [8086,FPU]
2533
2534 \c{FSAVE} saves the entire floating-point unit state, including all
2535 the information saved by \c{FSTENV} (\k{insFSTENV}) plus the
2536 contents of all the registers, to a 94 or 108 byte area of memory
2537 (depending on the CPU mode). \c{FRSTOR} restores the floating-point
2538 state from the same area of memory.
2539
2540 \c{FNSAVE} does the same as \c{FSAVE}, without first waiting for
2541 pending floating-point exceptions to clear.
2542
2543
2544 \S{insFSCALE} \i\c{FSCALE}: Scale Floating-Point Value by Power of Two
2545
2546 \c FSCALE                        ; D9 FD                [8086,FPU]
2547
2548 \c{FSCALE} scales a number by a power of two: it rounds \c{ST1}
2549 towards zero to obtain an integer, then multiplies \c{ST0} by two to
2550 the power of that integer, and stores the result in \c{ST0}.
2551
2552
2553 \S{insFSETPM} \i\c{FSETPM}: Set Protected Mode
2554
2555 \c FSETPM                        ; DB E4                [286,FPU]
2556
2557 This instruction initializes protected mode on the 287 floating-point
2558 coprocessor. It is only meaningful on that processor: the 387 and
2559 above treat the instruction as a no-operation.
2560
2561
2562 \S{insFSIN} \i\c{FSIN}, \i\c{FSINCOS}: Sine and Cosine
2563
2564 \c FSIN                          ; D9 FE                [386,FPU]
2565 \c FSINCOS                       ; D9 FB                [386,FPU]
2566
2567 \c{FSIN} calculates the sine of \c{ST0} (in radians) and stores the
2568 result in \c{ST0}. \c{FSINCOS} does the same, but then pushes the
2569 cosine of the same value on the register stack, so that the sine
2570 ends up in \c{ST1} and the cosine in \c{ST0}. \c{FSINCOS} is faster
2571 than executing \c{FSIN} and \c{FCOS} (see \k{insFCOS}) in succession.
2572
2573 The absolute value of \c{ST0} must be less than 2**63.
2574
2575
2576 \S{insFSQRT} \i\c{FSQRT}: Floating-Point Square Root
2577
2578 \c FSQRT                         ; D9 FA                [8086,FPU]
2579
2580 \c{FSQRT} calculates the square root of \c{ST0} and stores the
2581 result in \c{ST0}.
2582
2583
2584 \S{insFST} \i\c{FST}, \i\c{FSTP}: Floating-Point Store
2585
2586 \c FST mem32                     ; D9 /2                [8086,FPU]
2587 \c FST mem64                     ; DD /2                [8086,FPU]
2588 \c FST fpureg                    ; DD D0+r              [8086,FPU]
2589
2590 \c FSTP mem32                    ; D9 /3                [8086,FPU]
2591 \c FSTP mem64                    ; DD /3                [8086,FPU]
2592 \c FSTP mem80                    ; DB /7                [8086,FPU]
2593 \c FSTP fpureg                   ; DD D8+r              [8086,FPU]
2594
2595 \c{FST} stores the value in \c{ST0} into the given memory location
2596 or other FPU register. \c{FSTP} does the same, but then pops the
2597 register stack.
2598
2599
2600 \S{insFSTCW} \i\c{FSTCW}: Store Floating-Point Control Word
2601
2602 \c FSTCW mem16                   ; 9B D9 /7             [8086,FPU]
2603 \c FNSTCW mem16                  ; D9 /7                [8086,FPU]
2604
2605 \c{FSTCW} stores the \c{FPU} control word (governing things like the
2606 rounding mode, the precision, and the exception masks) into a 2-byte
2607 memory area. See also \c{FLDCW} (\k{insFLDCW}).
2608
2609 \c{FNSTCW} does the same thing as \c{FSTCW}, without first waiting
2610 for pending floating-point exceptions to clear.
2611
2612
2613 \S{insFSTENV} \i\c{FSTENV}: Store Floating-Point Environment
2614
2615 \c FSTENV mem                    ; 9B D9 /6             [8086,FPU]
2616 \c FNSTENV mem                   ; D9 /6                [8086,FPU]
2617
2618 \c{FSTENV} stores the \c{FPU} operating environment (control word,
2619 status word, tag word, instruction pointer, data pointer and last
2620 opcode) into memory. The memory area is 14 or 28 bytes long,
2621 depending on the CPU mode at the time. See also \c{FLDENV}
2622 (\k{insFLDENV}).
2623
2624 \c{FNSTENV} does the same thing as \c{FSTENV}, without first waiting
2625 for pending floating-point exceptions to clear.
2626
2627
2628 \S{insFSTSW} \i\c{FSTSW}: Store Floating-Point Status Word
2629
2630 \c FSTSW mem16                   ; 9B DD /7             [8086,FPU]
2631 \c FSTSW AX                      ; 9B DF E0             [286,FPU]
2632
2633 \c FNSTSW mem16                  ; DD /7                [8086,FPU]
2634 \c FNSTSW AX                     ; DF E0                [286,FPU]
2635
2636 \c{FSTSW} stores the \c{FPU} status word into \c{AX} or into a 2-byte
2637 memory area.
2638
2639 \c{FNSTSW} does the same thing as \c{FSTSW}, without first waiting
2640 for pending floating-point exceptions to clear.
2641
2642
2643 \S{insFSUB} \i\c{FSUB}, \i\c{FSUBP}, \i\c{FSUBR}, \i\c{FSUBRP}: Floating-Point Subtract
2644
2645 \c FSUB mem32                    ; D8 /4                [8086,FPU]
2646 \c FSUB mem64                    ; DC /4                [8086,FPU]
2647
2648 \c FSUB fpureg                   ; D8 E0+r              [8086,FPU]
2649 \c FSUB ST0,fpureg               ; D8 E0+r              [8086,FPU]
2650
2651 \c FSUB TO fpureg                ; DC E8+r              [8086,FPU]
2652 \c FSUB fpureg,ST0               ; DC E8+r              [8086,FPU]
2653
2654 \c FSUBR mem32                   ; D8 /5                [8086,FPU]
2655 \c FSUBR mem64                   ; DC /5                [8086,FPU]
2656
2657 \c FSUBR fpureg                  ; D8 E8+r              [8086,FPU]
2658 \c FSUBR ST0,fpureg              ; D8 E8+r              [8086,FPU]
2659
2660 \c FSUBR TO fpureg               ; DC E0+r              [8086,FPU]
2661 \c FSUBR fpureg,ST0              ; DC E0+r              [8086,FPU]
2662
2663 \c FSUBP fpureg                  ; DE E8+r              [8086,FPU]
2664 \c FSUBP fpureg,ST0              ; DE E8+r              [8086,FPU]
2665
2666 \c FSUBRP fpureg                 ; DE E0+r              [8086,FPU]
2667 \c FSUBRP fpureg,ST0             ; DE E0+r              [8086,FPU]
2668
2669 \b \c{FSUB} subtracts the given operand from \c{ST0} and stores the
2670 result back in \c{ST0}, unless the \c{TO} qualifier is given, in
2671 which case it subtracts \c{ST0} from the given operand and stores
2672 the result in the operand.
2673
2674 \b \c{FSUBR} does the same thing, but does the subtraction the other
2675 way up: so if \c{TO} is not given, it subtracts \c{ST0} from the given
2676 operand and stores the result in \c{ST0}, whereas if \c{TO} is given
2677 it subtracts its operand from \c{ST0} and stores the result in the
2678 operand.
2679
2680 \b \c{FSUBP} operates like \c{FSUB TO}, but pops the register stack
2681 once it has finished.
2682
2683 \b \c{FSUBRP} operates like \c{FSUBR TO}, but pops the register stack
2684 once it has finished.
2685
2686
2687 \S{insFTST} \i\c{FTST}: Test \c{ST0} Against Zero
2688
2689 \c FTST                          ; D9 E4                [8086,FPU]
2690
2691 \c{FTST} compares \c{ST0} with zero and sets the FPU flags
2692 accordingly. \c{ST0} is treated as the left-hand side of the
2693 comparison, so that a `less-than' result is generated if \c{ST0} is
2694 negative.
2695
2696
2697 \S{insFUCOM} \i\c{FUCOMxx}: Floating-Point Unordered Compare
2698
2699 \c FUCOM fpureg                  ; DD E0+r              [386,FPU]
2700 \c FUCOM ST0,fpureg              ; DD E0+r              [386,FPU]
2701
2702 \c FUCOMP fpureg                 ; DD E8+r              [386,FPU]
2703 \c FUCOMP ST0,fpureg             ; DD E8+r              [386,FPU]
2704
2705 \c FUCOMPP                       ; DA E9                [386,FPU]
2706
2707 \c FUCOMI fpureg                 ; DB E8+r              [P6,FPU]
2708 \c FUCOMI ST0,fpureg             ; DB E8+r              [P6,FPU]
2709
2710 \c FUCOMIP fpureg                ; DF E8+r              [P6,FPU]
2711 \c FUCOMIP ST0,fpureg            ; DF E8+r              [P6,FPU]
2712
2713 \b \c{FUCOM} compares \c{ST0} with the given operand, and sets the
2714 FPU flags accordingly. \c{ST0} is treated as the left-hand side of
2715 the comparison, so that the carry flag is set (for a `less-than'
2716 result) if \c{ST0} is less than the given operand.
2717
2718 \b \c{FUCOMP} does the same as \c{FUCOM}, but pops the register stack
2719 afterwards. \c{FUCOMPP} compares \c{ST0} with \c{ST1} and then pops
2720 the register stack twice.
2721
2722 \b \c{FUCOMI} and \c{FUCOMIP} work like the corresponding forms of
2723 \c{FUCOM} and \c{FUCOMP}, but write their results directly to the CPU
2724 flags register rather than the FPU status word, so they can be
2725 immediately followed by conditional jump or conditional move
2726 instructions.
2727
2728 The \c{FUCOM} instructions differ from the \c{FCOM} instructions
2729 (\k{insFCOM}) only in the way they handle quiet NaNs: \c{FUCOM} will
2730 handle them silently and set the condition code flags to an
2731 `unordered' result, whereas \c{FCOM} will generate an exception.
2732
2733
2734 \S{insFXAM} \i\c{FXAM}: Examine Class of Value in \c{ST0}
2735
2736 \c FXAM                          ; D9 E5                [8086,FPU]
2737
2738 \c{FXAM} sets the FPU flags \c{C3}, \c{C2} and \c{C0} depending on
2739 the type of value stored in \c{ST0}:
2740
2741 \c  Register contents     Flags
2742
2743 \c  Unsupported format    000
2744 \c  NaN                   001
2745 \c  Finite number         010
2746 \c  Infinity              011
2747 \c  Zero                  100
2748 \c  Empty register        101
2749 \c  Denormal              110
2750
2751 Additionally, the \c{C1} flag is set to the sign of the number.
2752
2753
2754 \S{insFXCH} \i\c{FXCH}: Floating-Point Exchange
2755
2756 \c FXCH                          ; D9 C9                [8086,FPU]
2757 \c FXCH fpureg                   ; D9 C8+r              [8086,FPU]
2758 \c FXCH fpureg,ST0               ; D9 C8+r              [8086,FPU]
2759 \c FXCH ST0,fpureg               ; D9 C8+r              [8086,FPU]
2760
2761 \c{FXCH} exchanges \c{ST0} with a given FPU register. The no-operand
2762 form exchanges \c{ST0} with \c{ST1}.
2763
2764
2765 \S{insFXRSTOR} \i\c{FXRSTOR}: Restore \c{FP}, \c{MMX} and \c{SSE} State
2766
2767 \c FXRSTOR memory                ; 0F AE /1               [P6,SSE,FPU]
2768
2769 The \c{FXRSTOR} instruction reloads the \c{FPU}, \c{MMX} and \c{SSE}
2770 state (environment and registers), from the 512 byte memory area defined
2771 by the source operand. This data should have been written by a previous
2772 \c{FXSAVE}.
2773
2774
2775 \S{insFXSAVE} \i\c{FXSAVE}: Store \c{FP}, \c{MMX} and \c{SSE} State
2776
2777 \c FXSAVE memory                 ; 0F AE /0         [P6,SSE,FPU]
2778
2779 \c{FXSAVE}The FXSAVE instruction writes the current \c{FPU}, \c{MMX}
2780 and \c{SSE} technology states (environment and registers), to the
2781 512 byte memory area defined by the destination operand. It does this
2782 without checking for pending unmasked floating-point exceptions
2783 (similar to the operation of \c{FNSAVE}).
2784
2785 Unlike the \c{FSAVE/FNSAVE} instructions, the processor retains the
2786 contents of the \c{FPU}, \c{MMX} and \c{SSE} state in the processor
2787 after the state has been saved. This instruction has been optimized
2788 to maximize floating-point save performance.
2789
2790
2791 \S{insFXTRACT} \i\c{FXTRACT}: Extract Exponent and Significand
2792
2793 \c FXTRACT                       ; D9 F4                [8086,FPU]
2794
2795 \c{FXTRACT} separates the number in \c{ST0} into its exponent and
2796 significand (mantissa), stores the exponent back into \c{ST0}, and
2797 then pushes the significand on the register stack (so that the
2798 significand ends up in \c{ST0}, and the exponent in \c{ST1}).
2799
2800
2801 \S{insFYL2X} \i\c{FYL2X}, \i\c{FYL2XP1}: Compute Y times Log2(X) or Log2(X+1)
2802
2803 \c FYL2X                         ; D9 F1                [8086,FPU]
2804 \c FYL2XP1                       ; D9 F9                [8086,FPU]
2805
2806 \c{FYL2X} multiplies \c{ST1} by the base-2 logarithm of \c{ST0},
2807 stores the result in \c{ST1}, and pops the register stack (so that
2808 the result ends up in \c{ST0}). \c{ST0} must be non-zero and
2809 positive.
2810
2811 \c{FYL2XP1} works the same way, but replacing the base-2 log of
2812 \c{ST0} with that of \c{ST0} plus one. This time, \c{ST0} must have
2813 magnitude no greater than 1 minus half the square root of two.
2814
2815
2816 \S{insHLT} \i\c{HLT}: Halt Processor
2817
2818 \c HLT                           ; F4                   [8086,PRIV]
2819
2820 \c{HLT} puts the processor into a halted state, where it will
2821 perform no more operations until restarted by an interrupt or a
2822 reset.
2823
2824 On the 286 and later processors, this is a privileged instruction.
2825
2826
2827 \S{insIBTS} \i\c{IBTS}: Insert Bit String
2828
2829 \c IBTS r/m16,reg16              ; o16 0F A7 /r         [386,UNDOC]
2830 \c IBTS r/m32,reg32              ; o32 0F A7 /r         [386,UNDOC]
2831
2832 The implied operation of this instruction is:
2833
2834 \c IBTS r/m16,AX,CL,reg16
2835 \c IBTS r/m32,EAX,CL,reg32
2836
2837 Writes a bit string from the source operand to the destination.
2838 \c{CL} indicates the number of bits to be copied, from the low bits
2839 of the source. \c{(E)AX} indicates the low order bit offset in the
2840 destination that is written to. For example, if \c{CL} is set to 4
2841 and \c{AX} (for 16-bit code) is set to 5, bits 0-3 of \c{src} will
2842 be copied to bits 5-8 of \c{dst}. This instruction is very poorly
2843 documented, and I have been unable to find any official source of
2844 documentation on it.
2845
2846 \c{IBTS} is supported only on the early Intel 386s, and conflicts
2847 with the opcodes for \c{CMPXCHG486} (on early Intel 486s). NASM
2848 supports it only for completeness. Its counterpart is \c{XBTS}
2849 (see \k{insXBTS}).
2850
2851
2852 \S{insIDIV} \i\c{IDIV}: Signed Integer Divide
2853
2854 \c IDIV r/m8                     ; F6 /7                [8086]
2855 \c IDIV r/m16                    ; o16 F7 /7            [8086]
2856 \c IDIV r/m32                    ; o32 F7 /7            [386]
2857
2858 \c{IDIV} performs signed integer division. The explicit operand
2859 provided is the divisor; the dividend and destination operands
2860 are implicit, in the following way:
2861
2862 \b For \c{IDIV r/m8}, \c{AX} is divided by the given operand;
2863 the quotient is stored in \c{AL} and the remainder in \c{AH}.
2864
2865 \b For \c{IDIV r/m16}, \c{DX:AX} is divided by the given operand;
2866 the quotient is stored in \c{AX} and the remainder in \c{DX}.
2867
2868 \b For \c{IDIV r/m32}, \c{EDX:EAX} is divided by the given operand;
2869 the quotient is stored in \c{EAX} and the remainder in \c{EDX}.
2870
2871 Unsigned integer division is performed by the \c{DIV} instruction:
2872 see \k{insDIV}.
2873
2874
2875 \S{insIMUL} \i\c{IMUL}: Signed Integer Multiply
2876
2877 \c IMUL r/m8                     ; F6 /5                [8086]
2878 \c IMUL r/m16                    ; o16 F7 /5            [8086]
2879 \c IMUL r/m32                    ; o32 F7 /5            [386]
2880
2881 \c IMUL reg16,r/m16              ; o16 0F AF /r         [386]
2882 \c IMUL reg32,r/m32              ; o32 0F AF /r         [386]
2883
2884 \c IMUL reg16,imm8               ; o16 6B /r ib         [186]
2885 \c IMUL reg16,imm16              ; o16 69 /r iw         [186]
2886 \c IMUL reg32,imm8               ; o32 6B /r ib         [386]
2887 \c IMUL reg32,imm32              ; o32 69 /r id         [386]
2888
2889 \c IMUL reg16,r/m16,imm8         ; o16 6B /r ib         [186]
2890 \c IMUL reg16,r/m16,imm16        ; o16 69 /r iw         [186]
2891 \c IMUL reg32,r/m32,imm8         ; o32 6B /r ib         [386]
2892 \c IMUL reg32,r/m32,imm32        ; o32 69 /r id         [386]
2893
2894 \c{IMUL} performs signed integer multiplication. For the
2895 single-operand form, the other operand and destination are
2896 implicit, in the following way:
2897
2898 \b For \c{IMUL r/m8}, \c{AL} is multiplied by the given operand;
2899 the product is stored in \c{AX}.
2900
2901 \b For \c{IMUL r/m16}, \c{AX} is multiplied by the given operand;
2902 the product is stored in \c{DX:AX}.
2903
2904 \b For \c{IMUL r/m32}, \c{EAX} is multiplied by the given operand;
2905 the product is stored in \c{EDX:EAX}.
2906
2907 The two-operand form multiplies its two operands and stores the
2908 result in the destination (first) operand. The three-operand
2909 form multiplies its last two operands and stores the result in
2910 the first operand.
2911
2912 The two-operand form with an immediate second operand is in
2913 fact a shorthand for the three-operand form, as can be seen by
2914 examining the opcode descriptions: in the two-operand form, the
2915 code \c{/r} takes both its register and \c{r/m} parts from the
2916 same operand (the first one).
2917
2918 In the forms with an 8-bit immediate operand and another longer
2919 source operand, the immediate operand is considered to be signed,
2920 and is sign-extended to the length of the other source operand.
2921 In these cases, the \c{BYTE} qualifier is necessary to force
2922 NASM to generate this form of the instruction.
2923
2924 Unsigned integer multiplication is performed by the \c{MUL}
2925 instruction: see \k{insMUL}.
2926
2927
2928 \S{insIN} \i\c{IN}: Input from I/O Port
2929
2930 \c IN AL,imm8                    ; E4 ib                [8086]
2931 \c IN AX,imm8                    ; o16 E5 ib            [8086]
2932 \c IN EAX,imm8                   ; o32 E5 ib            [386]
2933 \c IN AL,DX                      ; EC                   [8086]
2934 \c IN AX,DX                      ; o16 ED               [8086]
2935 \c IN EAX,DX                     ; o32 ED               [386]
2936
2937 \c{IN} reads a byte, word or doubleword from the specified I/O port,
2938 and stores it in the given destination register. The port number may
2939 be specified as an immediate value if it is between 0 and 255, and
2940 otherwise must be stored in \c{DX}. See also \c{OUT} (\k{insOUT}).
2941
2942
2943 \S{insINC} \i\c{INC}: Increment Integer
2944
2945 \c INC reg16                     ; o16 40+r             [8086]
2946 \c INC reg32                     ; o32 40+r             [386]
2947 \c INC r/m8                      ; FE /0                [8086]
2948 \c INC r/m16                     ; o16 FF /0            [8086]
2949 \c INC r/m32                     ; o32 FF /0            [386]
2950
2951 \c{INC} adds 1 to its operand. It does \e{not} affect the carry
2952 flag: to affect the carry flag, use \c{ADD something,1} (see
2953 \k{insADD}). \c{INC} affects all the other flags according to the result.
2954
2955 This instruction can be used with a \c{LOCK} prefix to allow atomic execution.
2956
2957 See also \c{DEC} (\k{insDEC}).
2958
2959
2960 \S{insINSB} \i\c{INSB}, \i\c{INSW}, \i\c{INSD}: Input String from I/O Port
2961
2962 \c INSB                          ; 6C                   [186]
2963 \c INSW                          ; o16 6D               [186]
2964 \c INSD                          ; o32 6D               [386]
2965
2966 \c{INSB} inputs a byte from the I/O port specified in \c{DX} and
2967 stores it at \c{[ES:DI]} or \c{[ES:EDI]}. It then increments or
2968 decrements (depending on the direction flag: increments if the flag
2969 is clear, decrements if it is set) \c{DI} or \c{EDI}.
2970
2971 The register used is \c{DI} if the address size is 16 bits, and
2972 \c{EDI} if it is 32 bits. If you need to use an address size not
2973 equal to the current \c{BITS} setting, you can use an explicit
2974 \i\c{a16} or \i\c{a32} prefix.
2975
2976 Segment override prefixes have no effect for this instruction: the
2977 use of \c{ES} for the load from \c{[DI]} or \c{[EDI]} cannot be
2978 overridden.
2979
2980 \c{INSW} and \c{INSD} work in the same way, but they input a word or
2981 a doubleword instead of a byte, and increment or decrement the
2982 addressing register by 2 or 4 instead of 1.
2983
2984 The \c{REP} prefix may be used to repeat the instruction \c{CX} (or
2985 \c{ECX} - again, the address size chooses which) times.
2986
2987 See also \c{OUTSB}, \c{OUTSW} and \c{OUTSD} (\k{insOUTSB}).
2988
2989
2990 \S{insINT} \i\c{INT}: Software Interrupt
2991
2992 \c INT imm8                      ; CD ib                [8086]
2993
2994 \c{INT} causes a software interrupt through a specified vector
2995 number from 0 to 255.
2996
2997 The code generated by the \c{INT} instruction is always two bytes
2998 long: although there are short forms for some \c{INT} instructions,
2999 NASM does not generate them when it sees the \c{INT} mnemonic. In
3000 order to generate single-byte breakpoint instructions, use the
3001 \c{INT3} or \c{INT1} instructions (see \k{insINT1}) instead.
3002
3003
3004 \S{insINT1} \i\c{INT3}, \i\c{INT1}, \i\c{ICEBP}, \i\c{INT01}: Breakpoints
3005
3006 \c INT1                          ; F1                   [P6]
3007 \c ICEBP                         ; F1                   [P6]
3008 \c INT01                         ; F1                   [P6]
3009
3010 \c INT3                          ; CC                   [8086]
3011 \c INT03                         ; CC                   [8086]
3012
3013 \c{INT1} and \c{INT3} are short one-byte forms of the instructions
3014 \c{INT 1} and \c{INT 3} (see \k{insINT}). They perform a similar
3015 function to their longer counterparts, but take up less code space.
3016 They are used as breakpoints by debuggers.
3017
3018 \b \c{INT1}, and its alternative synonyms \c{INT01} and \c{ICEBP}, is
3019 an instruction used by in-circuit emulators (ICEs). It is present,
3020 though not documented, on some processors down to the 286, but is
3021 only documented for the Pentium Pro. \c{INT3} is the instruction
3022 normally used as a breakpoint by debuggers.
3023
3024 \b \c{INT3}, and its synonym \c{INT03}, is not precisely equivalent to
3025 \c{INT 3}: the short form, since it is designed to be used as a
3026 breakpoint, bypasses the normal \c{IOPL} checks in virtual-8086 mode,
3027 and also does not go through interrupt redirection.
3028
3029
3030 \S{insINTO} \i\c{INTO}: Interrupt if Overflow
3031
3032 \c INTO                          ; CE                   [8086]
3033
3034 \c{INTO} performs an \c{INT 4} software interrupt (see \k{insINT})
3035 if and only if the overflow flag is set.
3036
3037
3038 \S{insINVD} \i\c{INVD}: Invalidate Internal Caches
3039
3040 \c INVD                          ; 0F 08                [486]
3041
3042 \c{INVD} invalidates and empties the processor's internal caches,
3043 and causes the processor to instruct external caches to do the same.
3044 It does not write the contents of the caches back to memory first:
3045 any modified data held in the caches will be lost. To write the data
3046 back first, use \c{WBINVD} (\k{insWBINVD}).
3047
3048
3049 \S{insINVLPG} \i\c{INVLPG}: Invalidate TLB Entry
3050
3051 \c INVLPG mem                    ; 0F 01 /7             [486]
3052
3053 \c{INVLPG} invalidates the translation lookahead buffer (TLB) entry
3054 associated with the supplied memory address.
3055
3056
3057 \S{insIRET} \i\c{IRET}, \i\c{IRETW}, \i\c{IRETD}: Return from Interrupt
3058
3059 \c IRET                          ; CF                   [8086]
3060 \c IRETW                         ; o16 CF               [8086]
3061 \c IRETD                         ; o32 CF               [386]
3062
3063 \c{IRET} returns from an interrupt (hardware or software) by means
3064 of popping \c{IP} (or \c{EIP}), \c{CS} and the flags off the stack
3065 and then continuing execution from the new \c{CS:IP}.
3066
3067 \c{IRETW} pops \c{IP}, \c{CS} and the flags as 2 bytes each, taking
3068 6 bytes off the stack in total. \c{IRETD} pops \c{EIP} as 4 bytes,
3069 pops a further 4 bytes of which the top two are discarded and the
3070 bottom two go into \c{CS}, and pops the flags as 4 bytes as well,
3071 taking 12 bytes off the stack.
3072
3073 \c{IRET} is a shorthand for either \c{IRETW} or \c{IRETD}, depending
3074 on the default \c{BITS} setting at the time.
3075
3076
3077 \S{insJcc} \i\c{Jcc}: Conditional Branch
3078
3079 \c Jcc imm                       ; 70+cc rb             [8086]
3080 \c Jcc NEAR imm                  ; 0F 80+cc rw/rd       [386]
3081
3082 The \i{conditional jump} instructions execute a near (same segment)
3083 jump if and only if their conditions are satisfied. For example,
3084 \c{JNZ} jumps only if the zero flag is not set.
3085
3086 The ordinary form of the instructions has only a 128-byte range; the
3087 \c{NEAR} form is a 386 extension to the instruction set, and can
3088 span the full size of a segment. NASM will not override your choice
3089 of jump instruction: if you want \c{Jcc NEAR}, you have to use the
3090 \c{NEAR} keyword.
3091
3092 The \c{SHORT} keyword is allowed on the first form of the
3093 instruction, for clarity, but is not necessary.
3094
3095 For details of the condition codes, see \k{iref-cc}.
3096
3097
3098 \S{insJCXZ} \i\c{JCXZ}, \i\c{JECXZ}: Jump if CX/ECX Zero
3099
3100 \c JCXZ imm                      ; a16 E3 rb            [8086]
3101 \c JECXZ imm                     ; a32 E3 rb            [386]
3102
3103 \c{JCXZ} performs a short jump (with maximum range 128 bytes) if and
3104 only if the contents of the \c{CX} register is 0. \c{JECXZ} does the
3105 same thing, but with \c{ECX}.
3106
3107
3108 \S{insJMP} \i\c{JMP}: Jump
3109
3110 \c JMP imm                       ; E9 rw/rd             [8086]
3111 \c JMP SHORT imm                 ; EB rb                [8086]
3112 \c JMP imm:imm16                 ; o16 EA iw iw         [8086]
3113 \c JMP imm:imm32                 ; o32 EA id iw         [386]
3114 \c JMP FAR mem                   ; o16 FF /5            [8086]
3115 \c JMP FAR mem32                 ; o32 FF /5            [386]
3116 \c JMP r/m16                     ; o16 FF /4            [8086]
3117 \c JMP r/m32                     ; o32 FF /4            [386]
3118
3119 \c{JMP} jumps to a given address. The address may be specified as an
3120 absolute segment and offset, or as a relative jump within the
3121 current segment.
3122
3123 \c{JMP SHORT imm} has a maximum range of 128 bytes, since the
3124 displacement is specified as only 8 bits, but takes up less code
3125 space. NASM does not choose when to generate \c{JMP SHORT} for you:
3126 you must explicitly code \c{SHORT} every time you want a short jump.
3127
3128 You can choose between the two immediate \i{far jump} forms (\c{JMP
3129 imm:imm}) by the use of the \c{WORD} and \c{DWORD} keywords: \c{JMP
3130 WORD 0x1234:0x5678}) or \c{JMP DWORD 0x1234:0x56789abc}.
3131
3132 The \c{JMP FAR mem} forms execute a far jump by loading the
3133 destination address out of memory. The address loaded consists of 16
3134 or 32 bits of offset (depending on the operand size), and 16 bits of
3135 segment. The operand size may be overridden using \c{JMP WORD FAR
3136 mem} or \c{JMP DWORD FAR mem}.
3137
3138 The \c{JMP r/m} forms execute a \i{near jump} (within the same
3139 segment), loading the destination address out of memory or out of a
3140 register. The keyword \c{NEAR} may be specified, for clarity, in
3141 these forms, but is not necessary. Again, operand size can be
3142 overridden using \c{JMP WORD mem} or \c{JMP DWORD mem}.
3143
3144 As a convenience, NASM does not require you to jump to a far symbol
3145 by coding the cumbersome \c{JMP SEG routine:routine}, but instead
3146 allows the easier synonym \c{JMP FAR routine}.
3147
3148 The \c{JMP r/m} forms given above are near calls; NASM will accept
3149 the \c{NEAR} keyword (e.g. \c{JMP NEAR [address]}), even though it
3150 is not strictly necessary.
3151
3152
3153 \S{insLAHF} \i\c{LAHF}: Load AH from Flags
3154
3155 \c LAHF                          ; 9F                   [8086]
3156
3157 \c{LAHF} sets the \c{AH} register according to the contents of the
3158 low byte of the flags word.
3159
3160 The operation of \c{LAHF} is:
3161
3162 \c  AH <-- SF:ZF:0:AF:0:PF:1:CF
3163
3164 See also \c{SAHF} (\k{insSAHF}).
3165
3166
3167 \S{insLAR} \i\c{LAR}: Load Access Rights
3168
3169 \c LAR reg16,r/m16               ; o16 0F 02 /r         [286,PRIV]
3170 \c LAR reg32,r/m32               ; o32 0F 02 /r         [286,PRIV]
3171
3172 \c{LAR} takes the segment selector specified by its source (second)
3173 operand, finds the corresponding segment descriptor in the GDT or
3174 LDT, and loads the access-rights byte of the descriptor into its
3175 destination (first) operand.
3176
3177
3178 \S{insLDMXCSR} \i\c{LDMXCSR}: Load Streaming SIMD Extension
3179  Control/Status
3180
3181 \c LDMXCSR mem32                 ; 0F AE /2        [KATMAI,SSE]
3182
3183 \c{LDMXCSR} loads 32-bits of data from the specified memory location
3184 into the \c{MXCSR} control/status register. \c{MXCSR} is used to
3185 enable masked/unmasked exception handling, to set rounding modes,
3186 to set flush-to-zero mode, and to view exception status flags.
3187
3188 For details of the \c{MXCSR} register, see the Intel processor docs.
3189
3190 See also \c{STMXCSR} (\k{insSTMXCSR}
3191
3192
3193 \S{insLDS} \i\c{LDS}, \i\c{LES}, \i\c{LFS}, \i\c{LGS}, \i\c{LSS}: Load Far Pointer
3194
3195 \c LDS reg16,mem                 ; o16 C5 /r            [8086]
3196 \c LDS reg32,mem                 ; o32 C5 /r            [386]
3197
3198 \c LES reg16,mem                 ; o16 C4 /r            [8086]
3199 \c LES reg32,mem                 ; o32 C4 /r            [386]
3200
3201 \c LFS reg16,mem                 ; o16 0F B4 /r         [386]
3202 \c LFS reg32,mem                 ; o32 0F B4 /r         [386]
3203
3204 \c LGS reg16,mem                 ; o16 0F B5 /r         [386]
3205 \c LGS reg32,mem                 ; o32 0F B5 /r         [386]
3206
3207 \c LSS reg16,mem                 ; o16 0F B2 /r         [386]
3208 \c LSS reg32,mem                 ; o32 0F B2 /r         [386]
3209
3210 These instructions load an entire far pointer (16 or 32 bits of
3211 offset, plus 16 bits of segment) out of memory in one go. \c{LDS},
3212 for example, loads 16 or 32 bits from the given memory address into
3213 the given register (depending on the size of the register), then
3214 loads the \e{next} 16 bits from memory into \c{DS}. \c{LES},
3215 \c{LFS}, \c{LGS} and \c{LSS} work in the same way but use the other
3216 segment registers.
3217
3218
3219 \S{insLEA} \i\c{LEA}: Load Effective Address
3220
3221 \c LEA reg16,mem                 ; o16 8D /r            [8086]
3222 \c LEA reg32,mem                 ; o32 8D /r            [386]
3223
3224 \c{LEA}, despite its syntax, does not access memory. It calculates
3225 the effective address specified by its second operand as if it were
3226 going to load or store data from it, but instead it stores the
3227 calculated address into the register specified by its first operand.
3228 This can be used to perform quite complex calculations (e.g. \c{LEA
3229 EAX,[EBX+ECX*4+100]}) in one instruction.
3230
3231 \c{LEA}, despite being a purely arithmetic instruction which
3232 accesses no memory, still requires square brackets around its second
3233 operand, as if it were a memory reference.
3234
3235 The size of the calculation is the current \e{address} size, and the
3236 size that the result is stored as is the current \e{operand} size.
3237 If the address and operand size are not the same, then if the
3238 addressing mode was 32-bits, the low 16-bits are stored, and if the
3239 address was 16-bits, it is zero-extended to 32-bits before storing.
3240
3241
3242 \S{insLEAVE} \i\c{LEAVE}: Destroy Stack Frame
3243
3244 \c LEAVE                         ; C9                   [186]
3245
3246 \c{LEAVE} destroys a stack frame of the form created by the
3247 \c{ENTER} instruction (see \k{insENTER}). It is functionally
3248 equivalent to \c{MOV ESP,EBP} followed by \c{POP EBP} (or \c{MOV
3249 SP,BP} followed by \c{POP BP} in 16-bit mode).
3250
3251
3252 \S{insLFENCE} \i\c{LFENCE}: Load Fence
3253
3254 \c LFENCE                        ; 0F AE /5        [WILLAMETTE,SSE2]
3255
3256 \c{LFENCE} performs a serialising operation on all loads from memory
3257 that were issued before the \c{LFENCE} instruction. This guarantees that
3258 all memory reads before the \c{LFENCE} instruction are visible before any
3259 reads after the \c{LFENCE} instruction.
3260
3261 \c{LFENCE} is ordered respective to other \c{LFENCE} instruction, \c{MFENCE},
3262 any memory read and any other serialising instruction (such as \c{CPUID}).
3263
3264 Weakly ordered memory types can be used to achieve higher processor
3265 performance through such techniques as out-of-order issue and
3266 speculative reads. The degree to which a consumer of data recognizes
3267 or knows that the data is weakly ordered varies among applications
3268 and may be unknown to the producer of this data. The \c{LFENCE}
3269 instruction provides a performance-efficient way of ensuring load
3270 ordering between routines that produce weakly-ordered results and
3271 routines that consume that data.
3272
3273 \c{LFENCE} uses the following ModRM encoding:
3274
3275 \c           Mod (7:6)        = 11B
3276 \c           Reg/Opcode (5:3) = 101B
3277 \c           R/M (2:0)        = 000B
3278
3279 All other ModRM encodings are defined to be reserved, and use
3280 of these encodings risks incompatibility with future processors.
3281
3282 See also \c{SFENCE} (\k{insSFENCE}) and \c{MFENCE} (\k{insMFENCE}).
3283
3284
3285 \S{insLGDT} \i\c{LGDT}, \i\c{LIDT}, \i\c{LLDT}: Load Descriptor Tables
3286
3287 \c LGDT mem                      ; 0F 01 /2             [286,PRIV]
3288 \c LIDT mem                      ; 0F 01 /3             [286,PRIV]
3289 \c LLDT r/m16                    ; 0F 00 /2             [286,PRIV]
3290
3291 \c{LGDT} and \c{LIDT} both take a 6-byte memory area as an operand:
3292 they load a 16-bit size limit and a 32-bit linear address from that
3293 area (in the opposite order) into the \c{GDTR} (global descriptor table
3294 register) or \c{IDTR} (interrupt descriptor table register). These are
3295 the only instructions which directly use \e{linear} addresses, rather
3296 than segment/offset pairs.
3297
3298 \c{LLDT} takes a segment selector as an operand. The processor looks
3299 up that selector in the GDT and stores the limit and base address
3300 given there into the \c{LDTR} (local descriptor table register).
3301
3302 See also \c{SGDT}, \c{SIDT} and \c{SLDT} (\k{insSGDT}).
3303
3304
3305 \S{insLMSW} \i\c{LMSW}: Load/Store Machine Status Word
3306
3307 \c LMSW r/m16                    ; 0F 01 /6             [286,PRIV]
3308
3309 \c{LMSW} loads the bottom four bits of the source operand into the
3310 bottom four bits of the \c{CR0} control register (or the Machine
3311 Status Word, on 286 processors). See also \c{SMSW} (\k{insSMSW}).
3312
3313
3314 \S{insLOADALL} \i\c{LOADALL}, \i\c{LOADALL286}: Load Processor State
3315
3316 \c LOADALL                       ; 0F 07                [386,UNDOC]
3317 \c LOADALL286                    ; 0F 05                [286,UNDOC]
3318
3319 This instruction, in its two different-opcode forms, is apparently
3320 supported on most 286 processors, some 386 and possibly some 486.
3321 The opcode differs between the 286 and the 386.
3322
3323 The function of the instruction is to load all information relating
3324 to the state of the processor out of a block of memory: on the 286,
3325 this block is located implicitly at absolute address \c{0x800}, and
3326 on the 386 and 486 it is at \c{[ES:EDI]}.
3327
3328
3329 \S{insLODSB} \i\c{LODSB}, \i\c{LODSW}, \i\c{LODSD}: Load from String
3330
3331 \c LODSB                         ; AC                   [8086]
3332 \c LODSW                         ; o16 AD               [8086]
3333 \c LODSD                         ; o32 AD               [386]
3334
3335 \c{LODSB} loads a byte from \c{[DS:SI]} or \c{[DS:ESI]} into \c{AL}.
3336 It then increments or decrements (depending on the direction flag:
3337 increments if the flag is clear, decrements if it is set) \c{SI} or
3338 \c{ESI}.
3339
3340 The register used is \c{SI} if the address size is 16 bits, and
3341 \c{ESI} if it is 32 bits. If you need to use an address size not
3342 equal to the current \c{BITS} setting, you can use an explicit
3343 \i\c{a16} or \i\c{a32} prefix.
3344
3345 The segment register used to load from \c{[SI]} or \c{[ESI]} can be
3346 overridden by using a segment register name as a prefix (for
3347 example, \c{ES LODSB}).
3348
3349 \c{LODSW} and \c{LODSD} work in the same way, but they load a
3350 word or a doubleword instead of a byte, and increment or decrement
3351 the addressing registers by 2 or 4 instead of 1.
3352
3353
3354 \S{insLOOP} \i\c{LOOP}, \i\c{LOOPE}, \i\c{LOOPZ}, \i\c{LOOPNE}, \i\c{LOOPNZ}: Loop with Counter
3355
3356 \c LOOP imm                      ; E2 rb                [8086]
3357 \c LOOP imm,CX                   ; a16 E2 rb            [8086]
3358 \c LOOP imm,ECX                  ; a32 E2 rb            [386]
3359
3360 \c LOOPE imm                     ; E1 rb                [8086]
3361 \c LOOPE imm,CX                  ; a16 E1 rb            [8086]
3362 \c LOOPE imm,ECX                 ; a32 E1 rb            [386]
3363 \c LOOPZ imm                     ; E1 rb                [8086]
3364 \c LOOPZ imm,CX                  ; a16 E1 rb            [8086]
3365 \c LOOPZ imm,ECX                 ; a32 E1 rb            [386]
3366
3367 \c LOOPNE imm                    ; E0 rb                [8086]
3368 \c LOOPNE imm,CX                 ; a16 E0 rb            [8086]
3369 \c LOOPNE imm,ECX                ; a32 E0 rb            [386]
3370 \c LOOPNZ imm                    ; E0 rb                [8086]
3371 \c LOOPNZ imm,CX                 ; a16 E0 rb            [8086]
3372 \c LOOPNZ imm,ECX                ; a32 E0 rb            [386]
3373
3374 \c{LOOP} decrements its counter register (either \c{CX} or \c{ECX} -
3375 if one is not specified explicitly, the \c{BITS} setting dictates
3376 which is used) by one, and if the counter does not become zero as a
3377 result of this operation, it jumps to the given label. The jump has
3378 a range of 128 bytes.
3379
3380 \c{LOOPE} (or its synonym \c{LOOPZ}) adds the additional condition
3381 that it only jumps if the counter is nonzero \e{and} the zero flag
3382 is set. Similarly, \c{LOOPNE} (and \c{LOOPNZ}) jumps only if the
3383 counter is nonzero and the zero flag is clear.
3384
3385
3386 \S{insLSL} \i\c{LSL}: Load Segment Limit
3387
3388 \c LSL reg16,r/m16               ; o16 0F 03 /r         [286,PRIV]
3389 \c LSL reg32,r/m32               ; o32 0F 03 /r         [286,PRIV]
3390
3391 \c{LSL} is given a segment selector in its source (second) operand;
3392 it computes the segment limit value by loading the segment limit
3393 field from the associated segment descriptor in the \c{GDT} or \c{LDT}.
3394 (This involves shifting left by 12 bits if the segment limit is
3395 page-granular, and not if it is byte-granular; so you end up with a
3396 byte limit in either case.) The segment limit obtained is then
3397 loaded into the destination (first) operand.
3398
3399
3400 \S{insLTR} \i\c{LTR}: Load Task Register
3401
3402 \c LTR r/m16                     ; 0F 00 /3             [286,PRIV]
3403
3404 \c{LTR} looks up the segment base and limit in the GDT or LDT
3405 descriptor specified by the segment selector given as its operand,
3406 and loads them into the Task Register.
3407
3408
3409 \S{insMASKMOVDQU} \i\c{MASKMOVDQU}: Byte Mask Write
3410
3411 \c MASKMOVDQU xmm1,xmm2          ; 66 0F F7 /r     [WILLAMETTE,SSE2]
3412
3413 \c{MASKMOVDQU} stores data from xmm1 to the location specified by
3414 \c{ES:(E)DI}. The size of the store depends on the address-size
3415 attribute. The most significant bit in each byte of the mask
3416 register xmm2 is used to selectively write the data (0 = no write,
3417 1 = write) on a per-byte basis.
3418
3419
3420 \S{insMASKMOVQ} \i\c{MASKMOVQ}: Byte Mask Write
3421
3422 \c MASKMOVQ mm1,mm2              ; 0F F7 /r        [KATMAI,MMX]
3423
3424 \c{MASKMOVQ} stores data from mm1 to the location specified by
3425 \c{ES:(E)DI}. The size of the store depends on the address-size
3426 attribute. The most significant bit in each byte of the mask
3427 register mm2 is used to selectively write the data (0 = no write,
3428 1 = write) on a per-byte basis.
3429
3430
3431 \S{insMAXPD} \i\c{MAXPD}: Return Packed Double-Precision FP Maximum
3432
3433 \c MAXPD xmm1,xmm2/m128          ; 66 0F 5F /r     [WILLAMETTE,SSE2]
3434
3435 \c{MAXPD} performs a SIMD compare of the packed double-precision
3436 FP numbers from xmm1 and xmm2/mem, and stores the maximum values
3437 of each pair of values in xmm1. If the values being compared are
3438 both zeroes, source2 (xmm2/m128) would be returned. If source2
3439 (xmm2/m128) is an SNaN, this SNaN is forwarded unchanged to the
3440 destination (i.e., a QNaN version of the SNaN is not returned).
3441
3442
3443 \S{insMAXPS} \i\c{MAXPS}: Return Packed Single-Precision FP Maximum
3444
3445 \c MAXPS xmm1,xmm2/m128          ; 0F 5F /r        [KATMAI,SSE]
3446
3447 \c{MAXPS} performs a SIMD compare of the packed single-precision
3448 FP numbers from xmm1 and xmm2/mem, and stores the maximum values
3449 of each pair of values in xmm1. If the values being compared are
3450 both zeroes, source2 (xmm2/m128) would be returned. If source2
3451 (xmm2/m128) is an SNaN, this SNaN is forwarded unchanged to the
3452 destination (i.e., a QNaN version of the SNaN is not returned).
3453
3454
3455 \S{insMAXSD} \i\c{MAXSD}: Return Scalar Double-Precision FP Maximum
3456
3457 \c MAXSD xmm1,xmm2/m64           ; F2 0F 5F /r     [WILLAMETTE,SSE2]
3458
3459 \c{MAXSD} compares the low-order double-precision FP numbers from
3460 xmm1 and xmm2/mem, and stores the maximum value in xmm1. If the
3461 values being compared are both zeroes, source2 (xmm2/m64) would
3462 be returned. If source2 (xmm2/m64) is an SNaN, this SNaN is
3463 forwarded unchanged to the destination (i.e., a QNaN version of
3464 the SNaN is not returned). The high quadword of the destination
3465 is left unchanged.
3466
3467
3468 \S{insMAXSS} \i\c{MAXSS}: Return Scalar Single-Precision FP Maximum
3469
3470 \c MAXSS xmm1,xmm2/m32           ; F3 0F 5F /r     [KATMAI,SSE]
3471
3472 \c{MAXSS} compares the low-order single-precision FP numbers from
3473 xmm1 and xmm2/mem, and stores the maximum value in xmm1. If the
3474 values being compared are both zeroes, source2 (xmm2/m32) would
3475 be returned. If source2 (xmm2/m32) is an SNaN, this SNaN is
3476 forwarded unchanged to the destination (i.e., a QNaN version of
3477 the SNaN is not returned). The high three doublewords of the
3478 destination are left unchanged.
3479
3480
3481 \S{insMFENCE} \i\c{MFENCE}: Memory Fence
3482
3483 \c MFENCE                        ; 0F AE /6        [WILLAMETTE,SSE2]
3484
3485 \c{MFENCE} performs a serialising operation on all loads from memory
3486 and writes to memory that were issued before the \c{MFENCE} instruction.
3487 This guarantees that all memory reads and writes before the \c{MFENCE}
3488 instruction are completed before any reads and writes after the
3489 \c{MFENCE} instruction.
3490
3491 \c{MFENCE} is ordered respective to other \c{MFENCE} instructions,
3492 \c{LFENCE}, \c{SFENCE}, any memory read and any other serialising
3493 instruction (such as \c{CPUID}).
3494
3495 Weakly ordered memory types can be used to achieve higher processor
3496 performance through such techniques as out-of-order issue, speculative
3497 reads, write-combining, and write-collapsing. The degree to which a
3498 consumer of data recognizes or knows that the data is weakly ordered
3499 varies among applications and may be unknown to the producer of this
3500 data. The \c{MFENCE} instruction provides a performance-efficient way
3501 of ensuring load and store ordering between routines that produce
3502 weakly-ordered results and routines that consume that data.
3503
3504 \c{MFENCE} uses the following ModRM encoding:
3505
3506 \c           Mod (7:6)        = 11B
3507 \c           Reg/Opcode (5:3) = 110B
3508 \c           R/M (2:0)        = 000B
3509
3510 All other ModRM encodings are defined to be reserved, and use
3511 of these encodings risks incompatibility with future processors.
3512
3513 See also \c{LFENCE} (\k{insLFENCE}) and \c{SFENCE} (\k{insSFENCE}).
3514
3515
3516 \S{insMINPD} \i\c{MINPD}: Return Packed Double-Precision FP Minimum
3517
3518 \c MINPD xmm1,xmm2/m128          ; 66 0F 5D /r     [WILLAMETTE,SSE2]
3519
3520 \c{MINPD} performs a SIMD compare of the packed double-precision
3521 FP numbers from xmm1 and xmm2/mem, and stores the minimum values
3522 of each pair of values in xmm1. If the values being compared are
3523 both zeroes, source2 (xmm2/m128) would be returned. If source2
3524 (xmm2/m128) is an SNaN, this SNaN is forwarded unchanged to the
3525 destination (i.e., a QNaN version of the SNaN is not returned).
3526
3527
3528 \S{insMINPS} \i\c{MINPS}: Return Packed Single-Precision FP Minimum
3529
3530 \c MINPS xmm1,xmm2/m128          ; 0F 5D /r        [KATMAI,SSE]
3531
3532 \c{MINPS} performs a SIMD compare of the packed single-precision
3533 FP numbers from xmm1 and xmm2/mem, and stores the minimum values
3534 of each pair of values in xmm1. If the values being compared are
3535 both zeroes, source2 (xmm2/m128) would be returned. If source2
3536 (xmm2/m128) is an SNaN, this SNaN is forwarded unchanged to the
3537 destination (i.e., a QNaN version of the SNaN is not returned).
3538
3539
3540 \S{insMINSD} \i\c{MINSD}: Return Scalar Double-Precision FP Minimum
3541
3542 \c MINSD xmm1,xmm2/m64           ; F2 0F 5D /r     [WILLAMETTE,SSE2]
3543
3544 \c{MINSD} compares the low-order double-precision FP numbers from
3545 xmm1 and xmm2/mem, and stores the minimum value in xmm1. If the
3546 values being compared are both zeroes, source2 (xmm2/m64) would
3547 be returned. If source2 (xmm2/m64) is an SNaN, this SNaN is
3548 forwarded unchanged to the destination (i.e., a QNaN version of
3549 the SNaN is not returned). The high quadword of the destination
3550 is left unchanged.
3551
3552
3553 \S{insMINSS} \i\c{MINSS}: Return Scalar Single-Precision FP Minimum
3554
3555 \c MINSS xmm1,xmm2/m32           ; F3 0F 5D /r     [KATMAI,SSE]
3556
3557 \c{MINSS} compares the low-order single-precision FP numbers from
3558 xmm1 and xmm2/mem, and stores the minimum value in xmm1. If the
3559 values being compared are both zeroes, source2 (xmm2/m32) would
3560 be returned. If source2 (xmm2/m32) is an SNaN, this SNaN is
3561 forwarded unchanged to the destination (i.e., a QNaN version of
3562 the SNaN is not returned). The high three doublewords of the
3563 destination are left unchanged.
3564
3565
3566 \S{insMOV} \i\c{MOV}: Move Data
3567
3568 \c MOV r/m8,reg8                 ; 88 /r                [8086]
3569 \c MOV r/m16,reg16               ; o16 89 /r            [8086]
3570 \c MOV r/m32,reg32               ; o32 89 /r            [386]
3571 \c MOV reg8,r/m8                 ; 8A /r                [8086]
3572 \c MOV reg16,r/m16               ; o16 8B /r            [8086]
3573 \c MOV reg32,r/m32               ; o32 8B /r            [386]
3574
3575 \c MOV reg8,imm8                 ; B0+r ib              [8086]
3576 \c MOV reg16,imm16               ; o16 B8+r iw          [8086]
3577 \c MOV reg32,imm32               ; o32 B8+r id          [386]
3578 \c MOV r/m8,imm8                 ; C6 /0 ib             [8086]
3579 \c MOV r/m16,imm16               ; o16 C7 /0 iw         [8086]
3580 \c MOV r/m32,imm32               ; o32 C7 /0 id         [386]
3581
3582 \c MOV AL,memoffs8               ; A0 ow/od             [8086]
3583 \c MOV AX,memoffs16              ; o16 A1 ow/od         [8086]
3584 \c MOV EAX,memoffs32             ; o32 A1 ow/od         [386]
3585 \c MOV memoffs8,AL               ; A2 ow/od             [8086]
3586 \c MOV memoffs16,AX              ; o16 A3 ow/od         [8086]
3587 \c MOV memoffs32,EAX             ; o32 A3 ow/od         [386]
3588
3589 \c MOV r/m16,segreg              ; o16 8C /r            [8086]
3590 \c MOV r/m32,segreg              ; o32 8C /r            [386]
3591 \c MOV segreg,r/m16              ; o16 8E /r            [8086]
3592 \c MOV segreg,r/m32              ; o32 8E /r            [386]
3593
3594 \c MOV reg32,CR0/2/3/4           ; 0F 20 /r             [386]
3595 \c MOV reg32,DR0/1/2/3/6/7       ; 0F 21 /r             [386]
3596 \c MOV reg32,TR3/4/5/6/7         ; 0F 24 /r             [386]
3597 \c MOV CR0/2/3/4,reg32           ; 0F 22 /r             [386]
3598 \c MOV DR0/1/2/3/6/7,reg32       ; 0F 23 /r             [386]
3599 \c MOV TR3/4/5/6/7,reg32         ; 0F 26 /r             [386]
3600
3601 \c{MOV} copies the contents of its source (second) operand into its
3602 destination (first) operand.
3603
3604 In all forms of the \c{MOV} instruction, the two operands are the
3605 same size, except for moving between a segment register and an
3606 \c{r/m32} operand. These instructions are treated exactly like the
3607 corresponding 16-bit equivalent (so that, for example, \c{MOV
3608 DS,EAX} functions identically to \c{MOV DS,AX} but saves a prefix
3609 when in 32-bit mode), except that when a segment register is moved
3610 into a 32-bit destination, the top two bytes of the result are
3611 undefined.
3612
3613 \c{MOV} may not use \c{CS} as a destination.
3614
3615 \c{CR4} is only a supported register on the Pentium and above.
3616
3617 Test registers are supported on 386/486 processors and on some
3618 non-Intel Pentium class processors.
3619
3620
3621 \S{insMOVAPD} \i\c{MOVAPD}: Move Aligned Packed Double-Precision FP Values
3622
3623 \c MOVAPD xmm1,xmm2/mem128       ; 66 0F 28 /r     [WILLAMETTE,SSE2]
3624 \c MOVAPD xmm1/mem128,xmm2       ; 66 0F 29 /r     [WILLAMETTE,SSE2]
3625
3626 \c{MOVAPD} moves a double quadword containing 2 packed double-precision
3627 FP values from the source operand to the destination. When the source
3628 or destination operand is a memory location, it must be aligned on a
3629 16-byte boundary.
3630
3631 To move data in and out of memory locations that are not known to be on
3632 16-byte boundaries, use the \c{MOVUPD} instruction (\k{insMOVUPD}).
3633
3634
3635 \S{insMOVAPS} \i\c{MOVAPS}: Move Aligned Packed Single-Precision FP Values
3636
3637 \c MOVAPS xmm1,xmm2/mem128       ; 0F 28 /r        [KATMAI,SSE]
3638 \c MOVAPS xmm1/mem128,xmm2       ; 0F 29 /r        [KATMAI,SSE]
3639
3640 \c{MOVAPS} moves a double quadword containing 4 packed single-precision
3641 FP values from the source operand to the destination. When the source
3642 or destination operand is a memory location, it must be aligned on a
3643 16-byte boundary.
3644
3645 To move data in and out of memory locations that are not known to be on
3646 16-byte boundaries, use the \c{MOVUPS} instruction (\k{insMOVUPS}).
3647
3648
3649 \S{insMOVD} \i\c{MOVD}: Move Doubleword to/from MMX Register
3650
3651 \c MOVD mm,r/m32                 ; 0F 6E /r             [PENT,MMX]
3652 \c MOVD r/m32,mm                 ; 0F 7E /r             [PENT,MMX]
3653 \c MOVD xmm,r/m32                ; 66 0F 6E /r     [WILLAMETTE,SSE2]
3654 \c MOVD r/m32,xmm                ; 66 0F 7E /r     [WILLAMETTE,SSE2]
3655
3656 \c{MOVD} copies 32 bits from its source (second) operand into its
3657 destination (first) operand. When the destination is a 64-bit \c{MMX}
3658 register or a 128-bit \c{XMM} register, the input value is zero-extended
3659 to fill the destination register.
3660
3661
3662 \S{insMOVDQ2Q} \i\c{MOVDQ2Q}: Move Quadword from XMM to MMX register.
3663
3664 \c MOVDQ2Q mm,xmm                ; F2 OF D6 /r     [WILLAMETTE,SSE2]
3665
3666 \c{MOVDQ2Q} moves the low quadword from the source operand to the
3667 destination operand.
3668
3669
3670 \S{insMOVDQA} \i\c{MOVDQA}: Move Aligned Double Quadword
3671
3672 \c MOVDQA xmm1,xmm2/m128         ; 66 OF 6F /r     [WILLAMETTE,SSE2]
3673 \c MOVDQA xmm1/m128,xmm2         ; 66 OF 7F /r     [WILLAMETTE,SSE2]
3674
3675 \c{MOVDQA} moves a double quadword from the source operand to the
3676 destination operand. When the source or destination operand is a
3677 memory location, it must be aligned to a 16-byte boundary.
3678
3679 To move a double quadword to or from unaligned memory locations,
3680 use the \c{MOVDQU} instruction (\k{insMOVDQU}).
3681
3682
3683 \S{insMOVDQU} \i\c{MOVDQU}: Move Unaligned Double Quadword
3684
3685 \c MOVDQU xmm1,xmm2/m128         ; F3 OF 6F /r     [WILLAMETTE,SSE2]
3686 \c MOVDQU xmm1/m128,xmm2         ; F3 OF 7F /r     [WILLAMETTE,SSE2]
3687
3688 \c{MOVDQU} moves a double quadword from the source operand to the
3689 destination operand. When the source or destination operand is a
3690 memory location, the memory may be unaligned.
3691
3692 To move a double quadword to or from known aligned memory locations,
3693 use the \c{MOVDQA} instruction (\k{insMOVDQA}).
3694
3695
3696 \S{insMOVHLPS} \i\c{MOVHLPS}: Move Packed Single-Precision FP High to Low
3697
3698 \c MOVHLPS xmm1,xmm2             ; OF 12 /r        [KATMAI,SSE]
3699
3700 \c{MOVHLPS} moves the two packed single-precision FP values from the
3701 high quadword of the source register xmm2 to the low quadword of the
3702 destination register, xmm2. The upper quadword of xmm1 is left unchanged.
3703
3704 The operation of this instruction is:
3705
3706 \c    dst[0-63]   := src[64-127],
3707 \c    dst[64-127] remains unchanged.
3708
3709
3710 \S{insMOVHPD} \i\c{MOVHPD}: Move High Packed Double-Precision FP
3711
3712 \c MOVHPD xmm,m64               ; 66 OF 16 /r      [WILLAMETTE,SSE2]
3713 \c MOVHPD m64,xmm               ; 66 OF 17 /r      [WILLAMETTE,SSE2]
3714
3715 \c{MOVHPD} moves a double-precision FP value between the source and
3716 destination operands. One of the operands is a 64-bit memory location,
3717 the other is the high quadword of an \c{XMM} register.
3718
3719 The operation of this instruction is:
3720
3721 \c    mem[0-63]   := xmm[64-127];
3722
3723 or
3724
3725 \c    xmm[0-63]   remains unchanged;
3726 \c    xmm[64-127] := mem[0-63].
3727
3728
3729 \S{insMOVHPS} \i\c{MOVHPS}: Move High Packed Single-Precision FP
3730
3731 \c MOVHPS xmm,m64               ; 0F 16 /r         [KATMAI,SSE]
3732 \c MOVHPS m64,xmm               ; 0F 17 /r         [KATMAI,SSE]
3733
3734 \c{MOVHPS} moves two packed single-precision FP values between the source
3735 and destination operands. One of the operands is a 64-bit memory location,
3736 the other is the high quadword of an \c{XMM} register.
3737
3738 The operation of this instruction is:
3739
3740 \c    mem[0-63]   := xmm[64-127];
3741
3742 or
3743
3744 \c    xmm[0-63]   remains unchanged;
3745 \c    xmm[64-127] := mem[0-63].
3746
3747
3748 \S{insMOVLHPS} \i\c{MOVLHPS}: Move Packed Single-Precision FP Low to High
3749
3750 \c MOVLHPS xmm1,xmm2             ; OF 16 /r         [KATMAI,SSE]
3751
3752 \c{MOVLHPS} moves the two packed single-precision FP values from the
3753 low quadword of the source register xmm2 to the high quadword of the
3754 destination register, xmm2. The low quadword of xmm1 is left unchanged.
3755
3756 The operation of this instruction is:
3757
3758 \c    dst[0-63]   remains unchanged;
3759 \c    dst[64-127] := src[0-63].
3760
3761 \S{insMOVLPD} \i\c{MOVLPD}: Move Low Packed Double-Precision FP
3762
3763 \c MOVLPD xmm,m64                ; 66 OF 12 /r     [WILLAMETTE,SSE2]
3764 \c MOVLPD m64,xmm                ; 66 OF 13 /r     [WILLAMETTE,SSE2]
3765
3766 \c{MOVLPD} moves a double-precision FP value between the source and
3767 destination operands. One of the operands is a 64-bit memory location,
3768 the other is the low quadword of an \c{XMM} register.
3769
3770 The operation of this instruction is:
3771
3772 \c    mem(0-63)   := xmm(0-63);
3773
3774 or
3775
3776 \c    xmm(0-63)   := mem(0-63);
3777 \c    xmm(64-127) remains unchanged.
3778
3779 \S{insMOVLPS} \i\c{MOVLPS}: Move Low Packed Single-Precision FP
3780
3781 \c MOVLPS xmm,m64                ; OF 12 /r        [KATMAI,SSE]
3782 \c MOVLPS m64,xmm                ; OF 13 /r        [KATMAI,SSE]
3783
3784 \c{MOVLPS} moves two packed single-precision FP values between the source
3785 and destination operands. One of the operands is a 64-bit memory location,
3786 the other is the low quadword of an \c{XMM} register.
3787
3788 The operation of this instruction is:
3789
3790 \c    mem(0-63)   := xmm(0-63);
3791
3792 or
3793
3794 \c    xmm(0-63)   := mem(0-63);
3795 \c    xmm(64-127) remains unchanged.
3796
3797
3798 \S{insMOVMSKPD} \i\c{MOVMSKPD}: Extract Packed Double-Precision FP Sign Mask
3799
3800 \c MOVMSKPD reg32,xmm              ; 66 0F 50 /r   [WILLAMETTE,SSE2]
3801
3802 \c{MOVMSKPD} inserts a 2-bit mask in r32, formed of the most significant
3803 bits of each double-precision FP number of the source operand.
3804
3805
3806 \S{insMOVMSKPS} \i\c{MOVMSKPS}: Extract Packed Single-Precision FP Sign Mask
3807
3808 \c MOVMSKPS reg32,xmm              ; 0F 50 /r      [KATMAI,SSE]
3809
3810 \c{MOVMSKPS} inserts a 4-bit mask in r32, formed of the most significant
3811 bits of each single-precision FP number of the source operand.
3812
3813
3814 \S{insMOVNTDQ} \i\c{MOVNTDQ}: Move Double Quadword Non Temporal
3815
3816 \c MOVNTDQ m128,xmm              ; 66 0F E7 /r     [WILLAMETTE,SSE2]
3817
3818 \c{MOVNTDQ} moves the double quadword from the \c{XMM} source
3819 register to the destination memory location, using a non-temporal
3820 hint. This store instruction minimizes cache pollution.
3821
3822
3823 \S{insMOVNTI} \i\c{MOVNTI}: Move Doubleword Non Temporal
3824
3825 \c MOVNTI m32,reg32              ; 0F C3 /r        [WILLAMETTE,SSE2]
3826
3827 \c{MOVNTI} moves the doubleword in the source register
3828 to the destination memory location, using a non-temporal
3829 hint. This store instruction minimizes cache pollution.
3830
3831
3832 \S{insMOVNTPD} \i\c{MOVNTPD}: Move Aligned Four Packed Single-Precision
3833 FP Values Non Temporal
3834
3835 \c MOVNTPD m128,xmm              ; 66 0F 2B /r     [WILLAMETTE,SSE2]
3836
3837 \c{MOVNTPD} moves the double quadword from the \c{XMM} source
3838 register to the destination memory location, using a non-temporal
3839 hint. This store instruction minimizes cache pollution. The memory
3840 location must be aligned to a 16-byte boundary.
3841
3842
3843 \S{insMOVNTPS} \i\c{MOVNTPS}: Move Aligned Four Packed Single-Precision
3844 FP Values Non Temporal
3845
3846 \c MOVNTPS m128,xmm              ; 0F 2B /r        [KATMAI,SSE]
3847
3848 \c{MOVNTPS} moves the double quadword from the \c{XMM} source
3849 register to the destination memory location, using a non-temporal
3850 hint. This store instruction minimizes cache pollution. The memory
3851 location must be aligned to a 16-byte boundary.
3852
3853
3854 \S{insMOVNTQ} \i\c{MOVNTQ}: Move Quadword Non Temporal
3855
3856 \c MOVNTQ m64,mm                 ; 0F E7 /r        [KATMAI,MMX]
3857
3858 \c{MOVNTQ} moves the quadword in the \c{MMX} source register
3859 to the destination memory location, using a non-temporal
3860 hint. This store instruction minimizes cache pollution.
3861
3862
3863 \S{insMOVQ} \i\c{MOVQ}: Move Quadword to/from MMX Register
3864
3865 \c MOVQ mm1,mm2/m64               ; 0F 6F /r             [PENT,MMX]
3866 \c MOVQ mm1/m64,mm2               ; 0F 7F /r             [PENT,MMX]
3867
3868 \c MOVQ xmm1,xmm2/m64             ; F3 0F 7E /r    [WILLAMETTE,SSE2]
3869 \c MOVQ xmm1/m64,xmm2             ; 66 0F D6 /r    [WILLAMETTE,SSE2]
3870
3871 \c{MOVQ} copies 64 bits from its source (second) operand into its
3872 destination (first) operand. When the source is an \c{XMM} register,
3873 the low quadword is moved. When the destination is an \c{XMM} register,
3874 the destination is the low quadword, and the high quadword is cleared.
3875
3876
3877 \S{insMOVQ2DQ} \i\c{MOVQ2DQ}: Move Quadword from MMX to XMM register.
3878
3879 \c MOVQ2DQ xmm,mm                ; F3 OF D6 /r     [WILLAMETTE,SSE2]
3880
3881 \c{MOVQ2DQ} moves the quadword from the source operand to the low
3882 quadword of the destination operand, and clears the high quadword.
3883
3884
3885 \S{insMOVSB} \i\c{MOVSB}, \i\c{MOVSW}, \i\c{MOVSD}: Move String
3886
3887 \c MOVSB                         ; A4                   [8086]
3888 \c MOVSW                         ; o16 A5               [8086]
3889 \c MOVSD                         ; o32 A5               [386]
3890
3891 \c{MOVSB} copies the byte at \c{[DS:SI]} or \c{[DS:ESI]} to
3892 \c{[ES:DI]} or \c{[ES:EDI]}. It then increments or decrements
3893 (depending on the direction flag: increments if the flag is clear,
3894 decrements if it is set) \c{SI} and \c{DI} (or \c{ESI} and \c{EDI}).
3895
3896 The registers used are \c{SI} and \c{DI} if the address size is 16
3897 bits, and \c{ESI} and \c{EDI} if it is 32 bits. If you need to use
3898 an address size not equal to the current \c{BITS} setting, you can
3899 use an explicit \i\c{a16} or \i\c{a32} prefix.
3900
3901 The segment register used to load from \c{[SI]} or \c{[ESI]} can be
3902 overridden by using a segment register name as a prefix (for
3903 example, \c{es movsb}). The use of \c{ES} for the store to \c{[DI]}
3904 or \c{[EDI]} cannot be overridden.
3905
3906 \c{MOVSW} and \c{MOVSD} work in the same way, but they copy a word
3907 or a doubleword instead of a byte, and increment or decrement the
3908 addressing registers by 2 or 4 instead of 1.
3909
3910 The \c{REP} prefix may be used to repeat the instruction \c{CX} (or
3911 \c{ECX} - again, the address size chooses which) times.
3912
3913
3914 \S{insMOVSD} \i\c{MOVSD}: Move Scalar Double-Precision FP Value
3915
3916 \c MOVSD xmm1,xmm2/m64           ; F2 0F 10 /r     [WILLAMETTE,SSE2]
3917 \c MOVSD xmm1/m64,xmm2           ; F2 0F 11 /r     [WILLAMETTE,SSE2]
3918
3919 \c{MOVSD} moves a double-precision FP value from the source operand
3920 to the destination operand. When the source or destination is a
3921 register, the low-order FP value is read or written.
3922
3923
3924 \S{insMOVSS} \i\c{MOVSS}: Move Scalar Single-Precision FP Value
3925
3926 \c MOVSS xmm1,xmm2/m32           ; F3 0F 10 /r     [KATMAI,SSE]
3927 \c MOVSS xmm1/m32,xmm2           ; F3 0F 11 /r     [KATMAI,SSE]
3928
3929 \c{MOVSS} moves a single-precision FP value from the source operand
3930 to the destination operand. When the source or destination is a
3931 register, the low-order FP value is read or written.
3932
3933
3934 \S{insMOVSX} \i\c{MOVSX}, \i\c{MOVZX}: Move Data with Sign or Zero Extend
3935
3936 \c MOVSX reg16,r/m8              ; o16 0F BE /r         [386]
3937 \c MOVSX reg32,r/m8              ; o32 0F BE /r         [386]
3938 \c MOVSX reg32,r/m16             ; o32 0F BF /r         [386]
3939
3940 \c MOVZX reg16,r/m8              ; o16 0F B6 /r         [386]
3941 \c MOVZX reg32,r/m8              ; o32 0F B6 /r         [386]
3942 \c MOVZX reg32,r/m16             ; o32 0F B7 /r         [386]
3943
3944 \c{MOVSX} sign-extends its source (second) operand to the length of
3945 its destination (first) operand, and copies the result into the
3946 destination operand. \c{MOVZX} does the same, but zero-extends
3947 rather than sign-extending.
3948
3949
3950 \S{insMOVUPD} \i\c{MOVUPD}: Move Unaligned Packed Double-Precision FP Values
3951
3952 \c MOVUPD xmm1,xmm2/mem128       ; 66 0F 10 /r     [WILLAMETTE,SSE2]
3953 \c MOVUPD xmm1/mem128,xmm2       ; 66 0F 11 /r     [WILLAMETTE,SSE2]
3954
3955 \c{MOVUPD} moves a double quadword containing 2 packed double-precision
3956 FP values from the source operand to the destination. This instruction
3957 makes no assumptions about alignment of memory operands.
3958
3959 To move data in and out of memory locations that are known to be on 16-byte
3960 boundaries, use the \c{MOVAPD} instruction (\k{insMOVAPD}).
3961
3962
3963 \S{insMOVUPS} \i\c{MOVUPS}: Move Unaligned Packed Single-Precision FP Values
3964
3965 \c MOVUPS xmm1,xmm2/mem128       ; 0F 10 /r        [KATMAI,SSE]
3966 \c MOVUPS xmm1/mem128,xmm2       ; 0F 11 /r        [KATMAI,SSE]
3967
3968 \c{MOVUPS} moves a double quadword containing 4 packed single-precision
3969 FP values from the source operand to the destination. This instruction
3970 makes no assumptions about alignment of memory operands.
3971
3972 To move data in and out of memory locations that are known to be on 16-byte
3973 boundaries, use the \c{MOVAPS} instruction (\k{insMOVAPS}).
3974
3975
3976 \S{insMUL} \i\c{MUL}: Unsigned Integer Multiply
3977
3978 \c MUL r/m8                      ; F6 /4                [8086]
3979 \c MUL r/m16                     ; o16 F7 /4            [8086]
3980 \c MUL r/m32                     ; o32 F7 /4            [386]
3981
3982 \c{MUL} performs unsigned integer multiplication. The other operand
3983 to the multiplication, and the destination operand, are implicit, in
3984 the following way:
3985
3986 \b For \c{MUL r/m8}, \c{AL} is multiplied by the given operand; the
3987 product is stored in \c{AX}.
3988
3989 \b For \c{MUL r/m16}, \c{AX} is multiplied by the given operand;
3990 the product is stored in \c{DX:AX}.
3991
3992 \b For \c{MUL r/m32}, \c{EAX} is multiplied by the given operand;
3993 the product is stored in \c{EDX:EAX}.
3994
3995 Signed integer multiplication is performed by the \c{IMUL}
3996 instruction: see \k{insIMUL}.
3997
3998
3999 \S{insMULPD} \i\c{MULPD}: Packed Single-FP Multiply
4000
4001 \c MULPD xmm1,xmm2/mem128        ; 66 0F 59 /r     [WILLAMETTE,SSE2]
4002
4003 \c{MULPD} performs a SIMD multiply of the packed double-precision FP
4004 values in both operands, and stores the results in the destination register.
4005
4006
4007 \S{insMULPS} \i\c{MULPS}: Packed Single-FP Multiply
4008
4009 \c MULPS xmm1,xmm2/mem128        ; 0F 59 /r        [KATMAI,SSE]
4010
4011 \c{MULPS} performs a SIMD multiply of the packed single-precision FP
4012 values in both operands, and stores the results in the destination register.
4013
4014
4015 \S{insMULSD} \i\c{MULSD}: Scalar Single-FP Multiply
4016
4017 \c MULSD xmm1,xmm2/mem32         ; F2 0F 59 /r     [WILLAMETTE,SSE2]
4018
4019 \c{MULSD} multiplies the lowest double-precision FP values of both
4020 operands, and stores the result in the low quadword of xmm1.
4021
4022
4023 \S{insMULSS} \i\c{MULSS}: Scalar Single-FP Multiply
4024
4025 \c MULSS xmm1,xmm2/mem32         ; F3 0F 59 /r     [KATMAI,SSE]
4026
4027 \c{MULSS} multiplies the lowest single-precision FP values of both
4028 operands, and stores the result in the low doubleword of xmm1.
4029
4030
4031 \S{insNEG} \i\c{NEG}, \i\c{NOT}: Two's and One's Complement
4032
4033 \c NEG r/m8                      ; F6 /3                [8086]
4034 \c NEG r/m16                     ; o16 F7 /3            [8086]
4035 \c NEG r/m32                     ; o32 F7 /3            [386]
4036
4037 \c NOT r/m8                      ; F6 /2                [8086]
4038 \c NOT r/m16                     ; o16 F7 /2            [8086]
4039 \c NOT r/m32                     ; o32 F7 /2            [386]
4040
4041 \c{NEG} replaces the contents of its operand by the two's complement
4042 negation (invert all the bits and then add one) of the original
4043 value. \c{NOT}, similarly, performs one's complement (inverts all
4044 the bits).
4045
4046
4047 \S{insNOP} \i\c{NOP}: No Operation
4048
4049 \c NOP                           ; 90                   [8086]
4050
4051 \c{NOP} performs no operation. Its opcode is the same as that
4052 generated by \c{XCHG AX,AX} or \c{XCHG EAX,EAX} (depending on the
4053 processor mode; see \k{insXCHG}).
4054
4055
4056 \S{insOR} \i\c{OR}: Bitwise OR
4057
4058 \c OR r/m8,reg8                  ; 08 /r                [8086]
4059 \c OR r/m16,reg16                ; o16 09 /r            [8086]
4060 \c OR r/m32,reg32                ; o32 09 /r            [386]
4061
4062 \c OR reg8,r/m8                  ; 0A /r                [8086]
4063 \c OR reg16,r/m16                ; o16 0B /r            [8086]
4064 \c OR reg32,r/m32                ; o32 0B /r            [386]
4065
4066 \c OR r/m8,imm8                  ; 80 /1 ib             [8086]
4067 \c OR r/m16,imm16                ; o16 81 /1 iw         [8086]
4068 \c OR r/m32,imm32                ; o32 81 /1 id         [386]
4069
4070 \c OR r/m16,imm8                 ; o16 83 /1 ib         [8086]
4071 \c OR r/m32,imm8                 ; o32 83 /1 ib         [386]
4072
4073 \c OR AL,imm8                    ; 0C ib                [8086]
4074 \c OR AX,imm16                   ; o16 0D iw            [8086]
4075 \c OR EAX,imm32                  ; o32 0D id            [386]
4076
4077 \c{OR} performs a bitwise OR operation between its two operands
4078 (i.e. each bit of the result is 1 if and only if at least one of the
4079 corresponding bits of the two inputs was 1), and stores the result
4080 in the destination (first) operand.
4081
4082 In the forms with an 8-bit immediate second operand and a longer
4083 first operand, the second operand is considered to be signed, and is
4084 sign-extended to the length of the first operand. In these cases,
4085 the \c{BYTE} qualifier is necessary to force NASM to generate this
4086 form of the instruction.
4087
4088 The MMX instruction \c{POR} (see \k{insPOR}) performs the same
4089 operation on the 64-bit MMX registers.
4090
4091
4092 \S{insORPD} \i\c{ORPD}: Bit-wise Logical OR of Double-Precision FP Data
4093
4094 \c ORPD xmm1,xmm2/m128           ; 66 0F 56 /r     [WILLAMETTE,SSE2]
4095
4096 \c{ORPD} return a bit-wise logical OR between xmm1 and xmm2/mem,
4097 and stores the result in xmm1. If the source operand is a memory
4098 location, it must be aligned to a 16-byte boundary.
4099
4100
4101 \S{insORPS} \i\c{ORPS}: Bit-wise Logical OR of Single-Precision FP Data
4102
4103 \c ORPS xmm1,xmm2/m128           ; 0F 56 /r        [KATMAI,SSE]
4104
4105 \c{ORPS} return a bit-wise logical OR between xmm1 and xmm2/mem,
4106 and stores the result in xmm1. If the source operand is a memory
4107 location, it must be aligned to a 16-byte boundary.
4108
4109
4110 \S{insOUT} \i\c{OUT}: Output Data to I/O Port
4111
4112 \c OUT imm8,AL                   ; E6 ib                [8086]
4113 \c OUT imm8,AX                   ; o16 E7 ib            [8086]
4114 \c OUT imm8,EAX                  ; o32 E7 ib            [386]
4115 \c OUT DX,AL                     ; EE                   [8086]
4116 \c OUT DX,AX                     ; o16 EF               [8086]
4117 \c OUT DX,EAX                    ; o32 EF               [386]
4118
4119 \c{OUT} writes the contents of the given source register to the
4120 specified I/O port. The port number may be specified as an immediate
4121 value if it is between 0 and 255, and otherwise must be stored in
4122 \c{DX}. See also \c{IN} (\k{insIN}).
4123
4124
4125 \S{insOUTSB} \i\c{OUTSB}, \i\c{OUTSW}, \i\c{OUTSD}: Output String to I/O Port
4126
4127 \c OUTSB                         ; 6E                   [186]
4128 \c OUTSW                         ; o16 6F               [186]
4129 \c OUTSD                         ; o32 6F               [386]
4130
4131 \c{OUTSB} loads a byte from \c{[DS:SI]} or \c{[DS:ESI]} and writes
4132 it to the I/O port specified in \c{DX}. It then increments or
4133 decrements (depending on the direction flag: increments if the flag
4134 is clear, decrements if it is set) \c{SI} or \c{ESI}.
4135
4136 The register used is \c{SI} if the address size is 16 bits, and
4137 \c{ESI} if it is 32 bits. If you need to use an address size not
4138 equal to the current \c{BITS} setting, you can use an explicit
4139 \i\c{a16} or \i\c{a32} prefix.
4140
4141 The segment register used to load from \c{[SI]} or \c{[ESI]} can be
4142 overridden by using a segment register name as a prefix (for
4143 example, \c{es outsb}).
4144
4145 \c{OUTSW} and \c{OUTSD} work in the same way, but they output a
4146 word or a doubleword instead of a byte, and increment or decrement
4147 the addressing registers by 2 or 4 instead of 1.
4148
4149 The \c{REP} prefix may be used to repeat the instruction \c{CX} (or
4150 \c{ECX} - again, the address size chooses which) times.
4151
4152
4153 \S{insPACKSSDW} \i\c{PACKSSDW}, \i\c{PACKSSWB}, \i\c{PACKUSWB}: Pack Data
4154
4155 \c PACKSSDW mm1,mm2/m64          ; 0F 6B /r             [PENT,MMX]
4156 \c PACKSSWB mm1,mm2/m64          ; 0F 63 /r             [PENT,MMX]
4157 \c PACKUSWB mm1,mm2/m64          ; 0F 67 /r             [PENT,MMX]
4158
4159 \c PACKSSDW xmm1,xmm2/m128       ; 66 0F 6B /r     [WILLAMETTE,SSE2]
4160 \c PACKSSWB xmm1,xmm2/m128       ; 66 0F 63 /r     [WILLAMETTE,SSE2]
4161 \c PACKUSWB xmm1,xmm2/m128       ; 66 0F 67 /r     [WILLAMETTE,SSE2]
4162
4163 All these instructions start by combining the source and destination
4164 operands, and then splitting the result in smaller sections which it
4165 then packs into the destination register. The \c{MMX} versions pack
4166 two 64-bit operands into one 64-bit register, while the \c{SSE}
4167 versions pack two 128-bit operands into one 128-bit register.
4168
4169 \b \c{PACKSSWB} splits the combined value into words, and then reduces
4170 the words to bytes, using signed saturation. It then packs the bytes
4171 into the destination register in the same order the words were in.
4172
4173 \b \c{PACKSSDW} performs the same operation as \c{PACKSSWB}, except that
4174 it reduces doublewords to words, then packs them into the destination
4175 register.
4176
4177 \b \c{PACKUSWB} performs the same operation as \c{PACKSSWB}, except that
4178 it uses unsigned saturation when reducing the size of the elements.
4179
4180 To perform signed saturation on a number, it is replaced by the largest
4181 signed number (\c{7FFFh} or \c{7Fh}) that \e{will} fit, and if it is too
4182 small it is replaced by the smallest signed number (\c{8000h} or
4183 \c{80h}) that will fit. To perform unsigned saturation, the input is
4184 treated as unsigned, and the input is replaced by the largest unsigned
4185 number that will fit.
4186
4187
4188 \S{insPADDB} \i\c{PADDB}, \i\c{PADDW}, \i\c{PADDD}: Add Packed Integers
4189
4190 \c PADDB mm1,mm2/m64             ; 0F FC /r             [PENT,MMX]
4191 \c PADDW mm1,mm2/m64             ; 0F FD /r             [PENT,MMX]
4192 \c PADDD mm1,mm2/m64             ; 0F FE /r             [PENT,MMX]
4193
4194 \c PADDB xmm1,xmm2/m128          ; 66 0F FC /r     [WILLAMETTE,SSE2]
4195 \c PADDW xmm1,xmm2/m128          ; 66 0F FD /r     [WILLAMETTE,SSE2]
4196 \c PADDD xmm1,xmm2/m128          ; 66 0F FE /r     [WILLAMETTE,SSE2]
4197
4198 \c{PADDx} performs packed addition of the two operands, storing the
4199 result in the destination (first) operand.
4200
4201 \b \c{PADDB} treats the operands as packed bytes, and adds each byte
4202 individually;
4203
4204 \b \c{PADDW} treats the operands as packed words;
4205
4206 \b \c{PADDD} treats its operands as packed doublewords.
4207
4208 When an individual result is too large to fit in its destination, it
4209 is wrapped around and the low bits are stored, with the carry bit
4210 discarded.
4211
4212
4213 \S{insPADDQ} \i\c{PADDQ}: Add Packed Quadword Integers
4214
4215 \c PADDQ mm1,mm2/m64             ; 0F D4 /r             [PENT,MMX]
4216
4217 \c PADDQ xmm1,xmm2/m128          ; 66 0F D4 /r     [WILLAMETTE,SSE2]
4218
4219 \c{PADDQ} adds the quadwords in the source and destination operands, and
4220 stores the result in the destination register.
4221
4222 When an individual result is too large to fit in its destination, it
4223 is wrapped around and the low bits are stored, with the carry bit
4224 discarded.
4225
4226
4227 \S{insPADDSB} \i\c{PADDSB}, \i\c{PADDSW}: Add Packed Signed Integers With Saturation
4228
4229 \c PADDSB mm1,mm2/m64            ; 0F EC /r             [PENT,MMX]
4230 \c PADDSW mm1,mm2/m64            ; 0F ED /r             [PENT,MMX]
4231
4232 \c PADDSB xmm1,xmm2/m128         ; 66 0F EC /r     [WILLAMETTE,SSE2]
4233 \c PADDSW xmm1,xmm2/m128         ; 66 0F ED /r     [WILLAMETTE,SSE2]
4234
4235 \c{PADDSx} performs packed addition of the two operands, storing the
4236 result in the destination (first) operand.
4237 \c{PADDSB} treats the operands as packed bytes, and adds each byte
4238 individually; and \c{PADDSW} treats the operands as packed words.
4239
4240 When an individual result is too large to fit in its destination, a
4241 saturated value is stored. The resulting value is the value with the
4242 largest magnitude of the same sign as the result which will fit in
4243 the available space.
4244
4245
4246 \S{insPADDSIW} \i\c{PADDSIW}: MMX Packed Addition to Implicit Destination
4247
4248 \c PADDSIW mmxreg,r/m64          ; 0F 51 /r             [CYRIX,MMX]
4249
4250 \c{PADDSIW}, specific to the Cyrix extensions to the MMX instruction
4251 set, performs the same function as \c{PADDSW}, except that the result
4252 is placed in an implied register.
4253
4254 To work out the implied register, invert the lowest bit in the register
4255 number. So \c{PADDSIW MM0,MM2} would put the result in \c{MM1}, but
4256 \c{PADDSIW MM1,MM2} would put the result in \c{MM0}.
4257
4258
4259 \S{insPADDUSB} \i\c{PADDUSB}, \i\c{PADDUSW}: Add Packed Unsigned Integers With Saturation
4260
4261 \c PADDUSB mm1,mm2/m64           ; 0F DC /r             [PENT,MMX]
4262 \c PADDUSW mm1,mm2/m64           ; 0F DD /r             [PENT,MMX]
4263
4264 \c PADDUSB xmm1,xmm2/m128         ; 66 0F DC /r    [WILLAMETTE,SSE2]
4265 \c PADDUSW xmm1,xmm2/m128         ; 66 0F DD /r    [WILLAMETTE,SSE2]
4266
4267 \c{PADDUSx} performs packed addition of the two operands, storing the
4268 result in the destination (first) operand.
4269 \c{PADDUSB} treats the operands as packed bytes, and adds each byte
4270 individually; and \c{PADDUSW} treats the operands as packed words.
4271
4272 When an individual result is too large to fit in its destination, a
4273 saturated value is stored. The resulting value is the maximum value
4274 that will fit in the available space.
4275
4276
4277 \S{insPAND} \i\c{PAND}, \i\c{PANDN}: MMX Bitwise AND and AND-NOT
4278
4279 \c PAND mm1,mm2/m64              ; 0F DB /r             [PENT,MMX]
4280 \c PANDN mm1,mm2/m64             ; 0F DF /r             [PENT,MMX]
4281
4282 \c PAND xmm1,xmm2/m128           ; 66 0F DB /r     [WILLAMETTE,SSE2]
4283 \c PANDN xmm1,xmm2/m128          ; 66 0F DF /r     [WILLAMETTE,SSE2]
4284
4285
4286 \c{PAND} performs a bitwise AND operation between its two operands
4287 (i.e. each bit of the result is 1 if and only if the corresponding
4288 bits of the two inputs were both 1), and stores the result in the
4289 destination (first) operand.
4290
4291 \c{PANDN} performs the same operation, but performs a one's
4292 complement operation on the destination (first) operand first.
4293
4294
4295 \S{insPAUSE} \i\c{PAUSE}: Spin Loop Hint
4296
4297 \c PAUSE                         ; F3 90           [WILLAMETTE,SSE2]
4298
4299 \c{PAUSE} provides a hint to the processor that the following code
4300 is a spin loop. This improves processor performance by bypassing
4301 possible memory order violations. On older processors, this instruction
4302 operates as a \c{NOP}.
4303
4304
4305 \S{insPAVEB} \i\c{PAVEB}: MMX Packed Average
4306
4307 \c PAVEB mmxreg,r/m64            ; 0F 50 /r             [CYRIX,MMX]
4308
4309 \c{PAVEB}, specific to the Cyrix MMX extensions, treats its two
4310 operands as vectors of eight unsigned bytes, and calculates the
4311 average of the corresponding bytes in the operands. The resulting
4312 vector of eight averages is stored in the first operand.
4313
4314 This opcode maps to \c{MOVMSKPS r32, xmm} on processors that support
4315 the SSE instruction set.
4316
4317
4318 \S{insPAVGB} \i\c{PAVGB} \i\c{PAVGW}: Average Packed Integers
4319
4320 \c PAVGB mm1,mm2/m64             ; 0F E0 /r        [KATMAI,MMX]
4321 \c PAVGW mm1,mm2/m64             ; 0F E3 /r        [KATMAI,MMX,SM]
4322
4323 \c PAVGB xmm1,xmm2/m128          ; 66 0F E0 /r     [WILLAMETTE,SSE2]
4324 \c PAVGW xmm1,xmm2/m128          ; 66 0F E3 /r     [WILLAMETTE,SSE2]
4325
4326 \c{PAVGB} and \c{PAVGW} add the unsigned data elements of the source
4327 operand to the unsigned data elements of the destination register,
4328 then adds 1 to the temporary results. The results of the add are then
4329 each independently right-shifted by one bit position. The high order
4330 bits of each element are filled with the carry bits of the corresponding
4331 sum.
4332
4333 \b \c{PAVGB} operates on packed unsigned bytes, and
4334
4335 \b \c{PAVGW} operates on packed unsigned words.
4336
4337
4338 \S{insPAVGUSB} \i\c{PAVGUSB}: Average of unsigned packed 8-bit values
4339
4340 \c PAVGUSB mm1,mm2/m64           ; 0F 0F /r BF          [PENT,3DNOW]
4341
4342 \c{PAVGUSB} adds the unsigned data elements of the source operand to
4343 the unsigned data elements of the destination register, then adds 1
4344 to the temporary results. The results of the add are then each
4345 independently right-shifted by one bit position. The high order bits
4346 of each element are filled with the carry bits of the corresponding
4347 sum.
4348
4349 This instruction performs exactly the same operations as the \c{PAVGB}
4350 \c{MMX} instruction (\k{insPAVGB}).
4351
4352
4353 \S{insPCMPEQB} \i\c{PCMPxx}: Compare Packed Integers.
4354
4355 \c PCMPEQB mm1,mm2/m64           ; 0F 74 /r             [PENT,MMX]
4356 \c PCMPEQW mm1,mm2/m64           ; 0F 75 /r             [PENT,MMX]
4357 \c PCMPEQD mm1,mm2/m64           ; 0F 76 /r             [PENT,MMX]
4358
4359 \c PCMPGTB mm1,mm2/m64           ; 0F 64 /r             [PENT,MMX]
4360 \c PCMPGTW mm1,mm2/m64           ; 0F 65 /r             [PENT,MMX]
4361 \c PCMPGTD mm1,mm2/m64           ; 0F 66 /r             [PENT,MMX]
4362
4363 \c PCMPEQB xmm1,xmm2/m128        ; 66 0F 74 /r     [WILLAMETTE,SSE2]
4364 \c PCMPEQW xmm1,xmm2/m128        ; 66 0F 75 /r     [WILLAMETTE,SSE2]
4365 \c PCMPEQD xmm1,xmm2/m128        ; 66 0F 76 /r     [WILLAMETTE,SSE2]
4366
4367 \c PCMPGTB xmm1,xmm2/m128        ; 66 0F 64 /r     [WILLAMETTE,SSE2]
4368 \c PCMPGTW xmm1,xmm2/m128        ; 66 0F 65 /r     [WILLAMETTE,SSE2]
4369 \c PCMPGTD xmm1,xmm2/m128        ; 66 0F 66 /r     [WILLAMETTE,SSE2]
4370
4371 The \c{PCMPxx} instructions all treat their operands as vectors of
4372 bytes, words, or doublewords; corresponding elements of the source
4373 and destination are compared, and the corresponding element of the
4374 destination (first) operand is set to all zeros or all ones
4375 depending on the result of the comparison.
4376
4377 \b \c{PCMPxxB} treats the operands as vectors of bytes;
4378
4379 \b \c{PCMPxxW} treats the operands as vectors of words;
4380
4381 \b \c{PCMPxxD} treats the operands as vectors of doublewords;
4382
4383 \b \c{PCMPEQx} sets the corresponding element of the destination
4384 operand to all ones if the two elements compared are equal;
4385
4386 \b \c{PCMPGTx} sets the destination element to all ones if the element
4387 of the first (destination) operand is greater (treated as a signed
4388 integer) than that of the second (source) operand.
4389
4390
4391 \S{insPDISTIB} \i\c{PDISTIB}: MMX Packed Distance and Accumulate
4392 with Implied Register
4393
4394 \c PDISTIB mm,m64                ; 0F 54 /r             [CYRIX,MMX]
4395
4396 \c{PDISTIB}, specific to the Cyrix MMX extensions, treats its two
4397 input operands as vectors of eight unsigned bytes. For each byte
4398 position, it finds the absolute difference between the bytes in that
4399 position in the two input operands, and adds that value to the byte
4400 in the same position in the implied output register. The addition is
4401 saturated to an unsigned byte in the same way as \c{PADDUSB}.
4402
4403 To work out the implied register, invert the lowest bit in the register
4404 number. So \c{PDISTIB MM0,M64} would put the result in \c{MM1}, but
4405 \c{PDISTIB MM1,M64} would put the result in \c{MM0}.
4406
4407 Note that \c{PDISTIB} cannot take a register as its second source
4408 operand.
4409
4410 Operation:
4411
4412 \c    dstI[0-7]     := dstI[0-7]   + ABS(src0[0-7] - src1[0-7]),
4413 \c    dstI[8-15]    := dstI[8-15]  + ABS(src0[8-15] - src1[8-15]),
4414 \c    .......
4415 \c    .......
4416 \c    dstI[56-63]   := dstI[56-63] + ABS(src0[56-63] - src1[56-63]).
4417
4418
4419 \S{insPEXTRW} \i\c{PEXTRW}: Extract Word
4420
4421 \c PEXTRW reg32,mm,imm8          ; 0F C5 /r ib     [KATMAI,MMX]
4422 \c PEXTRW reg32,xmm,imm8         ; 66 0F C5 /r ib  [WILLAMETTE,SSE2]
4423
4424 \c{PEXTRW} moves the word in the source register (second operand)
4425 that is pointed to by the count operand (third operand), into the
4426 lower half of a 32-bit general purpose register. The upper half of
4427 the register is cleared to all 0s.
4428
4429 When the source operand is an \c{MMX} register, the two least
4430 significant bits of the count specify the source word. When it is
4431 an \c{SSE} register, the three least significant bits specify the
4432 word location.
4433
4434
4435 \S{insPF2ID} \i\c{PF2ID}: Packed Single-Precision FP to Integer Convert
4436
4437 \c PF2ID mm1,mm2/m64             ; 0F 0F /r 1D          [PENT,3DNOW]
4438
4439 \c{PF2ID} converts two single-precision FP values in the source operand
4440 to signed 32-bit integers, using truncation, and stores them in the
4441 destination operand. Source values that are outside the range supported
4442 by the destination are saturated to the largest absolute value of the
4443 same sign.
4444
4445
4446 \S{insPF2IW} \i\c{PF2IW}: Packed Single-Precision FP to Integer Word Convert
4447
4448 \c PF2IW mm1,mm2/m64             ; 0F 0F /r 1C          [PENT,3DNOW]
4449
4450 \c{PF2IW} converts two single-precision FP values in the source operand
4451 to signed 16-bit integers, using truncation, and stores them in the
4452 destination operand. Source values that are outside the range supported
4453 by the destination are saturated to the largest absolute value of the
4454 same sign.
4455
4456 \b In the K6-2 and K6-III, the 16-bit value is zero-extended to 32-bits
4457 before storing.
4458
4459 \b In the K6-2+, K6-III+ and Athlon processors, the value is sign-extended
4460 to 32-bits before storing.
4461
4462
4463 \S{insPFACC} \i\c{PFACC}: Packed Single-Precision FP Accumulate
4464
4465 \c PFACC mm1,mm2/m64             ; 0F 0F /r AE          [PENT,3DNOW]
4466
4467 \c{PFACC} adds the two single-precision FP values from the destination
4468 operand together, then adds the two single-precision FP values from the
4469 source operand, and places the results in the low and high doublewords
4470 of the destination operand.
4471
4472 The operation is:
4473
4474 \c    dst[0-31]   := dst[0-31] + dst[32-63],
4475 \c    dst[32-63]  := src[0-31] + src[32-63].
4476
4477
4478 \S{insPFADD} \i\c{PFADD}: Packed Single-Precision FP Addition
4479
4480 \c PFADD mm1,mm2/m64             ; 0F 0F /r 9E          [PENT,3DNOW]
4481
4482 \c{PFADD} performs addition on each of two packed single-precision
4483 FP value pairs.
4484
4485 \c    dst[0-31]   := dst[0-31]  + src[0-31],
4486 \c    dst[32-63]  := dst[32-63] + src[32-63].
4487
4488
4489 \S{insPFCMP} \i\c{PFCMPxx}: Packed Single-Precision FP Compare
4490 \I\c{PFCMPEQ} \I\c{PFCMPGE} \I\c{PFCMPGT}
4491
4492 \c PFCMPEQ mm1,mm2/m64           ; 0F 0F /r B0          [PENT,3DNOW]
4493 \c PFCMPGE mm1,mm2/m64           ; 0F 0F /r 90          [PENT,3DNOW]
4494 \c PFCMPGT mm1,mm2/m64           ; 0F 0F /r A0          [PENT,3DNOW]
4495
4496 The \c{PFCMPxx} instructions compare the packed single-point FP values
4497 in the source and destination operands, and set the destination
4498 according to the result. If the condition is true, the destination is
4499 set to all 1s, otherwise it's set to all 0s.
4500
4501 \b \c{PFCMPEQ} tests whether dst == src;
4502
4503 \b \c{PFCMPGE} tests whether dst >= src;
4504
4505 \b \c{PFCMPGT} tests whether dst >  src.
4506
4507
4508 \S{insPFMAX} \i\c{PFMAX}: Packed Single-Precision FP Maximum
4509
4510 \c PFMAX mm1,mm2/m64             ; 0F 0F /r A4          [PENT,3DNOW]
4511
4512 \c{PFMAX} returns the higher of each pair of single-precision FP values.
4513 If the higher value is zero, it is returned as positive zero.
4514
4515
4516 \S{insPFMIN} \i\c{PFMIN}: Packed Single-Precision FP Minimum
4517
4518 \c PFMIN mm1,mm2/m64             ; 0F 0F /r 94          [PENT,3DNOW]
4519
4520 \c{PFMIN} returns the lower of each pair of single-precision FP values.
4521 If the lower value is zero, it is returned as positive zero.
4522
4523
4524 \S{insPFMUL} \i\c{PFMUL}: Packed Single-Precision FP Multiply
4525
4526 \c PFMUL mm1,mm2/m64             ; 0F 0F /r B4          [PENT,3DNOW]
4527
4528 \c{PFMUL} returns the product of each pair of single-precision FP values.
4529
4530 \c    dst[0-31]  := dst[0-31]  * src[0-31],
4531 \c    dst[32-63] := dst[32-63] * src[32-63].
4532
4533
4534 \S{insPFNACC} \i\c{PFNACC}: Packed Single-Precision FP Negative Accumulate
4535
4536 \c PFNACC mm1,mm2/m64            ; 0F 0F /r 8A          [PENT,3DNOW]
4537
4538 \c{PFNACC} performs a negative accumulate of the two single-precision
4539 FP values in the source and destination registers. The result of the
4540 accumulate from the destination register is stored in the low doubleword
4541 of the destination, and the result of the source accumulate is stored in
4542 the high doubleword of the destination register.
4543
4544 The operation is:
4545
4546 \c    dst[0-31]  := dst[0-31] - dst[32-63],
4547 \c    dst[32-63] := src[0-31] - src[32-63].
4548
4549
4550 \S{insPFPNACC} \i\c{PFPNACC}: Packed Single-Precision FP Mixed Accumulate
4551
4552 \c PFPNACC mm1,mm2/m64           ; 0F 0F /r 8E          [PENT,3DNOW]
4553
4554 \c{PFPNACC} performs a positive accumulate of the two single-precision
4555 FP values in the source register and a negative accumulate of the
4556 destination register. The result of the accumulate from the destination
4557 register is stored in the low doubleword of the destination, and the
4558 result of the source accumulate is stored in the high doubleword of the
4559 destination register.
4560
4561 The operation is:
4562
4563 \c    dst[0-31]  := dst[0-31] - dst[32-63],
4564 \c    dst[32-63] := src[0-31] + src[32-63].
4565
4566
4567 \S{insPFRCP} \i\c{PFRCP}: Packed Single-Precision FP Reciprocal Approximation
4568
4569 \c PFRCP mm1,mm2/m64             ; 0F 0F /r 96          [PENT,3DNOW]
4570
4571 \c{PFRCP} performs a low precision estimate of the reciprocal of the
4572 low-order single-precision FP value in the source operand, storing the
4573 result in both halves of the destination register. The result is accurate
4574 to 14 bits.
4575
4576 For higher precision reciprocals, this instruction should be followed by
4577 two more instructions: \c{PFRCPIT1} (\k{insPFRCPIT1}) and \c{PFRCPIT2}
4578 (\k{insPFRCPIT1}). This will result in a 24-bit accuracy. For more details,
4579 see the AMD 3DNow! technology manual.
4580
4581
4582 \S{insPFRCPIT1} \i\c{PFRCPIT1}: Packed Single-Precision FP Reciprocal,
4583 First Iteration Step
4584
4585 \c PFRCPIT1 mm1,mm2/m64          ; 0F 0F /r A6          [PENT,3DNOW]
4586
4587 \c{PFRCPIT1} performs the first intermediate step in the calculation of
4588 the reciprocal of a single-precision FP value. The first source value
4589 (\c{mm1} is the original value, and the second source value (\c{mm2/m64}
4590 is the result of a \c{PFRCP} instruction.
4591
4592 For the final step in a reciprocal, returning the full 24-bit accuracy
4593 of a single-precision FP value, see \c{PFRCPIT2} (\k{insPFRCPIT2}). For
4594 more details, see the AMD 3DNow! technology manual.
4595
4596
4597 \S{insPFRCPIT2} \i\c{PFRCPIT2}: Packed Single-Precision FP
4598 Reciprocal/ Reciprocal Square Root, Second Iteration Step
4599
4600 \c PFRCPIT2 mm1,mm2/m64          ; 0F 0F /r B6          [PENT,3DNOW]
4601
4602 \c{PFRCPIT2} performs the second and final intermediate step in the
4603 calculation of a reciprocal or reciprocal square root, refining the
4604 values returned by the \c{PFRCP} and \c{PFRSQRT} instructions,
4605 respectively.
4606
4607 The first source value (\c{mm1}) is the output of either a \c{PFRCPIT1}
4608 or a \c{PFRSQIT1} instruction, and the second source is the output of
4609 either the \c{PFRCP} or the \c{PFRSQRT} instruction. For more details,
4610 see the AMD 3DNow! technology manual.
4611
4612
4613 \S{insPFRSQIT1} \i\c{PFRSQIT1}: Packed Single-Precision FP Reciprocal
4614 Square Root, First Iteration Step
4615
4616 \c PFRSQIT1 mm1,mm2/m64          ; 0F 0F /r A7          [PENT,3DNOW]
4617
4618 \c{PFRSQIT1} performs the first intermediate step in the calculation of
4619 the reciprocal square root of a single-precision FP value. The first
4620 source value (\c{mm1} is the square of the result of a \c{PFRSQRT}
4621 instruction, and the second source value (\c{mm2/m64} is the original
4622 value.
4623
4624 For the final step in a calculation, returning the full 24-bit accuracy
4625 of a single-precision FP value, see \c{PFRCPIT2} (\k{insPFRCPIT2}). For
4626 more details, see the AMD 3DNow! technology manual.
4627
4628
4629 \S{insPFRSQRT} \i\c{PFRSQRT}: Packed Single-Precision FP Reciprocal
4630 Square Root Approximation
4631
4632 \c PFRSQRT mm1,mm2/m64           ; 0F 0F /r 97          [PENT,3DNOW]
4633
4634 \c{PFRSQRT} performs a low precision estimate of the reciprocal square
4635 root of the low-order single-precision FP value in the source operand,
4636 storing the result in both halves of the destination register. The result
4637 is accurate to 15 bits.
4638
4639 For higher precision reciprocals, this instruction should be followed by
4640 two more instructions: \c{PFRSQIT1} (\k{insPFRSQIT1}) and \c{PFRCPIT2}
4641 (\k{insPFRCPIT1}). This will result in a 24-bit accuracy. For more details,
4642 see the AMD 3DNow! technology manual.
4643
4644
4645 \S{insPFSUB} \i\c{PFSUB}: Packed Single-Precision FP Subtract
4646
4647 \c PFSUB mm1,mm2/m64             ; 0F 0F /r 9A          [PENT,3DNOW]
4648
4649 \c{PFSUB} subtracts the single-precision FP values in the source from
4650 those in the destination, and stores the result in the destination
4651 operand.
4652
4653 \c    dst[0-31]  := dst[0-31]  - src[0-31],
4654 \c    dst[32-63] := dst[32-63] - src[32-63].
4655
4656
4657 \S{insPFSUBR} \i\c{PFSUBR}: Packed Single-Precision FP Reverse Subtract
4658
4659 \c PFSUBR mm1,mm2/m64            ; 0F 0F /r AA          [PENT,3DNOW]
4660
4661 \c{PFSUBR} subtracts the single-precision FP values in the destination
4662 from those in the source, and stores the result in the destination
4663 operand.
4664
4665 \c    dst[0-31]  := src[0-31]  - dst[0-31],
4666 \c    dst[32-63] := src[32-63] - dst[32-63].
4667
4668
4669 \S{insPI2FD} \i\c{PI2FD}: Packed Doubleword Integer to Single-Precision FP Convert
4670
4671 \c PI2FD mm1,mm2/m64             ; 0F 0F /r 0D          [PENT,3DNOW]
4672
4673 \c{PF2ID} converts two signed 32-bit integers in the source operand
4674 to single-precision FP values, using truncation of significant digits,
4675 and stores them in the destination operand.
4676
4677
4678 \S{insPF2IW} \i\c{PF2IW}: Packed Word Integer to Single-Precision FP Convert
4679
4680 \c PI2FW mm1,mm2/m64             ; 0F 0F /r 0C          [PENT,3DNOW]
4681
4682 \c{PF2IW} converts two signed 16-bit integers in the source operand
4683 to single-precision FP values, and stores them in the destination
4684 operand. The input values are in the low word of each doubleword.
4685
4686
4687 \S{insPINSRW} \i\c{PINSRW}: Insert Word
4688
4689 \c PINSRW mm,r16/r32/m16,imm8    ;0F C4 /r ib      [KATMAI,MMX]
4690 \c PINSRW xmm,r16/r32/m16,imm8   ;66 0F C4 /r ib   [WILLAMETTE,SSE2]
4691
4692 \c{PINSRW} loads a word from a 16-bit register (or the low half of a
4693 32-bit register), or from memory, and loads it to the word position
4694 in the destination register, pointed at by the count operand (third
4695 operand). If the destination is an \c{MMX} register, the low two bits
4696 of the count byte are used, if it is an \c{XMM} register the low 3
4697 bits are used. The insertion is done in such a way that the other
4698 words from the destination register are left untouched.
4699
4700
4701 \S{insPMACHRIW} \i\c{PMACHRIW}: Packed Multiply and Accumulate with Rounding
4702
4703 \c PMACHRIW mm,m64               ; 0F 5E /r             [CYRIX,MMX]
4704
4705 \c{PMACHRIW} takes two packed 16-bit integer inputs, multiplies the
4706 values in the inputs, rounds on bit 15 of each result, then adds bits
4707 15-30 of each result to the corresponding position of the \e{implied}
4708 destination register.
4709
4710 The operation of this instruction is:
4711
4712 \c    dstI[0-15]  := dstI[0-15]  + (mm[0-15] *m64[0-15]
4713 \c                                           + 0x00004000)[15-30],
4714 \c    dstI[16-31] := dstI[16-31] + (mm[16-31]*m64[16-31]
4715 \c                                           + 0x00004000)[15-30],
4716 \c    dstI[32-47] := dstI[32-47] + (mm[32-47]*m64[32-47]
4717 \c                                           + 0x00004000)[15-30],
4718 \c    dstI[48-63] := dstI[48-63] + (mm[48-63]*m64[48-63]
4719 \c                                           + 0x00004000)[15-30].
4720
4721 Note that \c{PMACHRIW} cannot take a register as its second source
4722 operand.
4723
4724
4725 \S{insPMADDWD} \i\c{PMADDWD}: MMX Packed Multiply and Add
4726
4727 \c PMADDWD mm1,mm2/m64           ; 0F F5 /r             [PENT,MMX]
4728 \c PMADDWD xmm1,xmm2/m128        ; 66 0F F5 /r     [WILLAMETTE,SSE2]
4729
4730 \c{PMADDWD} treats its two inputs as vectors of signed words. It
4731 multiplies corresponding elements of the two operands, giving doubleword
4732 results. These are then added together in pairs and stored in the
4733 destination operand.
4734
4735 The operation of this instruction is:
4736
4737 \c    dst[0-31]   := (dst[0-15] * src[0-15])
4738 \c                                + (dst[16-31] * src[16-31]);
4739 \c    dst[32-63]  := (dst[32-47] * src[32-47])
4740 \c                                + (dst[48-63] * src[48-63]);
4741
4742 The following apply to the \c{SSE} version of the instruction:
4743
4744 \c    dst[64-95]  := (dst[64-79] * src[64-79])
4745 \c                                + (dst[80-95] * src[80-95]);
4746 \c    dst[96-127] := (dst[96-111] * src[96-111])
4747 \c                                + (dst[112-127] * src[112-127]).
4748
4749
4750 \S{insPMAGW} \i\c{PMAGW}: MMX Packed Magnitude
4751
4752 \c PMAGW mm1,mm2/m64             ; 0F 52 /r             [CYRIX,MMX]
4753
4754 \c{PMAGW}, specific to the Cyrix MMX extensions, treats both its
4755 operands as vectors of four signed words. It compares the absolute
4756 values of the words in corresponding positions, and sets each word
4757 of the destination (first) operand to whichever of the two words in
4758 that position had the larger absolute value.
4759
4760
4761 \S{insPMAXSW} \i\c{PMAXSW}: Packed Signed Integer Word Maximum
4762
4763 \c PMAXSW mm1,mm2/m64            ; 0F EE /r        [KATMAI,MMX]
4764 \c PMAXSW xmm1,xmm2/m128         ; 66 0F EE /r     [WILLAMETTE,SSE2]
4765
4766 \c{PMAXSW} compares each pair of words in the two source operands, and
4767 for each pair it stores the maximum value in the destination register.
4768
4769
4770 \S{insPMAXUB} \i\c{PMAXUB}: Packed Unsigned Integer Byte Maximum
4771
4772 \c PMAXUB mm1,mm2/m64            ; 0F DE /r        [KATMAI,MMX]
4773 \c PMAXUB xmm1,xmm2/m128         ; 66 0F DE /r     [WILLAMETTE,SSE2]
4774
4775 \c{PMAXUB} compares each pair of bytes in the two source operands, and
4776 for each pair it stores the maximum value in the destination register.
4777
4778
4779 \S{insPMINSW} \i\c{PMINSW}: Packed Signed Integer Word Minimum
4780
4781 \c PMINSW mm1,mm2/m64            ; 0F EA /r        [KATMAI,MMX]
4782 \c PMINSW xmm1,xmm2/m128         ; 66 0F EA /r     [WILLAMETTE,SSE2]
4783
4784 \c{PMINSW} compares each pair of words in the two source operands, and
4785 for each pair it stores the minimum value in the destination register.
4786
4787
4788 \S{insPMINUB} \i\c{PMINUB}: Packed Unsigned Integer Byte Minimum
4789
4790 \c PMINUB mm1,mm2/m64            ; 0F DA /r        [KATMAI,MMX]
4791 \c PMINUB xmm1,xmm2/m128         ; 66 0F DA /r     [WILLAMETTE,SSE2]
4792
4793 \c{PMINUB} compares each pair of bytes in the two source operands, and
4794 for each pair it stores the minimum value in the destination register.
4795
4796
4797 \S{insPMOVMSKB} \i\c{PMOVMSKB}: Move Byte Mask To Integer
4798
4799 \c PMOVMSKB reg32,mm             ; 0F D7 /r        [KATMAI,MMX]
4800 \c PMOVMSKB reg32,xmm            ; 66 0F D7 /r     [WILLAMETTE,SSE2]
4801
4802 \c{PMOVMSKB} returns an 8-bit or 16-bit mask formed of the most
4803 significant bits of each byte of source operand (8-bits for an
4804 \c{MMX} register, 16-bits for an \c{XMM} register).
4805
4806
4807 \S{insPMULHRW} \i\c{PMULHRWC}, \i\c{PMULHRIW}: Multiply Packed 16-bit Integers
4808 With Rounding, and Store High Word
4809
4810 \c PMULHRWC mm1,mm2/m64         ; 0F 59 /r              [CYRIX,MMX]
4811 \c PMULHRIW mm1,mm2/m64         ; 0F 5D /r              [CYRIX,MMX]
4812
4813 These instructions take two packed 16-bit integer inputs, multiply the
4814 values in the inputs, round on bit 15 of each result, then store bits
4815 15-30 of each result to the corresponding position of the destination
4816 register.
4817
4818 \b For \c{PMULHRWC}, the destination is the first source operand.
4819
4820 \b For \c{PMULHRIW}, the destination is an implied register (worked out
4821 as described for \c{PADDSIW} (\k{insPADDSIW})).
4822
4823 The operation of this instruction is:
4824
4825 \c    dst[0-15]  := (src1[0-15] *src2[0-15]  + 0x00004000)[15-30]
4826 \c    dst[16-31] := (src1[16-31]*src2[16-31] + 0x00004000)[15-30]
4827 \c    dst[32-47] := (src1[32-47]*src2[32-47] + 0x00004000)[15-30]
4828 \c    dst[48-63] := (src1[48-63]*src2[48-63] + 0x00004000)[15-30]
4829
4830 See also \c{PMULHRWA} (\k{insPMULHRWA}) for a 3DNow! version of this
4831 instruction.
4832
4833
4834 \S{insPMULHRWA} \i\c{PMULHRWA}: Multiply Packed 16-bit Integers
4835 With Rounding, and Store High Word
4836
4837 \c PMULHRWA mm1,mm2/m64          ; 0F 0F /r B7     [PENT,3DNOW]
4838
4839 \c{PMULHRWA} takes two packed 16-bit integer inputs, multiplies
4840 the values in the inputs, rounds on bit 16 of each result, then
4841 stores bits 16-31 of each result to the corresponding position
4842 of the destination register.
4843
4844 The operation of this instruction is:
4845
4846 \c    dst[0-15]  := (src1[0-15] *src2[0-15]  + 0x00008000)[16-31];
4847 \c    dst[16-31] := (src1[16-31]*src2[16-31] + 0x00008000)[16-31];
4848 \c    dst[32-47] := (src1[32-47]*src2[32-47] + 0x00008000)[16-31];
4849 \c    dst[48-63] := (src1[48-63]*src2[48-63] + 0x00008000)[16-31].
4850
4851 See also \c{PMULHRWC} (\k{insPMULHRW}) for a Cyrix version of this
4852 instruction.
4853
4854
4855 \S{insPMULHUW} \i\c{PMULHUW}: Multiply Packed 16-bit Integers,
4856 and Store High Word
4857
4858 \c PMULHUW mm1,mm2/m64           ; 0F E4 /r        [KATMAI,MMX]
4859 \c PMULHUW xmm1,xmm2/m128        ; 66 0F E4 /r     [WILLAMETTE,SSE2]
4860
4861 \c{PMULHUW} takes two packed unsigned 16-bit integer inputs, multiplies
4862 the values in the inputs, then stores bits 16-31 of each result to the
4863 corresponding position of the destination register.
4864
4865
4866 \S{insPMULHW} \i\c{PMULHW}, \i\c{PMULLW}: Multiply Packed 16-bit Integers,
4867 and Store
4868
4869 \c PMULHW mm1,mm2/m64            ; 0F E5 /r             [PENT,MMX]
4870 \c PMULLW mm1,mm2/m64            ; 0F D5 /r             [PENT,MMX]
4871
4872 \c PMULHW xmm1,xmm2/m128         ; 66 0F E5 /r     [WILLAMETTE,SSE2]
4873 \c PMULLW xmm1,xmm2/m128         ; 66 0F D5 /r     [WILLAMETTE,SSE2]
4874
4875 \c{PMULxW} takes two packed unsigned 16-bit integer inputs, and
4876 multiplies the values in the inputs, forming doubleword results.
4877
4878 \b \c{PMULHW} then stores the top 16 bits of each doubleword in the
4879 destination (first) operand;
4880
4881 \b \c{PMULLW} stores the bottom 16 bits of each doubleword in the
4882 destination operand.
4883
4884
4885 \S{insPMULUDQ} \i\c{PMULUDQ}: Multiply Packed Unsigned
4886 32-bit Integers, and Store.
4887
4888 \c PMULUDQ mm1,mm2/m64           ; 0F F4 /r        [WILLAMETTE,SSE2]
4889 \c PMULUDQ xmm1,xmm2/m128        ; 66 0F F4 /r     [WILLAMETTE,SSE2]
4890
4891 \c{PMULUDQ} takes two packed unsigned 32-bit integer inputs, and
4892 multiplies the values in the inputs, forming quadword results. The
4893 source is either an unsigned doubleword in the low doubleword of a
4894 64-bit operand, or it's two unsigned doublewords in the first and
4895 third doublewords of a 128-bit operand. This produces either one or
4896 two 64-bit results, which are stored in the respective quadword
4897 locations of the destination register.
4898
4899 The operation is:
4900
4901 \c    dst[0-63]   := dst[0-31]  * src[0-31];
4902 \c    dst[64-127] := dst[64-95] * src[64-95].
4903
4904
4905 \S{insPMVccZB} \i\c{PMVccZB}: MMX Packed Conditional Move
4906
4907 \c PMVZB mmxreg,mem64            ; 0F 58 /r             [CYRIX,MMX]
4908 \c PMVNZB mmxreg,mem64           ; 0F 5A /r             [CYRIX,MMX]
4909 \c PMVLZB mmxreg,mem64           ; 0F 5B /r             [CYRIX,MMX]
4910 \c PMVGEZB mmxreg,mem64          ; 0F 5C /r             [CYRIX,MMX]
4911
4912 These instructions, specific to the Cyrix MMX extensions, perform
4913 parallel conditional moves. The two input operands are treated as
4914 vectors of eight bytes. Each byte of the destination (first) operand
4915 is either written from the corresponding byte of the source (second)
4916 operand, or left alone, depending on the value of the byte in the
4917 \e{implied} operand (specified in the same way as \c{PADDSIW}, in
4918 \k{insPADDSIW}).
4919
4920 \b \c{PMVZB} performs each move if the corresponding byte in the
4921 implied operand is zero;
4922
4923 \b \c{PMVNZB} moves if the byte is non-zero;
4924
4925 \b \c{PMVLZB} moves if the byte is less than zero;
4926
4927 \b \c{PMVGEZB} moves if the byte is greater than or equal to zero.
4928
4929 Note that these instructions cannot take a register as their second
4930 source operand.
4931
4932
4933 \S{insPOP} \i\c{POP}: Pop Data from Stack
4934
4935 \c POP reg16                     ; o16 58+r             [8086]
4936 \c POP reg32                     ; o32 58+r             [386]
4937
4938 \c POP r/m16                     ; o16 8F /0            [8086]
4939 \c POP r/m32                     ; o32 8F /0            [386]
4940
4941 \c POP CS                        ; 0F                   [8086,UNDOC]
4942 \c POP DS                        ; 1F                   [8086]
4943 \c POP ES                        ; 07                   [8086]
4944 \c POP SS                        ; 17                   [8086]
4945 \c POP FS                        ; 0F A1                [386]
4946 \c POP GS                        ; 0F A9                [386]
4947
4948 \c{POP} loads a value from the stack (from \c{[SS:SP]} or
4949 \c{[SS:ESP]}) and then increments the stack pointer.
4950
4951 The address-size attribute of the instruction determines whether
4952 \c{SP} or \c{ESP} is used as the stack pointer: to deliberately
4953 override the default given by the \c{BITS} setting, you can use an
4954 \i\c{a16} or \i\c{a32} prefix.
4955
4956 The operand-size attribute of the instruction determines whether the
4957 stack pointer is incremented by 2 or 4: this means that segment
4958 register pops in \c{BITS 32} mode will pop 4 bytes off the stack and
4959 discard the upper two of them. If you need to override that, you can
4960 use an \i\c{o16} or \i\c{o32} prefix.
4961
4962 The above opcode listings give two forms for general-purpose
4963 register pop instructions: for example, \c{POP BX} has the two forms
4964 \c{5B} and \c{8F C3}. NASM will always generate the shorter form
4965 when given \c{POP BX}. NDISASM will disassemble both.
4966
4967 \c{POP CS} is not a documented instruction, and is not supported on
4968 any processor above the 8086 (since they use \c{0Fh} as an opcode
4969 prefix for instruction set extensions). However, at least some 8086
4970 processors do support it, and so NASM generates it for completeness.
4971
4972
4973 \S{insPOPA} \i\c{POPAx}: Pop All General-Purpose Registers
4974
4975 \c POPA                          ; 61                   [186]
4976 \c POPAW                         ; o16 61               [186]
4977 \c POPAD                         ; o32 61               [386]
4978
4979 \b \c{POPAW} pops a word from the stack into each of, successively,
4980 \c{DI}, \c{SI}, \c{BP}, nothing (it discards a word from the stack
4981 which was a placeholder for \c{SP}), \c{BX}, \c{DX}, \c{CX} and
4982 \c{AX}. It is intended to reverse the operation of \c{PUSHAW} (see
4983 \k{insPUSHA}), but it ignores the value for \c{SP} that was pushed
4984 on the stack by \c{PUSHAW}.
4985
4986 \b \c{POPAD} pops twice as much data, and places the results in
4987 \c{EDI}, \c{ESI}, \c{EBP}, nothing (placeholder for \c{ESP}),
4988 \c{EBX}, \c{EDX}, \c{ECX} and \c{EAX}. It reverses the operation of
4989 \c{PUSHAD}.
4990
4991 \c{POPA} is an alias mnemonic for either \c{POPAW} or \c{POPAD},
4992 depending on the current \c{BITS} setting.
4993
4994 Note that the registers are popped in reverse order of their numeric
4995 values in opcodes (see \k{iref-rv}).
4996
4997
4998 \S{insPOPF} \i\c{POPFx}: Pop Flags Register
4999
5000 \c POPF                          ; 9D                   [8086]
5001 \c POPFW                         ; o16 9D               [8086]
5002 \c POPFD                         ; o32 9D               [386]
5003
5004 \b \c{POPFW} pops a word from the stack and stores it in the bottom 16
5005 bits of the flags register (or the whole flags register, on
5006 processors below a 386).
5007
5008 \b \c{POPFD} pops a doubleword and stores it in the entire flags register.
5009
5010 \c{POPF} is an alias mnemonic for either \c{POPFW} or \c{POPFD},
5011 depending on the current \c{BITS} setting.
5012
5013 See also \c{PUSHF} (\k{insPUSHF}).
5014
5015
5016 \S{insPOR} \i\c{POR}: MMX Bitwise OR
5017
5018 \c POR mm1,mm2/m64               ; 0F EB /r             [PENT,MMX]
5019 \c POR xmm1,xmm2/m128            ; 66 0F EB /r     [WILLAMETTE,SSE2]
5020
5021 \c{POR} performs a bitwise OR operation between its two operands
5022 (i.e. each bit of the result is 1 if and only if at least one of the
5023 corresponding bits of the two inputs was 1), and stores the result
5024 in the destination (first) operand.
5025
5026
5027 \S{insPREFETCH} \i\c{PREFETCH}: Prefetch Data Into Caches
5028
5029 \c PREFETCH mem8                 ; 0F 0D /0             [PENT,3DNOW]
5030 \c PREFETCHW mem8                ; 0F 0D /1             [PENT,3DNOW]
5031
5032 \c{PREFETCH} and \c{PREFETCHW} fetch the line of data from memory that
5033 contains the specified byte. \c{PREFETCHW} performs differently on the
5034 Athlon to earlier processors.
5035
5036 For more details, see the 3DNow! Technology Manual.
5037
5038
5039 \S{insPREFETCHh} \i\c{PREFETCHh}: Prefetch Data Into Caches
5040 \I\c{PREFETCHNTA} \I\c{PREFETCHT0} \I\c{PREFETCHT1} \I\c{PREFETCHT2}
5041
5042 \c PREFETCHNTA m8                ; 0F 18 /0        [KATMAI]
5043 \c PREFETCHT0 m8                 ; 0F 18 /1        [KATMAI]
5044 \c PREFETCHT1 m8                 ; 0F 18 /2        [KATMAI]
5045 \c PREFETCHT2 m8                 ; 0F 18 /3        [KATMAI]
5046
5047 The \c{PREFETCHh} instructions fetch the line of data from memory
5048 that contains the specified byte. It is placed in the cache
5049 according to rules specified by locality hints \c{h}:
5050
5051 The hints are:
5052
5053 \b \c{T0} (temporal data) - prefetch data into all levels of the
5054 cache hierarchy.
5055
5056 \b \c{T1} (temporal data with respect to first level cache) -
5057 prefetch data into level 2 cache and higher.
5058
5059 \b \c{T2} (temporal data with respect to second level cache) -
5060 prefetch data into level 2 cache and higher.
5061
5062 \b \c{NTA} (non-temporal data with respect to all cache levels) -
5063 prefetch data into non-temporal cache structure and into a
5064 location close to the processor, minimizing cache pollution.
5065
5066 Note that this group of instructions doesn't provide a guarantee
5067 that the data will be in the cache when it is needed. For more
5068 details, see the Intel IA32 Software Developer Manual, Volume 2.
5069
5070
5071 \S{insPSADBW} \i\c{PSADBW}: Packed Sum of Absolute Differences
5072
5073 \c PSADBW mm1,mm2/m64            ; 0F F6 /r        [KATMAI,MMX]
5074 \c PSADBW xmm1,xmm2/m128         ; 66 0F F6 /r     [WILLAMETTE,SSE2]
5075
5076 \c{PSADBW} The PSADBW instruction computes the absolute value of the
5077 difference of the packed unsigned bytes in the two source operands.
5078 These differences are then summed to produce a word result in the lower
5079 16-bit field of the destination register; the rest of the register is
5080 cleared. The destination operand is an \c{MMX} or an \c{XMM} register.
5081 The source operand can either be a register or a memory operand.
5082
5083
5084 \S{insPSHUFD} \i\c{PSHUFD}: Shuffle Packed Doublewords
5085
5086 \c PSHUFD xmm1,xmm2/m128,imm8    ; 66 0F 70 /r ib  [WILLAMETTE,SSE2]
5087
5088 \c{PSHUFD} shuffles the doublewords in the source (second) operand
5089 according to the encoding specified by imm8, and stores the result
5090 in the destination (first) operand.
5091
5092 Bits 0 and 1 of imm8 encode the source position of the doubleword to
5093 be copied to position 0 in the destination operand. Bits 2 and 3
5094 encode for position 1, bits 4 and 5 encode for position 2, and bits
5095 6 and 7 encode for position 3. For example, an encoding of 10 in
5096 bits 0 and 1 of imm8 indicates that the doubleword at bits 64-95 of
5097 the source operand will be copied to bits 0-31 of the destination.
5098
5099
5100 \S{insPSHUFHW} \i\c{PSHUFHW}: Shuffle Packed High Words
5101
5102 \c PSHUFHW xmm1,xmm2/m128,imm8   ; F3 0F 70 /r ib  [WILLAMETTE,SSE2]
5103
5104 \c{PSHUFW} shuffles the words in the high quadword of the source
5105 (second) operand according to the encoding specified by imm8, and
5106 stores the result in the high quadword of the destination (first)
5107 operand.
5108
5109 The operation of this instruction is similar to the \c{PSHUFW}
5110 instruction, except that the source and destination are the top
5111 quadword of a 128-bit operand, instead of being 64-bit operands.
5112 The low quadword is copied from the source to the destination
5113 without any changes.
5114
5115
5116 \S{insPSHUFLW} \i\c{PSHUFLW}: Shuffle Packed Low Words
5117
5118 \c PSHUFLW xmm1,xmm2/m128,imm8   ; F2 0F 70 /r ib  [WILLAMETTE,SSE2]
5119
5120 \c{PSHUFLW} shuffles the words in the low quadword of the source
5121 (second) operand according to the encoding specified by imm8, and
5122 stores the result in the low quadword of the destination (first)
5123 operand.
5124
5125 The operation of this instruction is similar to the \c{PSHUFW}
5126 instruction, except that the source and destination are the low
5127 quadword of a 128-bit operand, instead of being 64-bit operands.
5128 The high quadword is copied from the source to the destination
5129 without any changes.
5130
5131
5132 \S{insPSHUFW} \i\c{PSHUFW}: Shuffle Packed Words
5133
5134 \c PSHUFW mm1,mm2/m64,imm8       ; 0F 70 /r ib     [KATMAI,MMX]
5135
5136 \c{PSHUFW} shuffles the words in the source (second) operand
5137 according to the encoding specified by imm8, and stores the result
5138 in the destination (first) operand.
5139
5140 Bits 0 and 1 of imm8 encode the source position of the word to be
5141 copied to position 0 in the destination operand. Bits 2 and 3 encode
5142 for position 1, bits 4 and 5 encode for position 2, and bits 6 and 7
5143 encode for position 3. For example, an encoding of 10 in bits 0 and 1
5144 of imm8 indicates that the word at bits 32-47 of the source operand
5145 will be copied to bits 0-15 of the destination.
5146
5147
5148 \S{insPSLLD} \i\c{PSLLx}: Packed Data Bit Shift Left Logical
5149
5150 \c PSLLW mm1,mm2/m64             ; 0F F1 /r             [PENT,MMX]
5151 \c PSLLW mm,imm8                 ; 0F 71 /6 ib          [PENT,MMX]
5152
5153 \c PSLLW xmm1,xmm2/m128          ; 66 0F F1 /r     [WILLAMETTE,SSE2]
5154 \c PSLLW xmm,imm8                ; 66 0F 71 /6 ib  [WILLAMETTE,SSE2]
5155
5156 \c PSLLD mm1,mm2/m64             ; 0F F2 /r             [PENT,MMX]
5157 \c PSLLD mm,imm8                 ; 0F 72 /6 ib          [PENT,MMX]
5158
5159 \c PSLLD xmm1,xmm2/m128          ; 66 0F F2 /r     [WILLAMETTE,SSE2]
5160 \c PSLLD xmm,imm8                ; 66 0F 72 /6 ib  [WILLAMETTE,SSE2]
5161
5162 \c PSLLQ mm1,mm2/m64             ; 0F F3 /r             [PENT,MMX]
5163 \c PSLLQ mm,imm8                 ; 0F 73 /6 ib          [PENT,MMX]
5164
5165 \c PSLLQ xmm1,xmm2/m128          ; 66 0F F3 /r     [WILLAMETTE,SSE2]
5166 \c PSLLQ xmm,imm8                ; 66 0F 73 /6 ib  [WILLAMETTE,SSE2]
5167
5168 \c PSLLDQ xmm1,imm8              ; 66 0F 73 /7 ib  [WILLAMETTE,SSE2]
5169
5170 \c{PSLLx} performs logical left shifts of the data elements in the
5171 destination (first) operand, moving each bit in the separate elements
5172 left by the number of bits specified in the source (second) operand,
5173 clearing the low-order bits as they are vacated. \c{PSLLDQ}
5174 shifts bytes, not bits.
5175
5176 \b \c{PSLLW} shifts word sized elements.
5177
5178 \b \c{PSLLD} shifts doubleword sized elements.
5179
5180 \b \c{PSLLQ} shifts quadword sized elements.
5181
5182 \b \c{PSLLDQ} shifts double quadword sized elements.
5183
5184
5185 \S{insPSRAD} \i\c{PSRAx}: Packed Data Bit Shift Right Arithmetic
5186
5187 \c PSRAW mm1,mm2/m64             ; 0F E1 /r             [PENT,MMX]
5188 \c PSRAW mm,imm8                 ; 0F 71 /4 ib          [PENT,MMX]
5189
5190 \c PSRAW xmm1,xmm2/m128          ; 66 0F E1 /r     [WILLAMETTE,SSE2]
5191 \c PSRAW xmm,imm8                ; 66 0F 71 /4 ib  [WILLAMETTE,SSE2]
5192
5193 \c PSRAD mm1,mm2/m64             ; 0F E2 /r             [PENT,MMX]
5194 \c PSRAD mm,imm8                 ; 0F 72 /4 ib          [PENT,MMX]
5195
5196 \c PSRAD xmm1,xmm2/m128          ; 66 0F E2 /r     [WILLAMETTE,SSE2]
5197 \c PSRAD xmm,imm8                ; 66 0F 72 /4 ib  [WILLAMETTE,SSE2]
5198
5199 \c{PSRAx} performs arithmetic right shifts of the data elements in the
5200 destination (first) operand, moving each bit in the separate elements
5201 right by the number of bits specified in the source (second) operand,
5202 setting the high-order bits to the value of the original sign bit.
5203
5204 \b \c{PSRAW} shifts word sized elements.
5205
5206 \b \c{PSRAD} shifts doubleword sized elements.
5207
5208
5209 \S{insPSRLD} \i\c{PSRLx}: Packed Data Bit Shift Right Logical
5210
5211 \c PSRLW mm1,mm2/m64             ; 0F D1 /r             [PENT,MMX]
5212 \c PSRLW mm,imm8                 ; 0F 71 /2 ib          [PENT,MMX]
5213
5214 \c PSRLW xmm1,xmm2/m128          ; 66 0F D1 /r     [WILLAMETTE,SSE2]
5215 \c PSRLW xmm,imm8                ; 66 0F 71 /2 ib  [WILLAMETTE,SSE2]
5216
5217 \c PSRLD mm1,mm2/m64             ; 0F D2 /r             [PENT,MMX]
5218 \c PSRLD mm,imm8                 ; 0F 72 /2 ib          [PENT,MMX]
5219
5220 \c PSRLD xmm1,xmm2/m128          ; 66 0F D2 /r     [WILLAMETTE,SSE2]
5221 \c PSRLD xmm,imm8                ; 66 0F 72 /2 ib  [WILLAMETTE,SSE2]
5222
5223 \c PSRLQ mm1,mm2/m64             ; 0F D3 /r             [PENT,MMX]
5224 \c PSRLQ mm,imm8                 ; 0F 73 /2 ib          [PENT,MMX]
5225
5226 \c PSRLQ xmm1,xmm2/m128          ; 66 0F D3 /r     [WILLAMETTE,SSE2]
5227 \c PSRLQ xmm,imm8                ; 66 0F 73 /2 ib  [WILLAMETTE,SSE2]
5228
5229 \c PSRLDQ xmm1,imm8              ; 66 0F 73 /3 ib  [WILLAMETTE,SSE2]
5230
5231 \c{PSRLx} performs logical right shifts of the data elements in the
5232 destination (first) operand, moving each bit in the separate elements
5233 right by the number of bits specified in the source (second) operand,
5234 clearing the high-order bits as they are vacated. \c{PSRLDQ}
5235 shifts bytes, not bits.
5236
5237 \b \c{PSRLW} shifts word sized elements.
5238
5239 \b \c{PSRLD} shifts doubleword sized elements.
5240
5241 \b \c{PSRLQ} shifts quadword sized elements.
5242
5243 \b \c{PSRLDQ} shifts double quadword sized elements.
5244
5245
5246 \S{insPSUBB} \i\c{PSUBx}: Subtract Packed Integers
5247
5248 \c PSUBB mm1,mm2/m64             ; 0F F8 /r             [PENT,MMX]
5249 \c PSUBW mm1,mm2/m64             ; 0F F9 /r             [PENT,MMX]
5250 \c PSUBD mm1,mm2/m64             ; 0F FA /r             [PENT,MMX]
5251 \c PSUBQ mm1,mm2/m64             ; 0F FB /r        [WILLAMETTE,SSE2]
5252
5253 \c PSUBB xmm1,xmm2/m128          ; 66 0F F8 /r     [WILLAMETTE,SSE2]
5254 \c PSUBW xmm1,xmm2/m128          ; 66 0F F9 /r     [WILLAMETTE,SSE2]
5255 \c PSUBD xmm1,xmm2/m128          ; 66 0F FA /r     [WILLAMETTE,SSE2]
5256 \c PSUBQ xmm1,xmm2/m128          ; 66 0F FB /r     [WILLAMETTE,SSE2]
5257
5258 \c{PSUBx} subtracts packed integers in the source operand from those
5259 in the destination operand. It doesn't differentiate between signed
5260 and unsigned integers, and doesn't set any of the flags.
5261
5262 \b \c{PSUBB} operates on byte sized elements.
5263
5264 \b \c{PSUBW} operates on word sized elements.
5265
5266 \b \c{PSUBD} operates on doubleword sized elements.
5267
5268 \b \c{PSUBQ} operates on quadword sized elements.
5269
5270
5271 \S{insPSUBSB} \i\c{PSUBSxx}, \i\c{PSUBUSx}: Subtract Packed Integers With Saturation
5272
5273 \c PSUBSB mm1,mm2/m64            ; 0F E8 /r             [PENT,MMX]
5274 \c PSUBSW mm1,mm2/m64            ; 0F E9 /r             [PENT,MMX]
5275
5276 \c PSUBSB xmm1,xmm2/m128         ; 66 0F E8 /r     [WILLAMETTE,SSE2]
5277 \c PSUBSW xmm1,xmm2/m128         ; 66 0F E9 /r     [WILLAMETTE,SSE2]
5278
5279 \c PSUBUSB mm1,mm2/m64           ; 0F D8 /r             [PENT,MMX]
5280 \c PSUBUSW mm1,mm2/m64           ; 0F D9 /r             [PENT,MMX]
5281
5282 \c PSUBUSB xmm1,xmm2/m128        ; 66 0F D8 /r     [WILLAMETTE,SSE2]
5283 \c PSUBUSW xmm1,xmm2/m128        ; 66 0F D9 /r     [WILLAMETTE,SSE2]
5284
5285 \c{PSUBSx} and \c{PSUBUSx} subtracts packed integers in the source
5286 operand from those in the destination operand, and use saturation for
5287 results that are outside the range supported by the destination operand.
5288
5289 \b \c{PSUBSB} operates on signed bytes, and uses signed saturation on the
5290 results.
5291
5292 \b \c{PSUBSW} operates on signed words, and uses signed saturation on the
5293 results.
5294
5295 \b \c{PSUBUSB} operates on unsigned bytes, and uses signed saturation on
5296 the results.
5297
5298 \b \c{PSUBUSW} operates on unsigned words, and uses signed saturation on
5299 the results.
5300
5301
5302 \S{insPSUBSIW} \i\c{PSUBSIW}: MMX Packed Subtract with Saturation to
5303 Implied Destination
5304
5305 \c PSUBSIW mm1,mm2/m64           ; 0F 55 /r             [CYRIX,MMX]
5306
5307 \c{PSUBSIW}, specific to the Cyrix extensions to the MMX instruction
5308 set, performs the same function as \c{PSUBSW}, except that the
5309 result is not placed in the register specified by the first operand,
5310 but instead in the implied destination register, specified as for
5311 \c{PADDSIW} (\k{insPADDSIW}).
5312
5313
5314 \S{insPSWAPD} \i\c{PSWAPD}: Swap Packed Data
5315 \I\c{PSWAPW}
5316
5317 \c PSWAPD mm1,mm2/m64            ; 0F 0F /r BB     [PENT,3DNOW]
5318
5319 \c{PSWAPD} swaps the packed doublewords in the source operand, and
5320 stores the result in the destination operand.
5321
5322 In the \c{K6-2} and \c{K6-III} processors, this opcode uses the
5323 mnemonic \c{PSWAPW}, and it swaps the order of words when copying
5324 from the source to the destination.
5325
5326 The operation in the \c{K6-2} and \c{K6-III} processors is
5327
5328 \c    dst[0-15]  = src[48-63];
5329 \c    dst[16-31] = src[32-47];
5330 \c    dst[32-47] = src[16-31];
5331 \c    dst[48-63] = src[0-15].
5332
5333 The operation in the \c{K6-x+}, \c{ATHLON} and later processors is:
5334
5335 \c    dst[0-31]  = src[32-63];
5336 \c    dst[32-63] = src[0-31].
5337
5338
5339 \S{insPUNPCKHBW} \i\c{PUNPCKxxx}: Unpack and Interleave Data
5340
5341 \c PUNPCKHBW mm1,mm2/m64         ; 0F 68 /r             [PENT,MMX]
5342 \c PUNPCKHWD mm1,mm2/m64         ; 0F 69 /r             [PENT,MMX]
5343 \c PUNPCKHDQ mm1,mm2/m64         ; 0F 6A /r             [PENT,MMX]
5344
5345 \c PUNPCKHBW xmm1,xmm2/m128      ; 66 0F 68 /r     [WILLAMETTE,SSE2]
5346 \c PUNPCKHWD xmm1,xmm2/m128      ; 66 0F 69 /r     [WILLAMETTE,SSE2]
5347 \c PUNPCKHDQ xmm1,xmm2/m128      ; 66 0F 6A /r     [WILLAMETTE,SSE2]
5348 \c PUNPCKHQDQ xmm1,xmm2/m128     ; 66 0F 6D /r     [WILLAMETTE,SSE2]
5349
5350 \c PUNPCKLBW mm1,mm2/m32         ; 0F 60 /r             [PENT,MMX]
5351 \c PUNPCKLWD mm1,mm2/m32         ; 0F 61 /r             [PENT,MMX]
5352 \c PUNPCKLDQ mm1,mm2/m32         ; 0F 62 /r             [PENT,MMX]
5353
5354 \c PUNPCKLBW xmm1,xmm2/m128      ; 66 0F 60 /r     [WILLAMETTE,SSE2]
5355 \c PUNPCKLWD xmm1,xmm2/m128      ; 66 0F 61 /r     [WILLAMETTE,SSE2]
5356 \c PUNPCKLDQ xmm1,xmm2/m128      ; 66 0F 62 /r     [WILLAMETTE,SSE2]
5357 \c PUNPCKLQDQ xmm1,xmm2/m128     ; 66 0F 6C /r     [WILLAMETTE,SSE2]
5358
5359 \c{PUNPCKxx} all treat their operands as vectors, and produce a new
5360 vector generated by interleaving elements from the two inputs. The
5361 \c{PUNPCKHxx} instructions start by throwing away the bottom half of
5362 each input operand, and the \c{PUNPCKLxx} instructions throw away
5363 the top half.
5364
5365 The remaining elements, are then interleaved into the destination,
5366 alternating elements from the second (source) operand and the first
5367 (destination) operand: so the leftmost part of each element in the
5368 result always comes from the second operand, and the rightmost from
5369 the destination.
5370
5371 \b \c{PUNPCKxBW} works a byte at a time, producing word sized output
5372 elements.
5373
5374 \b \c{PUNPCKxWD} works a word at a time, producing doubleword sized
5375 output elements.
5376
5377 \b \c{PUNPCKxDQ} works a doubleword at a time, producing quadword sized
5378 output elements.
5379
5380 \b \c{PUNPCKxQDQ} works a quadword at a time, producing double quadword
5381 sized output elements.
5382
5383 So, for example, for \c{MMX} operands, if the first operand held
5384 \c{0x7A6A5A4A3A2A1A0A} and the second held \c{0x7B6B5B4B3B2B1B0B},
5385 then:
5386
5387 \b \c{PUNPCKHBW} would return \c{0x7B7A6B6A5B5A4B4A}.
5388
5389 \b \c{PUNPCKHWD} would return \c{0x7B6B7A6A5B4B5A4A}.
5390
5391 \b \c{PUNPCKHDQ} would return \c{0x7B6B5B4B7A6A5A4A}.
5392
5393 \b \c{PUNPCKLBW} would return \c{0x3B3A2B2A1B1A0B0A}.
5394
5395 \b \c{PUNPCKLWD} would return \c{0x3B2B3A2A1B0B1A0A}.
5396
5397 \b \c{PUNPCKLDQ} would return \c{0x3B2B1B0B3A2A1A0A}.
5398
5399
5400 \S{insPUSH} \i\c{PUSH}: Push Data on Stack
5401
5402 \c PUSH reg16                    ; o16 50+r             [8086]
5403 \c PUSH reg32                    ; o32 50+r             [386]
5404
5405 \c PUSH r/m16                    ; o16 FF /6            [8086]
5406 \c PUSH r/m32                    ; o32 FF /6            [386]
5407
5408 \c PUSH CS                       ; 0E                   [8086]
5409 \c PUSH DS                       ; 1E                   [8086]
5410 \c PUSH ES                       ; 06                   [8086]
5411 \c PUSH SS                       ; 16                   [8086]
5412 \c PUSH FS                       ; 0F A0                [386]
5413 \c PUSH GS                       ; 0F A8                [386]
5414
5415 \c PUSH imm8                     ; 6A ib                [186]
5416 \c PUSH imm16                    ; o16 68 iw            [186]
5417 \c PUSH imm32                    ; o32 68 id            [386]
5418
5419 \c{PUSH} decrements the stack pointer (\c{SP} or \c{ESP}) by 2 or 4,
5420 and then stores the given value at \c{[SS:SP]} or \c{[SS:ESP]}.
5421
5422 The address-size attribute of the instruction determines whether
5423 \c{SP} or \c{ESP} is used as the stack pointer: to deliberately
5424 override the default given by the \c{BITS} setting, you can use an
5425 \i\c{a16} or \i\c{a32} prefix.
5426
5427 The operand-size attribute of the instruction determines whether the
5428 stack pointer is decremented by 2 or 4: this means that segment
5429 register pushes in \c{BITS 32} mode will push 4 bytes on the stack,
5430 of which the upper two are undefined. If you need to override that,
5431 you can use an \i\c{o16} or \i\c{o32} prefix.
5432
5433 The above opcode listings give two forms for general-purpose
5434 \i{register push} instructions: for example, \c{PUSH BX} has the two
5435 forms \c{53} and \c{FF F3}. NASM will always generate the shorter
5436 form when given \c{PUSH BX}. NDISASM will disassemble both.
5437
5438 Unlike the undocumented and barely supported \c{POP CS}, \c{PUSH CS}
5439 is a perfectly valid and sensible instruction, supported on all
5440 processors.
5441
5442 The instruction \c{PUSH SP} may be used to distinguish an 8086 from
5443 later processors: on an 8086, the value of \c{SP} stored is the
5444 value it has \e{after} the push instruction, whereas on later
5445 processors it is the value \e{before} the push instruction.
5446
5447
5448 \S{insPUSHA} \i\c{PUSHAx}: Push All General-Purpose Registers
5449
5450 \c PUSHA                         ; 60                   [186]
5451 \c PUSHAD                        ; o32 60               [386]
5452 \c PUSHAW                        ; o16 60               [186]
5453
5454 \c{PUSHAW} pushes, in succession, \c{AX}, \c{CX}, \c{DX}, \c{BX},
5455 \c{SP}, \c{BP}, \c{SI} and \c{DI} on the stack, decrementing the
5456 stack pointer by a total of 16.
5457
5458 \c{PUSHAD} pushes, in succession, \c{EAX}, \c{ECX}, \c{EDX},
5459 \c{EBX}, \c{ESP}, \c{EBP}, \c{ESI} and \c{EDI} on the stack,
5460 decrementing the stack pointer by a total of 32.
5461
5462 In both cases, the value of \c{SP} or \c{ESP} pushed is its
5463 \e{original} value, as it had before the instruction was executed.
5464
5465 \c{PUSHA} is an alias mnemonic for either \c{PUSHAW} or \c{PUSHAD},
5466 depending on the current \c{BITS} setting.
5467
5468 Note that the registers are pushed in order of their numeric values
5469 in opcodes (see \k{iref-rv}).
5470
5471 See also \c{POPA} (\k{insPOPA}).
5472
5473
5474 \S{insPUSHF} \i\c{PUSHFx}: Push Flags Register
5475
5476 \c PUSHF                         ; 9C                   [8086]
5477 \c PUSHFD                        ; o32 9C               [386]
5478 \c PUSHFW                        ; o16 9C               [8086]
5479
5480 \b \c{PUSHFW} pushes the bottom 16 bits of the flags register
5481 (or the whole flags register, on processors below a 386) onto
5482 the stack.
5483
5484 \b \c{PUSHFD} pushes the entire flags register onto the stack.
5485
5486 \c{PUSHF} is an alias mnemonic for either \c{PUSHFW} or \c{PUSHFD},
5487 depending on the current \c{BITS} setting.
5488
5489 See also \c{POPF} (\k{insPOPF}).
5490
5491
5492 \S{insPXOR} \i\c{PXOR}: MMX Bitwise XOR
5493
5494 \c PXOR mm1,mm2/m64              ; 0F EF /r             [PENT,MMX]
5495 \c PXOR xmm1,xmm2/m128           ; 66 0F EF /r     [WILLAMETTE,SSE2]
5496
5497 \c{PXOR} performs a bitwise XOR operation between its two operands
5498 (i.e. each bit of the result is 1 if and only if exactly one of the
5499 corresponding bits of the two inputs was 1), and stores the result
5500 in the destination (first) operand.
5501
5502
5503 \S{insRCL} \i\c{RCL}, \i\c{RCR}: Bitwise Rotate through Carry Bit
5504
5505 \c RCL r/m8,1                    ; D0 /2                [8086]
5506 \c RCL r/m8,CL                   ; D2 /2                [8086]
5507 \c RCL r/m8,imm8                 ; C0 /2 ib             [186]
5508 \c RCL r/m16,1                   ; o16 D1 /2            [8086]
5509 \c RCL r/m16,CL                  ; o16 D3 /2            [8086]
5510 \c RCL r/m16,imm8                ; o16 C1 /2 ib         [186]
5511 \c RCL r/m32,1                   ; o32 D1 /2            [386]
5512 \c RCL r/m32,CL                  ; o32 D3 /2            [386]
5513 \c RCL r/m32,imm8                ; o32 C1 /2 ib         [386]
5514
5515 \c RCR r/m8,1                    ; D0 /3                [8086]
5516 \c RCR r/m8,CL                   ; D2 /3                [8086]
5517 \c RCR r/m8,imm8                 ; C0 /3 ib             [186]
5518 \c RCR r/m16,1                   ; o16 D1 /3            [8086]
5519 \c RCR r/m16,CL                  ; o16 D3 /3            [8086]
5520 \c RCR r/m16,imm8                ; o16 C1 /3 ib         [186]
5521 \c RCR r/m32,1                   ; o32 D1 /3            [386]
5522 \c RCR r/m32,CL                  ; o32 D3 /3            [386]
5523 \c RCR r/m32,imm8                ; o32 C1 /3 ib         [386]
5524
5525 \c{RCL} and \c{RCR} perform a 9-bit, 17-bit or 33-bit bitwise
5526 rotation operation, involving the given source/destination (first)
5527 operand and the carry bit. Thus, for example, in the operation
5528 \c{RCL AL,1}, a 9-bit rotation is performed in which \c{AL} is
5529 shifted left by 1, the top bit of \c{AL} moves into the carry flag,
5530 and the original value of the carry flag is placed in the low bit of
5531 \c{AL}.
5532
5533 The number of bits to rotate by is given by the second operand. Only
5534 the bottom five bits of the rotation count are considered by
5535 processors above the 8086.
5536
5537 You can force the longer (286 and upwards, beginning with a \c{C1}
5538 byte) form of \c{RCL foo,1} by using a \c{BYTE} prefix: \c{RCL
5539 foo,BYTE 1}. Similarly with \c{RCR}.
5540
5541
5542 \S{insRCPPS} \i\c{RCPPS}: Packed Single-Precision FP Reciprocal
5543
5544 \c RCPPS xmm1,xmm2/m128          ; 0F 53 /r        [KATMAI,SSE]
5545
5546 \c{RCPPS} returns an approximation of the reciprocal of the packed
5547 single-precision FP values from xmm2/m128. The maximum error for this
5548 approximation is: |Error| <= 1.5 x 2^-12
5549
5550
5551 \S{insRCPSS} \i\c{RCPSS}: Scalar Single-Precision FP Reciprocal
5552
5553 \c RCPSS xmm1,xmm2/m128          ; F3 0F 53 /r     [KATMAI,SSE]
5554
5555 \c{RCPSS} returns an approximation of the reciprocal of the lower
5556 single-precision FP value from xmm2/m32; the upper three fields are
5557 passed through from xmm1. The maximum error for this approximation is:
5558 |Error| <= 1.5 x 2^-12
5559
5560
5561 \S{insRDMSR} \i\c{RDMSR}: Read Model-Specific Registers
5562
5563 \c RDMSR                         ; 0F 32                [PENT,PRIV]
5564
5565 \c{RDMSR} reads the processor Model-Specific Register (MSR) whose
5566 index is stored in \c{ECX}, and stores the result in \c{EDX:EAX}.
5567 See also \c{WRMSR} (\k{insWRMSR}).
5568
5569
5570 \S{insRDPMC} \i\c{RDPMC}: Read Performance-Monitoring Counters
5571
5572 \c RDPMC                         ; 0F 33                [P6]
5573
5574 \c{RDPMC} reads the processor performance-monitoring counter whose
5575 index is stored in \c{ECX}, and stores the result in \c{EDX:EAX}.
5576
5577 This instruction is available on P6 and later processors and on MMX
5578 class processors.
5579
5580
5581 \S{insRDSHR} \i\c{RDSHR}: Read SMM Header Pointer Register
5582
5583 \c RDSHR r/m32                   ; 0F 36 /0        [386,CYRIX,SMM]
5584
5585 \c{RDSHR} reads the contents of the SMM header pointer register and
5586 saves it to the destination operand, which can be either a 32 bit
5587 memory location or a 32 bit register.
5588
5589 See also \c{WRSHR} (\k{insWRSHR}).
5590
5591
5592 \S{insRDTSC} \i\c{RDTSC}: Read Time-Stamp Counter
5593
5594 \c RDTSC                         ; 0F 31                [PENT]
5595
5596 \c{RDTSC} reads the processor's time-stamp counter into \c{EDX:EAX}.
5597
5598
5599 \S{insRET} \i\c{RET}, \i\c{RETF}, \i\c{RETN}: Return from Procedure Call
5600
5601 \c RET                           ; C3                   [8086]
5602 \c RET imm16                     ; C2 iw                [8086]
5603
5604 \c RETF                          ; CB                   [8086]
5605 \c RETF imm16                    ; CA iw                [8086]
5606
5607 \c RETN                          ; C3                   [8086]
5608 \c RETN imm16                    ; C2 iw                [8086]
5609
5610 \b \c{RET}, and its exact synonym \c{RETN}, pop \c{IP} or \c{EIP} from
5611 the stack and transfer control to the new address. Optionally, if a
5612 numeric second operand is provided, they increment the stack pointer
5613 by a further \c{imm16} bytes after popping the return address.
5614
5615 \b \c{RETF} executes a far return: after popping \c{IP}/\c{EIP}, it
5616 then pops \c{CS}, and \e{then} increments the stack pointer by the
5617 optional argument if present.
5618
5619
5620 \S{insROL} \i\c{ROL}, \i\c{ROR}: Bitwise Rotate
5621
5622 \c ROL r/m8,1                    ; D0 /0                [8086]
5623 \c ROL r/m8,CL                   ; D2 /0                [8086]
5624 \c ROL r/m8,imm8                 ; C0 /0 ib             [186]
5625 \c ROL r/m16,1                   ; o16 D1 /0            [8086]
5626 \c ROL r/m16,CL                  ; o16 D3 /0            [8086]
5627 \c ROL r/m16,imm8                ; o16 C1 /0 ib         [186]
5628 \c ROL r/m32,1                   ; o32 D1 /0            [386]
5629 \c ROL r/m32,CL                  ; o32 D3 /0            [386]
5630 \c ROL r/m32,imm8                ; o32 C1 /0 ib         [386]
5631
5632 \c ROR r/m8,1                    ; D0 /1                [8086]
5633 \c ROR r/m8,CL                   ; D2 /1                [8086]
5634 \c ROR r/m8,imm8                 ; C0 /1 ib             [186]
5635 \c ROR r/m16,1                   ; o16 D1 /1            [8086]
5636 \c ROR r/m16,CL                  ; o16 D3 /1            [8086]
5637 \c ROR r/m16,imm8                ; o16 C1 /1 ib         [186]
5638 \c ROR r/m32,1                   ; o32 D1 /1            [386]
5639 \c ROR r/m32,CL                  ; o32 D3 /1            [386]
5640 \c ROR r/m32,imm8                ; o32 C1 /1 ib         [386]
5641
5642 \c{ROL} and \c{ROR} perform a bitwise rotation operation on the given
5643 source/destination (first) operand. Thus, for example, in the
5644 operation \c{ROL AL,1}, an 8-bit rotation is performed in which
5645 \c{AL} is shifted left by 1 and the original top bit of \c{AL} moves
5646 round into the low bit.
5647
5648 The number of bits to rotate by is given by the second operand. Only
5649 the bottom five bits of the rotation count are considered by processors
5650 above the 8086.
5651
5652 You can force the longer (286 and upwards, beginning with a \c{C1}
5653 byte) form of \c{ROL foo,1} by using a \c{BYTE} prefix: \c{ROL
5654 foo,BYTE 1}. Similarly with \c{ROR}.
5655
5656
5657 \S{insRSDC} \i\c{RSDC}: Restore Segment Register and Descriptor
5658
5659 \c RSDC segreg,m80               ; 0F 79 /r        [486,CYRIX,SMM]
5660
5661 \c{RSDC} restores a segment register (DS, ES, FS, GS, or SS) from mem80,
5662 and sets up its descriptor.
5663
5664
5665 \S{insRSLDT} \i\c{RSLDT}: Restore Segment Register and Descriptor
5666
5667 \c RSLDT m80                     ; 0F 7B /0        [486,CYRIX,SMM]
5668
5669 \c{RSLDT} restores the Local Descriptor Table (LDTR) from mem80.
5670
5671
5672 \S{insRSM} \i\c{RSM}: Resume from System-Management Mode
5673
5674 \c RSM                           ; 0F AA                [PENT]
5675
5676 \c{RSM} returns the processor to its normal operating mode when it
5677 was in System-Management Mode.
5678
5679
5680 \S{insRSQRTPS} \i\c{RSQRTPS}: Packed Single-Precision FP Square Root Reciprocal
5681
5682 \c RSQRTPS xmm1,xmm2/m128        ; 0F 52 /r        [KATMAI,SSE]
5683
5684 \c{RSQRTPS} computes the approximate reciprocals of the square
5685 roots of the packed single-precision floating-point values in the
5686 source and stores the results in xmm1. The maximum error for this
5687 approximation is: |Error| <= 1.5 x 2^-12
5688
5689
5690 \S{insRSQRTSS} \i\c{RSQRTSS}: Scalar Single-Precision FP Square Root Reciprocal
5691
5692 \c RSQRTSS xmm1,xmm2/m128        ; F3 0F 52 /r     [KATMAI,SSE]
5693
5694 \c{RSQRTSS} returns an approximation of the reciprocal of the
5695 square root of the lowest order single-precision FP value from
5696 the source, and stores it in the low doubleword of the destination
5697 register. The upper three fields of xmm1 are preserved. The maximum
5698 error for this approximation is: |Error| <= 1.5 x 2^-12
5699
5700
5701 \S{insRSTS} \i\c{RSTS}: Restore TSR and Descriptor
5702
5703 \c RSTS m80                      ; 0F 7D /0        [486,CYRIX,SMM]
5704
5705 \c{RSTS} restores Task State Register (TSR) from mem80.
5706
5707
5708 \S{insSAHF} \i\c{SAHF}: Store AH to Flags
5709
5710 \c SAHF                          ; 9E                   [8086]
5711
5712 \c{SAHF} sets the low byte of the flags word according to the
5713 contents of the \c{AH} register.
5714
5715 The operation of \c{SAHF} is:
5716
5717 \c  AH --> SF:ZF:0:AF:0:PF:1:CF
5718
5719 See also \c{LAHF} (\k{insLAHF}).
5720
5721
5722 \S{insSAL} \i\c{SAL}, \i\c{SAR}: Bitwise Arithmetic Shifts
5723
5724 \c SAL r/m8,1                    ; D0 /4                [8086]
5725 \c SAL r/m8,CL                   ; D2 /4                [8086]
5726 \c SAL r/m8,imm8                 ; C0 /4 ib             [186]
5727 \c SAL r/m16,1                   ; o16 D1 /4            [8086]
5728 \c SAL r/m16,CL                  ; o16 D3 /4            [8086]
5729 \c SAL r/m16,imm8                ; o16 C1 /4 ib         [186]
5730 \c SAL r/m32,1                   ; o32 D1 /4            [386]
5731 \c SAL r/m32,CL                  ; o32 D3 /4            [386]
5732 \c SAL r/m32,imm8                ; o32 C1 /4 ib         [386]
5733
5734 \c SAR r/m8,1                    ; D0 /7                [8086]
5735 \c SAR r/m8,CL                   ; D2 /7                [8086]
5736 \c SAR r/m8,imm8                 ; C0 /7 ib             [186]
5737 \c SAR r/m16,1                   ; o16 D1 /7            [8086]
5738 \c SAR r/m16,CL                  ; o16 D3 /7            [8086]
5739 \c SAR r/m16,imm8                ; o16 C1 /7 ib         [186]
5740 \c SAR r/m32,1                   ; o32 D1 /7            [386]
5741 \c SAR r/m32,CL                  ; o32 D3 /7            [386]
5742 \c SAR r/m32,imm8                ; o32 C1 /7 ib         [386]
5743
5744 \c{SAL} and \c{SAR} perform an arithmetic shift operation on the given
5745 source/destination (first) operand. The vacated bits are filled with
5746 zero for \c{SAL}, and with copies of the original high bit of the
5747 source operand for \c{SAR}.
5748
5749 \c{SAL} is a synonym for \c{SHL} (see \k{insSHL}). NASM will
5750 assemble either one to the same code, but NDISASM will always
5751 disassemble that code as \c{SHL}.
5752
5753 The number of bits to shift by is given by the second operand. Only
5754 the bottom five bits of the shift count are considered by processors
5755 above the 8086.
5756
5757 You can force the longer (286 and upwards, beginning with a \c{C1}
5758 byte) form of \c{SAL foo,1} by using a \c{BYTE} prefix: \c{SAL
5759 foo,BYTE 1}. Similarly with \c{SAR}.
5760
5761
5762 \S{insSALC} \i\c{SALC}: Set AL from Carry Flag
5763
5764 \c SALC                          ; D6                  [8086,UNDOC]
5765
5766 \c{SALC} is an early undocumented instruction similar in concept to
5767 \c{SETcc} (\k{insSETcc}). Its function is to set \c{AL} to zero if
5768 the carry flag is clear, or to \c{0xFF} if it is set.
5769
5770
5771 \S{insSBB} \i\c{SBB}: Subtract with Borrow
5772
5773 \c SBB r/m8,reg8                 ; 18 /r                [8086]
5774 \c SBB r/m16,reg16               ; o16 19 /r            [8086]
5775 \c SBB r/m32,reg32               ; o32 19 /r            [386]
5776
5777 \c SBB reg8,r/m8                 ; 1A /r                [8086]
5778 \c SBB reg16,r/m16               ; o16 1B /r            [8086]
5779 \c SBB reg32,r/m32               ; o32 1B /r            [386]
5780
5781 \c SBB r/m8,imm8                 ; 80 /3 ib             [8086]
5782 \c SBB r/m16,imm16               ; o16 81 /3 iw         [8086]
5783 \c SBB r/m32,imm32               ; o32 81 /3 id         [386]
5784
5785 \c SBB r/m16,imm8                ; o16 83 /3 ib         [8086]
5786 \c SBB r/m32,imm8                ; o32 83 /3 ib         [386]
5787
5788 \c SBB AL,imm8                   ; 1C ib                [8086]
5789 \c SBB AX,imm16                  ; o16 1D iw            [8086]
5790 \c SBB EAX,imm32                 ; o32 1D id            [386]
5791
5792 \c{SBB} performs integer subtraction: it subtracts its second
5793 operand, plus the value of the carry flag, from its first, and
5794 leaves the result in its destination (first) operand. The flags are
5795 set according to the result of the operation: in particular, the
5796 carry flag is affected and can be used by a subsequent \c{SBB}
5797 instruction.
5798
5799 In the forms with an 8-bit immediate second operand and a longer
5800 first operand, the second operand is considered to be signed, and is
5801 sign-extended to the length of the first operand. In these cases,
5802 the \c{BYTE} qualifier is necessary to force NASM to generate this
5803 form of the instruction.
5804
5805 To subtract one number from another without also subtracting the
5806 contents of the carry flag, use \c{SUB} (\k{insSUB}).
5807
5808
5809 \S{insSCASB} \i\c{SCASB}, \i\c{SCASW}, \i\c{SCASD}: Scan String
5810
5811 \c SCASB                         ; AE                   [8086]
5812 \c SCASW                         ; o16 AF               [8086]
5813 \c SCASD                         ; o32 AF               [386]
5814
5815 \c{SCASB} compares the byte in \c{AL} with the byte at \c{[ES:DI]}
5816 or \c{[ES:EDI]}, and sets the flags accordingly. It then increments
5817 or decrements (depending on the direction flag: increments if the
5818 flag is clear, decrements if it is set) \c{DI} (or \c{EDI}).
5819
5820 The register used is \c{DI} if the address size is 16 bits, and
5821 \c{EDI} if it is 32 bits. If you need to use an address size not
5822 equal to the current \c{BITS} setting, you can use an explicit
5823 \i\c{a16} or \i\c{a32} prefix.
5824
5825 Segment override prefixes have no effect for this instruction: the
5826 use of \c{ES} for the load from \c{[DI]} or \c{[EDI]} cannot be
5827 overridden.
5828
5829 \c{SCASW} and \c{SCASD} work in the same way, but they compare a
5830 word to \c{AX} or a doubleword to \c{EAX} instead of a byte to
5831 \c{AL}, and increment or decrement the addressing registers by 2 or
5832 4 instead of 1.
5833
5834 The \c{REPE} and \c{REPNE} prefixes (equivalently, \c{REPZ} and
5835 \c{REPNZ}) may be used to repeat the instruction up to \c{CX} (or
5836 \c{ECX} - again, the address size chooses which) times until the
5837 first unequal or equal byte is found.
5838
5839
5840 \S{insSETcc} \i\c{SETcc}: Set Register from Condition
5841
5842 \c SETcc r/m8                    ; 0F 90+cc /2          [386]
5843
5844 \c{SETcc} sets the given 8-bit operand to zero if its condition is
5845 not satisfied, and to 1 if it is.
5846
5847
5848 \S{insSFENCE} \i\c{SFENCE}: Store Fence
5849
5850 \c SFENCE                 ; 0F AE /7               [KATMAI]
5851
5852 \c{SFENCE} performs a serialising operation on all writes to memory
5853 that were issued before the \c{SFENCE} instruction. This guarantees that
5854 all memory writes before the \c{SFENCE} instruction are visible before any
5855 writes after the \c{SFENCE} instruction.
5856
5857 \c{SFENCE} is ordered respective to other \c{SFENCE} instruction, \c{MFENCE},
5858 any memory write and any other serialising instruction (such as \c{CPUID}).
5859
5860 Weakly ordered memory types can be used to achieve higher processor
5861 performance through such techniques as out-of-order issue,
5862 write-combining, and write-collapsing. The degree to which a consumer
5863 of data recognizes or knows that the data is weakly ordered varies
5864 among applications and may be unknown to the producer of this data.
5865 The \c{SFENCE} instruction provides a performance-efficient way of
5866 insuring store ordering between routines that produce weakly-ordered
5867 results and routines that consume this data.
5868
5869 \c{SFENCE} uses the following ModRM encoding:
5870
5871 \c           Mod (7:6)        = 11B
5872 \c           Reg/Opcode (5:3) = 111B
5873 \c           R/M (2:0)        = 000B
5874
5875 All other ModRM encodings are defined to be reserved, and use
5876 of these encodings risks incompatibility with future processors.
5877
5878 See also \c{LFENCE} (\k{insLFENCE}) and \c{MFENCE} (\k{insMFENCE}).
5879
5880
5881 \S{insSGDT} \i\c{SGDT}, \i\c{SIDT}, \i\c{SLDT}: Store Descriptor Table Pointers
5882
5883 \c SGDT mem                      ; 0F 01 /0             [286,PRIV]
5884 \c SIDT mem                      ; 0F 01 /1             [286,PRIV]
5885 \c SLDT r/m16                    ; 0F 00 /0             [286,PRIV]
5886
5887 \c{SGDT} and \c{SIDT} both take a 6-byte memory area as an operand:
5888 they store the contents of the GDTR (global descriptor table
5889 register) or IDTR (interrupt descriptor table register) into that
5890 area as a 32-bit linear address and a 16-bit size limit from that
5891 area (in that order). These are the only instructions which directly
5892 use \e{linear} addresses, rather than segment/offset pairs.
5893
5894 \c{SLDT} stores the segment selector corresponding to the LDT (local
5895 descriptor table) into the given operand.
5896
5897 See also \c{LGDT}, \c{LIDT} and \c{LLDT} (\k{insLGDT}).
5898
5899
5900 \S{insSHL} \i\c{SHL}, \i\c{SHR}: Bitwise Logical Shifts
5901
5902 \c SHL r/m8,1                    ; D0 /4                [8086]
5903 \c SHL r/m8,CL                   ; D2 /4                [8086]
5904 \c SHL r/m8,imm8                 ; C0 /4 ib             [186]
5905 \c SHL r/m16,1                   ; o16 D1 /4            [8086]
5906 \c SHL r/m16,CL                  ; o16 D3 /4            [8086]
5907 \c SHL r/m16,imm8                ; o16 C1 /4 ib         [186]
5908 \c SHL r/m32,1                   ; o32 D1 /4            [386]
5909 \c SHL r/m32,CL                  ; o32 D3 /4            [386]
5910 \c SHL r/m32,imm8                ; o32 C1 /4 ib         [386]
5911
5912 \c SHR r/m8,1                    ; D0 /5                [8086]
5913 \c SHR r/m8,CL                   ; D2 /5                [8086]
5914 \c SHR r/m8,imm8                 ; C0 /5 ib             [186]
5915 \c SHR r/m16,1                   ; o16 D1 /5            [8086]
5916 \c SHR r/m16,CL                  ; o16 D3 /5            [8086]
5917 \c SHR r/m16,imm8                ; o16 C1 /5 ib         [186]
5918 \c SHR r/m32,1                   ; o32 D1 /5            [386]
5919 \c SHR r/m32,CL                  ; o32 D3 /5            [386]
5920 \c SHR r/m32,imm8                ; o32 C1 /5 ib         [386]
5921
5922 \c{SHL} and \c{SHR} perform a logical shift operation on the given
5923 source/destination (first) operand. The vacated bits are filled with
5924 zero.
5925
5926 A synonym for \c{SHL} is \c{SAL} (see \k{insSAL}). NASM will
5927 assemble either one to the same code, but NDISASM will always
5928 disassemble that code as \c{SHL}.
5929
5930 The number of bits to shift by is given by the second operand. Only
5931 the bottom five bits of the shift count are considered by processors
5932 above the 8086.
5933
5934 You can force the longer (286 and upwards, beginning with a \c{C1}
5935 byte) form of \c{SHL foo,1} by using a \c{BYTE} prefix: \c{SHL
5936 foo,BYTE 1}. Similarly with \c{SHR}.
5937
5938
5939 \S{insSHLD} \i\c{SHLD}, \i\c{SHRD}: Bitwise Double-Precision Shifts
5940
5941 \c SHLD r/m16,reg16,imm8         ; o16 0F A4 /r ib      [386]
5942 \c SHLD r/m16,reg32,imm8         ; o32 0F A4 /r ib      [386]
5943 \c SHLD r/m16,reg16,CL           ; o16 0F A5 /r         [386]
5944 \c SHLD r/m16,reg32,CL           ; o32 0F A5 /r         [386]
5945
5946 \c SHRD r/m16,reg16,imm8         ; o16 0F AC /r ib      [386]
5947 \c SHRD r/m32,reg32,imm8         ; o32 0F AC /r ib      [386]
5948 \c SHRD r/m16,reg16,CL           ; o16 0F AD /r         [386]
5949 \c SHRD r/m32,reg32,CL           ; o32 0F AD /r         [386]
5950
5951 \b \c{SHLD} performs a double-precision left shift. It notionally
5952 places its second operand to the right of its first, then shifts
5953 the entire bit string thus generated to the left by a number of
5954 bits specified in the third operand. It then updates only the
5955 \e{first} operand according to the result of this. The second
5956 operand is not modified.
5957
5958 \b \c{SHRD} performs the corresponding right shift: it notionally
5959 places the second operand to the \e{left} of the first, shifts the
5960 whole bit string right, and updates only the first operand.
5961
5962 For example, if \c{EAX} holds \c{0x01234567} and \c{EBX} holds
5963 \c{0x89ABCDEF}, then the instruction \c{SHLD EAX,EBX,4} would update
5964 \c{EAX} to hold \c{0x12345678}. Under the same conditions, \c{SHRD
5965 EAX,EBX,4} would update \c{EAX} to hold \c{0xF0123456}.
5966
5967 The number of bits to shift by is given by the third operand. Only
5968 the bottom five bits of the shift count are considered.
5969
5970
5971 \S{insSHUFPD} \i\c{SHUFPD}: Shuffle Packed Double-Precision FP Values
5972
5973 \c SHUFPD xmm1,xmm2/m128,imm8    ; 66 0F C6 /r ib  [WILLAMETTE,SSE2]
5974
5975 \c{SHUFPD} moves one of the packed double-precision FP values from
5976 the destination operand into the low quadword of the destination
5977 operand; the upper quadword is generated by moving one of the
5978 double-precision FP values from the source operand into the
5979 destination. The select (third) operand selects which of the values
5980 are moved to the destination register.
5981
5982 The select operand is an 8-bit immediate: bit 0 selects which value
5983 is moved from the destination operand to the result (where 0 selects
5984 the low quadword and 1 selects the high quadword) and bit 1 selects
5985 which value is moved from the source operand to the result.
5986 Bits 2 through 7 of the shuffle operand are reserved.
5987
5988
5989 \S{insSHUFPS} \i\c{SHUFPS}: Shuffle Packed Single-Precision FP Values
5990
5991 \c SHUFPS xmm1,xmm2/m128,imm8    ; 0F C6 /r ib     [KATMAI,SSE]
5992
5993 \c{SHUFPS} moves two of the packed single-precision FP values from
5994 the destination operand into the low quadword of the destination
5995 operand; the upper quadword is generated by moving two of the
5996 single-precision FP values from the source operand into the
5997 destination. The select (third) operand selects which of the
5998 values are moved to the destination register.
5999
6000 The select operand is an 8-bit immediate: bits 0 and 1 select the
6001 value to be moved from the destination operand the low doubleword of
6002 the result, bits 2 and 3 select the value to be moved from the
6003 destination operand the second doubleword of the result, bits 4 and
6004 5 select the value to be moved from the source operand the third
6005 doubleword of the result, and bits 6 and 7 select the value to be
6006 moved from the source operand to the high doubleword of the result.
6007
6008
6009 \S{insSMI} \i\c{SMI}: System Management Interrupt
6010
6011 \c SMI                           ; F1                   [386,UNDOC]
6012
6013 \c{SMI} puts some AMD processors into SMM mode. It is available on some
6014 386 and 486 processors, and is only available when DR7 bit 12 is set,
6015 otherwise it generates an Int 1.
6016
6017
6018 \S{insSMINT} \i\c{SMINT}, \i\c{SMINTOLD}: Software SMM Entry (CYRIX)
6019
6020 \c SMINT                         ; 0F 38                [PENT,CYRIX]
6021 \c SMINTOLD                      ; 0F 7E                [486,CYRIX]
6022
6023 \c{SMINT} puts the processor into SMM mode. The CPU state information is
6024 saved in the SMM memory header, and then execution begins at the SMM base
6025 address.
6026
6027 \c{SMINTOLD} is the same as \c{SMINT}, but was the opcode used on the 486.
6028
6029 This pair of opcodes are specific to the Cyrix and compatible range of
6030 processors (Cyrix, IBM, Via).
6031
6032
6033 \S{insSMSW} \i\c{SMSW}: Store Machine Status Word
6034
6035 \c SMSW r/m16                    ; 0F 01 /4             [286,PRIV]
6036
6037 \c{SMSW} stores the bottom half of the \c{CR0} control register (or
6038 the Machine Status Word, on 286 processors) into the destination
6039 operand. See also \c{LMSW} (\k{insLMSW}).
6040
6041 For 32-bit code, this would store all of \c{CR0} in the specified
6042 register (or the bottom 16 bits if the destination is a memory location),
6043  without needing an operand size override byte.
6044
6045
6046 \S{insSQRTPD} \i\c{SQRTPD}: Packed Double-Precision FP Square Root
6047
6048 \c SQRTPD xmm1,xmm2/m128         ; 66 0F 51 /r     [WILLAMETTE,SSE2]
6049
6050 \c{SQRTPD} calculates the square root of the packed double-precision
6051 FP value from the source operand, and stores the double-precision
6052 results in the destination register.
6053
6054
6055 \S{insSQRTPS} \i\c{SQRTPS}: Packed Single-Precision FP Square Root
6056
6057 \c SQRTPS xmm1,xmm2/m128         ; 0F 51 /r        [KATMAI,SSE]
6058
6059 \c{SQRTPS} calculates the square root of the packed single-precision
6060 FP value from the source operand, and stores the single-precision
6061 results in the destination register.
6062
6063
6064 \S{insSQRTSD} \i\c{SQRTSD}: Scalar Double-Precision FP Square Root
6065
6066 \c SQRTSD xmm1,xmm2/m128         ; F2 0F 51 /r     [WILLAMETTE,SSE2]
6067
6068 \c{SQRTSD} calculates the square root of the low-order double-precision
6069 FP value from the source operand, and stores the double-precision
6070 result in the destination register. The high-quadword remains unchanged.
6071
6072
6073 \S{insSQRTSS} \i\c{SQRTSS}: Scalar Single-Precision FP Square Root
6074
6075 \c SQRTSS xmm1,xmm2/m128         ; F3 0F 51 /r     [KATMAI,SSE]
6076
6077 \c{SQRTSS} calculates the square root of the low-order single-precision
6078 FP value from the source operand, and stores the single-precision
6079 result in the destination register. The three high doublewords remain
6080 unchanged.
6081
6082
6083 \S{insSTC} \i\c{STC}, \i\c{STD}, \i\c{STI}: Set Flags
6084
6085 \c STC                           ; F9                   [8086]
6086 \c STD                           ; FD                   [8086]
6087 \c STI                           ; FB                   [8086]
6088
6089 These instructions set various flags. \c{STC} sets the carry flag;
6090 \c{STD} sets the direction flag; and \c{STI} sets the interrupt flag
6091 (thus enabling interrupts).
6092
6093 To clear the carry, direction, or interrupt flags, use the \c{CLC},
6094 \c{CLD} and \c{CLI} instructions (\k{insCLC}). To invert the carry
6095 flag, use \c{CMC} (\k{insCMC}).
6096
6097
6098 \S{insSTMXCSR} \i\c{STMXCSR}: Store Streaming SIMD Extension
6099  Control/Status
6100
6101 \c STMXCSR m32                   ; 0F AE /3        [KATMAI,SSE]
6102
6103 \c{STMXCSR} stores the contents of the \c{MXCSR} control/status
6104 register to the specified memory location. \c{MXCSR} is used to
6105 enable masked/unmasked exception handling, to set rounding modes,
6106 to set flush-to-zero mode, and to view exception status flags.
6107 The reserved bits in the \c{MXCSR} register are stored as 0s.
6108
6109 For details of the \c{MXCSR} register, see the Intel processor docs.
6110
6111 See also \c{LDMXCSR} (\k{insLDMXCSR}).
6112
6113
6114 \S{insSTOSB} \i\c{STOSB}, \i\c{STOSW}, \i\c{STOSD}: Store Byte to String
6115
6116 \c STOSB                         ; AA                   [8086]
6117 \c STOSW                         ; o16 AB               [8086]
6118 \c STOSD                         ; o32 AB               [386]
6119
6120 \c{STOSB} stores the byte in \c{AL} at \c{[ES:DI]} or \c{[ES:EDI]},
6121 and sets the flags accordingly. It then increments or decrements
6122 (depending on the direction flag: increments if the flag is clear,
6123 decrements if it is set) \c{DI} (or \c{EDI}).
6124
6125 The register used is \c{DI} if the address size is 16 bits, and
6126 \c{EDI} if it is 32 bits. If you need to use an address size not
6127 equal to the current \c{BITS} setting, you can use an explicit
6128 \i\c{a16} or \i\c{a32} prefix.
6129
6130 Segment override prefixes have no effect for this instruction: the
6131 use of \c{ES} for the store to \c{[DI]} or \c{[EDI]} cannot be
6132 overridden.
6133
6134 \c{STOSW} and \c{STOSD} work in the same way, but they store the
6135 word in \c{AX} or the doubleword in \c{EAX} instead of the byte in
6136 \c{AL}, and increment or decrement the addressing registers by 2 or
6137 4 instead of 1.
6138
6139 The \c{REP} prefix may be used to repeat the instruction \c{CX} (or
6140 \c{ECX} - again, the address size chooses which) times.
6141
6142
6143 \S{insSTR} \i\c{STR}: Store Task Register
6144
6145 \c STR r/m16                     ; 0F 00 /1             [286,PRIV]
6146
6147 \c{STR} stores the segment selector corresponding to the contents of
6148 the Task Register into its operand. When the operand size is 32 bit and
6149 the destination is a register, the upper 16-bits are cleared to 0s.
6150 When the destination operand is a memory location, 16 bits are
6151 written regardless of the  operand size.
6152
6153
6154 \S{insSUB} \i\c{SUB}: Subtract Integers
6155
6156 \c SUB r/m8,reg8                 ; 28 /r                [8086]
6157 \c SUB r/m16,reg16               ; o16 29 /r            [8086]
6158 \c SUB r/m32,reg32               ; o32 29 /r            [386]
6159
6160 \c SUB reg8,r/m8                 ; 2A /r                [8086]
6161 \c SUB reg16,r/m16               ; o16 2B /r            [8086]
6162 \c SUB reg32,r/m32               ; o32 2B /r            [386]
6163
6164 \c SUB r/m8,imm8                 ; 80 /5 ib             [8086]
6165 \c SUB r/m16,imm16               ; o16 81 /5 iw         [8086]
6166 \c SUB r/m32,imm32               ; o32 81 /5 id         [386]
6167
6168 \c SUB r/m16,imm8                ; o16 83 /5 ib         [8086]
6169 \c SUB r/m32,imm8                ; o32 83 /5 ib         [386]
6170
6171 \c SUB AL,imm8                   ; 2C ib                [8086]
6172 \c SUB AX,imm16                  ; o16 2D iw            [8086]
6173 \c SUB EAX,imm32                 ; o32 2D id            [386]
6174
6175 \c{SUB} performs integer subtraction: it subtracts its second
6176 operand from its first, and leaves the result in its destination
6177 (first) operand. The flags are set according to the result of the
6178 operation: in particular, the carry flag is affected and can be used
6179 by a subsequent \c{SBB} instruction (\k{insSBB}).
6180
6181 In the forms with an 8-bit immediate second operand and a longer
6182 first operand, the second operand is considered to be signed, and is
6183 sign-extended to the length of the first operand. In these cases,
6184 the \c{BYTE} qualifier is necessary to force NASM to generate this
6185 form of the instruction.
6186
6187
6188 \S{insSUBPD} \i\c{SUBPD}: Packed Double-Precision FP Subtract
6189
6190 \c SUBPD xmm1,xmm2/m128          ; 66 0F 5C /r     [WILLAMETTE,SSE2]
6191
6192 \c{SUBPD} subtracts the packed double-precision FP values of
6193 the source operand from those of the destination operand, and
6194 stores the result in the destination operation.
6195
6196
6197 \S{insSUBPS} \i\c{SUBPS}: Packed Single-Precision FP Subtract
6198
6199 \c SUBPS xmm1,xmm2/m128          ; 0F 5C /r        [KATMAI,SSE]
6200
6201 \c{SUBPS} subtracts the packed single-precision FP values of
6202 the source operand from those of the destination operand, and
6203 stores the result in the destination operation.
6204
6205
6206 \S{insSUBSD} \i\c{SUBSD}: Scalar Single-FP Subtract
6207
6208 \c SUBSD xmm1,xmm2/m128          ; F2 0F 5C /r     [WILLAMETTE,SSE2]
6209
6210 \c{SUBSD} subtracts the low-order double-precision FP value of
6211 the source operand from that of the destination operand, and
6212 stores the result in the destination operation. The high
6213 quadword is unchanged.
6214
6215
6216 \S{insSUBSS} \i\c{SUBSS}: Scalar Single-FP Subtract
6217
6218 \c SUBSS xmm1,xmm2/m128          ; F3 0F 5C /r     [KATMAI,SSE]
6219
6220 \c{SUBSS} subtracts the low-order single-precision FP value of
6221 the source operand from that of the destination operand, and
6222 stores the result in the destination operation. The three high
6223 doublewords are unchanged.
6224
6225
6226 \S{insSVDC} \i\c{SVDC}: Save Segment Register and Descriptor
6227
6228 \c SVDC m80,segreg               ; 0F 78 /r        [486,CYRIX,SMM]
6229
6230 \c{SVDC} saves a segment register (DS, ES, FS, GS, or SS) and its
6231 descriptor to mem80.
6232
6233
6234 \S{insSVLDT} \i\c{SVLDT}: Save LDTR and Descriptor
6235
6236 \c SVLDT m80                     ; 0F 7A /0        [486,CYRIX,SMM]
6237
6238 \c{SVLDT} saves the Local Descriptor Table (LDTR) to mem80.
6239
6240
6241 \S{insSVTS} \i\c{SVTS}: Save TSR and Descriptor
6242
6243 \c SVTS m80                      ; 0F 7C /0        [486,CYRIX,SMM]
6244
6245 \c{SVTS} saves the Task State Register (TSR) to mem80.
6246
6247
6248 \S{insSYSCALL} \i\c{SYSCALL}: Call Operating System
6249
6250 \c SYSCALL                       ; 0F 05                [P6,AMD]
6251
6252 \c{SYSCALL} provides a fast method of transferring control to a fixed
6253 entry point in an operating system.
6254
6255 \b The \c{EIP} register is copied into the \c{ECX} register.
6256
6257 \b Bits [31-0] of the 64-bit SYSCALL/SYSRET Target Address Register
6258 (\c{STAR}) are copied into the \c{EIP} register.
6259
6260 \b Bits [47-32] of the \c{STAR} register specify the selector that is
6261 copied into the \c{CS} register.
6262
6263 \b Bits [47-32]+1000b of the \c{STAR} register specify the selector that
6264 is copied into the SS register.
6265
6266 The \c{CS} and \c{SS} registers should not be modified by the operating
6267 system between the execution of the \c{SYSCALL} instruction and its
6268 corresponding \c{SYSRET} instruction.
6269
6270 For more information, see the \c{SYSCALL and SYSRET Instruction Specification}
6271 (AMD document number 21086.pdf).
6272
6273
6274 \S{insSYSENTER} \i\c{SYSENTER}: Fast System Call
6275
6276 \c SYSENTER                      ; 0F 34                [P6]
6277
6278 \c{SYSENTER} executes a fast call to a level 0 system procedure or
6279 routine. Before using this instruction, various MSRs need to be set
6280 up:
6281
6282 \b \c{SYSENTER_CS_MSR} contains the 32-bit segment selector for the
6283 privilege level 0 code segment. (This value is also used to compute
6284 the segment selector of the privilege level 0 stack segment.)
6285
6286 \b \c{SYSENTER_EIP_MSR} contains the 32-bit offset into the privilege
6287 level 0 code segment to the first instruction of the selected operating
6288 procedure or routine.
6289
6290 \b \c{SYSENTER_ESP_MSR} contains the 32-bit stack pointer for the
6291 privilege level 0 stack.
6292
6293 \c{SYSENTER} performs the following sequence of operations:
6294
6295 \b Loads the segment selector from the \c{SYSENTER_CS_MSR} into the
6296 \c{CS} register.
6297
6298 \b Loads the instruction pointer from the \c{SYSENTER_EIP_MSR} into
6299 the \c{EIP} register.
6300
6301 \b Adds 8 to the value in \c{SYSENTER_CS_MSR} and loads it into the
6302 \c{SS} register.
6303
6304 \b Loads the stack pointer from the \c{SYSENTER_ESP_MSR} into the
6305 \c{ESP} register.
6306
6307 \b Switches to privilege level 0.
6308
6309 \b Clears the \c{VM} flag in the \c{EFLAGS} register, if the flag
6310 is set.
6311
6312 \b Begins executing the selected system procedure.
6313
6314 In particular, note that this instruction des not save the values of
6315 \c{CS} or \c{(E)IP}. If you need to return to the calling code, you
6316 need to write your code to cater for this.
6317
6318 For more information, see the Intel Architecture Software Developer's
6319 Manual, Volume 2.
6320
6321
6322 \S{insSYSEXIT} \i\c{SYSEXIT}: Fast Return From System Call
6323
6324 \c SYSEXIT                       ; 0F 35                [P6,PRIV]
6325
6326 \c{SYSEXIT} executes a fast return to privilege level 3 user code.
6327 This instruction is a companion instruction to the \c{SYSENTER}
6328 instruction, and can only be executed by privilege level 0 code.
6329 Various registers need to be set up before calling this instruction:
6330
6331 \b \c{SYSENTER_CS_MSR} contains the 32-bit segment selector for the
6332 privilege level 0 code segment in which the processor is currently
6333 executing. (This value is used to compute the segment selectors for
6334 the privilege level 3 code and stack segments.)
6335
6336 \b \c{EDX} contains the 32-bit offset into the privilege level 3 code
6337 segment to the first instruction to be executed in the user code.
6338
6339 \b \c{ECX} contains the 32-bit stack pointer for the privilege level 3
6340 stack.
6341
6342 \c{SYSEXIT} performs the following sequence of operations:
6343
6344 \b Adds 16 to the value in \c{SYSENTER_CS_MSR} and loads the sum into
6345 the \c{CS} selector register.
6346
6347 \b Loads the instruction pointer from the \c{EDX} register into the
6348 \c{EIP} register.
6349
6350 \b Adds 24 to the value in \c{SYSENTER_CS_MSR} and loads the sum
6351 into the \c{SS} selector register.
6352
6353 \b Loads the stack pointer from the \c{ECX} register into the \c{ESP}
6354 register.
6355
6356 \b Switches to privilege level 3.
6357
6358 \b Begins executing the user code at the \c{EIP} address.
6359
6360 For more information on the use of the \c{SYSENTER} and \c{SYSEXIT}
6361 instructions, see the Intel Architecture Software Developer's
6362 Manual, Volume 2.
6363
6364
6365 \S{insSYSRET} \i\c{SYSRET}: Return From Operating System
6366
6367 \c SYSRET                        ; 0F 07                [P6,AMD,PRIV]
6368
6369 \c{SYSRET} is the return instruction used in conjunction with the
6370 \c{SYSCALL} instruction to provide fast entry/exit to an operating system.
6371
6372 \b The \c{ECX} register, which points to the next sequential instruction
6373 after the corresponding \c{SYSCALL} instruction, is copied into the \c{EIP}
6374 register.
6375
6376 \b Bits [63-48] of the \c{STAR} register specify the selector that is copied
6377 into the \c{CS} register.
6378
6379 \b Bits [63-48]+1000b of the \c{STAR} register specify the selector that is
6380 copied into the \c{SS} register.
6381
6382 \b Bits [1-0] of the \c{SS} register are set to 11b (RPL of 3) regardless of
6383 the value of bits [49-48] of the \c{STAR} register.
6384
6385 The \c{CS} and \c{SS} registers should not be modified by the operating
6386 system between the execution of the \c{SYSCALL} instruction and its
6387 corresponding \c{SYSRET} instruction.
6388
6389 For more information, see the \c{SYSCALL and SYSRET Instruction Specification}
6390 (AMD document number 21086.pdf).
6391
6392
6393 \S{insTEST} \i\c{TEST}: Test Bits (notional bitwise AND)
6394
6395 \c TEST r/m8,reg8                ; 84 /r                [8086]
6396 \c TEST r/m16,reg16              ; o16 85 /r            [8086]
6397 \c TEST r/m32,reg32              ; o32 85 /r            [386]
6398
6399 \c TEST r/m8,imm8                ; F6 /0 ib             [8086]
6400 \c TEST r/m16,imm16              ; o16 F7 /0 iw         [8086]
6401 \c TEST r/m32,imm32              ; o32 F7 /0 id         [386]
6402
6403 \c TEST AL,imm8                  ; A8 ib                [8086]
6404 \c TEST AX,imm16                 ; o16 A9 iw            [8086]
6405 \c TEST EAX,imm32                ; o32 A9 id            [386]
6406
6407 \c{TEST} performs a `mental' bitwise AND of its two operands, and
6408 affects the flags as if the operation had taken place, but does not
6409 store the result of the operation anywhere.
6410
6411
6412 \S{insUCOMISD} \i\c{UCOMISD}: Unordered Scalar Double-Precision FP
6413 compare and set EFLAGS
6414
6415 \c UCOMISD xmm1,xmm2/m128        ; 66 0F 2E /r     [WILLAMETTE,SSE2]
6416
6417 \c{UCOMISD} compares the low-order double-precision FP numbers in the
6418 two operands, and sets the \c{ZF}, \c{PF} and \c{CF} bits in the
6419 \c{EFLAGS} register. In addition, the \c{OF}, \c{SF} and \c{AF} bits
6420 in the \c{EFLAGS} register are zeroed out. The unordered predicate
6421 (\c{ZF}, \c{PF} and \c{CF} all set) is returned if either source
6422 operand is a \c{NaN} (\c{qNaN} or \c{sNaN}).
6423
6424
6425 \S{insUCOMISS} \i\c{UCOMISS}: Unordered Scalar Single-Precision FP
6426 compare and set EFLAGS
6427
6428 \c UCOMISS xmm1,xmm2/m128        ; 0F 2E /r        [KATMAI,SSE]
6429
6430 \c{UCOMISS} compares the low-order single-precision FP numbers in the
6431 two operands, and sets the \c{ZF}, \c{PF} and \c{CF} bits in the
6432 \c{EFLAGS} register. In addition, the \c{OF}, \c{SF} and \c{AF} bits
6433 in the \c{EFLAGS} register are zeroed out. The unordered predicate
6434 (\c{ZF}, \c{PF} and \c{CF} all set) is returned if either source
6435 operand is a \c{NaN} (\c{qNaN} or \c{sNaN}).
6436
6437
6438 \S{insUD2} \i\c{UD0}, \i\c{UD1}, \i\c{UD2}: Undefined Instruction
6439
6440 \c UD0                           ; 0F FF                [186,UNDOC]
6441 \c UD1                           ; 0F B9                [186,UNDOC]
6442 \c UD2                           ; 0F 0B                [186]
6443
6444 \c{UDx} can be used to generate an invalid opcode exception, for testing
6445 purposes.
6446
6447 \c{UD0} is specifically documented by AMD as being reserved for this
6448 purpose.
6449
6450 \c{UD1} is documented by Intel as being available for this purpose.
6451
6452 \c{UD2} is specifically documented by Intel as being reserved for this
6453 purpose. Intel document this as the preferred method of generating an
6454 invalid opcode exception.
6455
6456 All these opcodes can be used to generate invalid opcode exceptions on
6457 all currently available processors.
6458
6459
6460 \S{insUMOV} \i\c{UMOV}: User Move Data
6461
6462 \c UMOV r/m8,reg8                ; 0F 10 /r             [386,UNDOC]
6463 \c UMOV r/m16,reg16              ; o16 0F 11 /r         [386,UNDOC]
6464 \c UMOV r/m32,reg32              ; o32 0F 11 /r         [386,UNDOC]
6465
6466 \c UMOV reg8,r/m8                ; 0F 12 /r             [386,UNDOC]
6467 \c UMOV reg16,r/m16              ; o16 0F 13 /r         [386,UNDOC]
6468 \c UMOV reg32,r/m32              ; o32 0F 13 /r         [386,UNDOC]
6469
6470 This undocumented instruction is used by in-circuit emulators to
6471 access user memory (as opposed to host memory). It is used just like
6472 an ordinary memory/register or register/register \c{MOV}
6473 instruction, but accesses user space.
6474
6475 This instruction is only available on some AMD and IBM 386 and 486
6476 processors.
6477
6478
6479 \S{insUNPCKHPD} \i\c{UNPCKHPD}: Unpack and Interleave High Packed
6480 Double-Precision FP Values
6481
6482 \c UNPCKHPD xmm1,xmm2/m128       ; 66 0F 15 /r     [WILLAMETTE,SSE2]
6483
6484 \c{UNPCKHPD} performs an interleaved unpack of the high-order data
6485 elements of the source and destination operands, saving the result
6486 in \c{xmm1}. It ignores the lower half of the sources.
6487
6488 The operation of this instruction is:
6489
6490 \c    dst[63-0]   := dst[127-64];
6491 \c    dst[127-64] := src[127-64].
6492
6493
6494 \S{insUNPCKHPS} \i\c{UNPCKHPS}: Unpack and Interleave High Packed
6495 Single-Precision FP Values
6496
6497 \c UNPCKHPS xmm1,xmm2/m128       ; 0F 15 /r        [KATMAI,SSE]
6498
6499 \c{UNPCKHPS} performs an interleaved unpack of the high-order data
6500 elements of the source and destination operands, saving the result
6501 in \c{xmm1}. It ignores the lower half of the sources.
6502
6503 The operation of this instruction is:
6504
6505 \c    dst[31-0]   := dst[95-64];
6506 \c    dst[63-32]  := src[95-64];
6507 \c    dst[95-64]  := dst[127-96];
6508 \c    dst[127-96] := src[127-96].
6509
6510
6511 \S{insUNPCKLPD} \i\c{UNPCKLPD}: Unpack and Interleave Low Packed
6512 Double-Precision FP Data
6513
6514 \c UNPCKLPD xmm1,xmm2/m128       ; 66 0F 14 /r     [WILLAMETTE,SSE2]
6515
6516 \c{UNPCKLPD} performs an interleaved unpack of the low-order data
6517 elements of the source and destination operands, saving the result
6518 in \c{xmm1}. It ignores the lower half of the sources.
6519
6520 The operation of this instruction is:
6521
6522 \c    dst[63-0]   := dst[63-0];
6523 \c    dst[127-64] := src[63-0].
6524
6525
6526 \S{insUNPCKLPS} \i\c{UNPCKLPS}: Unpack and Interleave Low Packed
6527 Single-Precision FP Data
6528
6529 \c UNPCKLPS xmm1,xmm2/m128       ; 0F 14 /r        [KATMAI,SSE]
6530
6531 \c{UNPCKLPS} performs an interleaved unpack of the low-order data
6532 elements of the source and destination operands, saving the result
6533 in \c{xmm1}. It ignores the lower half of the sources.
6534
6535 The operation of this instruction is:
6536
6537 \c    dst[31-0]   := dst[31-0];
6538 \c    dst[63-32]  := src[31-0];
6539 \c    dst[95-64]  := dst[63-32];
6540 \c    dst[127-96] := src[63-32].
6541
6542
6543 \S{insVERR} \i\c{VERR}, \i\c{VERW}: Verify Segment Readability/Writability
6544
6545 \c VERR r/m16                    ; 0F 00 /4             [286,PRIV]
6546
6547 \c VERW r/m16                    ; 0F 00 /5             [286,PRIV]
6548
6549 \b \c{VERR} sets the zero flag if the segment specified by the selector
6550 in its operand can be read from at the current privilege level.
6551 Otherwise it is cleared.
6552
6553 \b \c{VERW} sets the zero flag if the segment can be written.
6554
6555
6556 \S{insWAIT} \i\c{WAIT}: Wait for Floating-Point Processor
6557
6558 \c WAIT                          ; 9B                   [8086]
6559 \c FWAIT                         ; 9B                   [8086]
6560
6561 \c{WAIT}, on 8086 systems with a separate 8087 FPU, waits for the
6562 FPU to have finished any operation it is engaged in before
6563 continuing main processor operations, so that (for example) an FPU
6564 store to main memory can be guaranteed to have completed before the
6565 CPU tries to read the result back out.
6566
6567 On higher processors, \c{WAIT} is unnecessary for this purpose, and
6568 it has the alternative purpose of ensuring that any pending unmasked
6569 FPU exceptions have happened before execution continues.
6570
6571
6572 \S{insWBINVD} \i\c{WBINVD}: Write Back and Invalidate Cache
6573
6574 \c WBINVD                        ; 0F 09                [486]
6575
6576 \c{WBINVD} invalidates and empties the processor's internal caches,
6577 and causes the processor to instruct external caches to do the same.
6578 It writes the contents of the caches back to memory first, so no
6579 data is lost. To flush the caches quickly without bothering to write
6580 the data back first, use \c{INVD} (\k{insINVD}).
6581
6582
6583 \S{insWRMSR} \i\c{WRMSR}: Write Model-Specific Registers
6584
6585 \c WRMSR                         ; 0F 30                [PENT]
6586
6587 \c{WRMSR} writes the value in \c{EDX:EAX} to the processor
6588 Model-Specific Register (MSR) whose index is stored in \c{ECX}.
6589 See also \c{RDMSR} (\k{insRDMSR}).
6590
6591
6592 \S{insWRSHR} \i\c{WRSHR}: Write SMM Header Pointer Register
6593
6594 \c WRSHR r/m32                   ; 0F 37 /0        [386,CYRIX,SMM]
6595
6596 \c{WRSHR} loads the contents of either a 32-bit memory location or a
6597 32-bit register into the SMM header pointer register.
6598
6599 See also \c{RDSHR} (\k{insRDSHR}).
6600
6601
6602 \S{insXADD} \i\c{XADD}: Exchange and Add
6603
6604 \c XADD r/m8,reg8                ; 0F C0 /r             [486]
6605 \c XADD r/m16,reg16              ; o16 0F C1 /r         [486]
6606 \c XADD r/m32,reg32              ; o32 0F C1 /r         [486]
6607
6608 \c{XADD} exchanges the values in its two operands, and then adds
6609 them together and writes the result into the destination (first)
6610 operand. This instruction can be used with a \c{LOCK} prefix for
6611 multi-processor synchronisation purposes.
6612
6613
6614 \S{insXBTS} \i\c{XBTS}: Extract Bit String
6615
6616 \c XBTS reg16,r/m16              ; o16 0F A6 /r         [386,UNDOC]
6617 \c XBTS reg32,r/m32              ; o32 0F A6 /r         [386,UNDOC]
6618
6619 The implied operation of this instruction is:
6620
6621 \c XBTS r/m16,reg16,AX,CL
6622 \c XBTS r/m32,reg32,EAX,CL
6623
6624 Writes a bit string from the source operand to the destination. \c{CL}
6625 indicates the number of bits to be copied, and \c{(E)AX} indicates the
6626 low order bit offset in the source. The bits are written to the low
6627 order bits of the destination register. For example, if \c{CL} is set
6628 to 4 and \c{AX} (for 16-bit code) is set to 5, bits 5-8 of \c{src} will
6629 be copied to bits 0-3 of \c{dst}. This instruction is very poorly
6630 documented, and I have been unable to find any official source of
6631 documentation on it.
6632
6633 \c{XBTS} is supported only on the early Intel 386s, and conflicts with
6634 the opcodes for \c{CMPXCHG486} (on early Intel 486s). NASM supports it
6635 only for completeness. Its counterpart is \c{IBTS} (see \k{insIBTS}).
6636
6637
6638 \S{insXCHG} \i\c{XCHG}: Exchange
6639
6640 \c XCHG reg8,r/m8                ; 86 /r                [8086]
6641 \c XCHG reg16,r/m8               ; o16 87 /r            [8086]
6642 \c XCHG reg32,r/m32              ; o32 87 /r            [386]
6643
6644 \c XCHG r/m8,reg8                ; 86 /r                [8086]
6645 \c XCHG r/m16,reg16              ; o16 87 /r            [8086]
6646 \c XCHG r/m32,reg32              ; o32 87 /r            [386]
6647
6648 \c XCHG AX,reg16                 ; o16 90+r             [8086]
6649 \c XCHG EAX,reg32                ; o32 90+r             [386]
6650 \c XCHG reg16,AX                 ; o16 90+r             [8086]
6651 \c XCHG reg32,EAX                ; o32 90+r             [386]
6652
6653 \c{XCHG} exchanges the values in its two operands. It can be used
6654 with a \c{LOCK} prefix for purposes of multi-processor
6655 synchronisation.
6656
6657 \c{XCHG AX,AX} or \c{XCHG EAX,EAX} (depending on the \c{BITS}
6658 setting) generates the opcode \c{90h}, and so is a synonym for
6659 \c{NOP} (\k{insNOP}).
6660
6661
6662 \S{insXLATB} \i\c{XLATB}: Translate Byte in Lookup Table
6663
6664 \c XLAT                          ; D7                   [8086]
6665 \c XLATB                         ; D7                   [8086]
6666
6667 \c{XLATB} adds the value in \c{AL}, treated as an unsigned byte, to
6668 \c{BX} or \c{EBX}, and loads the byte from the resulting address (in
6669 the segment specified by \c{DS}) back into \c{AL}.
6670
6671 The base register used is \c{BX} if the address size is 16 bits, and
6672 \c{EBX} if it is 32 bits. If you need to use an address size not
6673 equal to the current \c{BITS} setting, you can use an explicit
6674 \i\c{a16} or \i\c{a32} prefix.
6675
6676 The segment register used to load from \c{[BX+AL]} or \c{[EBX+AL]}
6677 can be overridden by using a segment register name as a prefix (for
6678 example, \c{es xlatb}).
6679
6680
6681 \S{insXOR} \i\c{XOR}: Bitwise Exclusive OR
6682
6683 \c XOR r/m8,reg8                 ; 30 /r                [8086]
6684 \c XOR r/m16,reg16               ; o16 31 /r            [8086]
6685 \c XOR r/m32,reg32               ; o32 31 /r            [386]
6686
6687 \c XOR reg8,r/m8                 ; 32 /r                [8086]
6688 \c XOR reg16,r/m16               ; o16 33 /r            [8086]
6689 \c XOR reg32,r/m32               ; o32 33 /r            [386]
6690
6691 \c XOR r/m8,imm8                 ; 80 /6 ib             [8086]
6692 \c XOR r/m16,imm16               ; o16 81 /6 iw         [8086]
6693 \c XOR r/m32,imm32               ; o32 81 /6 id         [386]
6694
6695 \c XOR r/m16,imm8                ; o16 83 /6 ib         [8086]
6696 \c XOR r/m32,imm8                ; o32 83 /6 ib         [386]
6697
6698 \c XOR AL,imm8                   ; 34 ib                [8086]
6699 \c XOR AX,imm16                  ; o16 35 iw            [8086]
6700 \c XOR EAX,imm32                 ; o32 35 id            [386]
6701
6702 \c{XOR} performs a bitwise XOR operation between its two operands
6703 (i.e. each bit of the result is 1 if and only if exactly one of the
6704 corresponding bits of the two inputs was 1), and stores the result
6705 in the destination (first) operand.
6706
6707 In the forms with an 8-bit immediate second operand and a longer
6708 first operand, the second operand is considered to be signed, and is
6709 sign-extended to the length of the first operand. In these cases,
6710 the \c{BYTE} qualifier is necessary to force NASM to generate this
6711 form of the instruction.
6712
6713 The \c{MMX} instruction \c{PXOR} (see \k{insPXOR}) performs the same
6714 operation on the 64-bit \c{MMX} registers.
6715
6716
6717 \S{insXORPD} \i\c{XORPD}: Bitwise Logical XOR of Double-Precision FP Values
6718
6719 \c XORPD xmm1,xmm2/m128          ; 66 0F 57 /r     [WILLAMETTE,SSE2]
6720
6721 \c{XORPD} returns a bit-wise logical XOR between the source and
6722 destination operands, storing the result in the destination operand.
6723
6724
6725 \S{insXORPS} \i\c{XORPS}: Bitwise Logical XOR of Single-Precision FP Values
6726
6727 \c XORPS xmm1,xmm2/m128          ; 0F 57 /r        [KATMAI,SSE]
6728
6729 \c{XORPS} returns a bit-wise logical XOR between the source and
6730 destination operands, storing the result in the destination operand.
6731
6732