s390x: XC instruction: clear in 8-byte increments if possible
The XC instruction is frequently executed in many programs, mainly for
clearing memory. It can target from 1 to 256 bytes. If the size is
constant and XC is actually used for clearing memory, Valgrind implements
it as a byte-wise loop and rolls out the loop for <= 8 bytes.
Instead of clearing byte-wise, it is more efficient to clear in 64-bit
increments, so do this for sizes >= 8 bytes. Roll out the loop for up to
32 bytes. Overall, this reduces the number of insns by a few percent and
provides a slight performance improvement for some programs.