various improvements to memcpy_test
- lock benchmark to CPU 0 to reduce task switches
- measure only the time of memcpy itself
- use cpuid serialize as proposed by agner.org
- add various checks that assure the memcpy impl is correct
- warm up CPU before starting benchmarks
- display only the best result for each size/impl
- use much higher round count to assure we really get an
optimal result.
with all these improvements, it's finally possible to get reliable
and repeatable numbers, given that things like intel speedstep
and turbo mode don't interfere.
it turned out that using kernel 5.4, it's possible to lock the
CPU speed using the CPU governor "userspace" and setting the
clockspeed to exactly 1 Ghz, without having to mess with BIOS
settings or even using a specialized kernel.
using other governors, or e.g. userspace governor with 800Mhz,
which is identical to powersave governor, the CPU would still
start switching around clock speeds.