kernel - Refactor the TSC MP synchronization test
* Refactor the TSC MP synchronization test. Do not use cpusync.
Using cpusync results in O(N x N) worth of overhead instead of
O(N) worth of overhead.
Instead, have the per-cpu threads run the test simultaneously using
each other's data.
* We synchronize to the last TSC element that was saved on each cpu.
This probably needs a bit of work to ensure determinism, but at
the moment its good in that it synchronizes all cores off of a
single cache mastership change, instead of having them all compete
for cache mastership.
* Probably needs some fine tuning, at the moment I allow a slop of
10uS which is almost certainly too much. Note, however, that
SMP interactions can create ~1uS latencies on particular memory
accesses.
* Solves serious issues with the old test on 64 cpu threads.
These issues may also have been related to the ipiq fifo size
being too small.