kernel - Reoptimize sys_pipe
* Use atomic ops for state updates, allowing us to avoid acquiring
the other side's token. This removes all remaining contention.
* Performance boosted by around 35%. On the ryzen, bulk buffer
write->read tests between localized cpu cores went from 9.2 GB/sec
to around 13 GBytes/sec. Cross-die performance increased from
2.5 GB/sec to around 4.5 GB/sec (gigabytes/sec).
1-byte ping-ponging (write-1/read-1/turn-around/write-back-1/
read-back1) fell from 1.0-2.0uS to 0.7uS to 1.7uS.
* Add kern.pipe.size, allowing the kernel pipe buffer size to be
changed (effects new pipes only). The default buffer size has
been increased to 32KB (it was 16KB).
* Refactor pipelining optimizations, further reducing unnecessary
tsleep/wakeup IPIs.
* Improve kern.pipe.delay operation (an IPI avoidance mechanism),
and reduce from 5uS to 4uS.
Also add cpu_pause() in the TSC loop (suggested-by mjg_).