kernel - Reduce spinning on shared spinlocks
* Improve spinlock performance by removing unnecessary extra reads,
using atomic_fetchadd_int() to avoid a cmpxchg loop, and allowing
the SHARED flag to remain soft-set on the 1->0 transition.
* The primary improvement here is that multiple cpu's obtaining the
same shared spinlock can now do so via a single atomic_fetchadd_int(),
whereas before we had multiple atomics and cmpxchg loops. This does not
remove the cacheline ping-pong but it significantly reduces unnecessary
looping when multiple cpu cores are heavily loading the same shared spin
lock.
* Trade-off is against the case where a spinlock's use-case switches from
shared to exclusive or back again, which requires an extra atomic op to
deal with. This is not a common case.
* Remove spin->countb debug code, it interferes with hw cacheline operations
and is no longer desireable.
Discussed-with: Mateusz Guzik (mjg_)