cpuops: Use cmpxchg for xchg to avoid lock semantics
Use cmpxchg instead of xchg to realize this_cpu_xchg.
xchg will cause LOCK overhead since LOCK is always implied but cmpxchg
will not.
Baselines:
xchg() = 18 cycles (no segment prefix, LOCK semantics)
__this_cpu_xchg = 1 cycle
(simulated using this_cpu_read/write, two prefixes. Looks like the
cpu can use loop optimization to get rid of most of the overhead)
Cycles before:
this_cpu_xchg = 37 cycles (segment prefix and LOCK (implied by xchg))
After:
this_cpu_xchg = 11 cycle (using cmpxchg without lock semantics)
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Tejun Heo <tj@kernel.org>