Replace intrinsic with inline asm for AVX512 unit test
Without high optimization, some compilers (icc) produce
assembly that lead to lots of store-to-load forwarding
during initialization, which screws up timing results.
The modified code uses inline asm without loading from
memory, which is fine since the inline (volatile) asm
will not be optimized. Tested to work and detect 2 FMA
units on Core i9-7920X and 1 FMA on Xeon Silver 4116,
with with gcc-5.4, gcc-7.1, icc 2017 and clang-5 with
optimization levels from -O0 to -O3. We also avoid
warning if we override the architecture with the
AVX-512 flags for the source file containing the asm.
Fixes #2340.
Change-Id: I3aea95b162c55c7773182a69f639dff1a01d0603