util/bufferiszero: Optimize SSE2 and AVX2 variants
commitf28e0bbefa41fe643cce2f107e868abff312ced9
authorAlexander Monakov <amonakov@ispras.ru>
Tue, 6 Feb 2024 20:48:08 +0000 (6 23:48 +0300)
committerRichard Henderson <richard.henderson@linaro.org>
Fri, 3 May 2024 15:03:05 +0000 (3 08:03 -0700)
tree933db7fedccb1c2590441909271db03ff8cba52f
parent93a6085618f16fb2cd316d1e84f1a638b7e2d8ff
util/bufferiszero: Optimize SSE2 and AVX2 variants

Increase unroll factor in SIMD loops from 4x to 8x in order to move
their bottlenecks from ALU port contention to load issue rate (two loads
per cycle on popular x86 implementations).

Avoid using out-of-bounds pointers in loop boundary conditions.

Follow SSE2 implementation strategy in the AVX2 variant. Avoid use of
PTEST, which is not profitable there (like in the removed SSE4 variant).

Signed-off-by: Alexander Monakov <amonakov@ispras.ru>
Signed-off-by: Mikhail Romanov <mmromanov@ispras.ru>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20240206204809.9859-6-amonakov@ispras.ru>
util/bufferiszero.c