powerpc64: Add optimized chacha20
It adds vectorized ChaCha20 implementation based on libgcrypt
cipher/chacha20-ppc.c. It targets POWER8 and it is used on default
for LE.
On a POWER8 it shows the following improvements (using formatted
bench-arc4random data):
POWER8
GENERIC MB/s
-----------------------------------------------
arc4random [single-thread] 138.77
arc4random_buf(16) [single-thread] 174.36
arc4random_buf(32) [single-thread] 228.11
arc4random_buf(48) [single-thread] 252.31
arc4random_buf(64) [single-thread] 270.11
arc4random_buf(80) [single-thread] 278.97
arc4random_buf(96) [single-thread] 287.78
arc4random_buf(112) [single-thread] 291.92
arc4random_buf(128) [single-thread] 295.25
POWER8 MB/s
-----------------------------------------------
arc4random [single-thread] 198.06
arc4random_buf(16) [single-thread] 278.79
arc4random_buf(32) [single-thread] 448.89
arc4random_buf(48) [single-thread] 551.09
arc4random_buf(64) [single-thread] 646.12
arc4random_buf(80) [single-thread] 698.04
arc4random_buf(96) [single-thread] 756.06
arc4random_buf(112) [single-thread] 784.12
arc4random_buf(128) [single-thread] 808.04
-----------------------------------------------
Checked on powerpc64-linux-gnu and powerpc64le-linux-gnu.
Reviewed-by: Paul E. Murphy <murphyp@linux.ibm.com>