Bring performance estimation up to date
The performance estimation code for estimating the PME/PP load
and the optimal DD grid setup used outdated numbers.
We now estimate using actual cycle counts on Haswell and esimate
for other architectures through a scaling factor that takes into
account the SIMD width and FMA.
The DD grid automation now ignores PBC cost for exclusions with
the Verlet scheme and the for angles and dihedrals with SIMD.
The effect of this is a more reliable PME load estimate that's
now a factor 1.4 to 1.7 higher on Haswell.
The DD grid automation will now often choose a setup that better
matches the PME `decomposition and reduce the PME redist cost.
Change-Id: I5daa6a6856f2b09ba6d17fda0eea800b816d21e4