LJ combination rule kernels for OpenCL
The current implementation enables combination rules for both AMD and
NVIDIA OpenCL (also ports the changes to the "nowarp" test/CPU kernel).
Like in the CUDA implementation, all kernels support it, but only for
plain cut-off are combination rules used.
Notes:
- On AMD tested on Hawaii, Fiji, Spectre and Oland devices;
combination rules in all cases improve performance, although combined
with the i-prefetching, the improvement is typically only ~10%.
- On NVIDIA tested on Kepler and Maxwell; in most cases the combination
rule kernels are fastest.
However, with certain inputs these kernels are 25% slower on Maxwell
(e.g. pure water box, cut-off LJ, pot shift), but not on Kepler.
This is likely a compiler mis-optimization, so we'll just leave the
defaults the same as AMD.
Change-Id: I05396e000cdf93c1d872729e6b477192af152495