CUDA 9/Volta support for the nonbonded module
The Volta architecture introduces independent thread scheduling. This
new architectural feature breaks implicit warp synchronous programming
and requires new intrinsics for warp wide operations like shfl.
This change implements the necessary sync point for the Volta
architecture and replaces the deprecated warp-intrinsics with their
_sycn version.
Note that the current implementation is conservative and aims for Volta
compatibility only and the implementation is likely not optimal (for
details see nbnxn_cuda_kernel.cuh).
Change-Id: I38dd572992cf14ce5a7158d0bbc3086b54f18676