enable GPU sharing among tMPI ranks
It turns out that the only issue preventing sharing GPUs among thread-MPI
threads was that when the thread arriving to free_gpu() first destroys
the context, it is highly likely that the other thread(s) sharing a GPU
with this are still freeing their resources - operation which fails as
soon as the context is destroyed by the "fast" thread.
Simply placing a barrier between the GPU resource freeing and context
destruction solves the issue. However, there is still a very unlikely
concurrency hazard after CUDA texture reference updates (non-bonded
parameter table and coulomb force table initialization). To be on the
safe side, with tMPI a barrier is placed after these operations.
Change-Id: Iac7a39f841ca31a32ab979ee0012cfc18a811d76