Improved the intra-GPU load balancing
The splitting of the pair list to improve load balancing on the GPU
was based on the number of generated lists. But this number can be
high(er) due to small lists before splitting. This lead to too few
lists for small systems and too many for large systems.
Now the splitting is based on the number of pairs in the list up till
now. This produces much more stable results.
Because of the more stable results, we increased the min_ci_balanced
factor from 40 to 44 (closer to the ideal 48).
With small systems on many threads we used to generate many more lists
than targeted. Because the algorithm is now far better, we increased
the minimum list size from 32 to 36 and still get fewer lists.
Change-Id: Id2210171a409ef1a27f7dc919fe806f0fe4d869c