gpu backend: ensure each register is accessed by a single thread
Before deciding whether to put data in private memory, the code
already checks that each element is only accessed by a single thread.
However, the location where the private memory tile is copied
from global memory may subsequently be lifted to a more shallow
depth and in theory it is possible for the mapping from threads
to accessed array elements to depend on the intermediate schedule
dimensions.
Adjust the depth if needed to ensure that each array element
remains accessed by a single thread.
It is not clear if the issue mentioned above could happen in practice,
but without the new test there was no guarantee that it couldn't.
Reported-by: Darte Alain <alain.darte@ens-lyon.fr>
Signed-off-by: Sven Verdoolaege <skimo@kotnet.org>