gpu backend: create single kernel for entire subtree without permutable bands
The gpu backend looks for a subtree of the schedule with permutable bands
and then maps the outermost permutable bands in that tree to the device.
There may however be further subtrees of the selected subtree that do
not contain any permutable band. In this case, the leaves of those
subtrees are mapped to the device. If those leaves have outer
(non-permutable) bands in the selected subtrees, then that means
that a kernel is invoked for each instance of those bands.
Since kernel invocation comes with some overhead, it should be
more efficient to map the entire subtree without permutable bands
to the device as a whole. The resulting kernel will not be very
efficient, but it should be better than repeatedly invoking
a smaller inefficient kernel.
Signed-off-by: Sven Verdoolaege <skimo@kotnet.org>