gpu: avoid mapping initial non-permutable bands to the device
If the outer node of the schedule tree is a sequence and the initial
children of this sequence do not have any permutable bands,
then there is no point in including these initial children
in the part that is mapped to the device.
Instead, these initial children can be run on the CPU and
any results produced can be copied to the device.
This should be cheaper than running one or more single instance kernels
on the device.
Signed-off-by: Sven Verdoolaege <skimo@kotnet.org>