gpu: also add synchronization after writes to shared memory from core
By default, the mapping to shared memory is computed at the level
that is mapped to threads and the copying is only hoisted up through
schedule dimensions that do not affect the shared memory tile offset.
It is therefore unlikely that the shared memory elements are accessed
by different threads in successive iterations of these intermediate
schedule dimensions, but it is still possible in theory.
Add synchronization after the core if it has any writes to shared
memory that require synchronization to protect us from such cases and
also from future cases where the mapping to shared memory could
be computed at a different level.
Reported-by: Tobias Grosser <tobias@grosser.es>
Signed-off-by: Sven Verdoolaege <skimo@kotnet.org>