cuda: allow copy to/from shared memory outside of outermost loop