sh: fix endianess and optimise the SH4 memcpy
This patch fixes the big-endian code and adds a new optimization
only for little endian mode.
This optimization is based on prefetching and 64bit data transfer via FPU.
Tests shows that
----------------------------------------
Memory bandwidth | Gain
| sh4-300 | sh4-200
----------------------------------------
512 bytes to 16KiB | ~20% | ~25%
from 32KiB to 16MiB | ~190% | ~5%
----------------------------------------
Signed-off-by: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Signed-off-by: Carmelo Amoroso <carmelo.amoroso@st.com>