add optimized aarch64 memcpy and memset
these are based on the ARM optimized-routines repository v20.05
(
ef907c7a799a), with macro dependencies flattened out and memmove code
removed from memcpy. this change is somewhat unfortunate since having
the branch for memmove support in the large n case of memcpy is the
performance-optimal and size-optimal way to do both, but it makes
memcpy alone (static-linked) about 40% larger and suggests a policy
that use of memcpy as memmove is supported.
tabs used for alignment have also been replaced with spaces.