RISC-V: Support highpart register overlap for vwcvt
Since Richard supports register filters recently, we are able to support highpart register
overlap for widening RVV instructions.
This patch support it for vwcvt intrinsics.
I leverage real application user codes for vwcvt:
https://github.com/riscv/riscv-v-spec/issues/929
https://godbolt.org/z/xoeGnzd8q
This is the real application codes that using LMUL = 8 with unrolling to gain optimal
performance for specific libraury.
You can see in the codegen, GCC has optimal codegen for such since we supported register
lowpart overlap for narrowing instructions (dest EEW < source EEW).
Now, we start to support highpart register overlap from this patch for widening instructions (dest EEW > source EEW).
Leverage this intrinsic codes above but for vwcvt:
https://godbolt.org/z/1TMPE5Wfr
size_t
foo (char const *buf, size_t len)
{
size_t sum = 0;
size_t vl = __riscv_vsetvlmax_e8m8 ();
size_t step = vl * 4;
const char *it = buf, *end = buf + len;
for (; it + step <= end;)
{
vint8m4_t v0 = __riscv_vle8_v_i8m4 ((void *) it, vl);
it += vl;
vint8m4_t v1 = __riscv_vle8_v_i8m4 ((void *) it, vl);
it += vl;
vint8m4_t v2 = __riscv_vle8_v_i8m4 ((void *) it, vl);
it += vl;
vint8m4_t v3 = __riscv_vle8_v_i8m4 ((void *) it, vl);
it += vl;
asm volatile("nop" ::: "memory");
vint16m8_t vw0 = __riscv_vwcvt_x_x_v_i16m8 (v0, vl);
vint16m8_t vw1 = __riscv_vwcvt_x_x_v_i16m8 (v1, vl);
vint16m8_t vw2 = __riscv_vwcvt_x_x_v_i16m8 (v2, vl);
vint16m8_t vw3 = __riscv_vwcvt_x_x_v_i16m8 (v3, vl);
asm volatile("nop" ::: "memory");
size_t sum0 = __riscv_vmv_x_s_i16m8_i16 (vw0);
size_t sum1 = __riscv_vmv_x_s_i16m8_i16 (vw1);
size_t sum2 = __riscv_vmv_x_s_i16m8_i16 (vw2);
size_t sum3 = __riscv_vmv_x_s_i16m8_i16 (vw3);
sum += sumation (sum0, sum1, sum2, sum3);
}
return sum;
}
Before this patch:
...
csrr t0,vlenb
...
vwcvt.x.x.v v16,v8
vwcvt.x.x.v v8,v28
vs8r.v v16,0(sp) ---> spill
vwcvt.x.x.v v16,v24
vwcvt.x.x.v v24,v4
nop
vsetvli zero,zero,e16,m8,ta,ma
vmv.x.s a2,v16
vl8re16.v v16,0(sp) ---> reload
...
csrr t0,vlenb
...
You can see heavy spill && reload inside the loop body.
After this patch:
...
vwcvt.x.x.v v8,v12
vwcvt.x.x.v v16,v20
vwcvt.x.x.v v24,v28
vwcvt.x.x.v v0,v4
...
Optimal codegen after this patch.
Tested on zvl128b no regression.
I am gonna to test zve64d/zvl256b/zvl512b/zvl1024b.
Ok for trunk if no regression on the testing above ?
Co-authored-by: kito-cheng <kito.cheng@sifive.com>
Co-authored-by: kito-cheng <kito.cheng@gmail.com>
PR target/112431
gcc/ChangeLog:
* config/riscv/constraints.md (TARGET_VECTOR ? V_REGS : NO_REGS): New register filters.
* config/riscv/riscv.md (no,W21,W42,W84,W41,W81,W82): Ditto.
(no,yes): Ditto.
* config/riscv/vector.md: Support highpart register overlap for vwcvt.
gcc/testsuite/ChangeLog:
* gcc.target/riscv/rvv/base/pr112431-1.c: New test.
* gcc.target/riscv/rvv/base/pr112431-2.c: New test.
* gcc.target/riscv/rvv/base/pr112431-3.c: New test.