Thanks for the VPU insights!
I managed to implement a 64x64 transpose with four individual 32x32 transpose steps, each getting its data preloaded one step ahead. By my crude measurements (is there a good way other than externally looking how long the executecode mailbox call takes?) it can transpose at around 350MB/s. Not sure if that's good or not, but getting rid of the transpose related instructions and leaving only the ld/st does indeed not make a difference now. Nice. Thanks!There is one other trick you can use.
[...]
You should then find your code takes no longer that a vld/vst memcpy as any other instructions are run while waiting for the loads to arrive.
Statistics: Posted by dividuum — Mon Feb 24, 2025 7:38 pm