Thanks. I'll see if it's worth getting that done too. This all is quite a deep rabbit hole to get lost into while optimizing stuff. My brain is melting, but the instruction set is quite fun. Right now I'm dynamically using both 16x16 and 16x64 transposers. A transpose now takes between 10 and 14ms, so it's fast enough to do 1080p60.I believe VC has a 256-bit width path to memory, so 16-bit loads/stores are optimal.
With a 16-bit load, the ls-byte is loaded to H(0,0) and ms-byte to H(0,16), so you need
to shuffle before writing back.
I think you want the vinterl/vintelh pair for getting the bytes into natural order in VRF after the v16ld.
Then the veven/vodd pair for getting the bytes back into a format suitable for a v16st.
I've seen this mentioned a few times. Is the physical memory aliased and some addresses don't use the cache? But that's only possible on the Pi3 with it's 1GB memory, correct? From looking the the addresses, right now it seems the physical address I get are in the 0xeXXXXXXX range. What's with those? None of the vc_sm_cma_ioctl_import_dmabuf.cached settings seems to make a difference.And in this case you should be using the 0xCxxxxxxx alias to bypass the cache anyway (a cache miss is slower that a bypass cache address).
That's fine for me. It's only the Pi3 I struggle to get the performance back to the old firmware levels. With this VPU transpose, I think I'm basically there and on the Pi4 and 5 the GPU is fast enough to do the rotation for me.Pi5 has scaled back the VPU significantly. Getting hold of VPU addresses is trickier. vc_sm_cma is scaling back what is available to userspace, particularly as it gets upstreamed.
Might be interesting to have an official alternative that helps with implementing a 90/270 degree rotation for DRM planes. For now I no longer immediately need that, so I'm goodImplementing a V4L2 rotation M2M device using the VPU for any 8bpc symmetrical (ie not YUV422) colour format probably wouldn't be too difficult. As Dom says, the VPU can load the VRF in nice efficient bursts, and save out with rotation with equally efficient bursts. I haven't messed with the firmware for a while, and other priorities may mean it takes a while to get implemented.
I also noticed that setting the scaling governor to 'performance' is a lot better than 'ondemand'. I guess the VPU shared a clock somewhere and is faster if the ARM side is clocked higher?
Statistics: Posted by dividuum — Fri Feb 21, 2025 7:27 pm