Interleaving your posts for reply. May be disjoint...
teh_orph wrote:tufty wrote:
Have you benchmarked the difference between doing strd (i.e. 2 registers at a time), and various number of registers at a time using stm? I assume you have, as you're doing 4-register writes rather than 8.
Yes. Eight reg stm gave about 1.1GB/s, 2*four reg stm gives just shy of 1.4 GB/s. Store dual (strd) was 800 MB/s IIRC? I will check. I'm unsure of why eight-reg stm is slower than four-reg stm on memset, but not memcpy.
4*strd = 276 MB/s (r4, r5 * 4)
4*2 reg stmia = 199 MB/s (r4, r5 * 4)
2*4 reg stmia = 1389 MB/s (r1, r4, r5, r6 * 2)
2*4 reg stmia = 1385 MB/s (1, 4, 5, 6, 7, 8, 9, 10)
Interesting that the strd and 2-register stms are so different. According to the TRM they should be the same as a best case, i.e. 1 cycle, and have the same latencies, but with strd falling
behind stm in the worst case.
Looks like 2x4 is the sweet spot for memset, but I don't see why. 8 register should beat it in purely performance terms, especially for memset where you're not locking your register values. Or maybe that's it? 2 interleaved 4 register writes meaning that none of the registers are locked at any point you want to use them?
64 bit or 64 byte? My stuff 32 *byte* aligns the data for a definite significant win.
Yeah, 64 bit (doubleword aligned). Brainfart on my part, it's obvious your code is 32 byte aligned.
memcpy :
Intestesting, re: the timings. I'll re-time with these different layouts and see what I get.
What's interesting is that
ldm r1! {stuff}
pld [r1, #12]
the pld takes a big hit if directly after the ldm...because r1 is updated quite late? Moving it far away improves the pld performance quite a bit. (I'm looking in oprofile results btw). Yet r1! can't the culprit...as I do back-to-back ldm r1!s in memset and it's faster! Wanna guess??
my guess would be that back-to-back ldmia r1!, {...} don't need the value of r1 to be ready, but pld does.
Give this a shot, just for kicks (if you have time, of course). Brings the preload up to a point where the r1 value should be available, and gives a maximum distance between it and the next time we touch r1. The subs only needs to be 2 cycles away from the branch to get best case performance anyway, so it might as well be used to give the preload time to take effect.
Code: Select all
ldmia r1!, {r4-r7}
ldmia r1!, {r8-r11}
stmia r0!, {r4-r7}
pld [r1, #128]
subs r3, #32
stmia r0!, {r8-r11}
bne fast_loop
Simon