User avatar
pipe
Posts: 4
Joined: Mon Feb 19, 2024 10:13 pm
Location: Gothenburg, Sweden

Optimal alignment between RAM buffers for fastest DMA copy?

Mon Feb 19, 2024 11:04 pm

I would like to copy one fixed-size SRAM buffer to another SRAM buffer using one DMA channel as quickly as possible, and I was wondering what relative offset the two buffers should have to each other to avoid both DMA read and write trying to hit the same bank on the same cycle?

Code: Select all

static uint32_t X[1000];
static uint32_t Y[1000];
dma_channel_config c = dma_channel_get_default_config(0);
channel_config_set_write_increment(&c,true);
dma_channel_configure (0,&c,Y,X,1000,true);
From what I understand the DMA will read and write in the same cycle, but it can not possibly read and write with a 0 delay, so it must have a pipeline of some sort. I could not find how long this pipeline is. Ideally I want it to run something like this:
  • 1. Read SRAM0
    2. Read SRAM1, Write SRAM0
    3. Read SRAM2, Write SRAM1
    4. ...
What I worry is that with a random offset between buffers X and Y there is a 25% chance that my DMA transfer goes like this:
  • 1. Read SRAM0
    2. Read SRAM1, (try to write SRAM1, stall)
    3. Write SRAM1
    4. Read SRAM2, (try to write SRAM2, stall)
    5. Write SRAM2
    6. ...
I can ensure that my buffers are aligned correctly, but I don't know how much to align it. I guess I could benchmark it but is there some more detail about the inner workings?

alastairpatrick
Posts: 753
Joined: Fri Apr 22, 2022 1:39 am
Location: USA

Re: Optimal alignment between RAM buffers for fastest DMA copy?

Tue Feb 20, 2024 2:53 am

This post explains the DMA pipeline. Screenshot below.
pipe.JPG
pipe.JPG (95.7 KiB) Viewed 389 times
Perhaps you have an application for the address range where SRAM is not striped? There is a linker script in the SDK to utilize it.
unstriped.JPG
unstriped.JPG (72.43 KiB) Viewed 389 times

arg001
Posts: 641
Joined: Tue Jan 23, 2018 10:06 am

Re: Optimal alignment between RAM buffers for fastest DMA copy?

Tue Feb 20, 2024 10:11 am

This should be easy enough to test - you'd expect the "wrong" alignment to take precisely twice as long as the other alingnments (and with your test program running from flash and the default stack in scratch X, the CPU shouldn't touch the main RAM so a simple program should give accurate answers).

There's only four alignments to test: the striping pattern repeats at intervals of 16 (or to put it another way, it's only the last hex digit of the addresses that matters), and you are obviously doing 32-bit transfers if you care about performance so the addresses are word aligned. Hence the four cases are offsets of 16n, 16n+4, 16n+8, 16n+12.

My reading of the pipleilne description above is that the reads and writes occur 2 cycles apart, so the 'bad' case is 16n+8.

If I'm right, then this is quite convenient: just allocate all of your buffers aligned to multiple of 16 and you are guaranteed to avoid the clash.

kilograham
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 1540
Joined: Fri Apr 12, 2019 11:00 am
Location: austin tx

Re: Optimal alignment between RAM buffers for fastest DMA copy?

Wed Feb 21, 2024 3:59 pm

don't quite me on this, but i don't think it makes any difference (bar a cycle) because if they are aligned, it will stall a cycle then they wont be

arg001
Posts: 641
Joined: Tue Jan 23, 2018 10:06 am

Re: Optimal alignment between RAM buffers for fastest DMA copy?

Wed Feb 21, 2024 11:00 pm

kilograham wrote:
Wed Feb 21, 2024 3:59 pm
don't quite me on this, but i don't think it makes any difference (bar a cycle) because if they are aligned, it will stall a cycle then they wont be
I don't think that's right in the simple case (it might be if there's multiple DMAs active).

The stall has to delay the whole pipeline - so although the next write will be delayed, the next read will be also.

I was sufficiently interested to write a test program, and it seems like I'm half right!

Code: Select all

#include "pico/stdlib.h"
#include "pico/bootrom.h"
#include "hardware/watchdog.h"
#include "hardware/clocks.h"
#include "hardware/dma.h"
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>

#define BUF_SIZE    8192        // 8K words = 32K bytes
uint32_t buf1[BUF_SIZE];
uint32_t buf2[BUF_SIZE];

// Which DMA channel to use?
#define DMA_CH  0

int main(void)
{
    dma_channel_config cfg;

    watchdog_enable(5000, true);    // Enable watchdog with 5sec timeout
 
    stdio_init_all();

    cfg = dma_channel_get_default_config(DMA_CH);
    channel_config_set_read_increment(&cfg, true);
    channel_config_set_write_increment(&cfg, true);
    channel_config_set_transfer_data_size(&cfg, DMA_SIZE_32);

    for (;;)
    {
        unsigned time1, time2, time3, time4, time5;
        int ch;
        printf("Press B for bootrom, any other key to start test\n");
        while ((ch = getchar_timeout_us(1000)) == PICO_ERROR_TIMEOUT)
            watchdog_update();
        if (ch == 'B') reset_usb_boot(PICO_DEFAULT_LED_PIN, 0);

        printf("Buffers are at %p and %p\n", buf1, buf2);

        time1 = time_us_32();
        dma_channel_configure(DMA_CH, &cfg, buf1, buf2, BUF_SIZE - 4, true);
        dma_channel_wait_for_finish_blocking(DMA_CH);
        time2 = time_us_32();
        dma_channel_configure(DMA_CH, &cfg, buf1, buf2 + 1, BUF_SIZE - 4, true);
        dma_channel_wait_for_finish_blocking(DMA_CH);
        time3 = time_us_32();
        dma_channel_configure(DMA_CH, &cfg, buf1, buf2 + 2, BUF_SIZE - 4, true);
        dma_channel_wait_for_finish_blocking(DMA_CH);
        time4 = time_us_32();
        dma_channel_configure(DMA_CH, &cfg, buf1, buf2 + 3, BUF_SIZE - 4, true);
        dma_channel_wait_for_finish_blocking(DMA_CH);
        time5 = time_us_32();
        printf("Time with aligned buffers %u\n", time2 - time1);
        printf("Time with buf + 1 %u\n", time3 - time2);
        printf("Time with buf + 2 %u\n", time4 - time3);
        printf("Time with buf + 3 %u\n", time5 - time4);
    }
}
This gives results:

Code: Select all

Time with aligned buffers 67
Time with buf + 1 66
Time with buf + 2 99
Time with buf + 3 66
This has two static buffers of 8K words (32K bytes). The DMA each time transfers 8K-4 words. The destination is always buf1, the source is variously buf2, buf2+1, buf2+2, buf2+3 (buffer of uint32_t, so +1 offsets the address by 4). CLK_SYS is at the default 125MHz.

USB is running for the printf output, and I'm only timing it with the 1us-resolution timer, so there's a bit of jitter, but in repeated runs three of the results are around 66/67 and the +1 case is always around 99/100.

If the striping is working perfectly, then the DMA should do one word per CLK_SYS, so the whole transfer of 8192-4 words should take (8192-4)/125 = 65.5us, plus a few cycles for starting the DMA and reading the clock.

So the three cases I expected to be stall-free indeed come out precisely at the expected value.

However, the 'bad' case which I expected to take twice as long (instead of the read and write overlapping, read and write would take two cycles), it's in fact only 50% worse.

So somehow, after the stall the two accesses get de-synchronized and the next one completes without stalling, but then they are back in step and stall again on the third transfer. I haven't exactly worked out the mechanism for this, but it seems reasonable that it could happen.

cleverca22
Posts: 8828
Joined: Sat Aug 18, 2012 2:33 pm

Re: Optimal alignment between RAM buffers for fastest DMA copy?

Thu Feb 22, 2024 12:14 am

arg001 wrote:
Wed Feb 21, 2024 11:00 pm
The stall has to delay the whole pipeline - so although the next write will be delayed, the next read will be also.
the dma on the rp2040 has 3 fifo's involved
each time a dma channel wants to perform a copy, it writes a src addr to the read fifo, and a dest addr to the write fifo

when the axi master port is able to, it will take an addr off the read fifo, and issue a read
when a read comes back, it will push it onto a 3rd fifo, the data fifo
when the axi master port is able to, it will take an addr from the write fifo, and a word the data fifo, and then issue the write

a write could stall, and just let data pile up in the fifo's
however, the same port does both r and w, and i dont think its async in nature
so a stalled write, means no more reads

User avatar
pipe
Posts: 4
Joined: Mon Feb 19, 2024 10:13 pm
Location: Gothenburg, Sweden

Re: Optimal alignment between RAM buffers for fastest DMA copy?

Fri Feb 23, 2024 10:59 am

cleverca22 wrote: the dma on the rp2040 has 3 fifo's involved
each time a dma channel wants to perform a copy, it writes a src addr to the read fifo, and a dest addr to the write fifo

when the axi master port is able to, it will take an addr off the read fifo, and issue a read
when a read comes back, it will push it onto a 3rd fifo, the data fifo
when the axi master port is able to, it will take an addr from the write fifo, and a word the data fifo, and then issue the write

a write could stall, and just let data pile up in the fifo's
however, the same port does both r and w, and i dont think its async in nature
so a stalled write, means no more reads
Thanks everyone, this certainly gave me a lot more insight into the details about the DMA. It solves my current project, but ruined another project I've been thinking of! :)

arg001 wrote:
Wed Feb 21, 2024 11:00 pm
I was sufficiently interested to write a test program, and it seems like I'm half right!

This gives results:

Code: Select all

Time with aligned buffers 67
Time with buf + 1 66
Time with buf + 2 99
Time with buf + 3 66
This has two static buffers of 8K words (32K bytes). The DMA each time transfers 8K-4 words. The destination is always buf1, the source is variously buf2, buf2+1, buf2+2, buf2+3 (buffer of uint32_t, so +1 offsets the address by 4). CLK_SYS is at the default 125MHz.

If the striping is working perfectly, then the DMA should do one word per CLK_SYS, so the whole transfer of 8192-4 words should take (8192-4)/125 = 65.5us, plus a few cycles for starting the DMA and reading the clock.

So the three cases I expected to be stall-free indeed come out precisely at the expected value.

However, the 'bad' case which I expected to take twice as long (instead of the read and write overlapping, read and write would take two cycles), it's in fact only 50% worse.

So somehow, after the stall the two accesses get de-synchronized and the next one completes without stalling, but then they are back in step and stall again on the third transfer. I haven't exactly worked out the mechanism for this, but it seems reasonable that it could happen.
This is of course more than enough to demonstrate the issue, but I wanted to write my own benchmark since I need to practice working with the pico. I set up a set of automatic transfers using DMA chaining and repeatedly taking a snapshot of the time between interrupts using the sysclock, calculating the average over 100000 runs for each alignment:

Code: Select all

Starting run with 100000×60000 words from 20000830 to 20000820
100000 entries between 59805 and 60008, mean = 60007.7 cycles, 1.00013 cycles/word
Starting run with 100000×60000 words from 20000834 to 20000820
100000 entries between 60007 and 60021, mean = 60007.7 cycles, 1.00013 cycles/word
Starting run with 100000×60000 words from 20000838 to 20000820
100000 entries between 90005 and 90007, mean = 90005.7 cycles, 1.50009 cycles/word
Starting run with 100000×60000 words from 2000083C to 20000820
100000 entries between 60009 and 60019, mean = 60009.7 cycles, 1.00016 cycles/word
This result matches your result exactly. Copying between buffers having an offset of 8 bytes takes exactly 1.5 cycles per word, every other alignment takes 1.0 cycles.

I'm still not 100% sure about why it's 1.5 cycles, but I don't really care enough to go through every step - I'll just keep my buffers aligned to the same 16-byte offset and I'll be fine!

Edit: Here is the (now modified) code to run the tests for the above: https://github.com/pipatron/pico-dma-co ... -alignment
Last edited by pipe on Sat Feb 24, 2024 1:29 am, edited 1 time in total.

arg001
Posts: 641
Joined: Tue Jan 23, 2018 10:06 am

Re: Optimal alignment between RAM buffers for fastest DMA copy?

Fri Feb 23, 2024 3:22 pm

I think I now understand it. The DMA control logic feeds pairs of addresses (src,dest) to the engine that actually does the transfers, and that engine has a 4-stage pipeline - read-address, read-data, write-address, write-data.

If we assume that the write gets priority over the read, then the following will happen (each row is one clock cycle, each column is one of the four pipeline stages):

Code: Select all

(read S0) (idle)    (idle)     (idle)
(read S1) (read S0) (idle)     (idle)
(read S2) (read S1) (write D0) (idle)   -- Read stalls if S2 and D0 are in the same bank
(read S2) (idle)    (write D1) (write D0)
(read S3) (read S2) (idle)     (write D1)
(read S4) (read S3) (write D2) (idle)  - Read stalls if S4 and D2 are in the same bank
(read S4) (idle)    (write D3) (write D2)
(read S5) (read S4) (idle)     (write D3)
(read S6) (read S5) (write D4) (idle)  -- Read stalls if S6 and D4 are in the same bank
(read S6) (idle)    (write D5) (idle)
(read S8) (read S7) (write D6) (write D5) - Read stalls if S8 and D5 are in the same bank
A classic pipeline bubble. That matches my test results - two transfers per three clock cycles (compared to 1 transfer per cycle if no stalling).

The case where read takes priority is harder to model. However, we have control over the priority in the bus_ctrl registers. I added this to my test program:

Code: Select all

#include "hardware/structs/bus_ctrl.h"
...
    // Give priority to writes
    bus_ctrl_hw->priority = BUSCTRL_BUS_PRIORITY_DMA_W_BITS;
...
    // Give priority to reads
    bus_ctrl_hw->priority = BUSCTRL_BUS_PRIORITY_DMA_R_BITS;
The write priority had no effect on the test results - presumably that was the effective priority with the default setting (all equal).

The read priority was more interesting - now the offset-by-two case is always fast, but sometimes the offset-by-one case is now slow (usually by the same 50%). Adding a delay in my program between the diagnostic printf()s and the start of the DMA made this go away and all four cases became fast.

What I think is going on here is that there has to be an optional holding register between the read and write sides of the DMA: the read side has already committed to doing the read before it knows if the write side will be ready to take the word 2 cycles later, so it has to be able to stash the word somewhere (and then stall the next read if the holding register is full).

So in the offset-by-two case, it's as Kilograham had in mind: there is obviously a collision at exactly the same place, but now it's resolved in favour of the read so the holding register fills up and now the offset between read and write is different and no further collisions occur.

In the offset-by-one case, there shouldn't be a collision at all - and indeed under favourable circumstances there isn't. However, if a stall does occur for other reasons, like a clash with the CPU (probably USB interrupts in my test program), then the holding register fills up and the continual DMA pressure never allows it to empty again - so again the offset between read and write changes for the rest of the duration of the transfer, just that this time it's harmful rather than helpful.


I don't think any of this changes the conclusions for normal use of the chip: ideally align all your buffers on 16-byte boundaries to guarantee avoiding this problem, but at least if it does happen it's only a 50% penalty rather than the 100% that we initially feared.

Setting the bus priority register can have significant impact on performance, but I suspect that it's almost impossible to use it in real life because the conditions to get an improvement are so specific (and a real program is likely to hit the opposite condition just as often).

(edit: removed some left-over text that accidentally got posted after I'd written the improved text above).
Last edited by arg001 on Sat Feb 24, 2024 10:23 am, edited 1 time in total.

dthacher
Posts: 1025
Joined: Sun Jun 06, 2021 12:07 am

Re: Optimal alignment between RAM buffers for fastest DMA copy?

Sat Feb 24, 2024 12:41 am

If you are using non-stripped, 64KB is the offset. If sequential you should be okay with any number with stripped. Promote the DMA read and write to minimize conflict. (This is somewhat dangerous.) Jitter is still somewhat possible. However I am under the impression sequential transfers should work into a pipeline. (What kilograham was getting at.)

DMA in the RP2040 is a singleton which is multiplexed. IO should almost always be a singleton. Most of the IO on the RP2040 is a singleton. However the memory bus supports concurrency. On many 32-bit microcontrollers the memory and IO are singletons. Many 8 and 16 bit controllers are concurrent.

Interesting that DMA does not schedule the stripping. However you can promote the DMA channels slightly to sort this out.

Return to “General”