tankist
Posts: 17
Joined: Thu Oct 31, 2019 8:31 pm

There is a trouble with the ping-pong dma chaining

Thu Sep 16, 2021 7:31 am

I tried to make continuous dma transaction with two channels (as known as 'PING-PONG' mode) but when I analyze trnsactions on my oscilloscope I was surprised: I found some extra clocks (32 nS means 4 clock ticks if you assume that the chip operates at 125 MHz) between adjacent DMA transactions (between 1st chanel transaction and 2nd channel transaction). What is the problem?

Image

Code: Select all

#include "pico/stdlib.h"
#include "hardware/pio.h"
#include "hardware/dma.h"
#include "hardware/irq.h"
#include "build/pio_conf.pio.h"

#define SM_NUM      0

// Output arrays
uint16_t outArr[16] = {(1<<0), (1<<0), (1<<0), (1<<0), 0, 0, 0, 0, (1<<0), (1<<0), (1<<0), (1<<0), 0, 0, 0, 0};	// wide pulses - 1st channel
uint16_t secArr[16] = {(1<<0), (1<<0), 0, 0, (1<<0), (1<<0), 0, 0, (1<<0), (1<<0), 0, 0, (1<<0), (1<<0), 0, 0};	// narrow pulses - 2nd channel

// DMA channel for using
int dma_chan, dma_chan2;
// PIO structure
PIO pio;

// DMA init function
void dmaInit()
{
    dma_chan = dma_claim_unused_channel(true);
    dma_chan2 = dma_claim_unused_channel(true);
    dma_channel_config c = dma_channel_get_default_config(dma_chan);
    dma_channel_config c2 = dma_channel_get_default_config(dma_chan2);
    channel_config_set_transfer_data_size(&c, DMA_SIZE_16);
    channel_config_set_transfer_data_size(&c2, DMA_SIZE_16);
    // We need to move SRC address but not DST
    channel_config_set_write_increment(&c, false);
    channel_config_set_read_increment(&c, true);
    channel_config_set_write_increment(&c2, false);
    channel_config_set_read_increment(&c2, true);
    
    // Repeat buffer again
    channel_config_set_ring(&c, false, 4); // (1 << 4) byte boundary on read ptr
    channel_config_set_ring(&c2, false, 4); // (1 << 4) byte boundary on read ptr
    // Chaining DMA channels to each other
    channel_config_set_chain_to(&c, dma_chan2);
    channel_config_set_chain_to(&c2, dma_chan);

    // Set DMA timer 0 as DMA trigger
    channel_config_set_dreq(&c, 0x3b);
    // Set DMA timer 0 as DMA trigger
    channel_config_set_dreq(&c2, 0x3b);
    dma_hw->timer[0] = (1 << 16)  |  4;   // run at 1/4 system clock

    dma_channel_configure
	(
        dma_chan,
        &c,
        &pio->txf[SM_NUM],      // Destination is PIO FIFO
        outArr,                 // Source is a memory array
        16,                     // Length of the transaction
        false                    // Don't start yet
    );

    dma_channel_configure
	(
        dma_chan2,
        &c2,
        &pio->txf[SM_NUM],      // Destination is PIO FIFO
        secArr,                 // Source is a memory array
        16,                     // Length of the transaction
        false                   // Don't start yet
    );

    // Let's start the 1st DMA channel
    dma_start_channel_mask(1u << dma_chan);
}

int main()
{
    pio = pio0;

    // Init PIO programm
    uint offset = pio_add_program(pio, &pio_conf_program);
    hello_program_init(pio, SM_NUM, offset, 0, 16);

    dmaInit();

    while (true)
    {
        tight_loop_contents();
    }
}

jayben
Posts: 356
Joined: Mon Aug 19, 2019 9:56 pm

Re: There is a trouble with the ping-pong dma chaining

Fri Sep 17, 2021 2:34 pm

Wouldn't you expect some delay when the hardware is switching from one control block to another?

If it is only 32 ns, then that sounds pretty quick; on a full-size Pi it is around 200 ns. That is one of the reasons for peripherals having FIFOs; they accommodate the irregularities in DMA operation.

cleverca22
Posts: 4665
Joined: Sat Aug 18, 2012 2:33 pm

Re: There is a trouble with the ping-pong dma chaining

Fri Sep 17, 2021 3:02 pm

jayben wrote:
Fri Sep 17, 2021 2:34 pm
That is one of the reasons for peripherals having FIFOs; they accommodate the irregularities in DMA operation.
but if your PIO code is very hungry, and eating 1 32bit sample every clock cycle, you have zero room for error

if you miss even 1 sample during the flip between buffers, your never going to catch up, and eventually, the fifo will run out after enough mistakes
also, if the dma has to contend with a cpu core for a given ram bank, somebody will loose and have to wait a cycle
tankist wrote:
Thu Sep 16, 2021 7:31 am
channel_config_set_transfer_data_size(&c, DMA_SIZE_16);
channel_config_set_transfer_data_size(&c2, DMA_SIZE_16);
i think this code is moving 16 bits per transfer
if you modify things to instead move 32bits per transfer, it will be doing one transfer every 2 clock cycles, and then it has spare cycles to catch up with

tankist
Posts: 17
Joined: Thu Oct 31, 2019 8:31 pm

Re: There is a trouble with the ping-pong dma chaining

Sat Sep 18, 2021 9:48 am

jayben wrote:
Fri Sep 17, 2021 2:34 pm
Wouldn't you expect some delay when the hardware is switching from one control block to another?

If it is only 32 ns, then that sounds pretty quick; on a full-size Pi it is around 200 ns. That is one of the reasons for peripherals having FIFOs; they accommodate the irregularities in DMA operation.
It is not "ONLY" 32 ns. I'm gonna do transactions at 125 MHz speed and at that speed I loose 4 samples. I really need seamless flow as everyone expects when he says about "ping-pong" mode.
I wonder if it my fault in a module configuration or it's a hardware issue? I'm still waiting the answer from the pico engineers.

tankist
Posts: 17
Joined: Thu Oct 31, 2019 8:31 pm

Re: There is a trouble with the ping-pong dma chaining

Sat Sep 18, 2021 9:53 am

cleverca22 wrote:
Fri Sep 17, 2021 3:02 pm
jayben wrote:
Fri Sep 17, 2021 2:34 pm
That is one of the reasons for peripherals having FIFOs; they accommodate the irregularities in DMA operation.
but if your PIO code is very hungry, and eating 1 32bit sample every clock cycle, you have zero room for error

if you miss even 1 sample during the flip between buffers, your never going to catch up, and eventually, the fifo will run out after enough mistakes
also, if the dma has to contend with a cpu core for a given ram bank, somebody will loose and have to wait a cycle
tankist wrote:
Thu Sep 16, 2021 7:31 am
channel_config_set_transfer_data_size(&c, DMA_SIZE_16);
channel_config_set_transfer_data_size(&c2, DMA_SIZE_16);
i think this code is moving 16 bits per transfer
if you modify things to instead move 32bits per transfer, it will be doing one transfer every 2 clock cycles, and then it has spare cycles to catch up with
I actually don't understand how the 32 bits transaction will be faster. Maybe it's not clear from my sources code, but two DMA channels don't work simultaneously - they work one by one as they are chaining.

cleverca22
Posts: 4665
Joined: Sat Aug 18, 2012 2:33 pm

Re: There is a trouble with the ping-pong dma chaining

Sat Sep 18, 2021 1:25 pm

tankist wrote:
Sat Sep 18, 2021 9:53 am
I actually don't understand how the 32 bits transaction will be faster. Maybe it's not clear from my sources code, but two DMA channels don't work simultaneously - they work one by one as they are chaining.
the hardware is limited to doing 1 transaction per clock cycle, for a given combination of src&dest

a transaction can be a maximum of 32 bits wide

if you configure it to do 16bit transactions, its wasting half the databus
if you configure it to do 32bit transactions, it can do half as many transactions, while moving the same amount of data

and if your only consuming 16bits per clock in the PIO, then the dma only needs to do a transaction on every 2nd clock
so when the dma does fall behind, the fifo can do its job, and the dma can do a quick burst on every clock cycle, and refill the fifo

User avatar
LukeW
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 54
Joined: Tue Jul 07, 2015 2:19 pm

Re: There is a trouble with the ping-pong dma chaining

Mon Sep 20, 2021 9:16 am

Short answer: channel A is not considered "finished" until the last write completes. After this point, channel B performs its first read. Reads happen before writes, so there is a stall of a few cycles when switching between channels, to get write-to-read ordering.

It's not possible to sustain 1 word/cycle through multiple control blocks.

Longer answer -- every individual transfer has to go through four phases in the DMA's bus pipeline:
  • Read address phase (issue from RAF)
  • Read data phase (commit to TDF)
  • Write address phase (issue from WAF, after TDF commit)
  • Write data phase (issue from TDF)
These four phases are overlapped, so at any point the DMA may have four different transfers, potentially from four different channels, in different stages of completion. The overlapping allows a throughput of one read and one write per cycle, even though the overall bus pipeline has a latency of four. When swapping between two channels you have to expose this latency, otherwise you could have reads from the later channel occurring before writes from the earlier channel, which can cause surprising inconsistencies.

tankist
Posts: 17
Joined: Thu Oct 31, 2019 8:31 pm

Re: There is a trouble with the ping-pong dma chaining

Mon Sep 20, 2021 1:24 pm

Thank you.
I wonder if it possible to loop one DMA channel on itself. Will it be seamless?

dthacher
Posts: 116
Joined: Sun Jun 06, 2021 12:07 am

Re: There is a trouble with the ping-pong dma chaining

Wed Sep 22, 2021 11:18 pm

Here is some background on PIC32's DMA for reference which explains the behavior of that implementation. See DMA performance for the top link:
https://people.ece.cornell.edu/land/cou ... x_DMA.html
https://people.ece.cornell.edu/land/cou ... chine.html

In summary there is a couple things which can cause this delay:
1. Memory/Bus Arbitration
2. DMA channel Arbitration
3. DMA state machine/pipeline
4. Bus/IO/Event latency

Most if not all of these issues can be resolved however they take additional hardware. Which increases cost and power consumption. DMA in many implementation gets special access to prevent most of these issues. However rarely are they completely solved. Many run DMAs optimized for burst or processing needs. Many IO controllers do not support this without FIFO and control logic.

In this case you are to use RP2040 secondary core with careful consideration. Cortex-M0 may be slower but it could potentially be more consistent.

Note this problem is in the domain of the system bus. If you lower the speed of the signal, the delay/overhead/latency should remain the same. Meaning the error percentage will decrease. This means that for high speed signals there is an issue. There could be a cross over point for CPU vs DMA depending on the context.

More than likely this is caused by a delay in the state machine's implementation. (Which may be pointless depending on IO/Bus design.) So there is nothing you can do to fix it. Single channel will not likely help and may be worse actually. PIC32 has issues like this. RP2040 uses a DMA much closer to the PIC32 than Pi. The Pi's kind of sucks. However still supports some things.

PIC32 is capable of transport trigger, I think at least some of this is possible in RP2040. This would be pointless in Pi. RP2040 supports scatter gather, which is another sign of FPGA/Pi based design. This enables some ping pong but lacks the timing closure component, if I had to guess. PIC32 is basically in the same boat but uses multiple channel for this.

Take it with a grain of salt.

dthacher
Posts: 116
Joined: Sun Jun 06, 2021 12:07 am

Re: There is a trouble with the ping-pong dma chaining

Wed Sep 22, 2021 11:39 pm

PIC32 supports ping pong in a few places, but if I remember correctly this is only on specific IO peripherals where the effects of this latency would not cause a problem. The same would be true for the RP2040. FIFOs can be very useful for pipelining/time/producer-consumer/etc.

RP2040 overclocks so potentially if the overclock is stable this would also lower the error. Usually there is some tolerance, so if you can get into that window it should be fine. However this is on the system bus, I cannot speak for how to ensure stability. However point is if nothing else you can play with the clock ratios and divide the problem.

Several solutions to this problem.

Return to “General”