DMA transfer taking more time than CPU transfer

Asked 14/5, 2019 at 4:14 Answered 28/10, 2020 at 10:2

Our task is intended to demonstrate the benefit of using DMA to copy a large amount of data versus relying on the processor to directly handle the copying. The processor is an STM32F407 on the ST discovery board.

In order to measure the copying time, a GPIO pin must be turned ON during copying and OFF once it has been copied.

The code appears to be functional but it is currently showing the CPU taking about 2.15ms to complete and DMA about 4.5ms, which is the opposite of what is intended. I'm not sure if there simply isn't enough data for the faster speed of DMA to offset the overhead in setting it up perhaps?

I have tried both copying elements of an array using the CPU and also using the memcpy function which seemed to yield very similar times.

The function code is shown below:

DMASpeed(void)
{
    #define elementNum 32000
    int *ptr = NULL;
    ptr = (int*)malloc(elementNum * sizeof(int));
    int *ptr2 = NULL;
    ptr2 = (int*)malloc(elementNum * sizeof(int));
    for (int i = 0; i < elementNum; i++)
    {
        ptr[i] = 4;
    }
    LD5_GPIO_Port->BSRR = (uint32_t)LD5_Pin << 16U;
    LD6_GPIO_Port->BSRR = (uint32_t)LD6_Pin << 16U;
    // Initial value
    // printf("BEFORE: dst = '%s'\n", dst);

    // Transfer
    printf("Initiate DMA Transfer...\n");
    HAL_DMA_Start(&hdma_memtomem_dma2_stream0, (int)ptr, (int)ptr2, (elementNum * sizeof(int)));
    LD5_GPIO_Port->BSRR = LD5_Pin;
    printf("DMA Transfer initiated.\n");


    // Poll for DMA completion
    printf("Poll for DMA completion.\n");
    HAL_DMA_PollForTransfer(&hdma_memtomem_dma2_stream0,
        HAL_DMA_FULL_TRANSFER, HAL_MAX_DELAY);
    LD5_GPIO_Port->BSRR = (uint32_t)LD5_Pin << 16U;
    printf("DMA complete.\n");

    // Print result
    // printf("AFTER: dst = '%s'\n", dst);
    free(ptr);
    free(ptr2);

    ptr = (int*)malloc(elementNum * sizeof(int));
    ptr2 = (int*)malloc(elementNum * sizeof(int));
    for (int i = 0; i < elementNum; i++)
    {
        ptr[i] = i;
    }

    printf("Initiate CPU Transfer...\n");
    LD6_GPIO_Port->BSRR = LD6_Pin;
    //  for (int i = 0; i<512; i++)
    //  {
    //  ptr2[i] = ptr[i];
    //  }
    memcpy(ptr2, ptr, (elementNum * sizeof(int)));
    printf("CPU Transfer Complete.\n");
    LD6_GPIO_Port->BSRR = (uint32_t)LD6_Pin << 16U;

    free(ptr);
    free(ptr2);
}

Thanks in advance for any assistance

Accurate answered 14/5, 2019 at 4:14 Comment(3)

DMA exists to take load off from the CPU and get rid of interrupts. It is not necessarily faster and may very well be slower. – Iatric 14/5, 2019 at 7:27

Anyway your benchmarking is complete hogwash since you are benchmarking printf calls. Remove all printf, remove all heap allocation (it's not a PC) then measure with an oscilloscope. – Iatric 14/5, 2019 at 7:28

Voting to close this as "cannot be reproduced", since the concept to prove is based on a misconception and the benchmarking code posted to prove the misconception is incorrect in itself. – Iatric 14/5, 2019 at 7:34

you try to proof something what is not the true. DMA memory to memory transfer will be always slower than direct CPU one. DMA was not intended to be faster than the CPU. it's there is to provide the transfer w without the CPU activity in the background. the core has always priority over the DMA.

MEM to MEM DMA transfer will be always slower than the CPU one

There is another problem as well. Many STM devices have memory areas which are not accessible by the DMA (for example CCMRAM).

Cam answered 14/5, 2019 at 5:37 Comment(7)

what important here is also notion what do you mean by faster, It is not about CPU clock speed and DMA clock speed, It also ivolvs CPU overheads in interrupt processing, task switching(if any) etc. – Arch 14/5, 2019 at 5:45

For the Microcontroller in question DMA is on AHB bus.st.com/content/ccc/resource/technical/document/application_note/… – Arch 14/5, 2019 at 5:54

This is technically correct, but in all fairness, if we have something like an incoming frequent interrupt from a peripheral, then the interrupt overhead must be taken in account. The CPU must stack return address, it must process the ISR, we must clear interrupt source flags, we must handle re-entrancy. DMA has the same re-entrancy cache issues but doesn't otherwise come with a lot of overhead like interrupts. – Iatric 14/5, 2019 at 7:40

@Iatric first of all ARM Cortex does not stack return address only sets the link register (it is not like the x86 architecture). It stacks instead the set of registers (sometimes even the FPU ones as well) If we consider that the CPU is busy we should take into the account that another DMA transfers make take place in the background significantly slowing the one we dispute here. So it makes only sense if we compare the CPU is only doing one thing at the moment and there are no other DMA transfers in the background when we compare the speed of the mem to mem transfer. – Cam 14/5, 2019 at 8:44

Yes well a generic CPU stacks "stuff", at the very least condition code registers etc. That's not the big performance bottleneck though, but rather the re-entrancy mechanism. If atomicity of shared objects can be guaranteed, then that's not a big problem (that is, either C11 _Atomic or inline asm). If not, then in an ISR-based solution, the caller will either have to solve re-entrancy by waiting for an on-going interrupt to finish, which costs the whole overhead of executing that ISR, or it can shut off the interrupt temporarily, at the cost of potentially lost data. – Iatric 14/5, 2019 at 8:53

My point is: these kind of performance calculations aren't trivial. – Iatric 14/5, 2019 at 8:54

Thanks for the feedback all. I did eventually (probably longer than it should have taken) realise that the printf statements were slowing the process down. I commented these all out and the DMA process is down to about 0.2ms. The CPU process appears to be so quick that it can't be measured with the logic analyser however. So essentially we have been given a task based on a misconception which can't actually be demonstrated? Regarding the heap allocation it was specified that this should be used for the exercise, for what reason I can only guess. – Accurate 15/5, 2019 at 0:50

Remove printf in below code segment:

LD5_GPIO_Port->BSRR = LD5_Pin;
printf("DMA Transfer initiated.\n");  // <--Remove this


// Poll for DMA completion
printf("Poll for DMA completion.\n"); // <--Remove this

You are turning ON the pin and then printing large text , it is adding up in your total time calculation.

Remove all printf OR atleast do not print anything in between pin toggling.

EDIT:

To be precise you are printing 50 characters in case of DMA transfer and 23 characters in case of CPU transfer.

Arch answered 14/5, 2019 at 4:22 Comment(8)

DMA transfer is always slower than the CPU one – Cam 14/5, 2019 at 5:38

@P__J__ please Refer: #43437062 – Arch 14/5, 2019 at 5:50

What you are assuming is DMA clock is slower than CPU clock. – Arch 14/5, 2019 at 5:51

For the Microcontroller in question DMA is on AHB bus.st.com/content/ccc/resource/technical/document/application_note/… – Arch 14/5, 2019 at 5:54

no the DMA transfer takes many clocks and CPU has always the priority over the DMA. – Cam 14/5, 2019 at 5:59

BTW the stackoverflow link you provided is the same wrong. the answer is incorrect. – Cam 14/5, 2019 at 6:1

@P__J__ could you please be more specific about which processor/platform you are talking about? DMA is independent of CPU operations once triggered hence no question of priority. – Arch 14/5, 2019 at 6:3

about stm32 which I program as my daytime job for more than decade. when the core and the DMA try to access the same bus CPU has always the priority over the DMA. I end this discussion on my side – Cam 14/5, 2019 at 6:59

-1

For those, who google for "How to fasten DMA memory-to-memory transfer?" here is the piece of advice: force your compiler to allocate all HAL code, related to your DMA transfer to the RAM, the best is to the RAM exclusively coupled with the Core. Your compiler will generate function code, which will be copied to the specific RAM at startup, and then all that functions will be called from the RAM and sped up because of it. However, that is also true for copying "by hand". In this case, it is recommended to allocate to the RAM the following files/functions:

stm32[whatever]_hal_dma.c
DMA[N]_Stream[M]_IRQHandler(), where N and M are the numbers of your DMA and stream used for the transfer respectively.

Newspaper answered 28/10, 2020 at 10:2 Comment(0)

Recommended topics

Hot tags