FunMiles
Posts: 99
Joined: Thu Jun 02, 2022 4:51 pm
Location: Colorado

Re: Inter-core communication and memory model for C/C++

Tue Jun 21, 2022 4:16 pm

@mschnell, to better illustrate my point about rethinking your algorithm for tasks, here is a simplified task-based code for matrix-matrix multiply
that could be run in parallel on the pico (I am assuming, there is a Matrix type with an API similar to the Eigen library):

First the basic tool

Code: Select all

task<>
base_multiply(Matrix &C, const Matrix &A, const Matrix &B)
{
    C = A*B;
    co_return; // Necessary to make this a coroutine.
}
Then the usage in which when_both_cores schedules the tasks for immediate resumption, one on each core and resumes the caller when
both argument coroutines are done:

Code: Select all

task<>
matrix_matrix_multiply(Matrix &C, const Matrix &A, const Matrix &B)
{
    co_await when_both_cores(
          base_multiply(upper_half(C), upper_half(A), B),
          base_multiply(lower_half(C), lower_half(A), B) 
   );
}
There you have a working parallel matrix-matrix multiply on the pico. It is not very cooperative for the task system, but if that's what you need....

Making it cooperative may not take too much work. Modifying base_multiply into:

Code: Select all

// A friendlier version.
task<>
base_multiply(Matrix &C, const Matrix &A, const Matrix &B)
{
    for (int i = 0; i < C.rows(); ++i) {
        C.row(i) = A.row(i) * B.col(i);
        co_await yield();  // This makes it a coroutine, no need for co_return anymore.
    }
}
PS: If you want to experiment with this, let me know. I wish the board allowed to PM people.

User avatar
MikeDB
Posts: 1236
Joined: Sun Oct 12, 2014 8:27 am

Re: Inter-core communication and memory model for C/C++

Tue Jun 21, 2022 9:56 pm

FunMiles wrote:
Tue Jun 21, 2022 4:05 pm
Parallel loops are the hardest thing to move from a local machine to a cluster. Tasks are the opposite. Each task is conceptually local. It's the transfer of data between nodes on the cluster that is difficult.
Sorry but not true. In many applications, and with proper design, you only need to transfer data between nodes at the end of the signal processing chains. This is true both in audio using Matlab and in semiconductor process simulation using parallelised versions of Spice, the two areas of HPC I have been involved with using clusters. Obviously there are problem spaces where this doesn't apply, notably in video processing, but that doesn't makes parallel loops invalid in applications where they work very well indeed.
Always interested in innovative audio startups needing help and investment. Look for InPoSe Ltd or Future Horizons on LinkedIn to find me (same avatar photograph)

FunMiles
Posts: 99
Joined: Thu Jun 02, 2022 4:51 pm
Location: Colorado

Re: Inter-core communication and memory model for C/C++

Wed Jun 22, 2022 1:20 am

MikeDB wrote:
Tue Jun 21, 2022 9:56 pm
FunMiles wrote:
Tue Jun 21, 2022 4:05 pm
Parallel loops are the hardest thing to move from a local machine to a cluster. Tasks are the opposite. Each task is conceptually local. It's the transfer of data between nodes on the cluster that is difficult.
Sorry but not true. In many applications, and with proper design, you only need to transfer data between nodes at the end of the signal processing chains. This is true both in audio using Matlab and in semiconductor process simulation using parallelised versions of Spice, the two areas of HPC I have been involved with using clusters. Obviously there are problem spaces where this doesn't apply, notably in video processing, but that doesn't makes parallel loops invalid in applications where they work very well indeed.
I am not very familiar with Matlab distributed computing, so I had to go look on the mathworks web site. Looking at their examples, all I see is that parfor creates and distributes totally independent, embarrassingly parallelizable tasks with scalar results (i.e. negligible communication) :P
What do you think the following really does?

Code: Select all

function a = MyCode(A)
    tic
    parfor i = 1:200
        a(i) = max(abs(eig(rand(A))));
    end
    toc
end
Sorry but it creates 200 tasks, each of which does a(i) = max(abs(eig(rand(A))));. That's one task and it can be dispatched anywhere.

Take OpenMP parallel loops and there is no way you are going to move them from local to distributed without a huge effort simply to distribute the data and then gather it back.

My HPC work solves coupled fluid/structure problems on some of the largest computer systems in the world, and I can assure you, you will never write a fluid or structural solver code as a parallel for loop of that kind.
This is so true for the majority of HPC codes that OpenMP introduced the notion of tasks and most code use that as the major building block.

Your experience is different because using matlab or spice for distributed HPC is hiding the dirty work for you.
The programmers of those systems have done the work for you, and in the above example, I can assure you that it goes through a task process (it results in what they call the overhead of parfor). The line inside of the parfor can be abstracted as a lambda called with i and A as arguments. Or, since i varies and A does not, you can even put A as a capture (one communication) and then it only has i as argument.
A lambda is a a unit of task. If you think of it that way, I hope you'll understand my point of view.

mschnell
Posts: 211
Joined: Wed Jul 28, 2021 10:33 am
Location: Krefeld, Germany

Re: Inter-core communication and memory model for C/C++

Wed Jun 22, 2022 9:15 am

FunMiles wrote:
Tue Jun 21, 2022 3:10 pm
I don't think the pico is a good architecture on which to do parallel loops, because of the lack of processor assisted atomic type operations, making constant synchronization between code running on both cores very inefficient.
Right. But a matrix*vector multiplication does not need any synchronization, as the algorithm is strictly parallel.
Using both cores would double the speed on a Pico,
Using all cores would greatly improve the speed on a more advanced multicore chip (embedded or "Desktop").
Allowing for doing the same code independent of the core count and OS, via C++ features and library would be viable.
...
Of course optimizing the problem in user code can give the best results. But with this argument you might propose doing everything in ASM :) :)
IMHO a perceived "simple" problem such as Matrix multiplication should (usually) be done in straight forward user code. A simple loop for single core systems, and when wanting to optimize for speed a "parallel loop" for setting multiple cores at work. The (simpelest) user code in C++ should be ignorant of core count and OS.

-Michael

User avatar
MikeDB
Posts: 1236
Joined: Sun Oct 12, 2014 8:27 am

Re: Inter-core communication and memory model for C/C++

Wed Jun 22, 2022 10:24 am

FunMiles wrote:
Wed Jun 22, 2022 1:20 am
I am not very familiar with Matlab distributed computing, so I had to go look on the mathworks web site. Looking at their examples, all I see is that parfor creates and distributes totally independent, embarrassingly parallelizable tasks with scalar results (i.e. negligible communication)
It's just a trivial example. Real Matlab parfor loops can encompass large blocks of code. Only thing is to remember not to embed a parfor inside another one, so it's often worth writing the whole code using for loops then trying out whether an outer loop or several inner loops should be parfors. Unfortunately that isn't always obvious and is often found best using a timer.

My HPC work solves coupled fluid/structure problems on some of the largest computer systems in the world, and I can assure you, you will never write a fluid or structural solver code as a parallel for loop of that kind.
I totally agree that fluid dynamics comes in the same category as video, in that each calculation block needs constant values from adjacent blocks, so communication needs go through the roof. But this is not the same in all problems, and for those where data is distributed out, then pulled back together after the calculations are complete, the Matlab or Spice solutions make very efficient use of clusters. I certainly woudn't expect to see any worthwhile gain by recoding these using the methods you use on fluid dynamics, and the code would probably be less readable. Plus of course Matlab doesn't directly support your techniques.
Always interested in innovative audio startups needing help and investment. Look for InPoSe Ltd or Future Horizons on LinkedIn to find me (same avatar photograph)

FunMiles
Posts: 99
Joined: Thu Jun 02, 2022 4:51 pm
Location: Colorado

Re: Inter-core communication and memory model for C/C++

Wed Jun 22, 2022 2:39 pm

@MikeDB finally we are in agreement :D
One moral of the story is that the tool you use can change the way you think. And when you think in a new way, you discover ways of solving problems that you had not thought about or would have outright dismissed.
It sometimes take an epiphany to switch thinking. E.g. I started programming in C++ in 1986 or 87. But I was a bad C++ programmer then because I saw it as C with a ++ :lol: .
While doing my MSc thesis, I used Smalltalk because my advisor was using that. It switched my paradigm and when I came back to programming in C++ I became a good C++ programmer.
I would encourage anyone to try the C++20 coroutines because they allow to easily do things that you would not have thought of doing before.

One example is that, in scientific/mathematical computing, it is common to see things very linearly and to then apply parallelization to each step.
What became a common thinking for me now is that with an asynchronous composable way of programming, I commonly think of doing something that is maybe not a major amount of computation, but is particularly difficult to parallelize while other steps are taking place. As soon as the data to proceed is available, start that process. Meanwhile, starts an easily parallelizable part of the work and suddenly the not so parallelizable part is no more running on a system starved for work. By the time the result of the not-so-parallelizable step is needed, it is available and has run on a system that occupied all the cores/CPUs/nodes at almost 100% occupancy, instead of having that step running at <10% occupancy as a single step run alone.
This kind of thinking is rendered common because the tool, C++20 coroutine based asynchronous/parallel computing with strong composition capabilities, makes it easy to do.
Other ways of composing work have become common, again because of ease of thinking about them and I am writing a document for teaching such approaches.

It is not an absolute solution to all problems, but it is a solution to many common problems.
I will continue to add hardware support to the library and once I think there's a sufficient amount and that it is stable enough, I'll put the link to it in the list of libraries sticky thread.

User avatar
MikeDB
Posts: 1236
Joined: Sun Oct 12, 2014 8:27 am

Re: Inter-core communication and memory model for C/C++

Wed Jun 22, 2022 3:27 pm

Good to hear. When you've got your library complete I'll give it a go on our CM4 bare metal project and see how it performs compared to our current solution as that can run three of the four cores nearly flat out.
Always interested in innovative audio startups needing help and investment. Look for InPoSe Ltd or Future Horizons on LinkedIn to find me (same avatar photograph)

HPCguy
Posts: 169
Joined: Fri Oct 09, 2020 7:08 pm

Re: Inter-core communication and memory model for C/C++

Wed Jun 29, 2022 11:49 am

FunMiles wrote:
Tue Jun 21, 2022 3:10 pm
Personally, in HPC code, I have moved away from the "parallel loop" paradigm over twenty years ago. It is rare that you should program that way.
Libraries do that for you, but even if you use such libraries, I find that not letting them do that but modifying your algorithm is usually a lot more efficient.
The entire NVIDIA programming model for HPC relies on parallel loops. I'm not sure how you manage to get away from them.

FunMiles
Posts: 99
Joined: Thu Jun 02, 2022 4:51 pm
Location: Colorado

Re: Inter-core communication and memory model for C/C++

Wed Jun 29, 2022 3:23 pm

HPCguy wrote:
Wed Jun 29, 2022 11:49 am
FunMiles wrote:
Tue Jun 21, 2022 3:10 pm
Personally, in HPC code, I have moved away from the "parallel loop" paradigm over twenty years ago. It is rare that you should program that way.
Libraries do that for you, but even if you use such libraries, I find that not letting them do that but modifying your algorithm is usually a lot more efficient.
The entire NVIDIA programming model for HPC relies on parallel loops. I'm not sure how you manage to get away from them.
Your statement seems to extend the way things work inside a GPU to the whole system. I don't think anybody would do that. Yes, a GPU kernel is driven as a data parallel for loop, but what about the rest?

Typically you will drive the GPU operations with three types of tasks. ( And I am purposely using the word task because it should show how you still shouldn't view things as a parallel for loop ):
- Data preparation/postprocessing on CPU
- Data transfer between CPU and GPU (this may be less of an issue with the latest Nvidia hardware with unified memory space
- Running a kernel on the GPU

Are you really going to put that as a big for loop?
You'd be wasting cycles. An optimized GPU utilizing code should be constructed as an asynchronous task system. While the data transfer tasks take place, the CPU should be running other tasks that it does better than the GPU, similarly to running the kernel. Some transfer tasks and CPU tasks can run asynchronously concurrently.
It is extremely hard, except for toy examples, to optimize all this with a parallel for approach

HPCguy
Posts: 169
Joined: Fri Oct 09, 2020 7:08 pm

Re: Inter-core communication and memory model for C/C++

Tue Jul 05, 2022 12:40 pm

Protothreads can sometimes be (much) cleaner than coroutines:

http://dunkels.com/adam/pt/

Example of same code with/without protothreads:

http://dunkels.com/adam/pt/examples.html#driver

Or in C++:

https://github.com/benhoyt/protothreads-cpp
Last edited by HPCguy on Wed Jul 06, 2022 1:07 am, edited 1 time in total.

dthacher
Posts: 419
Joined: Sun Jun 06, 2021 12:07 am

Re: Inter-core communication and memory model for C/C++

Tue Jul 05, 2022 2:39 pm

Protothreads are very clean. However, it does have one little issue. Libraries, as static memory does not work well. To get around this you will need to use struct in C. In C++ you will need to use multiple objects or structs. This may cause more overhead than desired.

The nice thing about protothreads is that you can avoid dynamic memory allocation if desired. I am guessing just about everyone here knows this, however some may not.

I do not recall this being in college, but it is a prediction of super loops. (RTOS and interrupts were poorly covered also. CS is a joke, full of perversion.) This is starting to bother me a little. The point of ABET certification is to provide a body of knowledge and regulation for the state. The prediction of poor education and sole reliance on experience is problems verifying competence and gaining experience. Aka the state pays the price.
There is more I could say here but I am going to leave it at this. Till the world is not a complete waste of time.

HPCguy
Posts: 169
Joined: Fri Oct 09, 2020 7:08 pm

Re: Inter-core communication and memory model for C/C++

Wed Jul 06, 2022 1:23 am

I feel I need to speak out here, rather than mislead a generation of young minds concerning HPC. Task models like HPX are no panacea. Often they are demonstrated on toy problems with specific core kernels, but are rarely used in production codes with many constraints. If you can name a counterexample in a code containing more than 250000 lines with a lifetime of more than ten years, please point it out. The lack of task models in large long lived multiphysics codes is not for lack of trying, but because theory applied to small or focused problems is one thing, but maintaining this model in large production codes often exposes their inefficiencies and drawbacks. In my opinion, the only task based programming environment that has teeth and potential for HPC multiphysics is the Loci model from Mississippi State University. All the others that get a lot of PR are not widely used in spite of the massive hype, and have severe drawbacks. As for my credentials, the last time I was asked to review task based academic papers, I declined saying I wasn't qualified, and was surprised to get pushback from the source, saying they knew full well I was an expert in precisely the subject matter, and would I please review the paper rather than shirk my responsibilities to the academic community.

That said, for *many* non-production scale HPC and other applications, event based threads, aka tasks, are wonderful, and are in fact often the best way to go.

FunMiles
Posts: 99
Joined: Thu Jun 02, 2022 4:51 pm
Location: Colorado

Re: Inter-core communication and memory model for C/C++

Wed Jul 06, 2022 2:55 am

HPCguy wrote:
Wed Jul 06, 2022 1:23 am
I feel I need to speak out here, rather than mislead a generation of young minds concerning HPC. Task models like HPX are no panacea. Often they are demonstrated on toy problems with specific core kernels, but are rarely used in production codes with many constraints. If you can name a counterexample in a code containing more than 250000 lines with a lifetime of more than ten years, please point it out. The lack of task models in large long lived multiphysics codes is not for lack of trying, but because theory applied to small or focused problems is one thing, but maintaining this model in large production codes often exposes their inefficiencies and drawbacks. In my opinion, the only task based programming environment that has teeth and potential for HPC multiphysics is the Loci model from Mississippi State University. All the others that get a lot of PR are not widely used in spite of the massive hype, and have severe drawbacks. As for my credentials, the last time I was asked to review task based academic papers, I declined saying I wasn't qualified, and was surprised to get pushback from the source, saying they knew full well I was an expert in precisely the subject matter, and would I please review the paper rather than shirk my responsibilities to the academic community.

That said, for *many* non-production scale HPC and other applications, event based threads, aka tasks, are wonderful, and are in fact often the best way to go.
I have over 30 years of multiphysics HPC codes based on a task model. I started and directed the writing of two codes (one for structures and one for fluids) that are still in use and have hundreds of thousands of lines of code. Others have taken over, but the codes still live and still run on some of the largest HPC systems in the world.
You will find public versions of one of those codes at https://bitbucket.org/frg/aero-s/downloads/
I started this code over 30 years ago. Many students have contributed and it is not the cleanest code out there. But it still lives and satisfies your requirement for showing a code that does so. I'll also mention that i am a co-winner of a Gordon Bell prize with a derivative of this code.
As for HPX, they run it on extremely large astrophysics problem.
And for my more recent work, I have written for a client what my client has measured as the fastest parallel multi-frontal solver on the market (and other sparse solvers). I could not have written it if it weren't for tasks. parallelfor loops are clunky for such a program.
If anybody thinks a parallel multi-frontal solver is a toy problem, I challenge them to write one with parallel for loops.
One of my works after that was to write a different type of sparse solver (so-called Left Looking) using C++20 coroutine based tasks. I have been amazed how quickly I reached similar milestones from the multi-frontal solver much faster, because the language supported coroutine approach helps thinking more naturally of asynchronous tasks and write the operations much more compactly. That makes writing, but even-more importantly, re-reading and reviewing code much easier.

One thing that has bothered me is that several people have pontificated here about coroutines without having what seems like the basic knowledge of any details about the C++20 coroutines. It seems most statements are based on the library based coroutines such as boost which are so far from the C++20 mechanism that you cannot draw any conclusion from one and extended it to the other.
In fact, since HPCguy mentioned protothread, I'll mention that I took a look at protothread and how it is implemented. A look at C++20 coroutines will reveal that they are actually a very similar approach. protothread might have been called protocoroutines. But the C++20 coroutines are language supported, support local variables (protothread cannot save local variables with its design) and offer many more tools. For example, going to the point of @dthacher about memory allocation, you are given the tools to avoid dynamic memory allocation as well.
I have said they are complex to understand because of the way they were designed to be an extremely flexible building tool for library writers. I am writing a library so others can use them on the pico. If you don't want to use it, that's not my business, but I don't see a reason to attack my attempts as futile.

This will be my last post in this thread and I think it should be closed.

ejolson
Posts: 9752
Joined: Tue Mar 18, 2014 11:47 am

Re: Inter-core communication and memory model for C/C++

Wed Jul 06, 2022 6:21 am

FunMiles wrote:
Wed Jul 06, 2022 2:55 am
I'll also mention that i am a co-winner of a Gordon Bell prize with a derivative of this code.
Wow! It must have been fun to receive such recognition.

I've been reading this thread and wondering how exascale supercomputing relates to the Raspberry Pi Pico. Then I realized how nice would be a portable parallel programming interface that is as useful on dual-core microcontrollers as on million-core supercomputers.

Compared to what has been mentioned so far, my own parallel processing experiences are much smaller scale. At one point I was quite happy with MIT Cilk and invested time to get Intel's libcilkrts working on a Raspberry Pi 2B. Details are at

viewtopic.php?t=102743

Unfortunately, the graphs in that thread are offline due to the shields-up directive. Since the webserver is now in the DMZ, those graphs should soon be back.

I was saddened when Cilk was deprecated, but at about the same time OpenMP in GCC and clang (but not IBM XL C) began offering a similar kind of task-based parallelism. Although lacking the ability of Tapir

https://cilk.mit.edu/tapir/

to optimize loop invariants within a parallel construct, I found it possible to wrap those OpenMP #pragma directives into syntactically prettier macros using _Pragma.

This obviously does not work on a Pico.

In response to
FunMiles wrote:
Wed Jul 06, 2022 2:55 am
I am writing a library so others can use them on the pico.
I would like to say, thanks. I didn't want to write my own compatibility layer around the Pico's locking functions and have been waiting for such a library. If not in this thread, please post somewhere obvious when it's stable enough, because I'd like to try it out.
Last edited by ejolson on Wed Jul 06, 2022 6:31 am, edited 1 time in total.

HPCguy
Posts: 169
Joined: Fri Oct 09, 2020 7:08 pm

Re: Inter-core communication and memory model for C/C++

Wed Jul 06, 2022 6:30 am

@Funmiles,

Nothing you said in your followup comments conflicts with what I said, so I think we are in agreement. No need to close the thread.

PS Since you are interested in aerospace and task based approaches to parallel programming, I hope you check out Loci. I have zero affiliation with the project, but am an admirer from afar:

https://web.cse.msstate.edu/~luke/chem/index.html

BTW The paper I mentioned I was asked to review concerned new task extensions to Uintah circa 2015.

@ejolson,

I am currently implementing a computer language that will be a follow on to my RAJA programming model. Wish me luck, as I am not a compiler writer, and will likely fail. :)

FunMiles
Posts: 99
Joined: Thu Jun 02, 2022 4:51 pm
Location: Colorado

Re: Inter-core communication and memory model for C/C++

Wed Jul 06, 2022 10:03 am

ejolson wrote:
Wed Jul 06, 2022 6:21 am
In response to
FunMiles wrote:
Wed Jul 06, 2022 2:55 am
I am writing a library so others can use them on the pico.
I would like to say, thanks. I didn't want to write my own compatibility layer around the Pico's locking functions and have been waiting for such a library. If not in this thread, please post somewhere obvious when it's stable enough, because I'd like to try it out.
I will put it in the first thread with the list of libraries for the Pico.
In the meantime, could you go to the GitHub site I've mentioned before and open an issue asking to support whatever device/feature you're most interested in and that'll help direct my efforts.
I'll be adding support for the new WiFi part of the new SDK as soon as I can get my hands on a Pico-W. I've already started working on WiFi with an ESP-1 via UART but it is not published yet as it is still too basic.

FunMiles
Posts: 99
Joined: Thu Jun 02, 2022 4:51 pm
Location: Colorado

Re: Inter-core communication and memory model for C/C++

Wed Jul 06, 2022 11:52 am

On the protothread similarity with C++20 tasks, I saw this C++ version: https://github.com/benhoyt/protothreads-cpp.
The code shown in the readme as being generated is, from. my understanding, not too far from what a C++20 compiler generates for a coroutine. The big difference is that rather than having to poll for PT_WAIT_UNTIL, the C++20 coroutine's co_await works with a temporary awaitable object, allowing to create separately the logic to continue once what is being awaited becomes available. Another difference mentioned are the temporaries. In the library I reference above, you can put all the temporaries in the class of the protothread. Effectively, the C++20 coroutine system does that for you but it does have one big advantage for modern clean programming: it deals with the lifetime of arbitrarily located temporaries, thus allowing temporaries that have no default constructors and whose constructor data is only available within the execution. Similarly, if the temporary appears within a restricted scope (inside an if or loop block), the destructor will be called correctly at its end of scope. It is possible to achieve the same with that protothreads-cpp library, but the manual care to achieve it may become quite a burden and the risk of making mistakes could quickly become high.

Thanks to HPCguy for mentioning it. It's a clever trick. I only became aware that one could write a case: of a switch statement inside a loop two or three years ago. (It was in a cppcon video talking about loop unrolling and how you could deal with the loop remainder with a switch right into one of the unrolled statements).

If you like what protothread does, then C++20 coroutines are the right tool for you. Especially if, like me, you are allergic to macros.

HPCguy
Posts: 169
Joined: Fri Oct 09, 2020 7:08 pm

Re: Inter-core communication and memory model for C/C++

Thu Jul 07, 2022 12:41 am

FunMiles wrote:
Wed Jul 06, 2022 11:52 am
I only became aware that one could write a case: of a switch statement inside a loop two or three years ago. (It was in a cppcon video talking about loop unrolling and how you could deal with the loop remainder with a switch right into one of the unrolled statements).
For those who want to follow up on this concept, the name of this construct is Duff's Device. The idea is that switch statements are just an encapsulation construct for labels, but just like goto labels, they can really be placed anywhere, so are a way to jump around in complex code, rather than merely an enumeration construct.

User avatar
MikeDB
Posts: 1236
Joined: Sun Oct 12, 2014 8:27 am

Re: Inter-core communication and memory model for C/C++

Thu Jul 07, 2022 8:17 am

HPCguy wrote:
Thu Jul 07, 2022 12:41 am
FunMiles wrote:
Wed Jul 06, 2022 11:52 am
I only became aware that one could write a case: of a switch statement inside a loop two or three years ago. (It was in a cppcon video talking about loop unrolling and how you could deal with the loop remainder with a switch right into one of the unrolled statements).
For those who want to follow up on this concept, the name of this construct is Duff's Device.
And shows up one of the most horrible things in the C language - the fact that switch statements fall through unless there is a break in place, and you can jump into the middle of one of them :-) Suffice it to say, it's prohibited in many company coding guidelines.
Always interested in innovative audio startups needing help and investment. Look for InPoSe Ltd or Future Horizons on LinkedIn to find me (same avatar photograph)

FunMiles
Posts: 99
Joined: Thu Jun 02, 2022 4:51 pm
Location: Colorado

Re: Inter-core communication and memory model for C/C++

Thu Jul 07, 2022 2:16 pm

MikeDB wrote:
Thu Jul 07, 2022 8:17 am
HPCguy wrote:
Thu Jul 07, 2022 12:41 am
FunMiles wrote:
Wed Jul 06, 2022 11:52 am
I only became aware that one could write a case: of a switch statement inside a loop two or three years ago. (It was in a cppcon video talking about loop unrolling and how you could deal with the loop remainder with a switch right into one of the unrolled statements).
For those who want to follow up on this concept, the name of this construct is Duff's Device.
And shows up one of the most horrible things in the C language - the fact that switch statements fall through unless there is a break in place, and you can jump into the middle of one of them :-) Suffice it to say, it's prohibited in many company coding guidelines.
In C++, newer compilers will warn you (and you can make it an error with a compiler flag). Though, if it is really your intention, you still have the possibility to make it clear to the compiler by adding the fallthrough attribute. See https://en.cppreference.com/w/cpp/langu ... allthrough

ejolson
Posts: 9752
Joined: Tue Mar 18, 2014 11:47 am

Re: Inter-core communication and memory model for C/C++

Thu Jul 07, 2022 4:12 pm

MikeDB wrote:
Thu Jul 07, 2022 8:17 am
HPCguy wrote:
Thu Jul 07, 2022 12:41 am
FunMiles wrote:
Wed Jul 06, 2022 11:52 am
I only became aware that one could write a case: of a switch statement inside a loop two or three years ago. (It was in a cppcon video talking about loop unrolling and how you could deal with the loop remainder with a switch right into one of the unrolled statements).
For those who want to follow up on this concept, the name of this construct is Duff's Device.
And shows up one of the most horrible things in the C language - the fact that switch statements fall through unless there is a break in place, and you can jump into the middle of one of them :-) Suffice it to say, it's prohibited in many company coding guidelines.
Duff's device is used in the Newlib C library on the Pico and other MCUs. I'd be surprised if similar examples didn't appear in other libraries. Although people use switch in place of a bunch of else if statements, I think it's more appropriate to consider it a computed goto.

In my opinion rigid coding rules are not that effective in mitigating programmers that lack common sense with C. In such cases mandating Basic, Python or even Go might be a better guideline. One could also promote such programmers to management positions.

User avatar
MikeDB
Posts: 1236
Joined: Sun Oct 12, 2014 8:27 am

Re: Inter-core communication and memory model for C/C++

Thu Jul 07, 2022 4:40 pm

ejolson wrote:
Thu Jul 07, 2022 4:12 pm
Duff's device is used in the Newlib C library on the Pico and other MCUs. I'd be surprised if similar examples didn't appear in other libraries. Although people use switch in place of a bunch of else if statements, I think it's more appropriate to consider it a computed goto.

In my opinion rigid coding rules are not that effective in mitigating programmers that lack common sense with C. In such cases mandating Basic, Python or even Go might be a better guideline. One could also promote such programmers to management positions.
Well hopefully one filters out those without common sense before hiring them. And we never use other languages other than Javascript as we're always after high performance.

Biggest thing I have to drum into people coding for us is to use Allman (BSD) style layout as that's been our standard since the 1970s. They always start out using it but then slip into K&R format.
Always interested in innovative audio startups needing help and investment. Look for InPoSe Ltd or Future Horizons on LinkedIn to find me (same avatar photograph)

FunMiles
Posts: 99
Joined: Thu Jun 02, 2022 4:51 pm
Location: Colorado

Re: Inter-core communication and memory model for C/C++

Fri Jul 08, 2022 2:38 am

I found an interesting read that addresses several subjects in the last posts. It is an interesting read and predates the C++20 coroutines as it was meant to guide their design to be well fitted for embedded system: https://arxiv.org/pdf/1906.00367.pdf

In it, I will not a few amusing things:
  • Agreeing with mikeDB: Tatham noted that ”this trick violates every coding standard in the book”
  • From Duff himself: Duff called the method a ”revolting way to use switches to implement interrupt driven state machines”.
  • Limitation number one: switch statements cannot be used safely in programs that use
    Protothreads; they may cause errors that are not detected by the compiler but cause unpredictable behaviour at run-time.
  • And the second (which I think the C++ version solves, more or less): , they do not manage local variable state on behalf of the programmer: any variable within the coroutine whose
    state should be maintained between calls must be declared as static (global)
I really think the C++20 coroutine design is a success for providing the right tools. They are not easy to understand because it is a complex problem, but they do provide all the tools to be extremely efficient while making it possible to have clean, understandable code whose correctness can be checked fairly easily, as it decomposes responsibility of concerns well.

PS: The paper mentions the issue of dynamic memory allocation and the issue of exceptions.
For dynamic memory allocation, the fact that C++20 coroutines offer several mechanisms for the user to handle the placement of memory for the coroutine does, I believe address this issue. And for exception, watch the following presentation, which offers a solution to avoid the 'normal' C++ exception mechanism and yet provide a way to still handle exceptional cases: https://www.youtube.com/watch?v=TsXYqnUXrwM

HPCguy
Posts: 169
Joined: Fri Oct 09, 2020 7:08 pm

Re: Inter-core communication and memory model for C/C++

Fri Jul 08, 2022 6:00 am

MikeDB wrote:
Thu Jul 07, 2022 4:40 pm
ejolson wrote:
Thu Jul 07, 2022 4:12 pm
Duff's device is used in the Newlib C library on the Pico and other MCUs. I'd be surprised if similar examples didn't appear in other libraries. Although people use switch in place of a bunch of else if statements, I think it's more appropriate to consider it a computed goto.

In my opinion rigid coding rules are not that effective in mitigating programmers that lack common sense with C. In such cases mandating Basic, Python or even Go might be a better guideline. One could also promote such programmers to management positions.
Well hopefully one filters out those without common sense before hiring them. And we never use other languages other than Javascript as we're always after high performance.

Biggest thing I have to drum into people coding for us is to use Allman (BSD) style layout as that's been our standard since the 1970s. They always start out using it but then slip into K&R format.
I encourage you to actually look at the example on Ben Hoyt's page before disparaging it. If you want to rip into me, do so after actually understanding the context. You'll find it to be cleaner, more maintainable, and more encapsulated than what you could produce otherwise. And unlike C++ 20 coroutines, there are no "spooky actions at a distance" done by the compiler, since all variables used by the protothread object are created and encapsulated in the object, as good programming standards dictate:

https://github.com/benhoyt/protothreads-cpp

Just because something is "frowned upon" in college without context, does not mean it does not have its place as a useful, highly effective programming technique under the right context. Same goes for Duff's device, such as wherever you need precise control over loop unrolling to meet strict hard time requirements, or to balance Icache use vs performance.

Finally, you can't benchmark *anything* vs some general comment found in a paper, or a Forum thread. What matters is how well it performs with respect to your particular use case, across *all* factors/requirements like maintainability, complexity, performance, etc. For example, task graphs can work extremely well for a MUMPS or one-sided factorization solution technique applied to a linear system.

User avatar
MikeDB
Posts: 1236
Joined: Sun Oct 12, 2014 8:27 am

Re: Inter-core communication and memory model for C/C++

Fri Jul 08, 2022 9:01 am

HPCguy wrote:
Fri Jul 08, 2022 6:00 am

I encourage you to actually look at the example on Ben Hoyt's page before disparaging it. If you want to rip into me, do so after actually understanding the context.
I think you've got me wrong here. I've certainly not ripped into anybody, other than those creating unreadable code. Both co-routines and protothreads potentially do a lot of tidying up of the code and I stated above we will give both a try.

One limitation we have is the compiler and libraries we use to generate VST applications, about a third of our work, doesn't support co-routines yet, but I'm sure it will in due course. The other problem is the use of either creates a disconnect with the original Matlab prototype, which does create a potential entry point for errors.
Always interested in innovative audio startups needing help and investment. Look for InPoSe Ltd or Future Horizons on LinkedIn to find me (same avatar photograph)

Return to “SDK”