I want use multithread to speed up my code. But the multithreaded version is slower.Do anyone have experience in multithreaded
programing in Raspi2, and give me any advice.
- iinnovations
- Posts: 621
- Joined: Thu Jun 06, 2013 5:17 pm
Re: multithreaded program in Raspi2
With such a general question, google will be more helpful. If you have a specific question, about an application or particular piece of code, this would be a good place to give some more detail about what you are trying to do. There are definitely times where threading does and does not make sense.
C
C
CuPID Controls :: Open Source browser-based sensor and device control
interfaceinnovations.org/cupidcontrols.html
cupidcontrols.com
interfaceinnovations.org/cupidcontrols.html
cupidcontrols.com
Re: multithreaded program in Raspi2
webber007,
What language are you using or wanting to use?
If it's C then OpenMP or Cilk might be an easier approach than the traditional pthreads approach.
Some languages have threading built in. Like Google's Go language.
If you want to distribute tasks around different machines, not just cores on the same machine, then Erlang might be the way to go.
What is the problem you are trying to speed up?
You really have to give us some clues as to what you want to do before anyone can advise.
Of course going threaded and using more than a single core has it's overheads. Some tasks are easier to parallelize than others and some tasks can be divided up to make efficient use of many cores whilst others cannot. See Amdahl's Law.
Here is my parallel FFT in C using OpenMP and also in Go: https://github.com/ZiCog/fftbench
Note: That is not a particularly smart or fast FFT. It uses integer only maths. It was designed to run on a Parallax Inc, 8 core micro-controller.
What language are you using or wanting to use?
If it's C then OpenMP or Cilk might be an easier approach than the traditional pthreads approach.
Some languages have threading built in. Like Google's Go language.
If you want to distribute tasks around different machines, not just cores on the same machine, then Erlang might be the way to go.
What is the problem you are trying to speed up?
You really have to give us some clues as to what you want to do before anyone can advise.
Of course going threaded and using more than a single core has it's overheads. Some tasks are easier to parallelize than others and some tasks can be divided up to make efficient use of many cores whilst others cannot. See Amdahl's Law.
Here is my parallel FFT in C using OpenMP and also in Go: https://github.com/ZiCog/fftbench
Note: That is not a particularly smart or fast FFT. It uses integer only maths. It was designed to run on a Parallax Inc, 8 core micro-controller.
Memory in C++ is a leaky abstraction .
-
- Posts: 413
- Joined: Fri Apr 12, 2013 9:27 am
- Location: Essex, UK
Re: multithreaded program in Raspi2
My examples, some good, some bad - main problems were threading overheads and updating shared data:webber007 wrote:I want use multithread to speed up my code. But the multithreaded version is slower.Do anyone have experience in multithreaded
programing in Raspi2, and give me any advice.
http://www.roylongbottom.org.uk/Raspber ... hmarks.htm
Link to codes used included
Re: multithreaded program in Raspi2
Thanks for your reply. The language I use is c++. I have a c++ program in windows7. And I transplant it to raspi2.Heater wrote:webber007,
What language are you using or wanting to use?
If it's C then OpenMP or Cilk might be an easier approach than the traditional pthreads approach.
Some languages have threading built in. Like Google's Go language.
If you want to distribute tasks around different machines, not just cores on the same machine, then Erlang might be the way to go.
What is the problem you are trying to speed up?
You really have to give us some clues as to what you want to do before anyone can advise.
Of course going threaded and using more than a single core has it's overheads. Some tasks are easier to parallelize than others and some tasks can be divided up to make efficient use of many cores whilst others cannot. See Amdahl's Law.
Here is my parallel FFT in C using OpenMP and also in Go: https://github.com/ZiCog/fftbench
Note: That is not a particularly smart or fast FFT. It uses integer only maths. It was designed to run on a Parallax Inc, 8 core micro-controller.
In window7, the multithread version is fast than normal version.However in raspi2, using multithread is slower
Re: multithreaded program in Raspi2
"A program in Windows7" does not define whether it does a sequential task (like "compute successive decimal digits of pi", which won't multithread) or a fully parallel task (like compute the number up to 256 and colour of each pixel in an 800x800 px region from [-2,-2] to [+2,+2] of the Mandelbrot Set), or a quadcore task (like core1 does something with data in memory while core2 does something involving loading from SD to memory while core3 is waiting for a wakeup off the internet while core4 is doing the desktop rendering). The mandelbrot task in particular is so massively parallel that it would certainly work a lot faster on a graphics card.
So, "a program in Windoze" is doing what?
So, "a program in Windoze" is doing what?
Re: multithreaded program in Raspi2
OK. C++. What are you using to do the threading? pthreads? Something else?
What does this code do?
What does this code do?
Memory in C++ is a leaky abstraction .
Re: multithreaded program in Raspi2
My program is a software renderer like mesa3d. I use pthread to create mutithread.Heater wrote:OK. C++. What are you using to do the threading? pthreads? Something else?
What does this code do?
My program frame is like :
Main thread:
pthread_mutex_lock(gLock);
gNumWorkingThreads = gNumThreads;
gRenderData.state = TRANSFORM_VERTEX;
pthread_cond_broadcast(gTransformCond);//wake children thread
pthread_mutex_unlock(gLock);
TransformVertices(0);
pthread_mutex_lock(gLock);
--gNumWorkingThreads;
// wait for all threads to finish transform and binning stage
if (gNumWorkingThreads == 0) {
gNumWorkingThreads = gNumThreads;
gRenderData.state = RASTER_TRIANGLE;
pthread_cond_broadcast(gRasterCond);
} else {
while (gRenderData.state == TRANSFORM_VERTEX) {
pthread_cond_wait(gRasterCond, gLock);
}
}
//here we are sure transform is finished and others are waiting for next stage
pthread_mutex_unlock(gLock);
Rasterize(0);
ShadeFragments(0);
pthread_mutex_lock(gLock);
--gNumWorkingThreads;
while (gNumWorkingThreads > 0) {
pthread_cond_wait(gEndCond, gLock);
}
pthread_mutex_unlock(gLock);
children thread:
pthread_mutex_lock(&gLock);
while (gRenderData.state == RASTER_TRIANGLE)
{
pthread_cond_wait(&gTransformCond, &gLock);
}
pthread_mutex_unlock(&gLock);
TransformVertices(threadId);
pthread_mutex_lock(&gLock);
--gNumWorkingThreads;
if (gNumWorkingThreads == 0)
{
gRenderData.state = RASTER_TRIANGLE;
gNumWorkingThreads = gNumThreads;
pthread_cond_broadcast(&gRasterCond);
}else{
while (gRenderData.state == TRANSFORM_VERTEX){
pthread_cond_wait(&gRasterCond, &gLock);
}
}
pthread_mutex_unlock(&gLock);
Rasterize(threadId);
ShaderFragments(threadId);
pthread_mutex_lock(&gLock);
--gNumWorkingThreads;
if (gNumWorkingThreads == 0){
pthread_cond_signal(&gEndCond);
}
pthread_mutex_unlock(&gLock);
Re: multithreaded program in Raspi2
It appears you are using pthread mutex locks to synchronize work done in parallel by multiple threads. It was recently reported on the Intel developer forum thatwebber007 wrote:My program is a software renderer like mesa3d. I use pthread to create multithread.
I'm not sure the mutex lock speed is the reason your code shows a speed improvement on Windows but not on Raspbian, but it is worth looking into. You might also check if your worker threads are calling any library functions that are serialized behind the scenes. In particular, any library function with hidden state is likely to be a problem. If you have time and are interested, you might also consider the MIT/Intel Cilk parallel programing extensions for C/C++ which have recently been ported to the Raspberry Pi 2B.from intel forum mosekmosek.com wrote:It turns [out] that a large part [of the] problems is caused by some locks/mutexes that cause the issue. For some reason the penalty with locks on Windows is small whereas on Linux the locks kills the performance.
Re: multithreaded program in Raspi2
To test the parallel performance in Raspi2, I design a code likes:
code A:
for(tid=0;tid<4;tid++){
for(i=0;i<MAX_COUNT;i++){
data[tid]=tid;
for(x=0;x<MAX_VALUE;x++)
{
data[tid]+=(float)x/MAX_VALUE;
}
}
}
code B:
#pragma omp parallel for
for(tid=0;tid<4;tid++){
for(i=0;i<MAX_COUNT;i++){
data[tid]=tid;
for(x=0;x<MAX_VALUE;x++)
{
data[tid]+=(float)x/MAX_VALUE;
}
}
}
code C:
void* add(void *x){
int tid=(int)x;
data[tid]=tid;
int i,sum;
for(i=0;i<MAX_COUNT;i++){
data[tid]=tid;
for(sum=0;sum<MAX_VALUE;sum++){
data[tid]+=(float)sum/MAX_VALUE;
}
}
return NULL;
}
for(i=1;i<4;i++){
if(pthread_create(&threads[i-1],NULL,add,(void*)i)){
printf("Creation of thread %d failed.\n",i);
abort();
}
}
for(i=0;i<MAX_COUNT;i++){
data[0]=0;
for(x=0;x<MAX_VALUE;x++)
{
data[0]+=(float)x/MAX_VALUE;
}
}
the time that the code A and B and C took was nearly same. I was confused about the result.
code A:
for(tid=0;tid<4;tid++){
for(i=0;i<MAX_COUNT;i++){
data[tid]=tid;
for(x=0;x<MAX_VALUE;x++)
{
data[tid]+=(float)x/MAX_VALUE;
}
}
}
code B:
#pragma omp parallel for
for(tid=0;tid<4;tid++){
for(i=0;i<MAX_COUNT;i++){
data[tid]=tid;
for(x=0;x<MAX_VALUE;x++)
{
data[tid]+=(float)x/MAX_VALUE;
}
}
}
code C:
void* add(void *x){
int tid=(int)x;
data[tid]=tid;
int i,sum;
for(i=0;i<MAX_COUNT;i++){
data[tid]=tid;
for(sum=0;sum<MAX_VALUE;sum++){
data[tid]+=(float)sum/MAX_VALUE;
}
}
return NULL;
}
for(i=1;i<4;i++){
if(pthread_create(&threads[i-1],NULL,add,(void*)i)){
printf("Creation of thread %d failed.\n",i);
abort();
}
}
for(i=0;i<MAX_COUNT;i++){
data[0]=0;
for(x=0;x<MAX_VALUE;x++)
{
data[0]+=(float)x/MAX_VALUE;
}
}
the time that the code A and B and C took was nearly same. I was confused about the result.
Re: multithreaded program in Raspi2
I tried your code with the MIT/Cilk parallel extensions to gcc-5.0 and obtained a parallel speedup of about 3.97 fold.webber007 wrote:the time that the code A and B and C took was nearly same. I was confused about the result.
Code: Select all
./webber -- Compute Webber's Iterations
Using MAX_COUNT 4096 and MAX_VALUE 4096.
CodeA elapsed time 3.363390176544189e-01 seconds
CodeB elapsed time 8.465195846557617e-02 seconds
Parallel speedup factor 3.9732 fold
Code: Select all
$ /usr/local/gcc-5.0/bin/gcc -O3 -mcpu=cortex-a7 -mfpu=neon-vfpv4 \
-mfloat-abi=hard -ffast-math -Wall -std=gnu99 -fcilkplus \
-o webber webber.c -lcilkrts -lm
Code: Select all
#include <sys/time.h>
#include <sys/resource.h>
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <cilk/cilk.h>
#define MAX_COUNT 4096
#define MAX_VALUE 4096
float data[4];
static double tic_time;
void tic() {
struct timeval tp;
gettimeofday(&tp,0);
tic_time=1.0e-6*tp.tv_usec+tp.tv_sec;
}
double toc() {
struct timeval tp;
gettimeofday(&tp,0);
return 1.0e-6*tp.tv_usec+tp.tv_sec-tic_time;
}
void codeA(){
for(int tid=0;tid<4;tid++){
for(int i=0;i<MAX_COUNT;i++){
data[tid]=tid;
for(int x=0;x<MAX_VALUE;x++){
data[tid]+=(float)x/MAX_VALUE;
}
}
}
}
void codeB(){
cilk_for(int tid=0;tid<4;tid++){
for(int i=0;i<MAX_COUNT;i++){
data[tid]=tid;
for(int x=0;x<MAX_VALUE;x++){
data[tid]+=(float)x/MAX_VALUE;
}
}
}
}
typedef void (*codetype)();
double bench(codetype code){
double tmin=100.0,ttot=0.0;
for(int j=0;;j++){
tic();
code();
double t=toc();
putchar('.'); fflush(stdout);
tmin=(t<tmin)?t:tmin;
ttot+=t;
if(ttot>5 && j>3) break;
}
printf("\n");
return tmin;
}
int main(int argc, char *argv[]){
printf(
"%s -- Compute Webber's Iterations\n"
"Using MAX_COUNT %d and MAX_VALUE %d.\n\n",
argv[0],MAX_COUNT,MAX_VALUE);
double tserial=bench(codeA);
double tparallel=bench(codeB);
printf(
"\n%24s %24.15e seconds\n"
"%24s %24.15e seconds\n"
"%24s %24g fold\n",
"CodeA elapsed time",tserial,
"CodeB elapsed time",tparallel,
"Parallel speedup factor",tserial/tparallel);
return 0;
}
Re: multithreaded program in Raspi2
Thanks for your reply. when I compiled my test code using cilk_for with gcc-5.0 and run, it posts an error:while loading shared libraries.so.5:cannot open shared object file: No such file or directory.ejolson wrote:I tried your code with the MIT/Cilk parallel extensions to gcc-5.0 and obtained a parallel speedup of about 3.97 fold.webber007 wrote:the time that the code A and B and C took was nearly same. I was confused about the result.
Since the code executes 2 floating point operations in the inner loop, this translates to a parallel performance of about 1585.5 million single-precision floating point operations per second. If you have NEON enabled and proper parallelization going on, you should be able to reproduce these timings with OpenMP, POSIX threads or Linux threads and futex locks. For reference the compilation command wasCode: Select all
./webber -- Compute Webber's Iterations Using MAX_COUNT 4096 and MAX_VALUE 4096. CodeA elapsed time 3.363390176544189e-01 seconds CodeB elapsed time 8.465195846557617e-02 seconds Parallel speedup factor 3.9732 fold
and the code isCode: Select all
$ /usr/local/gcc-5.0/bin/gcc -O3 -mcpu=cortex-a7 -mfpu=neon-vfpv4 \ -mfloat-abi=hard -ffast-math -Wall -std=gnu99 -fcilkplus \ -o webber webber.c -lcilkrts -lm
Code: Select all
#include <sys/time.h> #include <sys/resource.h> #include <stdio.h> #include <stdlib.h> #include <math.h> #include <cilk/cilk.h> #define MAX_COUNT 4096 #define MAX_VALUE 4096 float data[4]; static double tic_time; void tic() { struct timeval tp; gettimeofday(&tp,0); tic_time=1.0e-6*tp.tv_usec+tp.tv_sec; } double toc() { struct timeval tp; gettimeofday(&tp,0); return 1.0e-6*tp.tv_usec+tp.tv_sec-tic_time; } void codeA(){ for(int tid=0;tid<4;tid++){ for(int i=0;i<MAX_COUNT;i++){ data[tid]=tid; for(int x=0;x<MAX_VALUE;x++){ data[tid]+=(float)x/MAX_VALUE; } } } } void codeB(){ cilk_for(int tid=0;tid<4;tid++){ for(int i=0;i<MAX_COUNT;i++){ data[tid]=tid; for(int x=0;x<MAX_VALUE;x++){ data[tid]+=(float)x/MAX_VALUE; } } } } typedef void (*codetype)(); double bench(codetype code){ double tmin=100.0,ttot=0.0; for(int j=0;;j++){ tic(); code(); double t=toc(); putchar('.'); fflush(stdout); tmin=(t<tmin)?t:tmin; ttot+=t; if(ttot>5 && j>3) break; } printf("\n"); return tmin; } int main(int argc, char *argv[]){ printf( "%s -- Compute Webber's Iterations\n" "Using MAX_COUNT %d and MAX_VALUE %d.\n\n", argv[0],MAX_COUNT,MAX_VALUE); double tserial=bench(codeA); double tparallel=bench(codeB); printf( "\n%24s %24.15e seconds\n" "%24s %24.15e seconds\n" "%24s %24g fold\n", "CodeA elapsed time",tserial, "CodeB elapsed time",tparallel, "Parallel speedup factor",tserial/tparallel); return 0; }
And I'm confused why openmp or pthread can't speed up my test code in raspi2. Do you have a try?
Re: multithreaded program in Raspi2
I have successfully used pthreads for parallel processing, but it is difficult. POSIX threads were created in a time before multi-core CPUs were common to solve problems of input/output concurrency in user-level programs that were arrising from mouse interfaces, games and other things. The semantics were designed to be easy to implement as user-level libraries for the influential Unix operating systems of the time---Solaris, AIX, Ultrix, Irix and so forth. In Linux, POSIX threads are also implemented as a library abstraction on top of native Linux threads. Not only is the implementation inefficent, but the design is inconvenient for parallel processing.webber007 wrote: Thanks for your reply. when I compiled my test code using cilk_for with gcc-5.0 and run, it posts an error:while loading shared libraries.so.5:cannot open shared object file: No such file or directory.
And I'm confused why openmp or pthread can't speed up my test code in raspi2. Do you have a try?
The advent of multi-core CPU architectures has lead to an increasing focus on improving computational performance through the use of parallel processing. This is a distincly different task than writing multi-threaded programs to handle concurrent input/output, and as a result, is best accomplished using different programming tools. The main reason to use POSIX threads for parallel processing was so the resulting code would easily compile and run in any POSIX compliant environment. Now that programming languages designed for parallel processing, such as Cilk, OpenMP and openCL, are becoming widely available, there much less reason to use POSIX threads for parallel processing.
Did you download my binary, Doug's binary from github or compile gcc-5.1 with the Cilk parallel extensions yourself?
In either case, to solve your missing library problem, please update the environment variable LD_LIBRARY_PATH to point to the shared libraries. Instructions for doing this are in the thread on Intel/MIT Cilkplus as well as on the github site. Namely, you need to enter a command like
Code: Select all
$ export LD_LIBRARY_PATH=/usr/local/gcc-XXXXX/lib
Alternatively, you can more thoroughly integrate the new compiler into the system so you don't have to set an environment variable to run your programs. To do this, place a file gcc-local.conf in the subdirectory /etc/ld.so.conf.d containing the following lines
Code: Select all
# This is the file /etc/ld.so.conf.d/gcc-local.conf
/usr/local/gcc-XXXXX/lib
Code: Select all
# ldconfig