webber007
Posts: 19
Joined: Fri Apr 10, 2015 11:51 am

multithreaded program in Raspi2

Sun Apr 26, 2015 2:59 am

I want use multithread to speed up my code. But the multithreaded version is slower.Do anyone have experience in multithreaded
programing in Raspi2, and give me any advice.

User avatar
iinnovations
Posts: 621
Joined: Thu Jun 06, 2013 5:17 pm

Re: multithreaded program in Raspi2

Sun Apr 26, 2015 6:37 am

With such a general question, google will be more helpful. If you have a specific question, about an application or particular piece of code, this would be a good place to give some more detail about what you are trying to do. There are definitely times where threading does and does not make sense.

C
CuPID Controls :: Open Source browser-based sensor and device control
interfaceinnovations.org/cupidcontrols.html
cupidcontrols.com

Heater
Posts: 19056
Joined: Tue Jul 17, 2012 3:02 pm

Re: multithreaded program in Raspi2

Sun Apr 26, 2015 8:10 am

webber007,

What language are you using or wanting to use?

If it's C then OpenMP or Cilk might be an easier approach than the traditional pthreads approach.

Some languages have threading built in. Like Google's Go language.

If you want to distribute tasks around different machines, not just cores on the same machine, then Erlang might be the way to go.

What is the problem you are trying to speed up?

You really have to give us some clues as to what you want to do before anyone can advise.

Of course going threaded and using more than a single core has it's overheads. Some tasks are easier to parallelize than others and some tasks can be divided up to make efficient use of many cores whilst others cannot. See Amdahl's Law.

Here is my parallel FFT in C using OpenMP and also in Go: https://github.com/ZiCog/fftbench

Note: That is not a particularly smart or fast FFT. It uses integer only maths. It was designed to run on a Parallax Inc, 8 core micro-controller.
Memory in C++ is a leaky abstraction .

RoyLongbottom
Posts: 413
Joined: Fri Apr 12, 2013 9:27 am
Location: Essex, UK

Re: multithreaded program in Raspi2

Sun Apr 26, 2015 9:35 am

webber007 wrote:I want use multithread to speed up my code. But the multithreaded version is slower.Do anyone have experience in multithreaded
programing in Raspi2, and give me any advice.
My examples, some good, some bad - main problems were threading overheads and updating shared data:

http://www.roylongbottom.org.uk/Raspber ... hmarks.htm

Link to codes used included

webber007
Posts: 19
Joined: Fri Apr 10, 2015 11:51 am

Re: multithreaded program in Raspi2

Sun Apr 26, 2015 1:51 pm

Heater wrote:webber007,

What language are you using or wanting to use?

If it's C then OpenMP or Cilk might be an easier approach than the traditional pthreads approach.

Some languages have threading built in. Like Google's Go language.

If you want to distribute tasks around different machines, not just cores on the same machine, then Erlang might be the way to go.

What is the problem you are trying to speed up?

You really have to give us some clues as to what you want to do before anyone can advise.

Of course going threaded and using more than a single core has it's overheads. Some tasks are easier to parallelize than others and some tasks can be divided up to make efficient use of many cores whilst others cannot. See Amdahl's Law.

Here is my parallel FFT in C using OpenMP and also in Go: https://github.com/ZiCog/fftbench

Note: That is not a particularly smart or fast FFT. It uses integer only maths. It was designed to run on a Parallax Inc, 8 core micro-controller.
Thanks for your reply. The language I use is c++. I have a c++ program in windows7. And I transplant it to raspi2.
In window7, the multithread version is fast than normal version.However in raspi2, using multithread is slower

User avatar
r4049zt
Posts: 113
Joined: Sat Jul 21, 2012 1:36 pm

Re: multithreaded program in Raspi2

Sun Apr 26, 2015 3:19 pm

"A program in Windows7" does not define whether it does a sequential task (like "compute successive decimal digits of pi", which won't multithread) or a fully parallel task (like compute the number up to 256 and colour of each pixel in an 800x800 px region from [-2,-2] to [+2,+2] of the Mandelbrot Set), or a quadcore task (like core1 does something with data in memory while core2 does something involving loading from SD to memory while core3 is waiting for a wakeup off the internet while core4 is doing the desktop rendering). The mandelbrot task in particular is so massively parallel that it would certainly work a lot faster on a graphics card.

So, "a program in Windoze" is doing what?

Heater
Posts: 19056
Joined: Tue Jul 17, 2012 3:02 pm

Re: multithreaded program in Raspi2

Mon Apr 27, 2015 11:15 am

OK. C++. What are you using to do the threading? pthreads? Something else?

What does this code do?
Memory in C++ is a leaky abstraction .

webber007
Posts: 19
Joined: Fri Apr 10, 2015 11:51 am

Re: multithreaded program in Raspi2

Tue Apr 28, 2015 11:22 am

Heater wrote:OK. C++. What are you using to do the threading? pthreads? Something else?

What does this code do?
My program is a software renderer like mesa3d. I use pthread to create mutithread.
My program frame is like :

Main thread:
pthread_mutex_lock(gLock);
gNumWorkingThreads = gNumThreads;
gRenderData.state = TRANSFORM_VERTEX;
pthread_cond_broadcast(gTransformCond);//wake children thread
pthread_mutex_unlock(gLock);

TransformVertices(0);

pthread_mutex_lock(gLock);
--gNumWorkingThreads;
// wait for all threads to finish transform and binning stage
if (gNumWorkingThreads == 0) {
gNumWorkingThreads = gNumThreads;
gRenderData.state = RASTER_TRIANGLE;
pthread_cond_broadcast(gRasterCond);
} else {
while (gRenderData.state == TRANSFORM_VERTEX) {
pthread_cond_wait(gRasterCond, gLock);
}
}
//here we are sure transform is finished and others are waiting for next stage
pthread_mutex_unlock(gLock);

Rasterize(0);
ShadeFragments(0);

pthread_mutex_lock(gLock);
--gNumWorkingThreads;
while (gNumWorkingThreads > 0) {
pthread_cond_wait(gEndCond, gLock);
}
pthread_mutex_unlock(gLock);


children thread:
pthread_mutex_lock(&gLock);
while (gRenderData.state == RASTER_TRIANGLE)
{
pthread_cond_wait(&gTransformCond, &gLock);
}
pthread_mutex_unlock(&gLock);

TransformVertices(threadId);

pthread_mutex_lock(&gLock);
--gNumWorkingThreads;
if (gNumWorkingThreads == 0)
{
gRenderData.state = RASTER_TRIANGLE;
gNumWorkingThreads = gNumThreads;
pthread_cond_broadcast(&gRasterCond);
}else{
while (gRenderData.state == TRANSFORM_VERTEX){
pthread_cond_wait(&gRasterCond, &gLock);
}
}
pthread_mutex_unlock(&gLock);

Rasterize(threadId);
ShaderFragments(threadId);

pthread_mutex_lock(&gLock);
--gNumWorkingThreads;
if (gNumWorkingThreads == 0){
pthread_cond_signal(&gEndCond);
}
pthread_mutex_unlock(&gLock);

ejolson
Posts: 9483
Joined: Tue Mar 18, 2014 11:47 am

Re: multithreaded program in Raspi2

Tue Apr 28, 2015 7:20 pm

webber007 wrote:My program is a software renderer like mesa3d. I use pthread to create multithread.
It appears you are using pthread mutex locks to synchronize work done in parallel by multiple threads. It was recently reported on the Intel developer forum that
from intel forum mosekmosek.com wrote:It turns [out] that a large part [of the] problems is caused by some locks/mutexes that cause the issue. For some reason the penalty with locks on Windows is small whereas on Linux the locks kills the performance.
I'm not sure the mutex lock speed is the reason your code shows a speed improvement on Windows but not on Raspbian, but it is worth looking into. You might also check if your worker threads are calling any library functions that are serialized behind the scenes. In particular, any library function with hidden state is likely to be a problem. If you have time and are interested, you might also consider the MIT/Intel Cilk parallel programing extensions for C/C++ which have recently been ported to the Raspberry Pi 2B.

webber007
Posts: 19
Joined: Fri Apr 10, 2015 11:51 am

Re: multithreaded program in Raspi2

Wed Apr 29, 2015 6:58 am

To test the parallel performance in Raspi2, I design a code likes:

code A:
for(tid=0;tid<4;tid++){
for(i=0;i<MAX_COUNT;i++){
data[tid]=tid;
for(x=0;x<MAX_VALUE;x++)
{
data[tid]+=(float)x/MAX_VALUE;
}
}
}

code B:
#pragma omp parallel for
for(tid=0;tid<4;tid++){
for(i=0;i<MAX_COUNT;i++){
data[tid]=tid;
for(x=0;x<MAX_VALUE;x++)
{
data[tid]+=(float)x/MAX_VALUE;
}
}
}

code C:
void* add(void *x){
int tid=(int)x;
data[tid]=tid;
int i,sum;
for(i=0;i<MAX_COUNT;i++){
data[tid]=tid;
for(sum=0;sum<MAX_VALUE;sum++){
data[tid]+=(float)sum/MAX_VALUE;
}
}
return NULL;
}

for(i=1;i<4;i++){
if(pthread_create(&threads[i-1],NULL,add,(void*)i)){
printf("Creation of thread %d failed.\n",i);
abort();
}
}

for(i=0;i<MAX_COUNT;i++){
data[0]=0;
for(x=0;x<MAX_VALUE;x++)
{
data[0]+=(float)x/MAX_VALUE;
}
}



the time that the code A and B and C took was nearly same. I was confused about the result.

ejolson
Posts: 9483
Joined: Tue Mar 18, 2014 11:47 am

Re: multithreaded program in Raspi2

Wed Apr 29, 2015 9:04 am

webber007 wrote:the time that the code A and B and C took was nearly same. I was confused about the result.
I tried your code with the MIT/Cilk parallel extensions to gcc-5.0 and obtained a parallel speedup of about 3.97 fold.

Code: Select all

./webber -- Compute Webber's Iterations
Using MAX_COUNT 4096 and MAX_VALUE 4096.

      CodeA elapsed time    3.363390176544189e-01 seconds
      CodeB elapsed time    8.465195846557617e-02 seconds
 Parallel speedup factor                   3.9732 fold
Since the code executes 2 floating point operations in the inner loop, this translates to a parallel performance of about 1585.5 million single-precision floating point operations per second. If you have NEON enabled and proper parallelization going on, you should be able to reproduce these timings with OpenMP, POSIX threads or Linux threads and futex locks. For reference the compilation command was

Code: Select all

$ /usr/local/gcc-5.0/bin/gcc -O3 -mcpu=cortex-a7 -mfpu=neon-vfpv4 \
    -mfloat-abi=hard -ffast-math -Wall -std=gnu99 -fcilkplus \
    -o webber webber.c -lcilkrts -lm
and the code is

Code: Select all

#include <sys/time.h>
#include <sys/resource.h>
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <cilk/cilk.h>

#define MAX_COUNT 4096
#define MAX_VALUE 4096

float data[4];

static double tic_time;
void tic() {
    struct timeval tp;
    gettimeofday(&tp,0);
    tic_time=1.0e-6*tp.tv_usec+tp.tv_sec;
}
double toc() {
    struct timeval tp;
    gettimeofday(&tp,0);
    return 1.0e-6*tp.tv_usec+tp.tv_sec-tic_time;
}

void codeA(){
    for(int tid=0;tid<4;tid++){
        for(int i=0;i<MAX_COUNT;i++){
            data[tid]=tid;
            for(int x=0;x<MAX_VALUE;x++){
                data[tid]+=(float)x/MAX_VALUE;
            }
        }
    }
}
void codeB(){
    cilk_for(int tid=0;tid<4;tid++){
        for(int i=0;i<MAX_COUNT;i++){
            data[tid]=tid;
            for(int x=0;x<MAX_VALUE;x++){
                data[tid]+=(float)x/MAX_VALUE;
            }
        }
    }
}

typedef void (*codetype)();
double bench(codetype code){
    double tmin=100.0,ttot=0.0;
    for(int j=0;;j++){
        tic();
        code();
        double t=toc();
        putchar('.'); fflush(stdout);
        tmin=(t<tmin)?t:tmin;
        ttot+=t;
        if(ttot>5 && j>3) break;
    }
    printf("\n");
    return tmin;
}

int main(int argc, char *argv[]){
    printf(
        "%s -- Compute Webber's Iterations\n"
        "Using MAX_COUNT %d and MAX_VALUE %d.\n\n",
        argv[0],MAX_COUNT,MAX_VALUE);
    double tserial=bench(codeA);
    double tparallel=bench(codeB);
    printf(
        "\n%24s %24.15e seconds\n"
        "%24s %24.15e seconds\n"
        "%24s %24g fold\n",
        "CodeA elapsed time",tserial,
        "CodeB elapsed time",tparallel,
        "Parallel speedup factor",tserial/tparallel);
    return 0;
}

webber007
Posts: 19
Joined: Fri Apr 10, 2015 11:51 am

Re: multithreaded program in Raspi2

Sun May 03, 2015 2:16 am

ejolson wrote:
webber007 wrote:the time that the code A and B and C took was nearly same. I was confused about the result.
I tried your code with the MIT/Cilk parallel extensions to gcc-5.0 and obtained a parallel speedup of about 3.97 fold.

Code: Select all

./webber -- Compute Webber's Iterations
Using MAX_COUNT 4096 and MAX_VALUE 4096.

      CodeA elapsed time    3.363390176544189e-01 seconds
      CodeB elapsed time    8.465195846557617e-02 seconds
 Parallel speedup factor                   3.9732 fold
Since the code executes 2 floating point operations in the inner loop, this translates to a parallel performance of about 1585.5 million single-precision floating point operations per second. If you have NEON enabled and proper parallelization going on, you should be able to reproduce these timings with OpenMP, POSIX threads or Linux threads and futex locks. For reference the compilation command was

Code: Select all

$ /usr/local/gcc-5.0/bin/gcc -O3 -mcpu=cortex-a7 -mfpu=neon-vfpv4 \
    -mfloat-abi=hard -ffast-math -Wall -std=gnu99 -fcilkplus \
    -o webber webber.c -lcilkrts -lm
and the code is

Code: Select all

#include <sys/time.h>
#include <sys/resource.h>
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <cilk/cilk.h>

#define MAX_COUNT 4096
#define MAX_VALUE 4096

float data[4];

static double tic_time;
void tic() {
    struct timeval tp;
    gettimeofday(&tp,0);
    tic_time=1.0e-6*tp.tv_usec+tp.tv_sec;
}
double toc() {
    struct timeval tp;
    gettimeofday(&tp,0);
    return 1.0e-6*tp.tv_usec+tp.tv_sec-tic_time;
}

void codeA(){
    for(int tid=0;tid<4;tid++){
        for(int i=0;i<MAX_COUNT;i++){
            data[tid]=tid;
            for(int x=0;x<MAX_VALUE;x++){
                data[tid]+=(float)x/MAX_VALUE;
            }
        }
    }
}
void codeB(){
    cilk_for(int tid=0;tid<4;tid++){
        for(int i=0;i<MAX_COUNT;i++){
            data[tid]=tid;
            for(int x=0;x<MAX_VALUE;x++){
                data[tid]+=(float)x/MAX_VALUE;
            }
        }
    }
}

typedef void (*codetype)();
double bench(codetype code){
    double tmin=100.0,ttot=0.0;
    for(int j=0;;j++){
        tic();
        code();
        double t=toc();
        putchar('.'); fflush(stdout);
        tmin=(t<tmin)?t:tmin;
        ttot+=t;
        if(ttot>5 && j>3) break;
    }
    printf("\n");
    return tmin;
}

int main(int argc, char *argv[]){
    printf(
        "%s -- Compute Webber's Iterations\n"
        "Using MAX_COUNT %d and MAX_VALUE %d.\n\n",
        argv[0],MAX_COUNT,MAX_VALUE);
    double tserial=bench(codeA);
    double tparallel=bench(codeB);
    printf(
        "\n%24s %24.15e seconds\n"
        "%24s %24.15e seconds\n"
        "%24s %24g fold\n",
        "CodeA elapsed time",tserial,
        "CodeB elapsed time",tparallel,
        "Parallel speedup factor",tserial/tparallel);
    return 0;
}
Thanks for your reply. when I compiled my test code using cilk_for with gcc-5.0 and run, it posts an error:while loading shared libraries.so.5:cannot open shared object file: No such file or directory.
And I'm confused why openmp or pthread can't speed up my test code in raspi2. Do you have a try?

ejolson
Posts: 9483
Joined: Tue Mar 18, 2014 11:47 am

Re: multithreaded program in Raspi2

Sun May 03, 2015 5:30 pm

webber007 wrote: Thanks for your reply. when I compiled my test code using cilk_for with gcc-5.0 and run, it posts an error:while loading shared libraries.so.5:cannot open shared object file: No such file or directory.
And I'm confused why openmp or pthread can't speed up my test code in raspi2. Do you have a try?
I have successfully used pthreads for parallel processing, but it is difficult. POSIX threads were created in a time before multi-core CPUs were common to solve problems of input/output concurrency in user-level programs that were arrising from mouse interfaces, games and other things. The semantics were designed to be easy to implement as user-level libraries for the influential Unix operating systems of the time---Solaris, AIX, Ultrix, Irix and so forth. In Linux, POSIX threads are also implemented as a library abstraction on top of native Linux threads. Not only is the implementation inefficent, but the design is inconvenient for parallel processing.

The advent of multi-core CPU architectures has lead to an increasing focus on improving computational performance through the use of parallel processing. This is a distincly different task than writing multi-threaded programs to handle concurrent input/output, and as a result, is best accomplished using different programming tools. The main reason to use POSIX threads for parallel processing was so the resulting code would easily compile and run in any POSIX compliant environment. Now that programming languages designed for parallel processing, such as Cilk, OpenMP and openCL, are becoming widely available, there much less reason to use POSIX threads for parallel processing.

Did you download my binary, Doug's binary from github or compile gcc-5.1 with the Cilk parallel extensions yourself?

In either case, to solve your missing library problem, please update the environment variable LD_LIBRARY_PATH to point to the shared libraries. Instructions for doing this are in the thread on Intel/MIT Cilkplus as well as on the github site. Namely, you need to enter a command like

Code: Select all

$ export LD_LIBRARY_PATH=/usr/local/gcc-XXXXX/lib
where the XXXXX is different depending on which compiler you installed, before trying to run the programs compiled with the new compiler.

Alternatively, you can more thoroughly integrate the new compiler into the system so you don't have to set an environment variable to run your programs. To do this, place a file gcc-local.conf in the subdirectory /etc/ld.so.conf.d containing the following lines

Code: Select all

# This is the file /etc/ld.so.conf.d/gcc-local.conf
/usr/local/gcc-XXXXX/lib
where again XXXXX depends on which compiler you installed. After you have added this file, as root, execute the following

Code: Select all

# ldconfig
to include the new library directories into the system.

Return to “C/C++”