User avatar
scruss
Posts: 5161
Joined: Sat Jun 09, 2012 12:25 pm
Location: Toronto, ON

Re: A Pi Pie Chart

Sun Jun 26, 2022 3:24 pm

There's a really old one here: pichart-current.tgz - 2021-02-05. I don't have anything newer
‘Remember the Golden Rule of Selling: “Do not resort to violence.”’ — McGlashan.
Pronouns: he/him

ejolson
Posts: 10264
Joined: Tue Mar 18, 2014 11:47 am

Re: A Pi Pie Chart

Sun Jun 26, 2022 4:18 pm

scruss wrote:
Sun Jun 26, 2022 3:24 pm
There's a really old one here: pichart-current.tgz - 2021-02-05. I don't have anything newer
That is, in fact, the newest version. Note also I've tried to keep changes minimal so the first Pi pie charts are comparable to the latest.

Due to the shields up initiative

https://www.cisa.gov/shields-up

access to fractal.math.unr.edu is currently through a VPN. The website is being migrated to a virtual machine in the DMZ. The VM is already provisioned but I expect it to be a few more weeks before the links are working again.

User avatar
lurk101
Posts: 1976
Joined: Mon Jan 27, 2020 2:35 pm
Location: Cumming, GA (US)

Re: A Pi Pie Chart

Sun Jun 26, 2022 7:22 pm

Initial RISCV pi format SBC not very impressive. Lands somewhere between Pi Zero and Pi Zero 2.

Code: Select all

star@VisionFive:~/pichart-36$ gcc --version
gcc (Ubuntu 11.2.0-19ubuntu1) 11.2.0
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

star@VisionFive:~/pichart-36$ make
gcc -std=gnu99 -O3 -Wall -o pichart-serial pichart.c util.c sieve.c merge.c fourier.c lorenz.c -lm
gcc -std=gnu99 -O3 -Wall -fopenmp -o pichart-openmp pichart.c util.c sieve.c merge.c fourier.c lorenz.c -lm
star@VisionFive:~/pichart-36$ ./pichart-openmp -t "Dual-Core 64-bit RISCV64GC 2MB L2 @ 1.0GHz"
pichart -- Raspberry Pi Performance OPENMP version 36

Prime Sieve          P=14630843 Workers=2 Sec=3.60139 Mops=259.435
Merge Sort           N=16777216 Workers=2 Sec=5.28436 Mops=76.1971
Fourier Transform    N=4194304 Workers=2 Sec=7.67046 Mflops=60.1493
Lorenz 96            N=32768 K=16384 Workers=2 Sec=12.2998 Mflops=261.892

The Dual-Core 64-bit RISCV64GC 2MB L2 @ 1.0GHz has Raspberry Pi ratio=3.73012
Making pie charts...done.
star@VisionFive:~/pichart-36$
Screenshot_1.png
Screenshot_1.png (69.81 KiB) Viewed 1827 times
The Linux Foundation is like loggers who claim to speak for the trees.

ejolson
Posts: 10264
Joined: Tue Mar 18, 2014 11:47 am

Re: A Pi Pie Chart

Tue Jun 28, 2022 7:08 am

lurk101 wrote:
Sun Jun 26, 2022 7:22 pm
Initial RISCV pi format SBC not very impressive. Lands somewhere between Pi Zero and Pi Zero 2.

Code: Select all

star@VisionFive:~/pichart-36$ gcc --version
gcc (Ubuntu 11.2.0-19ubuntu1) 11.2.0
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

star@VisionFive:~/pichart-36$ make
gcc -std=gnu99 -O3 -Wall -o pichart-serial pichart.c util.c sieve.c merge.c fourier.c lorenz.c -lm
gcc -std=gnu99 -O3 -Wall -fopenmp -o pichart-openmp pichart.c util.c sieve.c merge.c fourier.c lorenz.c -lm
star@VisionFive:~/pichart-36$ ./pichart-openmp -t "Dual-Core 64-bit RISCV64GC 2MB L2 @ 1.0GHz"
pichart -- Raspberry Pi Performance OPENMP version 36

Prime Sieve          P=14630843 Workers=2 Sec=3.60139 Mops=259.435
Merge Sort           N=16777216 Workers=2 Sec=5.28436 Mops=76.1971
Fourier Transform    N=4194304 Workers=2 Sec=7.67046 Mflops=60.1493
Lorenz 96            N=32768 K=16384 Workers=2 Sec=12.2998 Mflops=261.892

The Dual-Core 64-bit RISCV64GC 2MB L2 @ 1.0GHz has Raspberry Pi ratio=3.73012
Making pie charts...done.
star@VisionFive:~/pichart-36$
Screenshot_1.png
Thanks for posting! Is this

https://www.cnx-software.com/2021/11/27 ... -computer/

the system you tested? The specs in that post indicate the CPU runs at 1.5 GHz. Were you running it under clocked or is there a typo somewhere?

AndrewPiEater
Posts: 141
Joined: Wed Jul 16, 2014 4:45 pm

Re: A Pi Pie Chart

Tue Jun 28, 2022 8:37 am

lurk101 wrote:
Sun Jun 26, 2022 7:22 pm
Initial RISCV pi format SBC not very impressive. Lands somewhere between Pi Zero and Pi Zero 2.
That's interesting. I have an AllWinner D1 based SBC, single core RISC V. It's not even Pi Zero performance. But it was much cheaper than the StarFive SBC and good enough for my curiosity and to give me a first look.

jamesh
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 31871
Joined: Sat Jul 30, 2011 7:41 pm

Re: A Pi Pie Chart

Tue Jun 28, 2022 9:52 am

I think it is going to be a while before RISCV is solid enough for real work, but looks like things are slowly moving in that area.
Principal Software Engineer at Raspberry Pi Ltd.
Working in the Applications Team.

User avatar
lurk101
Posts: 1976
Joined: Mon Jan 27, 2020 2:35 pm
Location: Cumming, GA (US)

Re: A Pi Pie Chart

Tue Jun 28, 2022 1:09 pm

ejolson wrote:
Tue Jun 28, 2022 7:08 am
the system you tested? The specs in that post indicate the CPU runs at 1.5 GHz. Were you running it under clocked or is there a typo somewhere?
Unclear. Most sites are early evaluation of the VisionFive and state 1.5, but the current documentation says 1.0 GHz. This unit was borrowed without documentation and runs with a fixed clock rate so there's no Ubuntu command to display it.
The Linux Foundation is like loggers who claim to speak for the trees.

User avatar
lurk101
Posts: 1976
Joined: Mon Jan 27, 2020 2:35 pm
Location: Cumming, GA (US)

Re: A Pi Pie Chart

Tue Jun 28, 2022 1:10 pm

jamesh wrote:
Tue Jun 28, 2022 9:52 am
I think it is going to be a while before RISCV is solid enough for real work, but looks like things are slowly moving in that area.
Agreed, meanwhile I'm suffering from ARM fatigue.
The Linux Foundation is like loggers who claim to speak for the trees.

User avatar
scruss
Posts: 5161
Joined: Sat Jun 09, 2012 12:25 pm
Location: Toronto, ON

Re: A Pi Pie Chart

Tue Jun 28, 2022 3:27 pm

jamesh wrote:
Tue Jun 28, 2022 9:52 am
I think it is going to be a while before RISCV is solid enough for real work, but looks like things are slowly moving in that area.
It's decent in the micro-controller field. The ESP32-C3 is a single-core RISC-V platform running at 160 MHz (max.) and 400 KB of SRAM with solid C and MicroPython support. It's in the WEMOS C3 mini, which is around the same price as a Raspberry Pi Pico.
‘Remember the Golden Rule of Selling: “Do not resort to violence.”’ — McGlashan.
Pronouns: he/him

ejolson
Posts: 10264
Joined: Tue Mar 18, 2014 11:47 am

Re: A Pi Pie Chart

Sat Sep 24, 2022 7:00 pm

As described in

viewtopic.php?p=2040484#p2040484

the school were I teach has
partnered with Apple [to provide] a common learning platform and equal access to technology and digital tools for new, undergraduate degree-seeking students and faculty.
I now have access to an iPad.

The iSH i386 emulator allows one to run Alpine Linux on an iPad. In this thread I'm testing a recent iPad Air 5.

According to what
iSH wrote: Possibly the most interesting thing I wrote as part of iSH is the JIT. It's not actually a JIT since it doesn't target machine code. Instead it generates an array of pointers to functions called gadgets, and each gadget ends with a tailcall to the next function; like the threaded code technique used by some Forth interpreters. The result is a speedup of roughly 3-5x compared to pure emulation
https://github.com/ish-app/ish

Amazingly, iSH is available from the iStore so one doesn't have to go through any strange steps to install it.

After installation I added gcc and the development tools with

Code: Select all

# apk add gcc
# apk add dev-musl
after which it was possible to compile the pichart program.

Unfortunately, the first run ended with an Invalid Instruction. Upon removing "-mtune=native -march-native" from CFLAGS the pichart program ran. However, I further had to force the timing routines in util.c to use gettimeofday as otherwise the reported run times were on the order of 8.16189e+12 seconds.

The final result was

Code: Select all

$ time ./pichart-serial -t "iSH on iPad Air 5"
pichart -- Raspberry Pi Performance Serial version 36

Prime Sieve          P=14630843 Workers=2 Sec=22.8542 Mops=40.8822
Merge Sort           N=16777216 Workers=1 Sec=17.5209 Mops=22.9813
Fourier Transform    N=4194304 Workers=1 Sec=21.1913 Mflops=21.7719
Lorenz 96            N=32768 K=16384 Workers=1 Sec=73.4931 Mflops=43.8303

The iSH on iPad Air 5 has Raspberry Pi ratio=0.864047
Making pie charts...done.
real    24m 46.78s
user    24m 43.64s
sys     0m 3.13s
Image

This is slightly slower than the original Raspberry Pi, but still fast enough for simple software development. I wonder if the Pico SDK can be installed.

User avatar
scruss
Posts: 5161
Joined: Sat Jun 09, 2012 12:25 pm
Location: Toronto, ON

Re: A Pi Pie Chart

Sun Sep 25, 2022 12:28 am

ejolson wrote:
Sat Sep 24, 2022 7:00 pm
Amazingly, iSH is available from the iStore so one doesn't have to go through any strange steps to install it.
… for now. It's only a matter of time before Apple notices and kicks it out. I mean, they don't allow emulators than run arbitrary BASIC, so this one's gonna go splat soon.
‘Remember the Golden Rule of Selling: “Do not resort to violence.”’ — McGlashan.
Pronouns: he/him

ejolson
Posts: 10264
Joined: Tue Mar 18, 2014 11:47 am

Re: A Pi Pie Chart

Sun Sep 25, 2022 1:09 am

scruss wrote:
Sun Sep 25, 2022 12:28 am
ejolson wrote:
Sat Sep 24, 2022 7:00 pm
Amazingly, iSH is available from the iStore so one doesn't have to go through any strange steps to install it.
… for now. It's only a matter of time before Apple notices and kicks it out. I mean, they don't allow emulators than run arbitrary BASIC, so this one's gonna go splat soon.
Apparently Apple kicked it out around October 2020

https://ish.app/blog/app-store-removal

and that was successfully appealed. I don't know the details.

Since everything is running in an emulated i386, maybe it doesn't violate the rules any more than those programmable HP calculator applications. It's also interesting that much of the web would not work without being able to execute arbitrary JavaScript or WebAssembly these days.

In my opinion, for Apple to have long-term success selling iPads with a keyboard folio and Apple pencil as university-recommended educational tools, they need applications that allow locally-hosted user programmability. The representative mentioned Swift Playgrounds and Remote Desktop, but what the teachers I know wanted was Python, R and Julia.

From that point of view something like a PiPad running Linux with long battery life could tick more of the checkboxes.

ejolson
Posts: 10264
Joined: Tue Mar 18, 2014 11:47 am

Re: A Pi Pie Chart

Tue Sep 27, 2022 5:57 pm

ejolson wrote:
Sat Sep 24, 2022 7:00 pm
The final result was

Code: Select all

$ time ./pichart-serial -t "iSH on iPad Air 5"
pichart -- Raspberry Pi Performance Serial version 36

Prime Sieve          P=14630843 Workers=2 Sec=22.8542 Mops=40.8822
Merge Sort           N=16777216 Workers=1 Sec=17.5209 Mops=22.9813
Fourier Transform    N=4194304 Workers=1 Sec=21.1913 Mflops=21.7719
Lorenz 96            N=32768 K=16384 Workers=1 Sec=73.4931 Mflops=43.8303

The iSH on iPad Air 5 has Raspberry Pi ratio=0.864047
Making pie charts...done.
real    24m 46.78s
user    24m 43.64s
sys     0m 3.13s
This is slightly slower than the original Raspberry Pi, but still fast enough for simple software development. I wonder if the Pico SDK can be installed.
I discovered another scripting application for the iPad which allows one to compile and execute C code. The a-shell app uses clang to create WebAssembly which can be executed using wasm. Note a-shell is also available from the iStore and can be installed without any complications.

To get things to compile I had to remove the use of clock_gettime from util.c and further exclude sys/time.h and sys/resource.h from pichart.h. Compiling and linking all the files on a single line crashed, so I split the build up as

Code: Select all

$ make
clang -std=gnu99 -O3 -Wall -c pichart.c
clang -std=gnu99 -O3 -Wall -c util.c
clang -std=gnu99 -O3 -Wall -c sieve.c
clang -std=gnu99 -O3 -Wall -c merge.c
clang -std=gnu99 -O3 -Wall -c fourier.c
clang -std=gnu99 -O3 -Wall -c lorenz.c
clang -std=gnu99 -O3 -Wall -o pichart-serial pichart.o util.o sieve.o merge.o fourier.o lorenz.o -lm
After this the output was

Code: Select all

$ wasm pichart-serial -t "iPad Air 5 Wasm"
pichart -- Raspberry Pi Performance Serial version 36

Prime Sieve          P=14630843 Workers=2 Sec=1.075 Mops=869.143
Merge Sort           N=16777216 Workers=1 Sec=1.543 Mops=260.955
Fourier Transform    N=4194304 Workers=2 Sec=0.356 Mflops=1295.99
Lorenz 96            N=32768 K=16384 Workers=2 Sec=0.857 Mflops=3758.72

The iPad Air 5 Wasm has Raspberry Pi ratio=28.7884
Making pie charts...done.
and the pichart looked like

Image

According to what
Wikipedia wrote: The fifth-generation iPad Air uses the M1 SoC
https://en.wikipedia.org/wiki/IPad_Air_(5th_generation)

I wonder how close the WebAssembly JIT compiler gets to native speed. It's a lot faster than the emulated i386.

ejolson
Posts: 10264
Joined: Tue Mar 18, 2014 11:47 am

Re: A Pi Pie Chart

Thu Sep 29, 2022 3:54 am

The dog developer has been running some single-core jobs on a 128-core dual-socket EPYC 7702 server and complaining when all the cores in a NUMA node are busy that the performance is about three times slower than when only one core is in use. After pondering how a factor of three performance difference has ramifications for shared-tenant hardware in the cloud, I then wondered how much the four cores on a Raspberry Pi interfere with each other when running pichart.

To measure this I first tried the serial version of the pichart with the other cores not doing anything to find the baseline performance. I then ran separate copies of the single-threaded pichart program in stress mode on one to three of the other cores while testing the forth core for possible performance degradation. This was done for each of the four different computations.

Note that only the serial version of pichart was used, but multiple copies were run on different processor cores.

The script

Code: Select all

#!/bin/bash

# The name of the system being tested
sys="Pi4B (1500MHz)"
# The number of cores
let numcpu=4

let smx=numcpu-1
echo pkill pichart-serial
pkill pichart-serial
sleep 5
for r in 1 2 4 8
do
    echo Running test $r...
    let n=0
    while test $n -le $smx
    do
        sync
        sleep 5
        if test $n -gt 0
        then
            let z=n-1
            echo "Starting stress on CPU 0 through $z..."
        fi
        let l=0
        while test $l -lt $n
        do
            echo taskset -c $l ./pichart-serial -t"$sys" -r$r -w10
            taskset -c $l ./pichart-serial -t"$sys" -r$r -w10 \
                >run${r}_${n}_${l}.out &
            let l=l+1
        done
        f="waiting"
        while test $f = "waiting"
        do
            sleep 5
            let l=0
            let s=0
            while test $l -lt $n
            do
                if grep -q "stress" run${r}_${n}_${l}.out
                then
                    let s=s+1
                fi
                let l=l+1
            done
            if test $l -eq $s
            then
                f="stress"
            fi
        done
        if test $n -gt 0
        then
            let z=n-1
            echo "...CPU 0 through $z are running stress"
        fi
        sleep 5
        echo taskset -c $smx ./pichart-serial -t"$sys" -r$r
        taskset -c $smx ./pichart-serial -t"$sys" -r$r \
            >run${r}_${n}.out
        sleep 5
        echo pkill pichart-serial
        pkill pichart-serial
        sleep 5
        let n=n+1
    done
    echo ...done with test $r
done
produced the output

Code: Select all

$ ./doload
pkill pichart-serial
Running test 1...
taskset -c 3 ./pichart-serial -tPi4B (1500MHz) -r1
pkill pichart-serial
Starting stress on CPU 0 through 0...
taskset -c 0 ./pichart-serial -tPi4B (1500MHz) -r1 -w10
...CPU 0 through 0 are running stress
taskset -c 3 ./pichart-serial -tPi4B (1500MHz) -r1
pkill pichart-serial
Starting stress on CPU 0 through 1...
taskset -c 0 ./pichart-serial -tPi4B (1500MHz) -r1 -w10
taskset -c 1 ./pichart-serial -tPi4B (1500MHz) -r1 -w10
...CPU 0 through 1 are running stress
taskset -c 3 ./pichart-serial -tPi4B (1500MHz) -r1
pkill pichart-serial
Starting stress on CPU 0 through 2...
taskset -c 0 ./pichart-serial -tPi4B (1500MHz) -r1 -w10
taskset -c 1 ./pichart-serial -tPi4B (1500MHz) -r1 -w10
taskset -c 2 ./pichart-serial -tPi4B (1500MHz) -r1 -w10
...CPU 0 through 2 are running stress
taskset -c 3 ./pichart-serial -tPi4B (1500MHz) -r1
pkill pichart-serial
...done with test 1
Running test 2...
taskset -c 3 ./pichart-serial -tPi4B (1500MHz) -r2
pkill pichart-serial
Starting stress on CPU 0 through 0...
taskset -c 0 ./pichart-serial -tPi4B (1500MHz) -r2 -w10
...CPU 0 through 0 are running stress
taskset -c 3 ./pichart-serial -tPi4B (1500MHz) -r2
pkill pichart-serial
Starting stress on CPU 0 through 1...
taskset -c 0 ./pichart-serial -tPi4B (1500MHz) -r2 -w10
taskset -c 1 ./pichart-serial -tPi4B (1500MHz) -r2 -w10
...CPU 0 through 1 are running stress   
taskset -c 3 ./pichart-serial -tPi4B (1500MHz) -r2
pkill pichart-serial
Starting stress on CPU 0 through 2...
taskset -c 0 ./pichart-serial -tPi4B (1500MHz) -r2 -w10
taskset -c 1 ./pichart-serial -tPi4B (1500MHz) -r2 -w10
taskset -c 2 ./pichart-serial -tPi4B (1500MHz) -r2 -w10
...CPU 0 through 2 are running stress  
taskset -c 3 ./pichart-serial -tPi4B (1500MHz) -r2
pkill pichart-serial
...done with test 2
Running test 4...
taskset -c 3 ./pichart-serial -tPi4B (1500MHz) -r4
pkill pichart-serial
Starting stress on CPU 0 through 0...
taskset -c 0 ./pichart-serial -tPi4B (1500MHz) -r4 -w10
...CPU 0 through 0 are running stress   
taskset -c 3 ./pichart-serial -tPi4B (1500MHz) -r4
pkill pichart-serial
Starting stress on CPU 0 through 1...
taskset -c 0 ./pichart-serial -tPi4B (1500MHz) -r4 -w10
taskset -c 1 ./pichart-serial -tPi4B (1500MHz) -r4 -w10
taskset -c 1 ./pichart-serial -tPi4B (1500MHz) -r4 -w10
...CPU 0 through 1 are running stress
taskset -c 3 ./pichart-serial -tPi4B (1500MHz) -r4
pkill pichart-serial
Starting stress on CPU 0 through 2...
taskset -c 0 ./pichart-serial -tPi4B (1500MHz) -r4 -w10
taskset -c 1 ./pichart-serial -tPi4B (1500MHz) -r4 -w10
taskset -c 2 ./pichart-serial -tPi4B (1500MHz) -r4 -w10
...CPU 0 through 2 are running stress
taskset -c 3 ./pichart-serial -tPi4B (1500MHz) -r4
pkill pichart-serial
...done with test 4
Running test 8...
taskset -c 3 ./pichart-serial -tPi4B (1500MHz) -r8
pkill pichart-serial
Starting stress on CPU 0 through 0...
taskset -c 0 ./pichart-serial -tPi4B (1500MHz) -r8 -w10
...CPU 0 through 0 are running stress
taskset -c 3 ./pichart-serial -tPi4B (1500MHz) -r8 
pkill pichart-serial
Starting stress on CPU 0 through 1...
taskset -c 0 ./pichart-serial -tPi4B (1500MHz) -r8 -w10
taskset -c 1 ./pichart-serial -tPi4B (1500MHz) -r8 -w10
...CPU 0 through 1 are running stress
taskset -c 3 ./pichart-serial -tPi4B (1500MHz) -r8
pkill pichart-serial
Starting stress on CPU 0 through 2...
taskset -c 0 ./pichart-serial -tPi4B (1500MHz) -r8 -w10
taskset -c 1 ./pichart-serial -tPi4B (1500MHz) -r8 -w10
taskset -c 2 ./pichart-serial -tPi4B (1500MHz) -r8 -w10
...CPU 0 through 2 are running stress
taskset -c 3 ./pichart-serial -tPi4B (1500MHz) -r8
pkill pichart-serial
...done with test 8
along with a bunch of runX_Y.out files where X indicates the test type and Y the number of other cores that were busy.

Graphing the data yielded

Image

Depending on the test the performance varied from not at all to more than 60 percent slower. For example, prime sieve experienced no degradation when other cores were busy while the Fourier transform and Lorenz 96 dynamical simulation showed significant effects.

These differences likely result from sharing the bandwidth of main memory and how memory intensive each type of tests is. In my opinion the results also shed light on why the parallel versions of the same calculation scale better in some cases and not others. When I showed the results to Fido, the canine coder growled something about Amdog's Law and then started chasing a rabbit across the field.

I suspect the memory contention between cores would be even more significant with a processor clocked at 1800MHz but haven't checked.

User avatar
jahboater
Posts: 8370
Joined: Wed Feb 04, 2015 6:38 pm
Location: Wonderful West Dorset

Re: A Pi Pie Chart

Thu Sep 29, 2022 9:17 am

ejolson wrote:
Thu Sep 29, 2022 3:54 am
Depending on the test the performance varied from not at all to more than 60 percent slower. For example, prime sieve experienced no degradation when other cores were busy while the Fourier transform and Lorenz 96 dynamical simulation showed significant effects.
Very interesting!

There is also the random interrupt load and perhaps OS housekeeping tasks.
With four CPU cores, there should always be a spare core to execute this other stuff, allowing the benchmark program 100% access to its core.

Could you use isolcpus and then taskset to the isolated CPU?

Also there might be something running that spreads the interrupt load evenly around all the cores ?

ejolson
Posts: 10264
Joined: Tue Mar 18, 2014 11:47 am

Re: A Pi Pie Chart

Thu Sep 29, 2022 8:02 pm

jahboater wrote:
Thu Sep 29, 2022 9:17 am
ejolson wrote:
Thu Sep 29, 2022 3:54 am
Depending on the test the performance varied from not at all to more than 60 percent slower. For example, prime sieve experienced no degradation when other cores were busy while the Fourier transform and Lorenz 96 dynamical simulation showed significant effects.
Very interesting!

There is also the random interrupt load and perhaps OS housekeeping tasks.
With four CPU cores, there should always be a spare core to execute this other stuff, allowing the benchmark program 100% access to its core.

Could you use isolcpus and then taskset to the isolated CPU?

Also there might be something running that spreads the interrupt load evenly around all the cores ?
After checking it would appear I forgot to set the CPU governor to performance before performing the tests. I reran the script and now the up and down behavior for the Lorenz equations between 2 and 3 is gone. The graph in the previous post has been updated.

I think people running clouds reserve a few CPUs for housekeeping tasks and don't assign everything to a VM. It's so difficult to separate operating system effects from hardware interactions that my preference is to view tests like this one as a systems test that measures combined effects of all performance characteristics as a whole.

In the end, it's what the user experiences that's important. This is also why I report wall time rather than, for example, CPU performance counters. Performance counters along with more targeted tests play an important role in tuning software and hardware; however, the idea here is to measure what happens for some realistic computational tasks coded in C without such tuning.

Having said that, I find it quite reassuring that the prime sieve does not experience any contention when four independent copies are running one per core. Note that the code for prime sieve is full of CPU-intensive bit operations used to conserve memory. Thus, it's not surprising it is the least memory intensive.

On the other hand Lorenz 96 consists of a floating-point kernel that sequentially updates the dynamical state in memory. I think the reason the Fourier transform doesn't degrade as fast is because it tends to be stalled on a cache miss even when the other cores are idle.

ejolson
Posts: 10264
Joined: Tue Mar 18, 2014 11:47 am

Re: A Pi Pie Chart

Fri Sep 30, 2022 1:07 am

ejolson wrote:
Thu Sep 29, 2022 3:54 am
For example, prime sieve experienced no degradation when other cores were busy while the Fourier transform and Lorenz 96 dynamical simulation showed significant effects.
Here is the same test run on a 6-core Ryzen 4650G APU.

Image

In this case only the Fourier transform shows significant effects when other cores are busy independently performing the same task.

Notably the Lorenz 96 simulation experienced almost no degradation when other cores were busy. While the cache
  • L1 Cache: 64KB (per core)
  • L2 Cache: 512KB (per core)
  • L3 Cache: 8MB (shared)
of the 4650G is not huge by modern standards, the cache appears large enough to allow much better scaling for the Lorenz 96.

For reference the cache on the Pi 4B is
  • L1 I-Cache: 48KB (per core)
  • L1 D-Cache: 32KB (per core)
  • L2 Cache: 1MB (shared)
Note that the Lorenz 96 kernel

Code: Select all

        z[-1]=z[N-1]; z[-2]=z[N-2]; z[N]=z[0];
        for(int i=0;i<N;i++){
            y[i]=(z[i]+dt*(z[i+1]-z[i-2])*z[i-1])*expmdt+forcdt;
        }
touches 512KB of RAM each iteration.

ejolson
Posts: 10264
Joined: Tue Mar 18, 2014 11:47 am

Re: A Pi Pie Chart

Sat Oct 01, 2022 7:58 pm

ejolson wrote:
Thu Sep 29, 2022 3:54 am
The dog developer has been running some single-core jobs on a 128-core dual-socket EPYC 7702 server and complaining when all the cores in a NUMA node are busy that the performance is about three times slower than when only one core is in use.
For what it's worth I ran the pichart test on the system Fido's been complaining about and obtained

Image

While performance of the prime sieve, merge sort and Lorenz 96 show little degradation when other cores are busy, the Fourier transform falls to about 30 percent the original performance.
Last edited by ejolson on Sun Oct 02, 2022 5:15 am, edited 1 time in total.

ejolson
Posts: 10264
Joined: Tue Mar 18, 2014 11:47 am

Re: A Pi Pie Chart

Sun Oct 02, 2022 2:15 am

ejolson wrote:
Thu Sep 29, 2022 3:54 am
When I showed the results to Fido, the canine coder growled something about Amdog's Law and then started chasing a rabbit across the field.
Here are some results from running multiple copies of the serial version of pichart in the Oracle cloud. I was a bit hesitant because the resulting stress might slow down other jobs running in VMs provisioned on the same hardware, but in the name of science here are the results:

Image

This shows a similar pattern to the Epyc and Ryzen processors where the per-core cache is large enough to handle prime sieve, merge sort and Lorenz 96 but not the Fourier transform.

Although the performance loss for the Fourier transform was only 20 percent, I only tested a four-core instance since that’s the one which is free.

ejolson
Posts: 10264
Joined: Tue Mar 18, 2014 11:47 am

Re: A Pi Pie Chart

Fri Oct 21, 2022 9:00 pm

ejolson wrote:
Mon Nov 19, 2018 7:25 am
I'm still working on polishing the code enough to post it. Currently the code is written using the Cilk parallel programming extensions to the C programming language and compiled with gcc version 6.4. As versions of gcc which support Cilk on ARM architectures are rare, I'm making an OpenMP version as well. Hopefully the performance will not be too different.
There is a new Cilk called OpenCilk

https://www.opencilk.org/

based on clang LLVM. According to what
OpenCilk wrote: Our mission: OpenCilk aims to make it easy for developers to write fast and correct multicore code, for researchers to pioneer technologies to do so, and for educators to teach and students to learn software performance engineering.
Thus, there is an educational aspect to the OpenCilk project.

Interestingly, the new version targets x86 as well as 64-bit ARM. Maybe test runs of the pichart program using OpenCilk on the Raspberry Pi could be possible. For now I installed OpenCilk on a Ryzen 4650G and obtained

Code: Select all

$ ./pichart-opencilk
pichart -- Raspberry Pi Performance OPENCILK version 37

Prime Sieve          P=14630843 Workers=12 Sec=0.07371 Mops=12675.7
Merge Sort           N=16777216 Workers=24 Sec=0.11233 Mops=3584.57
Fourier Transform    N=4194304 Workers=24 Sec=0.106151 Mflops=4346.38
Lorenz 96            N=32768 K=16384 Workers=24 Sec=0.0403074 Mflops=79916.4

My Computer has Raspberry Pi ratio=314.728
Making pie charts...done.
Compared to the gcc OpenMP results reported in

viewtopic.php?p=1992120#p1992120

it would appear OpenCilk is about 35 percent faster on average for these tests. I wonder if the Pi will experience a similar increase in performance.

ejolson
Posts: 10264
Joined: Tue Mar 18, 2014 11:47 am

Re: A Pi Pie Chart

Fri Oct 28, 2022 4:41 am

ejolson wrote:
Fri Oct 21, 2022 9:00 pm
it would appear OpenCilk is about 35 percent faster on average for these tests. I wonder if the Pi will experience a similar increase in performance.
The download has been updated to version 37 which supports OpenCilk.

Given how much faster OpenCilk ran, I also created a literal translation of the Pi pie chart program into the Go language. Though Go is slower, it appears the relative performances of the different Raspberry Pi are similar.

Image

The source codes for the Go version as well as the new C version are now available via their respective links from the first post of this thread.

User avatar
jahboater
Posts: 8370
Joined: Wed Feb 04, 2015 6:38 pm
Location: Wonderful West Dorset

Re: A Pi Pie Chart

Fri Oct 28, 2022 6:54 pm

By the way, I ran pichart-serial on a very early Pi1 Revision 1 256MB running at stock speed.
The compiler is the latest GCC 12.2 (don't ask me how long it took to build GCC on such a Pi!)
In case its of interest, the Pi Ratio is less than one.

Code: Select all

pi@pione:~/pichart-37 $ pichart-serial                        
pichart -- Raspberry Pi Performance Serial version 37

Prime Sieve          P=14630843 Workers=1 Sec=15.8008 Mops=59.1316
Merge Sort           N=16777216 Workers=1 Sec=23.4843 Mops=17.1457
Fourier Transform    N=4194304 Workers=1 Sec=20.469 Mflops=22.5401
Lorenz 96            N=32768 K=16384 Workers=1 Sec=57.802 Mflops=55.7286

My Computer has Raspberry Pi ratio=0.943289
Making pie charts...done.
pi@pione:~/pichart-37 $ 

ejolson
Posts: 10264
Joined: Tue Mar 18, 2014 11:47 am

Re: A Pi Pie Chart

Fri Oct 28, 2022 9:47 pm

jahboater wrote:
Fri Oct 28, 2022 6:54 pm
By the way, I ran pichart-serial on a very early Pi1 Revision 1 256MB running at stock speed.
The compiler is the latest GCC 12.2 (don't ask me how long it took to build GCC on such a Pi!)
In case its of interest, the Pi Ratio is less than one.

Code: Select all

pi@pione:~/pichart-37 $ pichart-serial                        
pichart -- Raspberry Pi Performance Serial version 37

Prime Sieve          P=14630843 Workers=1 Sec=15.8008 Mops=59.1316
Merge Sort           N=16777216 Workers=1 Sec=23.4843 Mops=17.1457
Fourier Transform    N=4194304 Workers=1 Sec=20.469 Mflops=22.5401
Lorenz 96            N=32768 K=16384 Workers=1 Sec=57.802 Mflops=55.7286

My Computer has Raspberry Pi ratio=0.943289
Making pie charts...done.
pi@pione:~/pichart-37 $ 
According to my notes a Pi B+ with gcc 6.3 was used for the baseline timings. I don’t think I’ve changed anything in the C code that would affect the results. I recall, however, there was a regression around gcc version 8.x with merge sort that was not cleared up until version 10. Maybe it’s still a bit slower than before.

ejolson
Posts: 10264
Joined: Tue Mar 18, 2014 11:47 am

Re: A Pi Pie Chart

Sun Oct 30, 2022 8:27 pm

ejolson wrote:
Fri Oct 21, 2022 9:00 pm
it would appear OpenCilk is about 35 percent faster on average for these tests. I wonder if the Pi will experience a similar increase in performance.
After thrashing all night the Pi 4B managed to install OpenCilk. More details are at

viewtopic.php?p=2050313#p2050313

It doesn't work. The output looks like

Code: Select all

$ ./pichart-opencilk 
pichart -- Raspberry Pi Performance OPENCILK version 37

Prime Sieve          Illegal instruction
I couldn't get a stack trace on the Pi but on the Oracle cloud obtained

Code: Select all

(gdb) run
Starting program: /x/hipa/ejolson/code/pichart/pichart-37/pichart-opencilk 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
pichart -- Raspberry Pi Performance OPENCILK version 37

Prime Sieve          [New Thread 0xfffff4adf1d0 (LWP 184070)]
[New Thread 0xffffecc9f1d0 (LWP 184071)]
[New Thread 0xffffe4c9f1d0 (LWP 184072)]

Thread 2 "pichart-opencil" received signal SIGILL, Illegal instruction.
[Switching to Thread 0xfffff4adf1d0 (LWP 184070)]
worker_scheduler (w=<optimized out>, w@entry=0xfffff0001000)
    at /x/hipa/ejolson/work/cilk/opencilk/cheetah/runtime/scheduler.c:1614
1614                    start = gettime_fast();
Missing separate debuginfos, use: yum debuginfo-install glibc-2.28-189.5.0.1.el8_6.aarch64
(gdb) bt
#0  worker_scheduler (w=<optimized out>, w@entry=0xfffff0001000)
    at /x/hipa/ejolson/work/cilk/opencilk/cheetah/runtime/scheduler.c:1614
#1  0x0000fffff7e4fcf8 in scheduler_thread_proc (arg=<optimized out>)
    at /x/hipa/ejolson/work/cilk/opencilk/cheetah/runtime/scheduler.c:1719
#2  0x0000fffff7c87908 in start_thread () from /lib64/libpthread.so.0
#3  0x0000fffff7ce429c in thread_start () from /lib64/libc.so.6
(gdb)
Interestingly, code for gettime_fast reads as

Code: Select all

static inline __attribute__((always_inline)) uint64_t gettime_fast(void) {
#ifdef APPLE_ARM64
    // __builtin_readcyclecounter triggers "illegal instruction" runtime errors
    // on Apple M1s.
    return clock_gettime_nsec_np(CLOCK_MONOTONIC_RAW);
#else
    return __builtin_readcyclecounter();
#endif // #if APPLE_ARM64
}
According to Fido the problem is clearly not enough ifdef's in the source. While I'm not sure about that, it seems reasonable to try a similar fix on the Pi to see if the illegal instruction error goes away.

ejolson
Posts: 10264
Joined: Tue Mar 18, 2014 11:47 am

Re: A Pi Pie Chart

Sun Oct 30, 2022 9:07 pm

ejolson wrote:
Wed May 26, 2021 6:51 pm
ejolson wrote:
Sun Feb 07, 2021 5:15 am
Interestingly, the single-core speed of the Graviton2 was 38 percent slower than the M1 while the 8-core parallel benchmark was about 7 percent faster.
For some, Big Red refers to the super computers at Indiana University

https://kb.iu.edu/d/alde#br200

which are now in the 4th generation. On the other hand, Oracle just announced that free Ampere ARM instances are available in their cloud.

The free tier consists of
  • 4 Ampere Altra ARM cores with 24 GB RAM.
I ran the Pi pichart program and discovered this virtual machine is the equivalent of 107 original Raspberry Pi model B computers.

For reference the output is

Code: Select all

$ ./pichart-openmp -t "4-core Altra" # Free Oracle A1 instance
pichart -- Raspberry Pi Performance OPENMP version 36

Prime Sieve          P=14630843 Workers=4 Sec=0.269277 Mops=3469.76
Merge Sort           N=16777216 Workers=8 Sec=0.432369 Mops=931.272
Fourier Transform    N=4194304 Workers=8 Sec=0.194219 Mflops=2375.53
Lorenz 96            N=32768 K=16384 Workers=4 Sec=0.11563 Mflops=27858.1

The 4-core Altra has Raspberry Pi ratio=107.378
Making pie charts...done.
and the resulting Pi chart is

Image

Since a 4B has Pi ratio 31.7 as seen in

https://www.raspberrypi.org/forums/view ... 4#p1657704

this makes the Oracle 4-core N1-based Altra A1 instance more than 3 times faster on average than the Cortex-A72 instances in the Pi cloud.

https://www.raspberrypi.org/forums/view ... 6&t=279176

Moreover, as the RAM is also three times an 8GB Pi 4B, the dog developer has been trying to convince me that Oracle is, in fact, providing everyone with the equivalent of three Raspberry Pi computers for free. It seems I may have to change the charging algorithm used to bill Fido for use of the Pi cloud. Do you think a free Pico would be enough to retain my only subscriber?
The dog developer suggested patching as

Code: Select all

static inline __attribute__((always_inline)) uint64_t gettime_fast(void) {
#ifdef APPLE_ARM64
    // __builtin_readcyclecounter triggers "illegal instruction" runtime errors
    // on Apple M1s.
    return clock_gettime_nsec_np(CLOCK_MONOTONIC_RAW);
#else
    struct timespec tic_now;
    clock_gettime(CLOCK_MONOTONIC_RAW,&tic_now);
    return tic_now.tv_sec*1000000000ull+tic_now.tv_nsec;
//    return __builtin_readcyclecounter();
#endif // #if APPLE_ARM64
}
and now the Oracle cloud reports

Code: Select all

$ ./pichart-opencilk 
pichart -- Raspberry Pi Performance OPENCILK version 37

Prime Sieve          P=14630843 Workers=4 Sec=0.252489 Mops=3700.47
Merge Sort           N=16777216 Workers=4 Sec=0.282035 Mops=1427.67
Fourier Transform    N=4194304 Workers=4 Sec=0.17842 Mflops=2585.88
Lorenz 96            N=32768 K=16384 Workers=4 Sec=0.208142 Mflops=15476.1

My Computer has Raspberry Pi ratio=107.073
Making pie charts...done.
Every computation except the Lorenz 96 simulation was faster compared to the previous results using OpenMP and gcc. However, unlike the Ryzen 4650G, the slowdown with Lorenz was so significant as to cancel any gains made to the Pi ratio. For some reason clang always seems slow on Lorenz 96.

I wonder if Fido's patch will work on the Pi 4B. Does anyone know a better replacement for clock_gettime_nsec_np? I'm not very happy with the multiply.

Return to “General discussion”