No, I have not looked at any of this, just did the charts with the default compiler on the pi itself. There's probably some mileage in using the latest compilers.ejolson wrote: ↑Tue Jun 25, 2019 4:36 pmMy understanding is that the task parallel constructs in modern OpenMP implementations fork a pool of threads at the beginning of the run (which isn't measured by the timing routines) and then use either work stealing or some sort of grand central dispatch to assign parcels of work to the threads in the pool. Maybe the cost of Linux thread synchronization primitives goes up when LPAE is enabled; however, it is strange that the serial version also runs slower.jamesh wrote: ↑Tue Jun 25, 2019 4:18 pmFrom Eben when I showed him the results for the merge, "Could be expensive line moves between L1s, but I suspect it's actually measuring the cost of forking processes in LPAE."ejolson wrote: ↑Tue Jun 25, 2019 3:40 pmThanks for posting. Your results are similar, though perhaps slightly faster, compared to the graphs that James uploaded which I converted to portable network graphics:
Here "My Computer" refers to the new Raspberry Pi 4B.
Compared to the original Pi B the Pi 4B is 26.8474 times faster. That's about double the performance of the 3B+ overall, however,I find it surprising that the merge sort timings are actually slower than the 3B+. I wonder if this result is related to the compiler version or an optimization setting. It would be nice to find a set of compiler flags for which the merge-sort timings were faster.
Which is why some of the other Pie charts were comparing LPAE kernels on the Pi3B+.
I wonder if this is a gcc version 8.x compiler regression. Have you tried any compiler flags to remedy the situation?
Re: A Pi Pie Chart
Principal Software Engineer at Raspberry Pi Ltd.
Working in the Applications Team.
Working in the Applications Team.
Re: A Pi Pie Chart
A summary of the merge sort timings for the Pi 4B and 3B+ arejahboater wrote: ↑Thu Jun 27, 2019 6:41 pmHere it is (4GB version if it makes any difference).Code: Select all
pi@pi4:~/pichart-30 $ ./pichart-openmp pichart -- Raspberry Pi Performance OPENMP version 30 Prime Sieve P=14630843 Workers=4 Sec=0.514499 Mops=1815.99 Merge Sort N=16777216 Workers=8 Sec=1.07414 Mops=374.861 Fourier Transform N=4194304 Workers=8 Sec=1.77694 Mflops=259.645 Lorenz 96 N=32768 K=16384 Workers=4 Sec=0.598412 Mflops=5382.96 My Computer has Raspberry Pi ratio=27.7325 Making pie charts...done. pi@pi4:~/pichart-30 $ ./pichart-serial pichart -- Raspberry Pi Performance Serial version 30 Prime Sieve P=14630843 Workers=2 Sec=2.0762 Mops=450.018 Merge Sort N=16777216 Workers=2 Sec=3.94588 Mops=102.044 Fourier Transform N=4194304 Workers=2 Sec=2.95565 Mflops=156.099 Lorenz 96 N=32768 K=16384 Workers=1 Sec=2.1619 Mflops=1490 My Computer has Raspberry Pi ratio=9.027 Making pie charts...done. pi@pi4:~/pichart-30 $ pi@pi4:~/pichart-30 $ gcc -v Using built-in specs. COLLECT_GCC=gcc COLLECT_LTO_WRAPPER=/usr/local/libexec/gcc/arm-linux-gnueabihf/9.1.0/lto-wrapper Target: arm-linux-gnueabihf Configured with: ../configure --enable-languages=c,d,c++,fortran --with-cpu=cortex-a72 --with-fpu=neon-fp-armv8 --with-float=hard --build=arm-linux-gnueabihf --host=arm-linux-gnueabihf --target=arm-linux-gnueabihf Thread model: posix gcc version 9.1.0 (GCC) pi@pi4:~/pichart-30 $
Code: Select all
Merge Sort (larger is better)
Serial OpenMP
Pi 3B+ 114.622 362.529 gcc 6.4
Pi 4B 102.044 374.861 gcc 9.1
While the OpenMP speeds now show the Pi 4B to be 3.5% faster than the Pi 3B+ on the parallel merge sort, the single-core speeds for the serial version are still slower.
Is it possible you compiled with -mtune=native -march=native as those flags are in the Makefile and that they reset the --with-cpu and --with-fpu settings to some randomly wrong thing?
Have you verified that no throttling occurred?
Last edited by ejolson on Fri Jun 28, 2019 9:23 pm, edited 1 time in total.
Re: A Pi Pie Chart
Yes, I used your Makefile as-is. Recent versions of GCC get the "native" types right.
To check I recompiled with
-mcpu=cortex-a72 -mtune=cortex-a72 -mfpu=neon-fp-armv8
and compared the binary with "cmp", and they were identical.
Of course - using "vcgencmd get_throttled".
Code: Select all
pi@pi4:~/pichart-30 $ make
gcc -std=gnu99 -O3 -mcpu=cortex-a72 -mtune=cortex-a72 -mfpu=neon-fp-armv8 -Wall -o pichart-serial pichart.c util.c sieve.c merge.c fourier.c lorenz.c -lm
gcc -std=gnu99 -O3 -mcpu=cortex-a72 -mtune=cortex-a72 -mfpu=neon-fp-armv8 -Wall -fopenmp -o pichart-openmp pichart.c util.c sieve.c merge.c fourier.c lorenz.c -lm
pi@pi4:~/pichart-30 $ cmp pichart-openmp xopenmp
pi@pi4:~/pichart-30 $
pi@pi4:~/pichart-30 $ ./pichart-openmp
pichart -- Raspberry Pi Performance OPENMP version 30
Prime Sieve P=14630843 Workers=4 Sec=0.515054 Mops=1814.04
Merge Sort N=16777216 Workers=8 Sec=1.07105 Mops=375.942
Fourier Transform N=4194304 Workers=8 Sec=1.79254 Mflops=257.386
Lorenz 96 N=32768 K=16384 Workers=4 Sec=0.598754 Mflops=5379.88
My Computer has Raspberry Pi ratio=27.6805
Making pie charts...done.
pi@pi4:~/pichart-30 $ ./pichart-serial
pichart -- Raspberry Pi Performance Serial version 30
Prime Sieve P=14630843 Workers=2 Sec=2.07626 Mops=450.005
Merge Sort N=16777216 Workers=2 Sec=3.94452 Mops=102.079
Fourier Transform N=4194304 Workers=2 Sec=2.92829 Mflops=157.557
Lorenz 96 N=32768 K=16384 Workers=1 Sec=2.16188 Mflops=1490.01
My Computer has Raspberry Pi ratio=9.04874
Making pie charts...done.
pi@pi4:~/pichart-30 $
pi@pi4:~/pichart-30 $ vcgencmd get_throttled
throttled=0x0
pi@pi4:~/pichart-30 $
Code: Select all
pi@pi4:~/pichart-30 $ ./pichart-openmp
pichart -- Raspberry Pi Performance OPENMP version 30
Prime Sieve P=14630843 Workers=4 Sec=0.548289 Mops=1704.08
Merge Sort N=16777216 Workers=8 Sec=1.13798 Mops=353.833
Fourier Transform N=4194304 Workers=8 Sec=1.73828 Mflops=265.42
Lorenz 96 N=32768 K=16384 Workers=4 Sec=0.628486 Mflops=5125.37
My Computer has Raspberry Pi ratio=26.7226
Making pie charts...done.
pi@pi4:~/pichart-30 $ ./pichart-serial
pichart -- Raspberry Pi Performance Serial version 30
Prime Sieve P=14630843 Workers=1 Sec=2.21255 Mops=422.285
Merge Sort N=16777216 Workers=2 Sec=4.20747 Mops=95.6995
Fourier Transform N=4194304 Workers=2 Sec=2.98104 Mflops=154.769
Lorenz 96 N=32768 K=16384 Workers=1 Sec=2.30599 Mflops=1396.89
My Computer has Raspberry Pi ratio=8.58486
Making pie charts...done.
pi@pi4:~/pichart-30 $ vcgencmd get_throttled
throttled=0x0
pi@pi4:~/pichart-30 $ vcgencmd get_config int
arm_freq=1500
audio_pwm_mode=514
config_hdmi_boost=5
core_freq=500
core_freq_min=250
disable_commandline_tags=2
disable_l2cache=1
disable_splash=1
display_hdmi_rotate=-1
display_lcd_rotate=-1
enable_gic=1
force_eeprom_read=1
force_pwm_open=1
framebuffer_depth=16
framebuffer_ignore_alpha=1
framebuffer_swap=1
gpu_freq=500
gpu_freq_min=500
init_uart_clock=0x2dc6c00
lcd_framerate=60
max_framebuffers=1
pause_burst_frames=1
program_serial_random=1
hdmi_force_cec_address:0=65535
hdmi_force_cec_address:1=65535
hdmi_pixel_freq_limit:0=0x11e1a300
hdmi_pixel_freq_limit:1=0x11e1a300
pi@pi4:~/pichart-30 $
Re: A Pi Pie Chart
Here are my Pi 3B vs Pi 4 numbers. I used the same binary, that I compiled with gcc 6.3 on an older image (because I am having huge performance issues with Raspbian Buster and I wanted to investigate those). For some reason, these performance issues don't show up at all in this pichart-test - but they do in software that I wrote (an old 2017 Raspbian image is more than twice as fast as Buster on that, no idea why, if anyone has an idea: https://www.raspberrypi.org/forums/view ... 8&t=243859 ).
So anyway, here are my numbers for the Pi 3B vs Pi 4:
Prime Sieve: 483 / 1411
Merge Sort: 301 / 540
Fourier: 149 / 258
Lorenz 96: 905 / 4610
Ah. And there's the cause of the MergeSort problem: It broke in gcc.
Here are the same numbers for the Pi 4, but using gcc 8.3 that's on the Buster image:
Prime Sieve: 1690
Merge sort: 328
Fourier: 253
Lorenz 96: 5747
So anyway, here are my numbers for the Pi 3B vs Pi 4:
Prime Sieve: 483 / 1411
Merge Sort: 301 / 540
Fourier: 149 / 258
Lorenz 96: 905 / 4610
Ah. And there's the cause of the MergeSort problem: It broke in gcc.
Here are the same numbers for the Pi 4, but using gcc 8.3 that's on the Buster image:
Prime Sieve: 1690
Merge sort: 328
Fourier: 253
Lorenz 96: 5747
Re: A Pi Pie Chart
Thanks for running those tests. The results make me more confident in the engineering behind the Pi 4B and the Cortex-A72. I'm sorry it didn't help with your problem.hvz wrote: ↑Fri Jun 28, 2019 10:52 amSo anyway, here are my numbers for the Pi 3B vs Pi 4:
Prime Sieve: 483 / 1411
Merge Sort: 301 / 540
Fourier: 149 / 258
Lorenz 96: 905 / 4610
Ah. And there's the cause of the MergeSort problem: It broke in gcc.
Here are the same numbers for the Pi 4, but using gcc 8.3 that's on the Buster image:
Prime Sieve: 1690
Merge sort: 328
Fourier: 253
Lorenz 96: 5747
The fact that merge sort compiled with gcc version 8.3 has only 60% the performance of the same code compiled with gcc version 6.3 makes one imagine too much time has been spent optimizing 64-bit at the expense of 32-bit targets. Maybe the regression is due to the mitigation of Spectra-like side channel vulnerabilities that might leak information about the numbers being sorted when the test is running. Maybe in-kernel side-channel information leakage mitigations are responsible for the slowdown you are experiencing.
It's not all bad. Even though the performance of merge sort deceased, the performance of prime sieve and the Lorenz 96 simulation increased. Therefore the final Pi ratio didn't decrease as much as might be expected. In summary
- The Pi 4B Pi ratio is 27.4 with gcc version 6.3.
- The Pi 4B Pi ratio is 26.6 with gcc version 8.3.
- The Pi 4B Pi ratio is 26.7 with gcc version 9.1.
- The Pi 4B Pi ratio is 30.6 using best compiler for each test.
Re: A Pi Pie Chart
Also interesting: Has anyone done a 64 bit test? It's too bad that Raspbian is still 32 bit, but there are other images that are 64 bits. On my Pi 3B I saw (on one specific program, and I think we were using different gcc versions too) a 10% increase in performance vs 32 bit.
I'll do some tests next week.
(Btw the other problem is solved, was a sound card speed issue in the older image, it apparently ran at a lower sample rate than selected).
I'll do some tests next week.
(Btw the other problem is solved, was a sound card speed issue in the older image, it apparently ran at a lower sample rate than selected).
Re: A Pi Pie Chart
More numbers, this time with Ubuntu Mate 64 bit, Pi 3B (because there's no Pi 4 version yet). And unfortunately with gcc 7.4, because that's the one that's delivered with Ubuntu Mate...hvz wrote: ↑Fri Jun 28, 2019 10:52 amHere are my Pi 3B vs Pi 4 numbers. I used the same binary, that I compiled with gcc 6.3 on an older image (because I am having huge performance issues with Raspbian Buster and I wanted to investigate those). For some reason, these performance issues don't show up at all in this pichart-test - but they do in software that I wrote (an old 2017 Raspbian image is more than twice as fast as Buster on that, no idea why, if anyone has an idea: https://www.raspberrypi.org/forums/view ... 8&t=243859 ).
So anyway, here are my numbers for the Pi 3B vs Pi 4:
Prime Sieve: 483 / 1411
Merge Sort: 301 / 540
Fourier: 149 / 258
Lorenz 96: 905 / 4610
Ah. And there's the cause of the MergeSort problem: It broke in gcc.
Here are the same numbers for the Pi 4, but using gcc 8.3 that's on the Buster image:
Prime Sieve: 1690
Merge sort: 328
Fourier: 253
Lorenz 96: 5747
Prime Sieve: 632 (30% faster than 32 bit, note that gcc 8.3 was already 19% faster than 6.3 so it could be that, leaving about 9% difference assuming that 7.4 was already this fast)
Merge Sort: 262 (Can't really compare this one, looks like it already broke in gcc 7.4)
Fourier: 167 (12% faster than 32 bit)
Lorenz 96: 1292 (42% faster than 32 bit, but gcc 8.3 was 25% faster than 6.3, leaving about 13% difference assuming that 7.4 was already this fast).
My very very unreliable estimate would be that 64 bit is between 10 and 15% faster than 32 bit. Which matches values that I've read elsewhere, and values that I've seen in my own software (compiled with the same compiler version in both 32 and 64 bit). It would be helpful to use the same gcc versions (and I could, I have gcc 8.2 running at both 32 and 64 bit on some other Pi's), but that's too much effort for now. I'm more interested in how it affects my own software than on how it affects a benchmark.
Re: A Pi Pie Chart
Thanks for the report. It's good to know that running pichart using 64-bit ARM doesn't make the surprising difference it does with sysbench. In a way I share your sentiment about only being interested in how fast 64-bit affects the software you wrote yourself. The only difference is that pichart is my software.hvz wrote: ↑Mon Jul 01, 2019 3:57 pmMy very very unreliable estimate would be that 64 bit is between 10 and 15% faster than 32 bit. Which matches values that I've read elsewhere, and values that I've seen in my own software (compiled with the same compiler version in both 32 and 64 bit). It would be helpful to use the same gcc versions (and I could, I have gcc 8.2 running at both 32 and 64 bit on some other Pi's), but that's too much effort for now. I'm more interested in how it affects my own software than on how it affects a benchmark.
I'm somewhat disappointed that switching to 64-bit didn't solve the performance problems with newer compilers and merge sort. From this post it looks like the Pi 4B will soon run a 64-bit version of Gentoo Linux. I wonder if there is anything that can be done with the current C code to make merge sort run faster with the newer versions of gcc.
Re: A Pi Pie Chart
Sakaki has also done a dual 32/64 nspawn version of Raspbian that works on the 3B+.
Not sure if that would give a difference between 32 and 64bit .
https://github.com/sakaki-/raspbian-nspawn-64
Will be interesting once Gentoo64 is compiled for A72 to compare that against the A53 code on a Pi4.
How to tune OS's for A72 cores? Going to need benchmarks..
Wonder how the Pi4 now compares against those other SBC's.
Not sure if that would give a difference between 32 and 64bit .
https://github.com/sakaki-/raspbian-nspawn-64
Will be interesting once Gentoo64 is compiled for A72 to compare that against the A53 code on a Pi4.
How to tune OS's for A72 cores? Going to need benchmarks..
Wonder how the Pi4 now compares against those other SBC's.
I'm dancing on Rainbows.
Raspberries are not Apples or Oranges
Raspberries are not Apples or Oranges
Re: A Pi Pie Chart
Quick first test at the desktop with temps being displayed in the top corner.
(This is a slow fan blowing down-ish over the naked Pi.
It is also a quick overclock test. Worked first time and has been looping an webgl2 aquarium for over an hour, room temp probably 24ishC)
1.75GHz ARM / 600MHz GPU / +.4 V iirc
Temp. went to touch 60C at end of Sieve with OpenMP
No throttling occurred.
Buster Raspbian as of today.
GCC 8.3.0-6+rpi1
OpenMP v30 (Mops, Buster as of today)
PS= 1972
MS= 385
FT= 280
Lz= 5258
PiRatio = 28.9
Serial v30
PS= 462
MS= 102
FT= 150
Lz= 1628
PiRatio = 9.2
(This is a slow fan blowing down-ish over the naked Pi.
It is also a quick overclock test. Worked first time and has been looping an webgl2 aquarium for over an hour, room temp probably 24ishC)
1.75GHz ARM / 600MHz GPU / +.4 V iirc
Temp. went to touch 60C at end of Sieve with OpenMP
No throttling occurred.
Buster Raspbian as of today.
GCC 8.3.0-6+rpi1
OpenMP v30 (Mops, Buster as of today)
PS= 1972
MS= 385
FT= 280
Lz= 5258
PiRatio = 28.9
Serial v30
PS= 462
MS= 102
FT= 150
Lz= 1628
PiRatio = 9.2
Re: A Pi Pie Chart
These seem to be the best scores yet for a single run of pichart on a Raspberry Pi 4B. For the record, there is a nice comparison to a Rock64 here which shows that single-board computer has roughly the same performance as the Pi 3B+. Since both machines use a quad-core Cortex-A53 processor this is not unexpected. However, I still find it interesting.bensimmo wrote: ↑Mon Jul 15, 2019 3:31 pmQuick first test at the desktop with temps being displayed in the top corner.
(This is a slow fan blowing down-ish over the naked Pi.
It is also a quick overclock test. Worked first time and has been looping an webgl2 aquarium for over an hour, room temp probably 24ishC)
1.75GHz ARM / 600MHz GPU / +.4 V iirc
Temp. went to touch 60C at end of Sieve with OpenMP
No throttling occurred.
Buster Raspbian as of today.
GCC 8.3.0-6+rpi1
OpenMP v30 (Mops, Buster as of today)
PS= 1972
MS= 385
FT= 280
Lz= 5258
PiRatio = 28.9
Serial v30
PS= 462
MS= 102
FT= 150
Lz= 1628
PiRatio = 9.2
Re: A Pi Pie Chart
Just note the Overclock though, I assume most boards will run at it, it the one Tom's Hardware used in thier announcement and I just copied it straight off. It a 17% frequency increase.
It'll probably go faster as it not sweating with just a gentle fan blowing over it.
No doubt we'll see these faster speeds if they build in active cooling solutions in say a + board in the future. The room is these in the SoC.
It'll probably go faster as it not sweating with just a gentle fan blowing over it.
No doubt we'll see these faster speeds if they build in active cooling solutions in say a + board in the future. The room is these in the SoC.
Re: A Pi Pie Chart
I clicked with my mouse and created an Amazon EC2 instance with 4 Graviton processors running 64-bit ARM Ubuntu Linux. For reference this is an a1.xlarge instance with 8GB RAM that costs US$ 0.102 per hour on demand to run. I ran the Pi pie-chart program using gcc versions 6.5, 7.4 and 8.3. With
CFLAGS=-march=native -mtune=native -O3 -ffast-math
the best results were obtained with version 8.3 as follows:
Code: Select all
$ ./pichart-openmp ; # Amazon EC2 a1.xlarge instance
pichart -- Raspberry Pi Performance OPENMP version 30
Prime Sieve P=14630843 Workers=4 Sec=0.395686 Mops=2361.29
Merge Sort N=16777216 Workers=8 Sec=0.725249 Mops=555.193
Fourier Transform N=4194304 Workers=8 Sec=0.650543 Mflops=709.213
Lorenz 96 N=32768 K=16384 Workers=4 Sec=0.298065 Mflops=10807.1
My Computer has Raspberry Pi ratio=49.9934
$ ./pichart-serial ; # Amazon EC2 a1.xlarge instance
pichart -- Raspberry Pi Performance Serial version 30
Prime Sieve P=14630843 Workers=2 Sec=1.57494 Mops=593.246
Merge Sort N=16777216 Workers=2 Sec=2.85815 Mops=140.879
Fourier Transform N=4194304 Workers=2 Sec=1.95764 Mflops=235.679
Lorenz 96 N=32768 K=16384 Workers=1 Sec=1.11644 Mflops=2885.27
My Computer has Raspberry Pi ratio=13.7101
Making pie charts...done.
In order to figure out how much per hour to charge my little brother for using the Pi 4B, I calculated 0.102 * 28 / 50 to obtain US$ 0.05712 per hour.
Re: A Pi Pie Chart
To see how far things have progressed since I first started using Linux, I ran the Pi Pie Chart program on a 66MHz 486DX2. Since there was only 32MB of RAM on that machine, I divided the array sizes by 256 for the prime sieve and by 128 for the merge sort and Fourier transforms. Because the machine was slow, I also divided the number of time steps for the Lorenz 96 simulation by 256. Other modifications were made so the program would compile with gcc version 2.7.2.3. The results were
This implies the original Raspberry Pi is about 17 times faster than the 486 PC and the Pi 4B is 478 times faster.
Code: Select all
$ ./pichart-serial
pichart -- Raspberry Pi Performance Serial version L31
Prime Sieve P=82025 Workers=2 Sec=2.29954 Mops=1.42527
Merge Sort N=131072 Workers=1 Sec=1.31166 Mops=1.69878
Fourier Transform N=32768 Workers=1 Sec=1.47467 Mflops=1.66655
Lorenz 96 N=128 K=16384 Workers=1 Sec=2.69302 Mflops=4.67242
My Computer has Raspberry Pi ratio=0.0585114
Making pie charts...done.
Re: A Pi Pie Chart
I spent a little time polishing the Pi Pie Chart program and have fixed the scalable vector graphics output routine so the resulting pichart.svg file can be viewed in a browser. For good measure I then reran the program on the Raspberry Pi 4B with the CPU governor set to performance as
Compiling with gcc version 8.3.0 on Raspbian obtained
and then with clang version 9.0.0 on Raspbian obtained
Not surprisingly, due to how badly the optimizer for recent versions of gcc work for the merge sort algorithm, the clang runs are a bit faster. The reason, however, to run all those tests was to obtain new Pi charts using the updated output routines.




Please let me know if you have any difficulty displaying the above scalable vector graphics images directly in your browser. The current version of the code may be downloaded at
http://fractal.math.unr.edu/~ejolson/pi ... urrent.tgz
Code: Select all
# echo performance >/sys/devices/system/cpu/cpufreq/policy0/scaling_governor
Code: Select all
$ ./pichart-openmp -t"Pi 4B gcc"
pichart -- Raspberry Pi Performance OPENMP version 32
Prime Sieve P=14630843 Workers=4 Sec=0.548457 Mops=1703.56
Merge Sort N=16777216 Workers=8 Sec=1.18613 Mops=339.467
Fourier Transform N=4194304 Workers=4 Sec=1.7113 Mflops=269.605
Lorenz 96 N=32768 K=16384 Workers=4 Sec=0.646306 Mflops=4984.05
The Pi 4B has Raspberry Pi ratio=26.3638
Making pie charts...done.
$ ./pichart-serial -t"Pi 4B gcc"
pichart -- Raspberry Pi Performance Serial version 32
Prime Sieve P=14630843 Workers=1 Sec=2.34654 Mops=398.172
Merge Sort N=16777216 Workers=2 Sec=4.58568 Mops=87.8066
Fourier Transform N=4194304 Workers=1 Sec=3.25404 Mflops=141.785
Lorenz 96 N=32768 K=16384 Workers=1 Sec=2.30612 Mflops=1396.82
The Pi 4B has Raspberry Pi ratio=8.09999
Making pie charts...done.
Code: Select all
$ ./pichart-openmp -t"Pi 4B clang"
pichart -- Raspberry Pi Performance OPENMP version 32
Prime Sieve P=14630843 Workers=4 Sec=0.53926 Mops=1732.61
Merge Sort N=16777216 Workers=4 Sec=0.879777 Mops=457.676
Fourier Transform N=4194304 Workers=4 Sec=1.7769 Mflops=259.651
Lorenz 96 N=32768 K=16384 Workers=4 Sec=0.692296 Mflops=4652.96
My Computer has Raspberry Pi ratio=27.7803
Making pie charts...done.
$ ./pichart-serial -t"Pi 4B clang"
pichart -- Raspberry Pi Performance Serial version 32
Prime Sieve P=14630843 Workers=1 Sec=2.12951 Mops=438.753
Merge Sort N=16777216 Workers=2 Sec=2.93432 Mops=137.222
Fourier Transform N=4194304 Workers=1 Sec=3.0257 Mflops=152.485
Lorenz 96 N=32768 K=16384 Workers=2 Sec=2.24929 Mflops=1432.11
The Pi 4B clang has Raspberry Pi ratio=9.50832
Making pie charts...done.
Please let me know if you have any difficulty displaying the above scalable vector graphics images directly in your browser. The current version of the code may be downloaded at
http://fractal.math.unr.edu/~ejolson/pi ... urrent.tgz
Re: A Pi Pie Chart
I made a change to the background color of the pie charts to distinguish the parallel performance to the serial and also updated the default systems appearing in the chart to include the clang timings for the Raspberry Pi 4B as a replacement for the 3B.
Here are the results for a 12-core Ryzen 1920X Threadripper:
The resulting pie charts are


Here are the results for a 12-core Ryzen 1920X Threadripper:
Code: Select all
$ ./pichart-openmp -t"Ryzen 1920X"
pichart -- Raspberry Pi Performance OPENMP version 33
Prime Sieve P=14630843 Workers=48 Sec=0.0717278 Mops=13026
Merge Sort N=16777216 Workers=48 Sec=0.100511 Mops=4006.07
Fourier Transform N=4194304 Workers=24 Sec=0.0830656 Mflops=5554.33
Lorenz 96 N=32768 K=16384 Workers=48 Sec=0.0653516 Mflops=49290.7
The Ryzen 1920X has Raspberry Pi ratio=306.99
Making pie charts...done.
$ ./pichart-serial -t"Ryzen 1920X"
pichart -- Raspberry Pi Performance Serial version 33
Prime Sieve P=14630843 Workers=1 Sec=0.762902 Mops=1224.7
Merge Sort N=16777216 Workers=1 Sec=1.57662 Mops=255.39
Fourier Transform N=4194304 Workers=1 Sec=0.553161 Mflops=834.068
Lorenz 96 N=32768 K=16384 Workers=1 Sec=0.194706 Mflops=16544
The Ryzen 1920X has Raspberry Pi ratio=40.4726
Making pie charts...done.
Re: A Pi Pie Chart
A friend gave me his old desktop computer, an AMD A6-5400K. I decided to make a pie chart to see whether the Pi 4B could be a desktop replacement. The run
and resulting chart

indicates to that the Pi 4B was faster with integer tests Prime Sieve and Merge Sort but slower with the floating-point tests Fourier Transform and Lorenz 96. Overall, the Pi 4B was slightly faster, quieter and more power efficient. I guess it would make a good desktop replacement, at least in this case.
Code: Select all
$ ./pichart-openmp -t"A6-5400K"
pichart -- Raspberry Pi Performance OPENMP version 33
Prime Sieve P=14630843 Workers=4 Sec=1.18758 Mops=786.749
Merge Sort N=16777216 Workers=4 Sec=1.33993 Mops=300.504
Fourier Transform N=4194304 Workers=2 Sec=0.969594 Mflops=475.842
Lorenz 96 N=32768 K=16384 Workers=4 Sec=0.524484 Mflops=6141.71
The A6-5400K has Raspberry Pi ratio=25.6007
Making pie charts...done.
indicates to that the Pi 4B was faster with integer tests Prime Sieve and Merge Sort but slower with the floating-point tests Fourier Transform and Lorenz 96. Overall, the Pi 4B was slightly faster, quieter and more power efficient. I guess it would make a good desktop replacement, at least in this case.
Re: A Pi Pie Chart
Woohoo! It looks like gcc version 10.1 fixes the regression with the merge sort on the Pi 4B. The new compiler gives
which shows merge sort being about 61 percent faster with version 10.1.
I wonder whether the new version of gcc creates executables which run that much faster on any other architecture or if the improvements are mostly a Pi thing.
Code: Select all
$ ./pichart-openmp
pichart -- Raspberry Pi Performance OPENMP version 32
Prime Sieve P=14630843 Workers=4 Sec=0.552507 Mops=1691.07
Merge Sort N=16777216 Workers=8 Sec=0.742115 Mops=542.575
Fourier Transform N=4194304 Workers=8 Sec=1.33157 Mflops=346.487
Lorenz 96 N=32768 K=16384 Workers=4 Sec=0.630267 Mflops=5110.89
My Computer has Raspberry Pi ratio=31.7025
Making pie charts...done.
I wonder whether the new version of gcc creates executables which run that much faster on any other architecture or if the improvements are mostly a Pi thing.
Re: A Pi Pie Chart
The Graviton2 processors reviewed at
https://www.anandtech.com/show/15578/cl ... el-and-amd
are now widely available on the EC2 cloud. I decided to make some Pi pie charts for the 4 processor instance which costs US$ 0.154 per hour. I used Ubuntu 20.04LTS with the gcc 9.3.0 compiler. The single-core results were
with the chart

The parallel results were
with the chart

With the recent improvements to compiler technology, the Raspberry Pi 4B now has a Pi ratio of 31.7025 as described in the previous post. This means the 4 processor Graviton2 instance is about 3.16 times faster than the 4B. Compared to the original Graviton instances, the Graviton2 is about 2 times faster but the prices are only 1.5 times more.
Following an updated version of the calculation given in
viewtopic.php?f=63&t=227177&start=100#p1537398
implies the price I can charge my little brother for using the Pi 4B has been reduced from US$ 0.05712 to about US$ 0.04873 per hour.
Woohoo! That's still plenty.
https://www.anandtech.com/show/15578/cl ... el-and-amd
are now widely available on the EC2 cloud. I decided to make some Pi pie charts for the 4 processor instance which costs US$ 0.154 per hour. I used Ubuntu 20.04LTS with the gcc 9.3.0 compiler. The single-core results were
Code: Select all
$ ./pichart-serial -t Graviton2
pichart -- Raspberry Pi Performance Serial version 33
Prime Sieve P=14630843 Workers=1 Sec=1.26827 Mops=736.695
Merge Sort N=16777216 Workers=2 Sec=2.01905 Mops=199.427
Fourier Transform N=4194304 Workers=2 Sec=0.597464 Mflops=772.22
Lorenz 96 N=32768 K=16384 Workers=1 Sec=0.518345 Mflops=6214.44
The Graviton2 has Raspberry Pi ratio=25.7304
Making pie charts...done.
The parallel results were
Code: Select all
$ ./pichart-openmp -t "Graviton2 (4-core)"
pichart -- Raspberry Pi Performance OPENMP version 33
Prime Sieve P=14630843 Workers=4 Sec=0.318541 Mops=2933.15
Merge Sort N=16777216 Workers=8 Sec=0.508454 Mops=791.917
Fourier Transform N=4194304 Workers=4 Sec=0.155877 Mflops=2959.85
Lorenz 96 N=32768 K=16384 Workers=4 Sec=0.136869 Mflops=23535.2
The Graviton2 (4-core) has Raspberry Pi ratio=100.148
Making pie charts...done.
With the recent improvements to compiler technology, the Raspberry Pi 4B now has a Pi ratio of 31.7025 as described in the previous post. This means the 4 processor Graviton2 instance is about 3.16 times faster than the 4B. Compared to the original Graviton instances, the Graviton2 is about 2 times faster but the prices are only 1.5 times more.
Following an updated version of the calculation given in
viewtopic.php?f=63&t=227177&start=100#p1537398
implies the price I can charge my little brother for using the Pi 4B has been reduced from US$ 0.05712 to about US$ 0.04873 per hour.
Woohoo! That's still plenty.
Last edited by ejolson on Sun May 31, 2020 4:01 pm, edited 1 time in total.
Re: A Pi Pie Chart
One of those toy power meters that plug into the wall arrived in the mail.
https://www.amazon.com/Poniie-PN1500-El ... 07VPTN8FZ/
Since the pie chart program hasn't been experiencing feature creep and code bloat for a while, I added a new option that allows one to perform a stress test of specified duration for all or any of the four benchmark problems. The archive available from the download link in the first post has been updated.
I took an average of 50 readings of the Pi 4B using both the volt-ampere and watt settings. The results using the new -w10 option were
The above measurements were collected with a fan running. The power to run the fan was not included. No heat or power-related throttling occurred. The Pi had a network cable, USB keyboard and monitor connected. WiFi was turned off. Mains voltage was 120V and I used the official power supply. Note that the first time I performed the test, I didn't run the fan and the system went in to throttling. While I may have been imagining things, it is possible that hot raspberry pies consume more power just before they throttle.
In all cases power usage remained under 8 watts. I wonder how much electricity the Graviton2 processors consume. For reference, the data I took (by hand) from the readout on the power meter was
The variation in the measurements is likely due, in part, to non-uniform scheduling efficiency for the parallel work units during some phases of the computation and the fact that the stress test itself runs in a loop that repeatedly initializes memory using a single thread and then runs the parallel computation.
https://www.amazon.com/Poniie-PN1500-El ... 07VPTN8FZ/
Since the pie chart program hasn't been experiencing feature creep and code bloat for a while, I added a new option that allows one to perform a stress test of specified duration for all or any of the four benchmark problems. The archive available from the download link in the first post has been updated.
I took an average of 50 readings of the Pi 4B using both the volt-ampere and watt settings. The results using the new -w10 option were
Code: Select all
VA Watt Efficiency
Pi 4B Idle 6.04 3.06 0.00 Mops/W
Prime Sieve 10.51 6.01 281.38 Mops/W
Merge Sort 9.75 5.59 97.06 Mops/W
Fourier Transform 8.80 4.96 69.86 Mflops/W
Lorenz 96 12.38 7.45 686.02 Mflops/W
In all cases power usage remained under 8 watts. I wonder how much electricity the Graviton2 processors consume. For reference, the data I took (by hand) from the readout on the power meter was
Code: Select all
PSVA PSW MSVA MSW FTVA FTW L96VA L96W
MIN 10.29 5.00 9.49 5.28 8.55 4.68 12.17 7.28
AVG 10.51 6.01 9.75 5.59 8.80 4.96 12.38 7.45
STD 0.15 0.26 0.16 0.17 0.15 0.17 0.14 0.14
MAX 10.77 6.33 10.09 5.93 9.04 5.47 12.62 7.70
DATA 10.46 6.18 9.57 5.83 8.88 5.00 12.49 7.30
10.53 5.62 9.78 5.72 8.82 4.79 12.60 7.50
10.65 5.93 9.76 5.45 8.66 4.80 12.60 7.56
10.36 6.22 9.49 5.78 8.99 5.11 12.31 7.29
10.74 6.18 9.87 5.62 8.55 4.83 12.50 7.30
10.56 6.10 9.51 5.50 8.74 4.75 12.24 7.29
10.61 5.95 9.77 5.44 8.63 5.09 12.32 7.54
10.36 6.09 9.70 5.85 8.61 4.83 12.41 7.52
10.76 5.96 9.62 5.61 8.89 5.10 12.17 7.30
10.38 6.28 9.93 5.41 8.73 4.84 12.56 7.60
10.58 6.03 9.60 5.38 8.77 4.79 12.21 7.58
10.56 5.95 9.85 5.63 8.95 5.15 12.37 7.48
10.77 6.27 9.71 5.45 8.63 4.83 12.38 7.29
10.52 6.00 9.73 5.44 9.03 5.21 12.18 7.30
10.36 5.96 9.93 5.68 8.71 4.79 12.57 7.29
10.69 6.31 9.56 5.44 8.81 4.83 12.17 7.30
10.31 5.95 9.97 5.28 8.90 5.16 12.41 7.40
10.59 6.14 9.65 5.63 8.62 4.70 12.36 7.64
10.43 6.07 9.78 5.51 9.03 4.99 12.21 7.31
10.40 6.32 9.94 5.33 8.67 5.01 12.58 7.37
10.64 5.97 9.52 5.84 8.86 4.82 12.20 7.32
10.29 5.97 9.97 5.53 8.88 5.09 12.46 7.55
10.65 5.96 9.61 5.93 8.66 4.77 12.28 7.62
10.41 6.27 9.76 5.50 9.04 5.16 12.54 7.56
10.47 6.24 9.86 5.58 8.62 4.79 12.22 7.30
10.60 5.97 9.64 5.73 8.90 5.07 12.53 7.32
10.31 5.27 10.09 5.52 8.82 4.87 12.31 7.60
10.67 5.99 9.61 5.45 8.65 5.12 12.34 7.61
10.35 5.96 9.86 5.42 9.03 5.00 12.51 7.34
10.49 6.33 9.83 5.60 8.91 4.81 12.21 7.32
10.57 6.10 9.61 5.80 8.64 4.91 12.55 7.35
10.33 5.96 9.98 5.66 9.00 5.12 12.28 7.59
10.73 6.15 9.57 5.57 8.60 4.80 12.37 7.48
10.35 5.97 9.59 5.45 8.93 5.05 12.47 7.34
10.54 6.25 9.71 5.50 8.73 5.09 12.22 7.60
10.51 5.00 9.68 5.49 8.70 4.68 12.62 7.36
10.31 5.97 10.01 5.81 8.95 5.15 12.25 7.36
10.72 5.96 9.55 5.39 8.60 4.82 12.43 7.66
10.34 6.25 9.97 5.73 9.01 4.83 12.46 7.67
10.58 5.96 9.78 5.48 8.70 5.08 12.23 7.66
10.48 6.00 9.71 5.48 8.79 4.71 12.62 7.65
10.35 6.25 9.95 5.51 8.90 5.08 12.24 7.37
10.70 5.96 9.61 5.83 8.62 5.11 12.49 7.70
10.32 6.06 9.94 5.76 9.01 4.79 12.40 7.39
10.62 6.02 9.69 5.69 8.67 5.07 12.21 7.28
10.44 5.96 9.64 5.83 8.83 5.47 12.44 7.37
10.41 5.96 9.86 5.77 8.91 4.78 12.18 7.66
10.66 5.20 9.56 5.33 8.62 5.11 12.55 7.38
10.33 5.97 10.00 5.79 9.03 5.12 12.20 7.37
10.65 6.27 9.62 5.38 8.68 5.05 12.36 7.39
Re: A Pi Pie Chart
This year the pea plants in the back yard have already grown taller than anytime last summer. The Earth is healing.
While relating this promising fact to the canine coder, there was an interruption. At first it sounded like barking but eventually I understood that since energy efficiency is part of the new normal, subsequent research should focus on developing a PET-on-a-chip that can be used to scale legacy 8-bit code to sub-milliwatt levels. I wonder if a POC could be advantageous in other ways.
Along different lines, I plugged a Ryzen 1920X Threadripper into the power meter and measured an idle power usage of 90 watts with the powersave governor and 100 with performance. Note that this measurement included eight spinning hard disks, that crazy Radeon VII GPU, multiple fans and an NVMe SSD. I then compared the efficiency with the Pi 4B.
Since the Pi 4B has 4 cores, I divided the 1920X into three cpusets each consisting of 4 cores (8 threads), simultaneously ran three copies of the Pi chart program and totaled the output. The results were
Not surprisingly, the efficiency results are less than the Pi. At the same time since the Pi was running from an SD card, it didn't have to power 8 hard disks or a heavy graphics card at the same time.
This brings up an interesting point about using desktop computers for distributed computing projects such as BOINC--the costs of running all the peripherals decrease the energy efficiency. On the other hand, if the desktop is anyway being used for other things, then the power taken by the peripherals is amortized and only the additional power used for BOINC needs to be counted.
In particular, by subtracting the 100 watt idle power one can find the efficiency obtained when scavenging cycles from an already running machine. While better, it astonishingly tells the same story.
Deploying a stand-alone Pi 4B provisioned to run only BOINC appears to be more energy efficient than scavenging cycles from an already running desktop computer. It's also possible my power meter toy doesn't accurately measure power in the 3 watt range. For example, there was a noticeable difference in the volt-ampere readings compared to watts for the Pi while these two quantities were essentially the same for the Ryzen.
Of course, if you have the new 8GB model, there is enough memory to scavenge CPU cycles while the Pi is used for something else. In this case the efficiency is
This result may explain why people are interested in ARM processors both for cloud and high-performance computing. It further makes the 4B attractive for building clusters as well as running microservices in a data center.
For reference, the performance data for the 1920X is
and the data collected from the meter
After meditating on the data, it seems possible that power consumption may have increased a bit during the run of prime sieve due to a cooling fan turning on and may have decreased during the Lorenz 96 dynamical simulation due to heat-related reduced turbo boost.
While relating this promising fact to the canine coder, there was an interruption. At first it sounded like barking but eventually I understood that since energy efficiency is part of the new normal, subsequent research should focus on developing a PET-on-a-chip that can be used to scale legacy 8-bit code to sub-milliwatt levels. I wonder if a POC could be advantageous in other ways.
Along different lines, I plugged a Ryzen 1920X Threadripper into the power meter and measured an idle power usage of 90 watts with the powersave governor and 100 with performance. Note that this measurement included eight spinning hard disks, that crazy Radeon VII GPU, multiple fans and an NVMe SSD. I then compared the efficiency with the Pi 4B.
Since the Pi 4B has 4 cores, I divided the 1920X into three cpusets each consisting of 4 cores (8 threads), simultaneously ran three copies of the Pi chart program and totaled the output. The results were
Code: Select all
Mops/MFlops Watt Efficiency
Ryzen 1920X Idle 0.00 100.0 0.00 Mops/W
Prime Sieve 14852.56 243.1 61.09 Mops/W
Merge Sort 4349.82 246.2 17.96 Mops/W
Fourier Transform 7014.84 192.8 36.38 Mflops/W
Lorenz 96 107614.5 265.7 405.02 Mflops/W
This brings up an interesting point about using desktop computers for distributed computing projects such as BOINC--the costs of running all the peripherals decrease the energy efficiency. On the other hand, if the desktop is anyway being used for other things, then the power taken by the peripherals is amortized and only the additional power used for BOINC needs to be counted.
In particular, by subtracting the 100 watt idle power one can find the efficiency obtained when scavenging cycles from an already running machine. While better, it astonishingly tells the same story.
Code: Select all
Mops/MFlops Extra Scavenging
Ryzen 1920X Watts Efficiency
Prime Sieve 14852.56 143.1 103.79 Mops/W
Merge Sort 4349.82 146.2 29.75 Mops/W
Fourier Transform 7014.84 92.8 75.59 Mflops/W
Lorenz 96 107614.5 165.7 649.45 Mflops/W
Of course, if you have the new 8GB model, there is enough memory to scavenge CPU cycles while the Pi is used for something else. In this case the efficiency is
Code: Select all
Mops/MFlops Extra Scavenging
Pi 4B Watts Efficiency
Prime Sieve 1691.07 3.01 561.82 Mops/W
Merge Sort 542.575 2.59 209.49 Mops/W
Fourier Transform 346.487 1.96 176.78 Mflops/W
Lorenz 96 5110.89 4.45 1148.51 Mflops/W
For reference, the performance data for the 1920X is
Code: Select all
Prime Sieve P=14630843 Workers=8 Sec=0.188313 Mops=4961.56
Prime Sieve P=14630843 Workers=8 Sec=0.188105 Mops=4967.05
Prime Sieve P=14630843 Workers=8 Sec=0.189752 Mops=4923.95
Merge Sort N=16777216 Workers=16 Sec=0.278942 Mops=1443.5
Merge Sort N=16777216 Workers=16 Sec=0.278015 Mops=1448.32
Merge Sort N=16777216 Workers=16 Sec=0.276168 Mops=1458
Fourier Transform N=4194304 Workers=4 Sec=0.218448 Mflops=2112.06
Fourier Transform N=4194304 Workers=4 Sec=0.185614 Mflops=2485.65
Fourier Transform N=4194304 Workers=8 Sec=0.190876 Mflops=2417.13
Lorenz 96 N=32768 K=16384 Workers=16 Sec=0.0878499 Mflops=36667.4
Lorenz 96 N=32768 K=16384 Workers=16 Sec=0.0861906 Mflops=37373.3
Lorenz 96 N=32768 K=16384 Workers=16 Sec=0.0959447 Mflops=33573.8
Code: Select all
PSW MSW FTW L96W
MIN 239.1 245.0 190.1 258.2
AVG 243.1 246.2 192.8 265.7
STD 2.9 0.6 1.7 4.5
MAX 251.5 247.6 199.2 275.5
DATA 241.2 245.8 190.8 273.4
239.7 245.2 191.5 271.9
240.1 246.5 192.0 272.7
239.9 246.4 193.8 271.8
239.6 246.0 193.7 271.0
240.4 246.2 196.9 271.1
239.7 246.3 199.2 271.2
239.1 245.8 197.0 272.4
239.2 246.5 194.2 274.2
239.5 246.3 192.6 275.5
239.4 246.0 191.7 273.4
239.9 245.7 191.1 268.4
240.8 246.4 193.0 268.3
241.3 246.3 192.5 266.5
241.1 246.0 191.4 267.1
240.9 246.1 192.7 268.2
241.3 246.1 193.1 267.3
242.0 245.7 193.6 267.0
241.8 245.2 193.7 267.5
242.0 245.7 195.3 265.4
242.6 246.2 193.8 266.2
242.4 245.5 191.6 267.5
242.1 245.7 192.8 267.2
242.5 246.8 192.1 264.1
243.1 245.5 191.6 264.7
242.9 245.0 191.0 263.2
242.2 246.0 191.5 262.3
243.0 246.5 191.7 263.3
242.1 245.7 192.5 264.5
241.9 245.8 191.8 265.0
243.1 246.4 191.5 265.4
243.2 246.5 192.2 263.3
242.9 245.6 192.8 261.9
243.4 245.7 192.7 263.5
246.5 246.2 192.3 261.8
246.4 246.6 190.1 261.9
247.7 246.1 191.0 260.4
251.5 247.0 192.3 262.9
248.8 247.6 193.8 262.4
246.1 246.4 193.1 261.7
245.1 245.9 194.4 260.0
245.9 246.8 194.7 263.1
246.0 246.9 195.8 262.5
246.1 246.2 191.3 263.2
246.3 247.3 191.1 260.4
246.0 247.4 192.0 259.9
246.4 246.3 191.3 260.2
246.5 246.3 192.4 261.5
246.3 247.2 192.8 260.3
246.3 247.5 193.0 258.2
Re: A Pi Pie Chart
As mentioned in
viewtopic.php?f=63&t=271121&p=1669795#p1669795
I had an opportunity to run some tests remotely on the newly announced ODRIOD-C4 single-board computer. Here are the pie chart results

I find it interesting that the C4 is faster for merge sort and the Fourier transform but slower for prime sieve and the Lorenz 96 simulation. This appears to reflect the combination of a slower processor with more bandwidth compared to the 4B.
For reference, the output was
Based on the Pi ratio, the 4B is about 38 percent faster on average.
viewtopic.php?f=63&t=271121&p=1669795#p1669795
I had an opportunity to run some tests remotely on the newly announced ODRIOD-C4 single-board computer. Here are the pie chart results
I find it interesting that the C4 is faster for merge sort and the Fourier transform but slower for prime sieve and the Lorenz 96 simulation. This appears to reflect the combination of a slower processor with more bandwidth compared to the 4B.
For reference, the output was
Code: Select all
$ ./pichart-openmp -t ODROID-C4
pichart -- Raspberry Pi Performance OPENMP version 34
Prime Sieve P=14630843 Workers=4 Sec=0.881419 Mops=1060.03
Merge Sort N=16777216 Workers=8 Sec=0.853603 Mops=471.71
Fourier Transform N=4194304 Workers=4 Sec=1.4914 Mflops=309.356
Lorenz 96 N=32768 K=16384 Workers=4 Sec=1.12385 Mflops=2866.25
The ODROID-C4 has Raspberry Pi ratio=22.9131
Making pie charts...done.
Last edited by ejolson on Fri Jun 12, 2020 3:34 pm, edited 2 times in total.
Re: A Pi Pie Chart
2GHz A55's
and
DDR4 1.32GHz
just looked as being nosey so popped to their
they have their own benchmarks against the Pi4 too etc.
and
DDR4 1.32GHz
just looked as being nosey so popped to their
they have their own benchmarks against the Pi4 too etc.
Re: A Pi Pie Chart
In preparation for working with Raspberry Pi OS images on an x86 server, I've been studying the user-mode QEMU binary emulation described in
https://github.com/sakaki-/gentoo-on-rp ... infmt_misc
It occurred to me, as it has to many other people, that the same technique could be used to run 64-bit AMD x86 binaries on a Raspberry Pi. So, I decided to check the performance of the Pi pie chart program while performing such an emulation. To do this I installed
on the Pi 4B. Then I compiled a statically-linked 64-bit x86 hello world executable created on my PC with the command
where hello.c contained the lines
and copied the binary over to the Pi. Finally, I typed
and thought, this is amazingly simple.
Returning to the PC I downloaded the latest copy of the Pi pie chart program from the link listed in the first post
viewtopic.php?p=1393365#p1393365
changed the Makefile so it read
and typed make.
Things are never quite as simple as one would hope. The output
indicated a problem with OpenMP.
The serial version seemed good enough to determine the relative speed between emulated and native code, so I copied it over. Then, back on the 4B I obtained
As indicated by the Pi ratio, emulated x86 code executes about half the speed of the original Raspberry Pi running native code.
For completeness as well as thinking part the slowness might come from emulating a 64-bit architecture using the 32-bit Raspberry Pi OS, I again compiled the Pi pie chart program, this time for the 32-bit Intel architecture. The results when running the 32-bit Intel i386 code using QEMU emulation on a Pi 4B were
Note that the emulated 32-bit code for the prime sieve was much faster, but the floating-point codes for the Fourier transform and Lorenz 96 dynamical simulation were slower. On average, emulating the 32-bit Intel architecture was slightly slower than emulating the 64-bit AMD architecture.
For comparison, runs using native ARM 32-bit and 64-bit binaries were
and
In conclusion, the results of using QEMU emulation to run Intel-compatible code on the Pi 4B may be summarized as
https://github.com/sakaki-/gentoo-on-rp ... infmt_misc
It occurred to me, as it has to many other people, that the same technique could be used to run 64-bit AMD x86 binaries on a Raspberry Pi. So, I decided to check the performance of the Pi pie chart program while performing such an emulation. To do this I installed
Code: Select all
# apt-get install qemu-user-static
Code: Select all
$ gcc -static -O3 -s -o hello hello.c
Code: Select all
#include <stdio.h>
int main(){
printf("Hello World!\n");
return 0;
}
Code: Select all
$ file hello
hello: ELF 64-bit LSB executable, x86-64, version 1 (GNU/Linux), statically linked, for GNU/Linux 3.2.0, BuildID[sha1]=b703362341d57ea5a503241b4ba62987a6552981, stripped
$ ./hello ; # Running of the Pi 4B
Hello World!
Returning to the PC I downloaded the latest copy of the Pi pie chart program from the link listed in the first post
viewtopic.php?p=1393365#p1393365
changed the Makefile so it read
Code: Select all
CFLAGS=-std=gnu99 -static -O3 -mtune=native -march=native -Wall -s
Things are never quite as simple as one would hope. The output
Code: Select all
$ make
gcc -std=gnu99 -static -O3 -mtune=native -march=native -Wall -s -o pichart-serial pichart.c util.c sieve.c merge.c fourier.c lorenz.c -lm
gcc -std=gnu99 -static -O3 -mtune=native -march=native -Wall -s -fopenmp -o pichart-openmp pichart.c util.c sieve.c merge.c fourier.c lorenz.c -lm
/usr/bin/ld: /usr/lib/gcc/x86_64-linux-gnu/8/libgomp.a(target.o): in function `gomp_target_init':
(.text+0x328): warning: Using 'dlopen' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking
The serial version seemed good enough to determine the relative speed between emulated and native code, so I copied it over. Then, back on the 4B I obtained
Code: Select all
$ file pichart-serial-x86_64
pichart-serial-x86_64: ELF 64-bit LSB executable, x86-64, version 1 (GNU/Linux), statically linked, for GNU/Linux 3.2.0, BuildID[sha1]=3304010ef0e97272118f32c94ba74cf9f6f516bd, stripped
$ ./pichart-serial-x86_64 ; # qemu-x86_64 on Pi 4B
pichart -- Raspberry Pi Performance Serial version 34
Prime Sieve P=14630843 Workers=1 Sec=25.126 Mops=37.1857
Merge Sort N=16777216 Workers=1 Sec=23.4029 Mops=17.2052
Fourier Transform N=4194304 Workers=1 Sec=47.0241 Mflops=9.81142
Lorenz 96 N=32768 K=16384 Workers=1 Sec=190.244 Mflops=16.9321
My Computer has Raspberry Pi ratio=0.507005
Making pie charts...done.
For completeness as well as thinking part the slowness might come from emulating a 64-bit architecture using the 32-bit Raspberry Pi OS, I again compiled the Pi pie chart program, this time for the 32-bit Intel architecture. The results when running the 32-bit Intel i386 code using QEMU emulation on a Pi 4B were
Code: Select all
$ file pichart-serial-i386
pichart-serial-i386: ELF 32-bit LSB executable, Intel 80386, version 1 (GNU/Linux), statically linked, for GNU/Linux 2.6.32, BuildID[sha1]=5585668ec69506e97728b7f7cb3d35297e9d4611, stripped
$ ./pichart-serial-i386 ; # qemu-i386 on Pi 4B
pichart -- Raspberry Pi Performance Serial version 34
Prime Sieve P=14630843 Workers=1 Sec=10.3452 Mops=90.3148
Merge Sort N=16777216 Workers=1 Sec=18.3665 Mops=21.9233
Fourier Transform N=4194304 Workers=2 Sec=131.468 Mflops=3.5094
Lorenz 96 N=32768 K=16384 Workers=1 Sec=323.73 Mflops=9.95035
My Computer has Raspberry Pi ratio=0.45533
Making pie charts...done.
$ sudo vcgencmd get_throttled
throttled=0x0
For comparison, runs using native ARM 32-bit and 64-bit binaries were
Code: Select all
./pichart-serial ; # Native 32-bit ARM on Pi 4B
pichart -- Raspberry Pi Performance Serial version 34
Prime Sieve P=14630843 Workers=1 Sec=2.18637 Mops=427.343
Merge Sort N=16777216 Workers=2 Sec=2.43822 Mops=165.142
Fourier Transform N=4194304 Workers=2 Sec=3.01506 Mflops=153.023
Lorenz 96 N=32768 K=16384 Workers=1 Sec=2.30807 Mflops=1395.64
My Computer has Raspberry Pi ratio=9.83861
Making pie charts...done.
Code: Select all
$ ./pichart-serial-aarch64 ; # Native 64-bit AARCH64 on Pi 4B
pichart -- Raspberry Pi Performance Serial version 34
Prime Sieve P=14630843 Workers=2 Sec=2.3852 Mops=391.719
Merge Sort N=16777216 Workers=1 Sec=4.20305 Mops=95.8003
Fourier Transform N=4194304 Workers=2 Sec=2.92823 Mflops=157.561
Lorenz 96 N=32768 K=16384 Workers=1 Sec=1.84992 Mflops=1741.27
My Computer has Raspberry Pi ratio=8.94451
Making pie charts...done.
Code: Select all
Single-core Pi 4B Pi Ratio Percent
Native 32-bit ARM 9.839 100
Native 64-bit AARCH64 8.945 91
Emulated 32-bit i386 0.455 4.6
Emulated 64-bit x86_64 0.507 5.2
Last edited by ejolson on Fri Jul 10, 2020 10:56 pm, edited 3 times in total.
Re: A Pi Pie Chart
Just because I was reading the thread
New processor run (in a laptop)
New processor run (in a laptop)
Code: Select all
root@G3ntleGiraffe:~/pichart-34# ./pichart-openmp -t "Ubuntu WSL2 Win10 i5-9300H"
pichart -- Raspberry Pi Performance OPENMP version 34
Prime Sieve P=14630843 Workers=8 Sec=0.196982 Mops=4743.21
Merge Sort N=16777216 Workers=16 Sec=0.307859 Mops=1307.92
Fourier Transform N=4194304 Workers=8 Sec=0.182756 Mflops=2524.53
Lorenz 96 N=32768 K=16384 Workers=8 Sec=0.0630457 Mflops=51093.5
The Ubuntu WSL2 Win10 i5-9300H has Raspberry Pi ratio=149.345
Making pie charts...done.