Re: Raspberry Pi Benchmarks
The best/easiest things you can do to reduce the temperature on the pi is to set it on a heat conducting surface or set it so that the board is vertical, instead of flat (i.e. lay the case on it's side). I had my pi hit 80 degrees while sitting in an Adafruit PiCase on carpet. I tipped it up on edge (still on the carpet, still in the case) and now I haven't seen it pass 65 degrees (Celsius of course). The thermal transfer to air is much better with a vertical surface (causes convective cooling, it actually creates a small breeze as the hot air rises past the surface).
It's not just about what kernel and overclock settings you run or what heat sink you put on it. The orientation and environment (even outside the case) make a difference.
-Jon, Aerospace Engineer
It's not just about what kernel and overclock settings you run or what heat sink you put on it. The orientation and environment (even outside the case) make a difference.
-Jon, Aerospace Engineer
Re: Raspberry Pi Benchmarks
Hello there, im kind of new in the area of benchmarking so excuse the lack of knowledge. But i ran all of the BM Roy posted in his website in my raspi, and then compiled them myself in the raspi. But i have a question i dont know why when i ran the whetstone BM without compiling them in the pi i get higher results than Roy ,he got 390.5 MWIPS a 1000 MHz and i got 400 MWIPS at 1000Mhz.Why is this happening ? I would apreciate any answer, im doing a research paper from this Benchmarks.



Enida Casanova
Barquisimeto,Venezuela.
Barquisimeto,Venezuela.
-
- Posts: 428
- Joined: Fri Apr 12, 2013 9:27 am
- Location: Essex, UK
Re: Raspberry Pi Benchmarks
Particularly with just one CPU core, speed can vary if the CPU has other things to do. Speed can also vary depending on memory address alignment and what programs were run before. I just looked at some of my old results at 700 MHz and they varied between 250 and 270 MWIPS. That would suggest 357 to 386 MWIPS at 1000 MHz, indicating- that you can’t really rely on the MHz claims. If you compare 1000/700 MHz speeds (ratio 1.43) of all Whetstone tests in my report, the ratio varies between 1.43 and 1.61.eniccm wrote:Hello there, im kind of new in the area of benchmarking so excuse the lack of knowledge. But i ran all of the BM Roy posted in his website in my raspi, and then compiled them myself in the raspi. But i have a question i dont know why when i ran the whetstone BM without compiling them in the pi i get higher results than Roy ,he got 390.5 MWIPS a 1000 MHz and i got 400 MWIPS at 1000Mhz.Why is this happening ? I would apreciate any answer, im doing a research paper from this Benchmarks.![]()
For your project, I suggest that you run the benchmarks a few times.

Re: Raspberry Pi Benchmarks
Thank you very much for the prompt reply
!!!!!
I have done what you suggested me and I have run all benchmarks at least 3 times but I have not found any literature or book that tells me how many executions are needed for statistical confidence and then perform arithmetic, geometric or harmonic average, do you have any suggestions?
PS: my results do not depart greatly from which you obtained

I have done what you suggested me and I have run all benchmarks at least 3 times but I have not found any literature or book that tells me how many executions are needed for statistical confidence and then perform arithmetic, geometric or harmonic average, do you have any suggestions?

PS: my results do not depart greatly from which you obtained

Enida Casanova
Barquisimeto,Venezuela.
Barquisimeto,Venezuela.
-
- Posts: 428
- Joined: Fri Apr 12, 2013 9:27 am
- Location: Essex, UK
Re: Raspberry Pi Benchmarks
As the benchmarks generally measure speeds, harmonic mean would seem to be appropriate. Then, there can always be exceptional external events that lead to unrepresentative conclusions. For statistical confidence, you would probably have to quote such as 95 percentiles, requiring hundreds of measurements, and that is not on. Traditionally, on asking a supplier to provide benchmark results, they would supply best results I suppose that I quote typical maximum speeds or a range of results if they have significant regular variation.
-
- Posts: 428
- Joined: Fri Apr 12, 2013 9:27 am
- Location: Essex, UK
Re: Raspberry Pi Benchmarks
Raspberry Pi 2 Performance
I have just started playing with my new Raspberry Pi 2, on which I will be running all my benchmarks, hopefully starting next week. As a sweetener, I thought that I should demonstrate MP performance
MP-MFLOPS arithmetic operations executed are of the form x = (x + a) * b - (x + c) * d + (x + e) * f with 2 or 32 operations per input data word. Array sizes used are 12.8 KB, 128 KB and 12.8 MB, to test with data in L1 cache, L2 cache and RAM. Each of 1, 2, 4 and 8 threads use the same calculations but accessing different segments of the data.
Results below demonstrate gains in line with cores used, and performance gains of 8.3 to 12.2 times, of original RPi speed.
I have just started playing with my new Raspberry Pi 2, on which I will be running all my benchmarks, hopefully starting next week. As a sweetener, I thought that I should demonstrate MP performance
MP-MFLOPS arithmetic operations executed are of the form x = (x + a) * b - (x + c) * d + (x + e) * f with 2 or 32 operations per input data word. Array sizes used are 12.8 KB, 128 KB and 12.8 MB, to test with data in L1 cache, L2 cache and RAM. Each of 1, 2, 4 and 8 threads use the same calculations but accessing different segments of the data.
Results below demonstrate gains in line with cores used, and performance gains of 8.3 to 12.2 times, of original RPi speed.
Code: Select all
Raspberry P1 2 900 MHz
Features: half thumb fastmult vfp edsp neon vfpv3
tls vfpv4 idiva idivt vfpd32 lpae evtstrm
MP-MFLOPS Linux/ARM v1.0 Sun Feb 8 10:37:43 2015
FPU Add & Multiply using 1, 2, 4 and 8 Threads
2 Ops/Word 32 Ops/Word
KB 12.8 128 12800 12.8 128 12800
MFLOPS
1T 102 147 128 407 406 390
2T 295 289 250 814 810 778
4T 418 554 360 1597 1612 1520
8T 488 450 378 1459 1548 1436
#####################################################
Raspberry Pi 700 MHz
Features: swp half thumb fastmult vfp edsp java tls
MP-MFLOPS Linux/ARM v1.0 Sat Jul 27 17:41:13 2013
FPU Add & Multiply using 1, 2, 4 and 8 Threads
2 Ops/Word 32 Ops/Word
KB 12.8 128 12800 12.8 128 12800
MFLOPS
1T 43 33 31 191 170 161
2T 44 42 31 192 174 160
4T 44 43 31 192 176 159
8T 43 51 31 192 184 160
-
- Posts: 1
- Joined: Sat Oct 17, 2015 4:01 pm
Re: Raspberry Pi Benchmarks
I have implemented dvfs mechanism ..in bash script with governors. I'm trying to measure run-time power measurement of pi or its processor . How should i do it?
-
- Posts: 428
- Joined: Fri Apr 12, 2013 9:27 am
- Location: Essex, UK
Re: Raspberry Pi Benchmarks
Sorry, I can’t help. The nearest I have been is measuring CPU MHz and temperature:misal sanjay wrote:I have implemented dvfs mechanism ..in bash script with governors. I'm trying to measure run-time power measurement of pi or its processor . How should i do it?
http://www.roylongbottom.org.uk/Raspber ... m#anchor28
Googling for “measure power consumption of raspberry pi” seems to suggest that external meters have to be used.
Re: Raspberry Pi Benchmarks
Hi Roy,
Have you done any benchmarks of Thumb2 on the Pi2? It seems to give me about 25% reduction in code size together with a slight increase in speed (perhaps because more instructions fit in the I-cache).
If you have not, just adding "-mthumb" is enough, and the program runs as before - its a complete instruction set (unlike thumb 1).
Have you done any benchmarks of Thumb2 on the Pi2? It seems to give me about 25% reduction in code size together with a slight increase in speed (perhaps because more instructions fit in the I-cache).
If you have not, just adding "-mthumb" is enough, and the program runs as before - its a complete instruction set (unlike thumb 1).
-
- Posts: 428
- Joined: Fri Apr 12, 2013 9:27 am
- Location: Essex, UK
Re: Raspberry Pi Benchmarks
I just tried it with the last results posted for MP-MFLOPS, compiled with:jahboater wrote:Hi Roy,
Have you done any benchmarks of Thumb2 on the Pi2? It seems to give me about 25% reduction in code size together with a slight increase in speed (perhaps because more instructions fit in the I-cache).
If you have not, just adding "-mthumb" is enough, and the program runs as before - its a complete instruction set (unlike thumb 1).
gcc mpmflops.c cpuidc.c -lrt -lc -lm -O3 -mcpu=cortex-a7 -mfloat-abi=hard -mfpu=neon-vfpv4 -lpthread -o MP-MFLOPSPiA7
Adding -mthumb produced a slightly smaller file that was slightly slower. Maybe it would be faster with an integer benchmark.
Later Dhrystone Benchmark 17.6 KB 1649 Vax MIPS from
gcc dhry_1.c dhry_2.c cpuidc.c -lm -lrt -O3 -mcpu=cortex-a7 -o dhrystonePiA7
Adding -mthumb 16.9 KB 1630 Vax MIPS
Re: Raspberry Pi Benchmarks
Yes, I guess the short thumb instructions are all integer. I think the VFP/NEON floating point instructions remain as is.
Incidentally, when running a benchmark I increase the priority with
"sudo nice --20" and make sure at least one core is free to handle interrupts.
Thanks
Incidentally, when running a benchmark I increase the priority with
"sudo nice --20" and make sure at least one core is free to handle interrupts.
Thanks
-
- Posts: 51
- Joined: Sun May 25, 2014 10:22 am
Re: Raspberry Pi Benchmarks
I know this subject is a bit passé, but I though I would report on Einstein@Home performance. E@H benchmarks each new CPU before sending work to it. It rates my Pi 3 at 748/2461 floating point/integer operations per second, compared with 441/1695 respectively for a Pi 2. The measured CPU temperature of my Pi 3 shoots up to 82 degrees and stays there (regulated by throttling, no doubt). Given another report that measurement by either IR or thermocouple probe exceeds this considerably, I'm not sure what the actual temperature would be. I am not overclocking, and I don't have heat sinks installed. The Pi is installed in a closed box (a PiDP-8 enclosure). I'm going to let it run forever or until it fails.
-
- Posts: 428
- Joined: Fri Apr 12, 2013 9:27 am
- Location: Essex, UK
Re: Raspberry Pi Benchmarks
I am currently running all my benchmarks on the RPi 3. Many have been run and reported on by others, so I will shortly be including a summary of results here.NeilAlexanderHiggins wrote:I know this subject is a bit passé, but I though I would report on Einstein@Home performance. E@H benchmarks each new CPU before sending work to it. It rates my Pi 3 at 748/2461 floating point/integer operations per second, compared with 441/1695 respectively for a Pi 2. The measured CPU temperature of my Pi 3 shoots up to 82 degrees and stays there (regulated by throttling, no doubt). Given another report that measurement by either IR or thermocouple probe exceeds this considerably, I'm not sure what the actual temperature would be. I am not overclocking, and I don't have heat sinks installed. The Pi is installed in a closed box (a PiDP-8 enclosure). I'm going to let it run forever or until it fails.
I started a while ago, initially concentrating on my new OpenGL GLUT benchmark. This included stress testing, measuring temperatures and CPU MHz at the same time. See the following thread that includes examples of throttling and display failures. My RPi 3 has a self adhesive heatsink and that made little difference.
viewtopic.php?f=68&t=145374
-
- Posts: 428
- Joined: Fri Apr 12, 2013 9:27 am
- Location: Essex, UK
Re: Raspberry Pi Benchmarks
Raspberry Pi 3 Benchmarks
I am currently running all my benchmarks on my Raspberry Pi 3. Some have already been run by others, with reports in various places. My results can be found via the following link, plus brief summaries of the gcc 4.8 compiled tests are provided here, including comparisons with the 900 MHz Raspberry Pi 2. For those who are interested in historic comparisons, links are provided for my results on Windows/Linux PCs, Android devices and RPis, plus original data starting in the 1970s/80s.
http://www.roylongbottom.org.uk/Raspber ... hmarks.htm
The Classic Benchmarks are the first programs that set standards of performance for computers in the 1970s and 1980s. They are Whetstone, Dhrystone, Linpack and Livermore Loops.
Whetstone - comprises eight tests measuring speeds of floating point, integers and mathematical functions, efficient compilation of the latter often determining the overall rating in MWIPS.
In the various areas, average RPi 3 speed was 40% to 47% faster than RPi 2. Floating point and integer tests were faster than a 3 GHz Pentium 4. For detailed results on Windows and Linux based PCs, Android devices and RPis, then speeds of computers from year dot see;
http://www.roylongbottom.org.uk/whetstone%20results.htm
http://www.roylongbottom.org.uk/whetstone.htm
Dhrystone - is a later sort of Whetstone benchmark without floating point. Results are in VAX MIPS or DMIPS (relative to the DEC VAX 11/780 minicomputer) and these are highly dependent on optimisation in a particular compiler.
RPi 3 was 48% faster than RPi 2 and clearly faster than Pentium 3 CPUs. My results and historic speeds are in the following. The latter provides ratings in Dhrystones Per Second and need dividing by 1757 for DMIPS (A later variation of VAX 11/780 score shown in the PDF file).
http://www.roylongbottom.org.uk/dhrystone%20results.htm
http://www.cs.virginia.edu/~mk2z/cs654/ ... chmark.pdf
ARM reports results in DMIPS/MHz. On this benchmark, the RPi 3 rating on this is 2.06. ARM rating is nearly always higher than via my benchmark but this might be a throwback to the origins where hardware and software were designed together for the highest benchmark speeds.
Linpack - This has floating point calculations, as in the original, using 100x100 matrices of double precision (DP) numbers, normally L2 cache sized data.. Performance depends almost entirely on a function calculating dy = dy + da*dx, suitable for vector type linked add and multiply. A version using NEON intrinsic functions is provided. As this uses single precision (SP), a standard compilation of this is also provided. Speed is measured in Millions of Floating Point Operations Per Second (MFLOPS).
Performance improvements over RPi 2 were 17% DP, 24% SP and 62% NEON, with RPi 3 measurements of 180, 194 and 486 MFLOPS. The DP result can be vaguely compared to a Pentium III E at 185 MFLOPS.
http://www.roylongbottom.org.uk/linpack%20results.htm
http://netlib.org/benchmark/performance.pdf
Livermore Loops - These comprise 24 kernels of numerical application, with speed measured in MFLOPS. The original was used to verify performance of the first Cray 1 supercomputer (cost $7 Million). The official average is geometric mean.
Average RPi 3 speed of 186 MFLOPS was 48% faster than RPi 2, 15.6 times that on a Cray 1 supercomputer and similar to a 1700 MHz Pentium 4. The original result were in “The Livermore Fortran Kernels: A Computer Test Of The Numerical Performance Range” by F.H. McMahon. This appears to be available for downloading but, in the case of researchgate.net, you will need approval from the authors (I am still waiting for it). My report includes a few summary results for some CDC and Cray computers.
http://www.roylongbottom.org.uk/livermo ... esults.htm
I am currently running all my benchmarks on my Raspberry Pi 3. Some have already been run by others, with reports in various places. My results can be found via the following link, plus brief summaries of the gcc 4.8 compiled tests are provided here, including comparisons with the 900 MHz Raspberry Pi 2. For those who are interested in historic comparisons, links are provided for my results on Windows/Linux PCs, Android devices and RPis, plus original data starting in the 1970s/80s.
http://www.roylongbottom.org.uk/Raspber ... hmarks.htm
The Classic Benchmarks are the first programs that set standards of performance for computers in the 1970s and 1980s. They are Whetstone, Dhrystone, Linpack and Livermore Loops.
Whetstone - comprises eight tests measuring speeds of floating point, integers and mathematical functions, efficient compilation of the latter often determining the overall rating in MWIPS.
In the various areas, average RPi 3 speed was 40% to 47% faster than RPi 2. Floating point and integer tests were faster than a 3 GHz Pentium 4. For detailed results on Windows and Linux based PCs, Android devices and RPis, then speeds of computers from year dot see;
http://www.roylongbottom.org.uk/whetstone%20results.htm
http://www.roylongbottom.org.uk/whetstone.htm
Dhrystone - is a later sort of Whetstone benchmark without floating point. Results are in VAX MIPS or DMIPS (relative to the DEC VAX 11/780 minicomputer) and these are highly dependent on optimisation in a particular compiler.
RPi 3 was 48% faster than RPi 2 and clearly faster than Pentium 3 CPUs. My results and historic speeds are in the following. The latter provides ratings in Dhrystones Per Second and need dividing by 1757 for DMIPS (A later variation of VAX 11/780 score shown in the PDF file).
http://www.roylongbottom.org.uk/dhrystone%20results.htm
http://www.cs.virginia.edu/~mk2z/cs654/ ... chmark.pdf
ARM reports results in DMIPS/MHz. On this benchmark, the RPi 3 rating on this is 2.06. ARM rating is nearly always higher than via my benchmark but this might be a throwback to the origins where hardware and software were designed together for the highest benchmark speeds.
Linpack - This has floating point calculations, as in the original, using 100x100 matrices of double precision (DP) numbers, normally L2 cache sized data.. Performance depends almost entirely on a function calculating dy = dy + da*dx, suitable for vector type linked add and multiply. A version using NEON intrinsic functions is provided. As this uses single precision (SP), a standard compilation of this is also provided. Speed is measured in Millions of Floating Point Operations Per Second (MFLOPS).
Performance improvements over RPi 2 were 17% DP, 24% SP and 62% NEON, with RPi 3 measurements of 180, 194 and 486 MFLOPS. The DP result can be vaguely compared to a Pentium III E at 185 MFLOPS.
http://www.roylongbottom.org.uk/linpack%20results.htm
http://netlib.org/benchmark/performance.pdf
Livermore Loops - These comprise 24 kernels of numerical application, with speed measured in MFLOPS. The original was used to verify performance of the first Cray 1 supercomputer (cost $7 Million). The official average is geometric mean.
Average RPi 3 speed of 186 MFLOPS was 48% faster than RPi 2, 15.6 times that on a Cray 1 supercomputer and similar to a 1700 MHz Pentium 4. The original result were in “The Livermore Fortran Kernels: A Computer Test Of The Numerical Performance Range” by F.H. McMahon. This appears to be available for downloading but, in the case of researchgate.net, you will need approval from the authors (I am still waiting for it). My report includes a few summary results for some CDC and Cray computers.
http://www.roylongbottom.org.uk/livermo ... esults.htm
-
- Posts: 428
- Joined: Fri Apr 12, 2013 9:27 am
- Location: Essex, UK
Re: Raspberry Pi Benchmarks
Raspberry Pi 3 Memory Benchmarks - up to 3.66 times faster
These benchmarks measure performance of processing data from caches and RAM. Performance improvements over the Raspberry Pi 2, using RAM, can be expected, as the clock speed is double The benchmarks covered here use ten or eleven data sizes between 8 KB and 65 MB, with results in MB/second. A summary and some detail and comparisons (with 900 MHz RPi 2) are provided below, with full details in:
http://www.roylongbottom.org.uk/Raspber ... hmarks.htm
MemSpeed - measures speeds carrying out floating point and integer calculations, with one and two operations per word. As shown below, best RPi3 improvement is from RAM, at 3.25 times. Relative speeds using cached data can be similar to CPU MHz differences, calculating with double precision numbers, with single precision and integer tests better, particularly using L2 cache.
NEON MemSpeed - This is MemSpeed with the compiler instructed to use NEON instructions. these currently are not applicable for double precision working, as reflected in similar speeds to MemSpeed. Main advantage is on single precision floating point calculations, particularly using RPi 2.
BusSpeed - this uses data streaming ANDing integers. It has variable address incrementing to show where burst reading occurs and possibly help to identify maximum speeds. See above HTM for details. Here, RPi 3 improvements are shown for reading all data. Although RAM MB/second measurements are the fastest amongst these tests, RPi 3 Performance is not much better than RPi 2 CPU clock difference of 1.33. Gains using L1 and L2 caches of 2.55 and 2.12 times.
NeonSpeed - This carries out the same single precision and integer calculations as MemSpeed, where Norm is compiled to use NEON instructions and Neon is from intrinsic functions (that the compiler might translate into faster code). Raspberry Pi 2 results show little difference between the two methods, but Pi 3 shows faster speeds via intrinsics, leading to faster relative performance. All RAM tests are at least 3.3 times faster on the RPi 3.
Fast Fourier Transforms - Original and optimised single and double precision FFT calculations, real applications from 1K to 1024K. See the HTM file for details. Results are in milliseconds (the lower the better). Performance depends on random or skipped sequential access that might be why RPi 3 performance gains are similar to the CPU clock speed ratio.
These benchmarks measure performance of processing data from caches and RAM. Performance improvements over the Raspberry Pi 2, using RAM, can be expected, as the clock speed is double The benchmarks covered here use ten or eleven data sizes between 8 KB and 65 MB, with results in MB/second. A summary and some detail and comparisons (with 900 MHz RPi 2) are provided below, with full details in:
http://www.roylongbottom.org.uk/Raspber ... hmarks.htm
MemSpeed - measures speeds carrying out floating point and integer calculations, with one and two operations per word. As shown below, best RPi3 improvement is from RAM, at 3.25 times. Relative speeds using cached data can be similar to CPU MHz differences, calculating with double precision numbers, with single precision and integer tests better, particularly using L2 cache.
NEON MemSpeed - This is MemSpeed with the compiler instructed to use NEON instructions. these currently are not applicable for double precision working, as reflected in similar speeds to MemSpeed. Main advantage is on single precision floating point calculations, particularly using RPi 2.
BusSpeed - this uses data streaming ANDing integers. It has variable address incrementing to show where burst reading occurs and possibly help to identify maximum speeds. See above HTM for details. Here, RPi 3 improvements are shown for reading all data. Although RAM MB/second measurements are the fastest amongst these tests, RPi 3 Performance is not much better than RPi 2 CPU clock difference of 1.33. Gains using L1 and L2 caches of 2.55 and 2.12 times.
NeonSpeed - This carries out the same single precision and integer calculations as MemSpeed, where Norm is compiled to use NEON instructions and Neon is from intrinsic functions (that the compiler might translate into faster code). Raspberry Pi 2 results show little difference between the two methods, but Pi 3 shows faster speeds via intrinsics, leading to faster relative performance. All RAM tests are at least 3.3 times faster on the RPi 3.
Fast Fourier Transforms - Original and optimised single and double precision FFT calculations, real applications from 1K to 1024K. See the HTM file for details. Results are in milliseconds (the lower the better). Performance depends on random or skipped sequential access that might be why RPi 3 performance gains are similar to the CPU clock speed ratio.
Code: Select all
Memory Speed Tests Calculating and Copying
x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m]
Cache Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32
RAM MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S
memspeedPiA7
RPi 2
L1 1197 1041 1955 1955 1320 2667 1926 2570 2622
L2 1096 1005 1549 1556 1115 1859 1245 1226 1225
RAM 343 333 384 379 349 404 952 693 693
RPi 3
L1 1606 1790 3383 2344 2203 3575 2703 3127 3147
L2 1560 1708 3223 2233 2069 3462 2614 2985 2958
RAM 893 1043 1250 1146 1089 1238 1038 925 927
RPi3/2
L1 1.34 1.72 1.73 1.20 1.67 1.34 1.40 1.22 1.20
L2 1.42 1.70 2.08 1.44 1.86 1.86 2.10 2.43 2.41
RAM 2.60 3.13 3.25 3.02 3.12 3.07 1.09 1.33 1.34
memSpdPiNEON
RPi 2
L1 1229 1776 2029 2028 2367 2832 2024 2832 2827
L2 1056 1321 1458 1460 1621 1726 1448 1091 1092
RAM 329 352 357 355 369 378 771 530 531
RPi 3
L1 1608 2346 3387 2348 3112 3717 2691 3144 3140
L2 1547 2198 3144 2198 2889 3388 2618 3009 3009
RAM 931 1155 1233 1142 1167 1241 1028 949 954
RPi3/2
L1 1.31 1.32 1.67 1.16 1.31 1.31 1.33 1.11 1.11
L2 1.46 1.66 2.16 1.51 1.78 1.96 1.81 2.76 2.75
RAM 2.83 3.28 3.45 3.21 3.16 3.29 1.33 1.79 1.80
Bus/Cache/RAM Reading Speed Test - busspeedPiA7
Reading Speed 4 Byte Words in MBytes/Second
Cache Inc32 Inc16 Inc8 Inc4 Inc2 Read Read
RAM Words Words Words Words Words All All
X RPi 2
RPi 2
L1 1095 1414 1535 1721 1684 1710
L2 377 405 697 1203 1573 1630
RAM 72 79 159 317 643 1264
RPi 3
L1 2650 2985 3431 4321 4348 4362 2.55
L2 556 559 1015 1781 2747 3462 2.12
RAM 119 128 246 492 974 1789 1.42
Vector Reading Speed - NeonSpeed - MBytes/Second
Float v=v+s*v Int v=v+v+s Neon v=v+v NEON/Normal
Norm 1 Neon Norm 2 Neon Float Int 1 2
Rpi 2
L1 1906 1965 2041 2273 2326 2771 1.03 1.11
L2 1449 1470 1543 1611 1635 1826 1.01 1.04
RAM 358 350 365 314 345 354 0.98 0.86
Rpi 3
L1 2659 3854 3364 4052 4283 4535 1.45 1.20
L2 2495 3457 3159 3591 3724 3909 1.39 1.14
RAM 1198 1249 1240 1148 1241 1236 1.04 0.93
RPi3/2
L1 1.40 1.96 1.65 1.78 1.84 1.64
L2 1.72 2.35 2.05 2.23 2.28 2.14
RAM 3.34 3.56 3.39 3.66 3.59 3.50
Re: Raspberry Pi Benchmarks
As discussed in a different thread, versions of linpack compiled with an ARM optimized subroutine library score 1.4 gflops on the Raspberry Pi 2B and 6.4 gflops on the Pi 3 for problem sizes around 8000 by 8000. This results in a 4.5 times speedup when switching between models. Note that 100 by 100 is an extremely small problem size for current computers. Also note without proper cooling that the speedup is only about 2.2 times.RoyLongbottom wrote:Raspberry Pi 3 Benchmarks
Linpack - This has floating point calculations, as in the original, using 100x100 matrices of double precision (DP) numbers, normally L2 cache sized data.. Performance depends almost entirely on a function calculating dy = dy + da*dx, suitable for vector type linked add and multiply. A version using NEON intrinsic functions is provided. As this uses single precision (SP), a standard compilation of this is also provided. Speed is measured in Millions of Floating Point Operations Per Second (MFLOPS).
Performance improvements over RPi 2 were 17% DP, 24% SP and 62% NEON, with RPi 3 measurements of 180, 194 and 486 MFLOPS. The DP result can be vaguely compared to a Pentium III E at 185 MFLOPS.
http://www.roylongbottom.org.uk/linpack%20results.htm
http://netlib.org/benchmark/performance.pdf
-
- Posts: 428
- Joined: Fri Apr 12, 2013 9:27 am
- Location: Essex, UK
Re: Raspberry Pi Benchmarks
That thread is not very convincing that RPi 3 achieves 6.4 Linpack GFLOPS, 4.5 times faster than RPi2, especially as the title is “Pi3 incorrect results under load ”. Those residual checks are there for a purpose and must be correct (consistently near expectations) to be able to be able to trust the reported performance. Then, can the clock measurement be trusted over a long time, with overheating occurring. I would like to see a series of complete results with decreasing values of N, to the point where consistent speeds are produced and some setting affinity to use one CPU core (see my example below).As discussed in a different thread, versions of linpack compiled with an ARM optimized subroutine library score 1.4 gflops on the Raspberry Pi 2B and 6.4 gflops on the Pi 3 for problem sizes around 8000 by 8000. This results in a 4.5 times speedup when switching between models. Note that 100 by 100 is an extremely small problem size for current computers. Also note without proper cooling that the speedup is only about 2.2 times.
On the other hand, I am just about to include results for my MP benchmarks, demonstrating more than 6 GFLOPS on the RPi 3. This is using single precision NEON instruction (were those results via SP NEON?). RPi 2 was up to 2.7 GFLOPS, the former being 2.2 times faster.
The source code for that Linpack is not the same as my Linpack 1, which is completely unsuitable for multi-threading. . So results cannot be compared. Linpack 1 also depends on large cache size. Results for my NEON Linpack MP benchmark are below, for unthreaded and with 1, 2 and 4 threads. The source code has some slight changes, with threading for selected parts (somebody might be able to do better), but it checks that results such as residuals are the same with and without threading.
Performance via multi-threading is the same as it is using shared data (but different segments), and 1 core at a time. At N=100, thread start/stop overheads are more significant, producing worst performance. Then N=100 is fastest without threading.
From that reference topic
Most important issues are running time and number of cores used. See details of my stress tests:3. What level of under clocking is safe for running optimized NEON code on a Pi 3B without a heat sink?
viewtopic.php?f=68&t=145374
Mine has a simple stick on heatsink that makes little difference.
Code: Select all
Linpack Single Precision MultiThreaded Benchmark
Using NEON Intrinsics, Mon Aug 15 19:44:30 2016
MFLOPS 0 to 4 Threads, N 100, 500, 1000
Threads None 1 2 4
N 100 538.46 116.24 113.61 113.47
N 500 467.73 335.53 338.61 338.97
N 1000 363.87 336.10 336.72 336.22
NR=norm resid RE=resid MA=machep X0=x[0]-1 XN=x[n-1]-1
N 100 500 1000
NR 2.17 5.42 9.50
RE 5.16722466e-05 6.46698638e-04 2.26586126e-03
MA 1.19209290e-07 1.19209290e-07 1.19209290e-07
X0 -2.38418579e-07 -5.54323196e-05 -1.26898289e-04
XN -5.06639481e-06 -4.70876694e-06 1.41978264e-04
Thread
0 - 4 Same Results Same Results Same Results
Re: Raspberry Pi Benchmarks
The ARM optimized linpack runs include the diagnostic reportRoyLongbottom wrote:On the other hand, I am just about to include results for my MP benchmarks, demonstrating more than 6 GFLOPS on the RPi 3. This is using single precision NEON instruction (were those results via SP NEON?). RPi 2 was up to 2.7 GFLOPS, the former being 2.2 times faster.
Code: Select all
N : 8000
NB : 256
PMAP : Row-major process mapping
P : 1
Q : 1
PFACT : Left
NBMIN : 2
NDIV : 2
RFACT : Right
BCAST : 2ring
DEPTH : 0
SWAP : Mix (threshold = 64)
L1 : transposed form
U : transposed form
EQUIL : yes
ALIGN : 8 double precision words
While there are multiple reports of people running the benchmark on broken hardware in that thread, there are also many reports of people who resolved their hardware issues with proper cooling and overvolt of the CPU. I think some changes were eventually made to the driver code that switches the CPU in and out of turbo mode to improve stability. Thus, 6.4 double-precision gflops appears to be the correct speed of linpack for 8000 by 8000 sized problems on a properly cooled Pi 3.
For your single precision MP results, getting only 2.2 times speed up hints at an overheated Pi 3 running fully throttled at 600 mhz rather than the usual 1200 mhz. It would be interesting to see how the numbers change with a better heatsink and fan.
-
- Posts: 428
- Joined: Fri Apr 12, 2013 9:27 am
- Location: Essex, UK
Re: Raspberry Pi Benchmarks
The benchmark execution time is only 5 seconds and around 1.25 seconds using four cores and increases CPU temperature by less than 2°C. I compiled it to run up to 10 times longer and results are below. You will see that they meet my minimum requirements in believing benchmark results. That is evidence that maximum speeds might be correct. In this case, up to four times speed with four threads and threaded calculation all produce the same numeric answers.For your single precision MP results, getting only 2.2 times speed up hints at an overheated Pi 3 running fully throttled at 600 mhz rather than the usual 1200 mhz. It would be interesting to see how the numbers change with a better heatsink and fan.
Recorded CPU MHz and temperatures are also shown at up to 70.9°C with no performance degradation. So I compiled another version aiming for 500 seconds. As shown below, this eventually lead to throttling and degraded performance. As I said before, most important issues are running time and number of cores used.
Code: Select all
################### Test 47 seconds ##################
MP-MFLOPS NEON Intrinsics v2.0 Wed Aug 17 00:59:30 2016
FPU Add & Multiply using 1, 2, 4 and 8 Threads
2 Ops/Word 32 Ops/Word
KB 12.8 128 12800 12.8 128 12800
MFLOPS
1T 595 586 421 1637 1642 1596
2T 1179 1164 426 3270 3266 3153
4T 2024 2005 429 6244 6455 5892
8T 1938 2129 430 6235 6379 5820
Results x 100000, 12345 indicates ERRORS
1T 40392 76406 99700 35218 66014 99520
2T 40392 76406 99700 35218 66014 99520
4T 40392 76406 99700 35218 66014 99520
8T 40392 76406 99700 35218 66014 99520
End of test Wed Aug 17 01:00:17 2016
#####################################################
Temperature and CPU MHz Measurement
Start at Wed Aug 17 00:59:30 2016
Using 50 samples at 1 second intervals
Seconds
0.0 1200 scaling MHz, 1199 ARM MHz, temp=54.8'C
1.0 1200 scaling MHz, 1200 ARM MHz, temp=56.4'C
2.0 1200 scaling MHz, 1200 ARM MHz, temp=56.9'C
3.1 1200 scaling MHz, 1200 ARM MHz, temp=57.5'C
4.1 1200 scaling MHz, 1200 ARM MHz, temp=56.9'C
5.2 1200 scaling MHz, 1200 ARM MHz, temp=57.5'C
6.2 1200 scaling MHz, 1200 ARM MHz, temp=56.9'C
7.2 1200 scaling MHz, 1200 ARM MHz, temp=56.9'C
8.3 1200 scaling MHz, 1199 ARM MHz, temp=57.5'C
9.3 1200 scaling MHz, 1200 ARM MHz, temp=58.0'C
10.3 1200 scaling MHz, 1199 ARM MHz, temp=58.0'C
11.4 1200 scaling MHz, 1199 ARM MHz, temp=58.5'C
12.4 1200 scaling MHz, 1200 ARM MHz, temp=58.5'C
13.4 1200 scaling MHz, 1199 ARM MHz, temp=58.5'C
14.5 1200 scaling MHz, 1200 ARM MHz, temp=58.5'C
15.5 1200 scaling MHz, 1200 ARM MHz, temp=59.1'C
16.6 1200 scaling MHz, 1200 ARM MHz, temp=59.1'C
17.6 1200 scaling MHz, 1200 ARM MHz, temp=59.1'C
18.6 1200 scaling MHz, 1200 ARM MHz, temp=59.1'C
19.7 1200 scaling MHz, 1200 ARM MHz, temp=59.1'C
20.7 1200 scaling MHz, 1200 ARM MHz, temp=59.1'C
21.7 1200 scaling MHz, 1200 ARM MHz, temp=59.6'C
22.8 1200 scaling MHz, 1200 ARM MHz, temp=60.1'C
23.9 1200 scaling MHz, 1200 ARM MHz, temp=62.3'C
25.1 1200 scaling MHz, 1199 ARM MHz, temp=61.2'C
26.2 1200 scaling MHz, 1200 ARM MHz, temp=61.2'C
27.2 1200 scaling MHz, 1200 ARM MHz, temp=61.8'C
28.2 1200 scaling MHz, 1200 ARM MHz, temp=61.2'C
29.3 1200 scaling MHz, 1200 ARM MHz, temp=62.3'C
30.3 1200 scaling MHz, 1200 ARM MHz, temp=62.3'C
31.3 1200 scaling MHz, 1200 ARM MHz, temp=62.8'C
32.4 1200 scaling MHz, 1199 ARM MHz, temp=62.3'C
33.4 1200 scaling MHz, 1199 ARM MHz, temp=63.4'C
34.5 1200 scaling MHz, 1200 ARM MHz, temp=65.5'C
36.0 1200 scaling MHz, 1200 ARM MHz, temp=64.5'C
37.0 1200 scaling MHz, 1200 ARM MHz, temp=65.5'C
38.0 1200 scaling MHz, 1199 ARM MHz, temp=65.5'C
39.1 1200 scaling MHz, 1200 ARM MHz, temp=67.7'C
40.2 1200 scaling MHz, 1200 ARM MHz, temp=68.8'C
41.2 1200 scaling MHz, 1200 ARM MHz, temp=67.7'C
42.7 1200 scaling MHz, 1200 ARM MHz, temp=67.7'C
43.8 1200 scaling MHz, 1200 ARM MHz, temp=68.2'C
44.8 1200 scaling MHz, 1200 ARM MHz, temp=68.8'C
45.9 1200 scaling MHz, 1200 ARM MHz, temp=69.8'C
46.9 1200 scaling MHz, 1200 ARM MHz, temp=70.9'C
Test Finished
48.0 1200 scaling MHz, 1199 ARM MHz, temp=67.7'C
49.0 1200 scaling MHz, 1200 ARM MHz, temp=66.6'C
50.1 1200 scaling MHz, 1200 ARM MHz, temp=65.5'C
51.1 600 scaling MHz, 600 ARM MHz, temp=64.5'C
52.1 600 scaling MHz, 600 ARM MHz, temp=63.4'C
53.2 600 scaling MHz, 600 ARM MHz, temp=62.8'C
End at Wed Aug 17 01:00:23 2016
################# Test 502 seconds ####################
MP-MFLOPS NEON Intrinsics v2.0 Wed Aug 17 21:42:12 2016
1T 594 584 420 1637 1632 1603
2T 1175 1171 421 3269 3264 3152
4T 2234 2227 421 5894 5224 4192
8T 1801 1921 416 4719 4519 3899
Results x 100000, 12345 indicates ERRORS
1T 40015 40392 97075 35216 35218 95363
2T 40015 40392 97075 35216 35218 95363
4T 40015 40392 97075 35216 35218 95363
8T 40015 40392 97075 35216 35218 95363
End of test Wed Aug 17 21:50:34 2016
#####################################################
Temperature and CPU MHz Measurement
Start at Wed Aug 17 21:42:12 2016
Seconds
0.0 1200 scaling MHz, 1200 ARM MHz, temp=56.4'C
1.0 1200 scaling MHz, 1199 ARM MHz, temp=56.9'C
2.0 1200 scaling MHz, 1200 ARM MHz, temp=58.0'C
1200 to
356.2 1200 scaling MHz, 1200 ARM MHz, temp=79.5'C
357.3 1200 scaling MHz, 1200 ARM MHz, temp=80.1'C
358.3 1200 scaling MHz, 1167 ARM MHz, temp=80.6'C
359.4 1200 scaling MHz, 1154 ARM MHz, temp=80.6'C
360.4 1200 scaling MHz, 1136 ARM MHz, temp=81.1'C
Down To
380.5 1200 scaling MHz, 983 ARM MHz, temp=81.7'C
381.5 1200 scaling MHz, 992 ARM MHz, temp=82.7'C
Down To
400.7 1200 scaling MHz, 857 ARM MHz, temp=83.3'C
401.8 1200 scaling MHz, 832 ARM MHz, temp=82.7'C
Test Finished
503.8 1200 scaling MHz, 1012 ARM MHz, temp=81.7'C
504.8 1200 scaling MHz, 1116 ARM MHz, temp=80.1'C
505.9 1200 scaling MHz, 1184 ARM MHz, temp=80.1'C
506.9 600 scaling MHz, 600 ARM MHz, temp=79.0'C
508.0 600 scaling MHz, 600 ARM MHz, temp=78.4'C
Re: Raspberry Pi Benchmarks
It does look like you can run for quite a while before the system starts throttling. I'm looking forward to the final MP results.RoyLongbottom wrote:Recorded CPU MHz and temperatures are also shown at up to 70.9°C with no performance degradation.
-
- Posts: 428
- Joined: Fri Apr 12, 2013 9:27 am
- Location: Essex, UK
Re: Raspberry Pi Benchmarks
Raspberry Pi High Performance Linpack Benchmark
I downloaded a precompiled version of High Performance Linpack for Raspberry Pi from and installed it on my Raspberry Pi 2 and 3 systems:
https://www.howtoforge.com/tutorial/hpl ... pberry-pi/
I ran it on the RPi 2 at N = 1000, 2000, 4000 and 8000 to use 1, 2 and 4 threads (or cores - using the taskset command) and all ran successfully. The 1 and 2 thread tests ran on the RPi 3, without any problems, but not when attempting to use four cores. occasional correct operation occurred (or appeared to), otherwise errors were reported or the system crashed, using the larger data sizes.
Below are the MFLOPS results, behaving as expected by doubling single thread performance when using two cores. Then the improvement of four cores to one was up to 13.6 times on the RPi 2 and 34.9 times on the RPi 3, with the highest gains at N=1000. Based on previous MP performance, no better than 3.9 would be expected.
Performance gains of RPi 3 over RPi 2 were up to 3.8 times or 3.15 times, ignoring 4 thread speeds. With only 4 cores, performance improvements at larger data sizes is (to me) rather surprising.
Does anyone have explanations for the strange performance and what else I could try? My only suspect area is something to do with the shared L2 cache.
For one test that crashed, I ran, with N=4000, from a remote PC via the Putty program. I opened three terminals, one to run the HPL benchmark, one to run vmstat to show system utilisation and one to run my CPU MHz and temperature monitoring program. The results are below up to where the RPi CPU stopped running after testing for 6 seconds. There was little increase in temperature and no clock throttling, with no strange vmstat performance recorded and 4 cores only fully active for the last two seconds.
I have a 2008 version of the benchmark from Intel. I ran it on a Windows 10 based tablet, with a 1.44 to 1.84 GHz quad core Atom x5-Z8300 processor. Following are results, with the sort of performance differences I would expect but, of course, I can’t say that the Raspberry Pi should follow that behaviour pattern.
I downloaded a precompiled version of High Performance Linpack for Raspberry Pi from and installed it on my Raspberry Pi 2 and 3 systems:
https://www.howtoforge.com/tutorial/hpl ... pberry-pi/
I ran it on the RPi 2 at N = 1000, 2000, 4000 and 8000 to use 1, 2 and 4 threads (or cores - using the taskset command) and all ran successfully. The 1 and 2 thread tests ran on the RPi 3, without any problems, but not when attempting to use four cores. occasional correct operation occurred (or appeared to), otherwise errors were reported or the system crashed, using the larger data sizes.
Below are the MFLOPS results, behaving as expected by doubling single thread performance when using two cores. Then the improvement of four cores to one was up to 13.6 times on the RPi 2 and 34.9 times on the RPi 3, with the highest gains at N=1000. Based on previous MP performance, no better than 3.9 would be expected.
Performance gains of RPi 3 over RPi 2 were up to 3.8 times or 3.15 times, ignoring 4 thread speeds. With only 4 cores, performance improvements at larger data sizes is (to me) rather surprising.
Does anyone have explanations for the strange performance and what else I could try? My only suspect area is something to do with the shared L2 cache.
Code: Select all
MFLOPS RPi 3/ Gains vs 1 Thread
N Threads Rpi 3 Rpi 2 RPi 2 Rpi 3 Rpi 2
1000 1 80 74 1.08
2 159 159 1.00 2.0 2.1
4 2794 1009 2.77 34.9 13.6
2000 1 224 162 1.38
2 505 355 1.42 2.3 2.2
4 4029 1229 3.28 18.0 7.6
4000 1 612 284 2.15
2 1317 584 2.26 2.2 2.1
4 5425 1429 3.80 8.9 5.0
8000 1 1119 356 3.14
2 2268 720 3.15 2.0 2.0
4 N/A 1514 4.3
Code: Select all
Start at Fri Aug 19 21:24:07 2016
Using 40 samples at 1 second intervals
Boot Settings
dtparam=audio=on
Seconds
0.0 600 scaling MHz, 600 ARM MHz, temp=52.6'C
1.0 600 scaling MHz, 1200 ARM MHz, temp=52.6'C
2.0 1200 scaling MHz, 1200 ARM MHz, temp=52.6'C
3.1 1200 scaling MHz, 1200 ARM MHz, temp=53.7'C
4.1 600 scaling MHz, 600 ARM MHz, temp=52.6'C
Start
5.2 1200 scaling MHz, 1200 ARM MHz, temp=53.7'C
6.2 1200 scaling MHz, 1200 ARM MHz, temp=55.8'C
7.2 1200 scaling MHz, 1200 ARM MHz, temp=55.8'C
8.3 1200 scaling MHz, 1200 ARM MHz, temp=56.9'C
9.3 1200 scaling MHz, 1200 ARM MHz, temp=61.2'C
10.4 1200 scaling MHz, 1200 ARM MHz, temp=62.8'C
#############################################################################
pi@raspberrypi:~ $ vmstat 1 40
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 0 0 524292 36648 236528 0 0 45 3 175 69 2 1 97 0 0
0 0 0 524276 36648 236528 0 0 0 0 612 152 0 0 100 0 0
0 0 0 524292 36648 236528 0 0 0 72 631 195 0 0 100 0 0
0 0 0 524324 36648 236528 0 0 0 0 572 104 0 0 100 0 0
0 0 0 524004 36784 236568 0 0 158 16 964 697 1 2 97 0 0
0 0 0 523616 36784 236568 0 0 0 12 822 493 2 0 98 0 0
0 0 0 523376 36784 236568 0 0 0 12 776 417 1 1 98 0 0
0 0 0 523376 36784 236568 0 0 0 12 794 447 2 1 98 0 0
0 0 0 523128 36784 236568 0 0 0 12 805 471 2 2 97 0 0
Start
1 0 0 499504 36784 236568 0 0 0 128 1035 669 18 15 67 0 0
1 0 0 458228 36784 236568 0 0 0 12 874 434 25 2 74 0 0
1 0 0 417184 36784 236568 0 0 0 12 818 370 25 2 73 0 0
4 0 0 394444 36784 236564 0 0 0 12 928 382 45 18 37 0 0
4 0 0 393700 36784 236564 0 0 0 12 1080 411 95 6 0 0 0
4 0 0 393396 36784 236568 0 0 0 56 1126 514 92 8 0 0 0
Used 129732 KB
or about 4 x 4 x 8 MB for N = 4000
Code: Select all
N Threads MFLOPS Seconds Residual Resid(norm)
1000 1 1350 0.50 1.161404E-12 3.960687E-02
2 2466 0.27 1.161404E-12 3.960687E-02
4 4120 0.16 1.161404E-12 3.960687E-02
2000 1 1540 3.47 4.756195E-12 4.137307E-02
2 2800 1.91 4.756195E-12 4.137307E-02
4 4601 1.16 4.756195E-12 4.137307E-02
4000 1 1633 26.15 1.702119E-11 3.709929E-02
2 3008 14.19 1.702119E-11 3.709929E-02
4 5345 7.99 1.702119E-11 3.709929E-02
8000 1 1641 208.04 5.967551E-11 3.282671E-02
2 3088 110.58 5.967551E-11 3.282671E-02
4 4982 68.55 5.967551E-11 3.282671E-02
-
- Posts: 428
- Joined: Fri Apr 12, 2013 9:27 am
- Location: Essex, UK
Re: Raspberry Pi Benchmarks
Raspberry Pi 3 Multithreading Benchmarks
The first ones are attempts to obtain better performance running the Classic Benchmarks. For detailed descriptions and results of all multithreading benchmarks see.
http://www.roylongbottom.org.uk/Raspber ... hmarks.htm
MP Whetstone Benchmarks
As with other multithreading benchmarks, this one runs using 1, 2, 4 and 8 threads, executing multiple copies of the same program code, each thread having dedicated variables. These should all be stored in L1 cache. With no conflicts, as shown below, doubling the number of threads leads to near doubling measured performance.
Raspberry Pi 3 overall MWIPS ratings are 1.37 times RPi 2 speeds, with ratios for other tests in the range 1.19 to 1.79, except the last copy test average of 2.73.
MP Dhrystone Benchmark
This uses shared program code and dedicated memory for arrays, but some read/write variables are shared. This can result in multithreaded performance providing little improvement or even worse than.
Raspberry Pi 3 performance, using a single thread, is not much faster than model 2 at 1.43 times faster, compared with a CPU MHz ratio of 1.33. Then, it appears to perform much better using threads, at 3.49 times faster.
MP Linpack Benchmark
The original Linpack benchmark operates on double precision floating point 100x100 matrices (N = 100). This version uses mainly the same C programming code as that for the single precision floating point. It is run run on 100x100, 500x500 and 1000x1000 matrices using 0, 1, 2 and 4 separate threads. Multiple threads each use different segments of shared data arrays.
The code differences were slight changes to allow a higher level of parallelism. The initial 100x100 Linpack benchmark is only of use for measuring performance of single processor systems. The one for shared memory multiple processor systems is a 1000x1000 variety. The programming code for the latter is the same as 100x100, except users are allowed to employ their own linear equation solver.
Performance of this MP benchmark is limited by the overhead of creating and closing threads too frequently, resulting in slower speeds using multiple threads. At 100x100, data size is 40 KB, L2 cache based. With larger matrices, performance becomes more dependent on RAM, but multi-threading overheads have less influence.
Raspberry Pi 3 - At N=100, average speed was 1.73 times that from a RPi 2, with 1.52 to 1.59 times using the larger matrices. These can be compared with a CPU MHz ratio of 1.33.
The first ones are attempts to obtain better performance running the Classic Benchmarks. For detailed descriptions and results of all multithreading benchmarks see.
http://www.roylongbottom.org.uk/Raspber ... hmarks.htm
MP Whetstone Benchmarks
As with other multithreading benchmarks, this one runs using 1, 2, 4 and 8 threads, executing multiple copies of the same program code, each thread having dedicated variables. These should all be stored in L1 cache. With no conflicts, as shown below, doubling the number of threads leads to near doubling measured performance.
Raspberry Pi 3 overall MWIPS ratings are 1.37 times RPi 2 speeds, with ratios for other tests in the range 1.19 to 1.79, except the last copy test average of 2.73.
Code: Select all
MP-Whetstone Benchmark Linux/ARM V7A v1.0 Mon Aug 15 19:34:21 2016
Using 1, 2, 4 and 8 Threads
MWIPS MFLOPS MFLOPS MFLOPS Cos Exp Fixpt If Equal
1 2 3 MOPS MOPS MOPS MOPS MOPS
1T 723.1 517.2 517.0 254.9 12.1 8.8 5853.9 1181.8 1189.8
2T 1464.7 960.5 1025.1 511.3 24.1 18.5 11899.0 2381.2 2385.7
4T 2902.3 1696.4 1867.3 1013.4 47.8 36.8 19754.6 4541.3 4687.1
8T 3004.0 2747.8 2569.0 1066.4 48.6 38.0 25502.9 6075.2 5610.8
Overall Seconds 4.77 1T, 4.74 2T, 4.88 4T, 9.76 8T
Comparison With Raspberry Pi 2 - CPU MHz ratio 1.33
1T 1.37 1.43 1.42 1.38 1.21 1.57 1.77 1.33 2.67
2T 1.39 1.33 1.41 1.39 1.21 1.65 1.79 1.34 2.68
4T 1.37 1.23 1.28 1.37 1.19 1.64 1.49 1.27 2.62
8T 1.37 1.44 1.39 1.32 1.19 1.65 1.45 1.26 2.96
This uses shared program code and dedicated memory for arrays, but some read/write variables are shared. This can result in multithreaded performance providing little improvement or even worse than.
Raspberry Pi 3 performance, using a single thread, is not much faster than model 2 at 1.43 times faster, compared with a CPU MHz ratio of 1.33. Then, it appears to perform much better using threads, at 3.49 times faster.
Code: Select all
MP-Dhrystone Benchmark Linux/ARM V7A v1.0 Mon Aug 15 19:47:57 2016
Using 1, 2, 4 and 8 Threads
Threads 1 2 4 8
Seconds 0.95 1.12 1.59 3.04
Dhrystones per Second 4229473 7124952 10091677 10523432
VAX MIPS rating 2407 4055 5744 5989
Internal pass count correct all threads
End of test Mon Aug 15 19:48:04 2016
Comparison With Raspberry Pi 2 - CPU MHz ratio 1.33
VAX MIPS rating 1.43 1.51 3.49 2.42
The original Linpack benchmark operates on double precision floating point 100x100 matrices (N = 100). This version uses mainly the same C programming code as that for the single precision floating point. It is run run on 100x100, 500x500 and 1000x1000 matrices using 0, 1, 2 and 4 separate threads. Multiple threads each use different segments of shared data arrays.
The code differences were slight changes to allow a higher level of parallelism. The initial 100x100 Linpack benchmark is only of use for measuring performance of single processor systems. The one for shared memory multiple processor systems is a 1000x1000 variety. The programming code for the latter is the same as 100x100, except users are allowed to employ their own linear equation solver.
Performance of this MP benchmark is limited by the overhead of creating and closing threads too frequently, resulting in slower speeds using multiple threads. At 100x100, data size is 40 KB, L2 cache based. With larger matrices, performance becomes more dependent on RAM, but multi-threading overheads have less influence.
Raspberry Pi 3 - At N=100, average speed was 1.73 times that from a RPi 2, with 1.52 to 1.59 times using the larger matrices. These can be compared with a CPU MHz ratio of 1.33.
Code: Select all
Linpack Single Precision MultiThreaded Benchmark
Using NEON Intrinsics, Mon Aug 15 19:44:30 2016
MFLOPS 0 to 4 Threads, N 100, 500, 1000
Threads None 1 2 4
N 100 538.46 116.24 113.61 113.47
N 500 467.73 335.53 338.61 338.97
N 1000 363.87 336.10 336.72 336.22
NR=norm resid RE=resid MA=machep X0=x[0]-1 XN=x[n-1]-1
N 100 500 1000
NR 2.17 5.42 9.50
RE 5.16722466e-05 6.46698638e-04 2.26586126e-03
MA 1.19209290e-07 1.19209290e-07 1.19209290e-07
X0 -2.38418579e-07 -5.54323196e-05 -1.26898289e-04
XN -5.06639481e-06 -4.70876694e-06 1.41978264e-04
Thread
0 - 4 Same Results Same Results Same Results
Comparison With Raspberry Pi 2 - CPU MHz ratio 1.33
Threads None 1 2 4
N 100 1.67 1.75 1.75 1.76
N 500 1.69 1.55 1.57 1.57
N 1000 1.55 1.52 1.51 1.50
-
- Posts: 428
- Joined: Fri Apr 12, 2013 9:27 am
- Location: Essex, UK
Re: Raspberry Pi Benchmarks
Memory MP Benchmarks
Next, we have benchmarks that use caches and RAM, with data sizes 12.3 KB for L1 cache, 122.9 KB for L2 cache and 12288 KB for RAM. Details and results in:
http://www.roylongbottom.org.uk/Raspber ... hmarks.htm
MP-BusSpeed Benchmark
This is read only using AND instructions, with varying address increments of 32 words to 1 word, to identify where burst reading occurs. All threads read the same data, where each thread starts from different addresses, avoiding too high RAM performance due to data being in the shared L2 cache.
Raspberry Pi 3 results are shown below. Just considering 1 word address increments (RdAll) and comparisons with RP3 2 as the last column, best RAM speed improvements were the same as memory bus speed difference. Cache speed improvements were around 1.9 times, compared with CPU MHz ratio of 1.33. MP gains of 4/1 threads average 3.65 from caches but RAM improvement were disappointing.
MP RandMem Benchmark
The benchmark has serial and random address selections, using the same program indexing structure, with read and read/write tests involving 32 bit integers. It uses data from the same array for all threads, but starting at different points. The use of shared data, with write back, leads to no increase in throughput using multiple threads. Also, random access speed can be considerably influenced by reading and writing in bursts.
Results below show average Raspberry Pi 3 vs Pi 2 performance ratios. Some are not much better than CPU MHz increase of 1.33 times, somewhat better on L2 cache serial activities and surprisingly high with serial reading from RAM. Average MP 4/1 thread gains were around 3.8 times on reading L1 cache data and L2 serial read tests, but lower via random reading from L2. Multiple thread random reading from RAM was particularly good, probable due to some data in the shared L2 cache. Unexpectedly, serial reading from RAM was better than MP BusSpeed above, with 4/1 thread improvement of 1.58 times.
OpenMP MemSpeed Benchmark
This is the same program as the
http://www.roylongbottom.org.uk/Raspber ... m#anchor10
but uses a simple directive for the compiler to parallelise the code. A new version was produced, with reduced overheads, and this was also compiled not to use OpenMP functions. Average full OMP results are below for Raspberry Pi 2 and 3, then for RPI3, OMP using 1 thread and without OMP. All MP benchmarks, including source code, are in the following, for anyone to play with (and change if you want).
http://www.roylongbottom.org.uk/Raspber ... hmarks.zip
The benchmark measures speed of the functions shown, using 4 KB to 132 MB, the summaries being average MB/second for data in L1 cache, L2 cache and RAM.
There are significant variations in relative performance that can be calculated but a brief summary is as follows. They all depend on what particular instructions are used with and without threading and threading overheads.
Next, we have benchmarks that use caches and RAM, with data sizes 12.3 KB for L1 cache, 122.9 KB for L2 cache and 12288 KB for RAM. Details and results in:
http://www.roylongbottom.org.uk/Raspber ... hmarks.htm
MP-BusSpeed Benchmark
This is read only using AND instructions, with varying address increments of 32 words to 1 word, to identify where burst reading occurs. All threads read the same data, where each thread starts from different addresses, avoiding too high RAM performance due to data being in the shared L2 cache.
Raspberry Pi 3 results are shown below. Just considering 1 word address increments (RdAll) and comparisons with RP3 2 as the last column, best RAM speed improvements were the same as memory bus speed difference. Cache speed improvements were around 1.9 times, compared with CPU MHz ratio of 1.33. MP gains of 4/1 threads average 3.65 from caches but RAM improvement were disappointing.
Code: Select all
MP-BusSpd ARM V7A v2 Tue Aug 30 13:45:43 2016
MB/Second Reading Data, 1, 2, 4 and 8 Threads
Staggered starting addresses to avoid caching
RPi3/RPi2
KB Inc32 Inc16 Inc8 Inc4 Inc2 RdAll RdAll
12.3 1T 1565 3749 3718 4078 4385 4160 1.87
2T 5041 6829 7066 7813 8584 7839 1.89
4T 5480 11958 13330 15256 16863 15614 1.92
8T 6006 8477 8873 7777 8918 8315
122.9 1T 566 566 1062 1822 2831 3907 1.91
2T 899 906 1742 2395 5433 7638 1.88
4T 907 935 1876 3757 7241 13871 1.76
8T 863 919 1789 3491 6411 9403
12288 1T 130 136 263 513 1047 2080 1.81
2T 185 138 276 554 1108 2149 1.71
4T 131 137 269 536 1169 2383 2.01
8T 125 133 224 513 1038 2142
End of test Tue Aug 30 13:45:55 2016
The benchmark has serial and random address selections, using the same program indexing structure, with read and read/write tests involving 32 bit integers. It uses data from the same array for all threads, but starting at different points. The use of shared data, with write back, leads to no increase in throughput using multiple threads. Also, random access speed can be considerably influenced by reading and writing in bursts.
Results below show average Raspberry Pi 3 vs Pi 2 performance ratios. Some are not much better than CPU MHz increase of 1.33 times, somewhat better on L2 cache serial activities and surprisingly high with serial reading from RAM. Average MP 4/1 thread gains were around 3.8 times on reading L1 cache data and L2 serial read tests, but lower via random reading from L2. Multiple thread random reading from RAM was particularly good, probable due to some data in the shared L2 cache. Unexpectedly, serial reading from RAM was better than MP BusSpeed above, with 4/1 thread improvement of 1.58 times.
Code: Select all
MP-RandMem Linux/ARM V7A v1.0 Tue Aug 30 14:13:08 2016
MB/Second Using 1, 2, 4 and 8 Threads
KB SerRD SerRDWR RndRD RndRDWR
12.3 1T 2930 3791 2918 3791
2T 5571 3766 5194 3776
4T 11196 3722 11205 3722
8T 10063 3685 10051 3702
122.9 1T 2675 3398 681 893
2T 5124 3387 1256 886
4T 10041 3387 1916 891
8T 9593 3367 1952 890
12288 1T 2120 979 54 71
2T 3255 980 107 71
4T 3346 979 138 70
8T 2226 979 143 71
End of test Tue Aug 30 14:13:54 2016
RPi3/RPi2 Average
L1 cache 1.53 1.36 1.47 1.35
L2 cache 1.86 2.29 1.24 1.31
RAM 4.46 1.04 1.17 1.25
This is the same program as the
http://www.roylongbottom.org.uk/Raspber ... m#anchor10
but uses a simple directive for the compiler to parallelise the code. A new version was produced, with reduced overheads, and this was also compiled not to use OpenMP functions. Average full OMP results are below for Raspberry Pi 2 and 3, then for RPI3, OMP using 1 thread and without OMP. All MP benchmarks, including source code, are in the following, for anyone to play with (and change if you want).
http://www.roylongbottom.org.uk/Raspber ... hmarks.zip
The benchmark measures speed of the functions shown, using 4 KB to 132 MB, the summaries being average MB/second for data in L1 cache, L2 cache and RAM.
Code: Select all
x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m]
Cache Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32
RAM MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S
Rpi2 4 Threads
L1 3131 2508 259 3273 2906 281 1188 1705 441
L2 2805 1781 232 4278 2846 236 1114 1297 290
RAM 747 646 283 930 964 272 1037 1071 288
RPi3 4 Threads
L1 5475 3134 1312 10182 5104 1435 15879 8025 1227
L2 4772 2916 1247 7935 4607 1329 8277 6223 1294
RAM 2788 1903 1324 4013 2794 1099 1065 1063 1073
RPi3 1 Thread
L1 1539 789 996 2582 1303 1022 4177 2357 653
L2 1380 745 922 2145 1186 945 3356 2061 633
RAM 995 653 798 1226 924 813 1189 1184 614
RPi3 Not Threaded
L1 1582 2511 3733 2360 3405 3733 2724 2722 2722
L2 1416 2071 2875 1978 2707 2888 2439 2337 2346
RAM 1032 1242 1295 1216 1291 1288 1030 1021 1021
Code: Select all
MP MemSpeed Performance gains Summary
Average Min Max
Raspberry Pi 3 v Raspberry Pi 2 OpenMP 3.77 0.99 13.37
Raspberry Pi 3 OpenMP v Not Threaded 1.97 0.35 5.83
Raspberry Pi 3 Not Threaded v OMP 1 Thread 1.92 0.65 4.17
-
- Posts: 428
- Joined: Fri Apr 12, 2013 9:27 am
- Location: Essex, UK
Re: Raspberry Pi Benchmarks
Maximum Floating Point Speed
The last series of multithreading benchmarks were intended to measure maximum floating point speed, They execute functions, where the arithmetic operations executed are of the form x = (x + a) * b - (x + c) * d + (x + e) * f with 2 or 32 operations per input single precision floating point data word. Array sizes used cover L1 cache, L2 cache and RAM as separate tests, all at 1, 2, 4 and 8 threads. Each thread uses the same calculations but accessing different segments of the data.
When compiled by GCC and run on Intel based PCs, near maximum performance could be demonstrated. Using parameters to generate SSE instructions, maximum single core MFLOPS could be CPU MHz x 4 (SSE 128 bit registers) x 2 (linked multiply and add). Then, AVX 1 would be twice as fast. So we have 32 and 64 times CPU MHz for a quad core processor. Running on a Quad Core i7 CPU, 23 out of 32 and 45.6 out of 64 times CPU MHz were demonstrated - 90 and 178 GFLOPS via 3.9 GHz CPU.
Raspberry Pi 3 Cortex A53
Raspberry Pi 3 Cortex A53 processor is also said to have a maximum speed of 32 x CPU GHz (as Intel SSE) or 38.4 GFLOPS, with double precision at a quarter of this speed.
The later benchmarks are MP-MFLOPSPiA7 and MP-MFLOPSDP, compiled for Cortex-A7, and MP-MFLOPSPiNeon, compiled from the same code, for Cortex-A7 with NEON SIMD, where the latter, with fused multiply and add would be expected to produce maximum speeds. Note that NEON, at this time, only deals with 32 bit single precision operation. Another variation is MP-NeonMFLOPS, this time produced using manually inserted intrinsic functions, with results virtually the same as the compiled “C” version. Finally OpenMP-MFLOPS, an established PC version was produced. This has run time parameters for the starting number of data words and repeat passes. As it happens, this has a useful extra set of tests with 8 operations per word. In order to provide a benchmark with no OpenMP or threading overheads, this was recompiled as notOpenMP-MFLOPS, to test a single core. Full details of all of these benchmarks and results are in:
http://www.roylongbottom.org.uk/Raspber ... hmarks.htm
Results are below for MP-MFLOPSPiNeon plus some one thread speeds from the other programs. Except when RAM speed limited, MP gains were quite respectable. Average speed improvements are shown to be 1.92 times RPi 2 and RPi 3 NEON/normal ratio of 2.76. Then MP-MFLOPS single and double precision speeds were the same, but remember that there is no NEON functions for DP.
The main issue is maximum speed of 1.67 GFLOPS from 1 core, 17% of that thought to be possible. Disassembled code of the main calculations are included in Raspberry Pi Multithreading Benchmarks.htm, indicating insufficient number of instructions in a loop at 2 operations per word. With the higher instruction count, the compiler unrolls the loop to execute 128 calculations, four at a time using quad word registers but there are 32 unnecessary instructions loading variables that limit maximum performance (not enough registers?).
Running the default non-threaded notOpenMP-MFLOPS indicated a slightly faster speed than above of 1.7 GFLOPS at 8 operations per word. The source code was modified to manually unroll this loop to include 128 arithmetic operations. Experimenting with different data sizes produced a maximum of just over 3 GFLOPS. Here. up to 6.6 GFLOPS might be expected, via 4 way vectors, with 16 add or multiply instructions plus 8 using linked multiply and add or subtract. As shown below, there were also 4 vector loads, 4 vector stores, 4 scalar adds and 3 instructions for loop control.
The last series of multithreading benchmarks were intended to measure maximum floating point speed, They execute functions, where the arithmetic operations executed are of the form x = (x + a) * b - (x + c) * d + (x + e) * f with 2 or 32 operations per input single precision floating point data word. Array sizes used cover L1 cache, L2 cache and RAM as separate tests, all at 1, 2, 4 and 8 threads. Each thread uses the same calculations but accessing different segments of the data.
When compiled by GCC and run on Intel based PCs, near maximum performance could be demonstrated. Using parameters to generate SSE instructions, maximum single core MFLOPS could be CPU MHz x 4 (SSE 128 bit registers) x 2 (linked multiply and add). Then, AVX 1 would be twice as fast. So we have 32 and 64 times CPU MHz for a quad core processor. Running on a Quad Core i7 CPU, 23 out of 32 and 45.6 out of 64 times CPU MHz were demonstrated - 90 and 178 GFLOPS via 3.9 GHz CPU.
Raspberry Pi 3 Cortex A53
Raspberry Pi 3 Cortex A53 processor is also said to have a maximum speed of 32 x CPU GHz (as Intel SSE) or 38.4 GFLOPS, with double precision at a quarter of this speed.
The later benchmarks are MP-MFLOPSPiA7 and MP-MFLOPSDP, compiled for Cortex-A7, and MP-MFLOPSPiNeon, compiled from the same code, for Cortex-A7 with NEON SIMD, where the latter, with fused multiply and add would be expected to produce maximum speeds. Note that NEON, at this time, only deals with 32 bit single precision operation. Another variation is MP-NeonMFLOPS, this time produced using manually inserted intrinsic functions, with results virtually the same as the compiled “C” version. Finally OpenMP-MFLOPS, an established PC version was produced. This has run time parameters for the starting number of data words and repeat passes. As it happens, this has a useful extra set of tests with 8 operations per word. In order to provide a benchmark with no OpenMP or threading overheads, this was recompiled as notOpenMP-MFLOPS, to test a single core. Full details of all of these benchmarks and results are in:
http://www.roylongbottom.org.uk/Raspber ... hmarks.htm
Results are below for MP-MFLOPSPiNeon plus some one thread speeds from the other programs. Except when RAM speed limited, MP gains were quite respectable. Average speed improvements are shown to be 1.92 times RPi 2 and RPi 3 NEON/normal ratio of 2.76. Then MP-MFLOPS single and double precision speeds were the same, but remember that there is no NEON functions for DP.
Code: Select all
Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz
MP-MFLOPS Compiled NEON v1.0 Mon Aug 15 19:09:46 2016
FPU Add & Multiply using 1, 2, 4 and 8 Threads
2 Ops/Word 32 Ops/Word
KB 12.8 128 12800 12.8 128 12800
MFLOPS
1T 419 782 437 1672 1660 1637
2T 1324 1529 442 3331 3308 3212
4T 1903 1574 439 5040 6073 5738
8T 1613 2204 433 5543 5780 5445
Results x 100000
1T 76406 97075 99969 66008 95367 99951
2T 76406 97075 99969 66008 95367 99951
4T 76406 97075 99969 66008 95367 99951
8T 76406 97075 99969 66008 95367 99951
End of test Mon Aug 15 19:09:52 2016
1 Thread
RPi 2 357 451 337 690 688 657
RPi 3 MP-MFLOPS
1T SP 168 182 171 691 693 684
1T DP 143 182 171 678 680 674
Running the default non-threaded notOpenMP-MFLOPS indicated a slightly faster speed than above of 1.7 GFLOPS at 8 operations per word. The source code was modified to manually unroll this loop to include 128 arithmetic operations. Experimenting with different data sizes produced a maximum of just over 3 GFLOPS. Here. up to 6.6 GFLOPS might be expected, via 4 way vectors, with 16 add or multiply instructions plus 8 using linked multiply and add or subtract. As shown below, there were also 4 vector loads, 4 vector stores, 4 scalar adds and 3 instructions for loop control.
Code: Select all
L27:
vld1.32 {q11}, [r3]
vld1.32 {q14}, [lr]
vld1.32 {q12}, [r2]
vadd.f32 q15, q11, q1
vld1.32 {q13}, [ip]
vadd.f32 q10, q14, q1
vadd.f32 q8, q12, q1
vmul.f32 q15, q15, q4
vadd.f32 q7, q0, q14
vadd.f32 q9, q13, q1
vst1.f32 {d30-d31}, [sp:64]
vmul.f32 q10, q10, q4
vadd.f32 q15, q0, q12
vmul.f32 q8, q8, q4
vadd.f32 q6, q0, q13
vfma.f32 q10, q7, q2
vfma.f32 q8, q15, q2
vadd.f32 q7, q0, q11
vmul.f32 q9, q9, q4
vld1.64 {d30-d31}, [sp:64]
vfma.f32 q9, q6, q2
vadd.f32 q14, q3, q14
vfma.f32 q15, q7, q2
vadd.f32 q13, q3, q13
vadd.f32 q12, q3, q12
vadd.f32 q11, q3, q11
vfms.f32 q10, q14, q5
vfms.f32 q9, q13, q5
vfms.f32 q8, q12, q5
vfms.f32 q15, q11, q5
add r4, r4, #1
cmp r4, r5
vst1.32 {q10}, [lr]
vst1.32 {q9}, [ip]
add lr, lr, #64
add ip, ip, #64
vst1.32 {q8}, [r2]
vst1.32 {q15}, [r3]
add r2, r2, #64
add r3, r3, #64
bcc .L27
-
- Posts: 428
- Joined: Fri Apr 12, 2013 9:27 am
- Location: Essex, UK
Re: Raspberry Pi Benchmarks
Raspberry Pi 3 Stress Tests
I have been running my stress tests on the Raspberry Pi 3. These can comprise four number crunching programs plus one that measures CPU MHz and temperatures, each running in its own terminal window, with typical running times of 15 minutes. Full details and program download links are in:
http://www.roylongbottom.org.uk/Raspber ... 0Tests.htm
The tests can also comprise a graphics program, instead of one of the number crunchers. Such an exercise was carried out on the RPi 3, in conjunction with a new OpenGL GLUT benchmark. Details can be found here:
viewtopic.php?p=958209#p958209
The tests demonstrate the apparently well known fact that the RPi 3 Cortex A-53 CPU can overheat and reduce CPU MHz (throttling) to avoid even higher temperatures. I have run the same compiled code on an Android tablet, with a Snapdragon Cortex-A53, and that continued running at full speed. The main reason for these differences (claimed by someone) is that the RPI 3 Broadcom version is manufactured using the 40 nm process, the tablet having a cooler Snapdragon implementation with 0.28 nm lithography.
It should be pointed out that it is unlikely that many people will want to execute such demanding code, using all cores, for extended periods.
The stress tests were originally run using a script file, but this did not work. Instead, the script was copied and pasted to a terminal prompt, for example:
This test specified a revised floating point stress test, where four cores can execute nearly 12 GFLOPS. The last exercise was to check out different heatsinks. In the following, Black is the latest from Pi Hut in September 2016, and Copper the rather swish Enzotech BMR-C1, kindly supplied by Doc Watson in September 2016. Then the third test is with the system plastic cover removed. Room temperature was 22°C.
Throttling started at around the reported 80°C, with a maximum of about a 34% reduction in CPU MHz and recorded MFLOPS, for both heatsinks and still 21% with the cover removed.
I have been running my stress tests on the Raspberry Pi 3. These can comprise four number crunching programs plus one that measures CPU MHz and temperatures, each running in its own terminal window, with typical running times of 15 minutes. Full details and program download links are in:
http://www.roylongbottom.org.uk/Raspber ... 0Tests.htm
The tests can also comprise a graphics program, instead of one of the number crunchers. Such an exercise was carried out on the RPi 3, in conjunction with a new OpenGL GLUT benchmark. Details can be found here:
viewtopic.php?p=958209#p958209
The tests demonstrate the apparently well known fact that the RPi 3 Cortex A-53 CPU can overheat and reduce CPU MHz (throttling) to avoid even higher temperatures. I have run the same compiled code on an Android tablet, with a Snapdragon Cortex-A53, and that continued running at full speed. The main reason for these differences (claimed by someone) is that the RPI 3 Broadcom version is manufactured using the 40 nm process, the tablet having a cooler Snapdragon implementation with 0.28 nm lithography.
It should be pointed out that it is unlikely that many people will want to execute such demanding code, using all cores, for extended periods.
The stress tests were originally run using a script file, but this did not work. Instead, the script was copied and pasted to a terminal prompt, for example:
Code: Select all
lxterminal --geometry=80x15 -e ./RPiHeatMHz passes 63, seconds 15
lxterminal --geometry=80x15 -e ./burninfpuPi2 Kwds 4 Sect 2 Mins 15 Log 11
lxterminal --geometry=80x15 -e ./burninfpuPi2 Kwds 4 Sect 2 Mins 15 Log 12
lxterminal --geometry=80x15 -e ./burninfpuPi2 Kwds 4 Sect 2 Mins 15 Log 13
lxterminal --geometry=80x15 -e ./burninfpuPi2 Kwds 4 Sect 2 Mins 15 Log 14
Throttling started at around the reported 80°C, with a maximum of about a 34% reduction in CPU MHz and recorded MFLOPS, for both heatsinks and still 21% with the cover removed.
Code: Select all
Revised Benchmark Max MFLOPS > 2900 Per Core - New OS Driver Enabled
Black Heatsink Copper Heatsink Copper No Cover
4 Core 4 Core 4 Core
Minute °C MHz MFLOPS °C MHz MFLOPS °C MHz MFLOPS
0 49.9 1200 41.9 1200 46.2 1200
1 73.6 1200 11699 65.0 1200 11706 67.1 1200 11720
2 81.7 1124 11282 73.6 1200 11709 74.1 1200 11709
3 82.7 977 9489 79.0 1200 11726 79.0 1200 11682
4 82.7 917 8954 81.7 1038 10322 80.6 1118 11059
5 83.8 867 8545 82.2 963 9629 81.7 1048 10296
6 83.8 846 8252 82.7 932 9165 81.7 1015 10073
7 83.8 830 8085 83.8 876 8832 81.7 991 9812
8 83.8 809 7991 83.3 867 8558 81.7 991 9684
9 83.8 816 7860 83.8 842 8318 82.2 963 9556
10 83.8 795 7738 83.8 824 8146 82.7 965 9369
11 84.4 782 7663 83.8 821 8051 82.7 968 9342
12 84.4 787 7625 83.8 813 7966 82.7 953 9241
13 83.8 844 8212 83.8 812 7879 82.2 956 9203
14 83.8 827 8177 84.4 796 7780 82.7 948 9194
15 84.4 830 8133 84.4 794 7710 82.7 949 9109
min 73.6 782 7625 65.0 794 7710 67.1 948 9109
max 84.4 1200 11699 84.4 1200 11726 82.7 1200 11720
Loss
% 34.8 34.8 33.8 34.2 21.0 22.3