Re: Benchmarking a raspberrypi compared to my own PC
Has anyone tried to build a 64 bit arm (AArch64 Cortex A53) gcc compiler on a RPI3, assuming you are one of the few lucky people that were able to get a RPI3?
I don't believe RPI3 needs a 64 bit OS to access a it's 1GB of memory or 16GB SDcard filesystem, but it would be nice to access to the 64 bit float point registers.
Here is the link http://infocenter.arm.com/help/topic/co ... le_2_0.pdf that got me thinking about a 64bit gcc compiler for the RPI3 to allow access to the 64 bit floating point registers.
What do you all think?
I don't believe RPI3 needs a 64 bit OS to access a it's 1GB of memory or 16GB SDcard filesystem, but it would be nice to access to the 64 bit float point registers.
Here is the link http://infocenter.arm.com/help/topic/co ... le_2_0.pdf that got me thinking about a 64bit gcc compiler for the RPI3 to allow access to the 64 bit floating point registers.
What do you all think?
Re: Benchmarking a raspberrypi compared to my own PC
IIRC 300k Pi3's made before launch, mostly sold.dmc1954 wrote:Has anyone tried to build a 64 bit arm (AArch64 Cortex A53) gcc compiler on a RPI3, assuming you are one of the few lucky people that were able to get a RPI3?
Quite a few lucky people I would say!
Principal Software Engineer at Raspberry Pi Ltd.
Working in the Applications Team.
Working in the Applications Team.
Re: Benchmarking a raspberrypi compared to my own PC
I don't think a 32bit kernel can load or execute a 64bit binary. However being able to cross compile 64bit binaries on a 32bit platform may be the first step in creating a 64bit kernel. It seems for now a 64bit kernel will have to run headless as the GPU binary blob is 32bit. That would still be enough for many applications.dmc1954 wrote:Has anyone tried to build a 64 bit arm (AArch64 Cortex A53) gcc compiler on a RPI3, assuming you are one of the few lucky people that were able to get a RPI3?
I don't believe RPI3 needs a 64 bit OS to access a it's 1GB of memory or 16GB SDcard filesystem, but it would be nice to access to the 64 bit float point registers.
Here is the link http://infocenter.arm.com/help/topic/co ... le_2_0.pdf that got me thinking about a 64bit gcc compiler for the RPI3 to allow access to the 64 bit floating point registers.
What do you all think?
Re: Benchmarking a raspberrypi compared to my own PC
Interesting thread. I was wondering where ARMs are today in terms of speed. I usually say 10 years behind x86. ARM chips used to be faster than x86 in the 1990's!
Thought I'd compile and run the primes.
2.3 GHz Core i7 Crystalwell (I74850HQ).
Model Identifier: MacBookPro11,3
Processor Name: Intel Core i7
Processor Speed: 2.3 GHz
Number of Processors: 1
Total Number of Cores: 4
L2 Cache (per Core): 256 KB
L3 Cache: 6 MB
L4 Cache: 128 MB
Memory: 16 GB
$ gcc version
Configured with: prefix=/Applications/Xcode.app/Contents/Developer/usr withgxxincludedir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang700.1.81)
Target: x86_64appledarwin15.4.0
Thread model: posix
Found a total of 664579 primes!
real 0m1.866s
user 0m1.830s
sys 0m0.005s
Found a total of 664579 primes!
real 0m1.850s
user 0m1.842s
sys 0m0.006s
Found a total of 664579 primes!
real 0m1.875s
user 0m1.868s
sys 0m0.007s
Thought I'd compile and run the primes.
2.3 GHz Core i7 Crystalwell (I74850HQ).
Model Identifier: MacBookPro11,3
Processor Name: Intel Core i7
Processor Speed: 2.3 GHz
Number of Processors: 1
Total Number of Cores: 4
L2 Cache (per Core): 256 KB
L3 Cache: 6 MB
L4 Cache: 128 MB
Memory: 16 GB
$ gcc version
Configured with: prefix=/Applications/Xcode.app/Contents/Developer/usr withgxxincludedir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang700.1.81)
Target: x86_64appledarwin15.4.0
Thread model: posix
Found a total of 664579 primes!
real 0m1.866s
user 0m1.830s
sys 0m0.005s
Found a total of 664579 primes!
real 0m1.850s
user 0m1.842s
sys 0m0.006s
Found a total of 664579 primes!
real 0m1.875s
user 0m1.868s
sys 0m0.007s
Re: Benchmarking a raspberrypi compared to my own PC
May be all worth adding to the wiki.
Re: Benchmarking a raspberrypi compared to my own PC
Tested on Raspberry Pi 3 and Core i76700K, just for fun. For those who don't know, the i76700K is based on Skylake, Intel's latest architecture, and runs at 4.04.2 GHz.
Raspberry Pi 3: 3.61 s
(gcc 4.9.2, O3 mcpu=cortexa53)
Core i76700K: 0.60 s
(gcc 5.2.1, O3 msse2 mfpmath=sse)
Ubuntu 15.10 32bit
So the i7 is 500% faster running this (rather simple) load. I'd expect the difference to be even larger in more taxing and memory intensive loads. Using all eight hardware threads on the i7 would of course extend that lead further. Anyway, always fun to compare, even if it's not very useful in this case.
Raspberry Pi 3: 3.61 s
(gcc 4.9.2, O3 mcpu=cortexa53)
Core i76700K: 0.60 s
(gcc 5.2.1, O3 msse2 mfpmath=sse)
Ubuntu 15.10 32bit
So the i7 is 500% faster running this (rather simple) load. I'd expect the difference to be even larger in more taxing and memory intensive loads. Using all eight hardware threads on the i7 would of course extend that lead further. Anyway, always fun to compare, even if it's not very useful in this case.
Last edited by Mikael on Sat Mar 05, 2016 8:04 am, edited 1 time in total.
Re: Benchmarking a raspberrypi compared to my own PC
Interestingly, the i7 TDP is 95 watts. Total board consumption for the Pi 3 is about 4 watts under multicore benchmarks so for 500% gain you're expending 2375% energyMikael wrote:Tested on Raspberry Pi 3 and Core i76700K, just for fun. For those who don't know, the i76700K is based on Skylake, Intel's latest architecture, and runs at 4.04.2 GHz.
Raspberry Pi 3: 3.61 s
(gcc 4.9.2, O3 mcpu=cortexa53)
Core i76700K: 0.60 s
(gcc 5.2.1, O3 msse2 mfpmath=sse)
So the i7 is 500% faster running this (rather simple) load. I'd expect the difference to be even larger in more taxing and memory intensive loads. Using all eight hardware threads on the i7 would of course extend that lead further. Anyway, always fun to compare, even if it's not very useful in this case.
Rockets are loud.
https://astropi.org
https://astropi.org
Re: Benchmarking a raspberrypi compared to my own PC
Cool, comparing with my desktop, an AMD Athlon X2 215 running at 1.5GHz in 64bit Ubuntu.
Found a total of 664579 primes!
real 0m4.400s
user 0m4.389s
sys 0m0.004s
Found a total of 664579 primes!
real 0m4.406s
user 0m4.398s
sys 0m0.000s
Found a total of 664579 primes!
real 0m4.397s
user 0m4.393s
sys 0m0.000s
Almost exactly the same as the Pi3!
Probably need a new desktop, that CPU speed seems a bit low.
Found a total of 664579 primes!
real 0m4.400s
user 0m4.389s
sys 0m0.004s
Found a total of 664579 primes!
real 0m4.406s
user 0m4.398s
sys 0m0.000s
Found a total of 664579 primes!
real 0m4.397s
user 0m4.393s
sys 0m0.000s
Almost exactly the same as the Pi3!
Probably need a new desktop, that CPU speed seems a bit low.
Principal Software Engineer at Raspberry Pi Ltd.
Working in the Applications Team.
Working in the Applications Team.
Re: Benchmarking a raspberrypi compared to my own PC
Extrapolating from the result of this one test alone puts my overclocked Pi2B on a par clockforclock with an AMD Athlon X2. I shall quietly gloat.jamesh wrote:Cool, comparing with my desktop, an AMD Athlon X2 215 running at 1.5GHz in 64bit Ubuntu.
...
Almost exactly the same as the Pi3!
Pi2B MiniPC/Media Centre: ARM=1GHz (+3), Core=500MHz, v3d=500MHz, h264=333MHz, RAM=DDR21200 (+6/+4/+4+schmoo). Sandisk Ultra HCI 32GB microSD card on '50=100' OCed slot (42MB/s read) running Raspbian/KODI16, Seagate 3.5" 1.5TB HDD mass storage.

 Posts: 1408
 Joined: Mon Oct 29, 2012 8:12 pm
 Location: Vancouver Island
Re: Benchmarking a raspberrypi compared to my own PC
ARM fp registers have been 64 bit for quite a while. Like, as long as there have been fp registers.dmc1954 wrote: I don't believe RPI3 needs a 64 bit OS to access a it's 1GB of memory or 16GB SDcard filesystem, but it would be nice to access to the 64 bit float point registers.
Making Smalltalk on ARM since 1986; making your Scratch better since 2012
Re: Benchmarking a raspberrypi compared to my own PC
Do you recall what compiler version and optimization switches you used?jamesh wrote:Cool, comparing with my desktop, an AMD Athlon X2 215 running at 1.5GHz in 64bit Ubuntu.
Found a total of 664579 primes!
real 0m4.400s
user 0m4.389s
sys 0m0.004s
Found a total of 664579 primes!
real 0m4.406s
user 0m4.398s
sys 0m0.000s
Found a total of 664579 primes!
real 0m4.397s
user 0m4.393s
sys 0m0.000s
Almost exactly the same as the Pi3!
Probably need a new desktop, that CPU speed seems a bit low.
Re: Benchmarking a raspberrypi compared to my own PC
That's actually far slower than expected. The general performance of AMD's K10 core should be at least 2530% higher at the same frequency. Probably a decent amount faster still in many loads.jamesh wrote:Cool, comparing with my desktop, an AMD Athlon X2 215 running at 1.5GHz in 64bit Ubuntu.
Found a total of 664579 primes!
real 0m4.400s
user 0m4.389s
sys 0m0.004s
Found a total of 664579 primes!
real 0m4.406s
user 0m4.398s
sys 0m0.000s
Found a total of 664579 primes!
real 0m4.397s
user 0m4.393s
sys 0m0.000s
Almost exactly the same as the Pi3!
Probably need a new desktop, that CPU speed seems a bit low.
I just tested my old Core 2 Duo T8100 (2.1GHz, 45nm, dual core, Penryn core) with the following results:
Core 2 Duo T8100 (2.1GHz): 1.588 s
(gcc 5.2.1, O3 msse2 mfpmath=sse)
Ubuntu 15.10 32bit
Given that AMD's K10 is in the same class, maybe 10% slower per clock, a more reasonable score for the Athlon II X2 215 @ 1.5GHz would be around the 2.5 second mark.
As said above, I think something's up with that result. Average real world performance of the K10 core in the Athlon II can be expected to be at least 70% higher than the CortexA7 in the Pi 2, clockforclock.GTR2Fan wrote:Extrapolating from the result of this one test alone puts my overclocked Pi2B on a par clockforclock with an AMD Athlon X2. I shall quietly gloat.
Re: Benchmarking a raspberrypi compared to my own PC
There's a lot running on the machine which probably doesn't help, and the 1.5Ghz speed seems low since the 215 should be good for 2.5GHz.Mikael wrote:That's actually far slower than expected. The general performance of AMD's K10 core should be at least 2530% higher at the same frequency. Probably a decent amount faster still in many loads.jamesh wrote:Cool, comparing with my desktop, an AMD Athlon X2 215 running at 1.5GHz in 64bit Ubuntu.
Found a total of 664579 primes!
real 0m4.400s
user 0m4.389s
sys 0m0.004s
Found a total of 664579 primes!
real 0m4.406s
user 0m4.398s
sys 0m0.000s
Found a total of 664579 primes!
real 0m4.397s
user 0m4.393s
sys 0m0.000s
Almost exactly the same as the Pi3!
Probably need a new desktop, that CPU speed seems a bit low.
I just tested my old Core 2 Duo T8100 (2.1GHz, 45nm, dual core, Penryn core) with the following results:
Core 2 Duo T8100 (2.1GHz): 1.588 s
(gcc 5.2.1, O3 msse2 mfpmath=sse)
Ubuntu 15.10 32bit
Given that AMD's K10 is in the same class, maybe 10% slower per clock, a more reasonable score for the Athlon II X2 215 @ 1.5GHz would be around the 2.5 second mark.
As said above, I think something's up with that result. Average real world performance of the K10 core in the Athlon II can be expected to be at least 70% higher than the CortexA7 in the Pi 2, clockforclock.GTR2Fan wrote:Extrapolating from the result of this one test alone puts my overclocked Pi2B on a par clockforclock with an AMD Athlon X2. I shall quietly gloat.
Build line:
cc O3 prime.c lm
Compiler
gcc (Ubuntu 4.8.42ubuntu1~14.04.1) 4.8.4
Principal Software Engineer at Raspberry Pi Ltd.
Working in the Applications Team.
Working in the Applications Team.
Re: Benchmarking a raspberrypi compared to my own PC
Tim, thank you for pointing this out to me. I wrongly assumed a float point double precision operations were being simulated using to single precision floating point instructions. I also verified the use of the d registers by dumping out the assembly (gcc S) of a simple double precision floating point program.timrowledge wrote:ARM fp registers have been 64 bit for quite a while. Like, as long as there have been fp registers.dmc1954 wrote: I don't believe RPI3 needs a 64 bit OS to access a it's 1GB of memory or 16GB SDcard filesystem, but it would be nice to access to the 64 bit float point registers.
This article https://wiki.debian.org/ArmHardFloatPort/VfpComparison also helped me better understand why double precision is so slow.
Re: Benchmarking a raspberrypi compared to my own PC
Perhaps you should be using NEON which is very fast on the Pi3 (full quad issue I think compared to dual issue on the Pi2).dmc1954 wrote:Tim, thank you for pointing this out to me. I wrongly assumed a float point double precision operations were being simulated using to single precision floating point instructions. I also verified the use of the d registers by dumping out the assembly (gcc S) of a simple double precision floating point program.timrowledge wrote: ARM fp registers have been 64 bit for quite a while. Like, as long as there have been fp registers.
This article https://wiki.debian.org/ArmHardFloatPort/VfpComparison also helped me better understand why double precision is so slow.
Re: Benchmarking a raspberrypi compared to my own PC
The Athlon II X2 215 is a 2.7GHz part. The number reported by /proc/cpuinfo is the current speed of the processor that the frequency governor has set. I have a similar vintage 3.1 GHz CPU and get timings consistent with yours. Under load, the governor is supposed to increase the speed as needed. The following scriptjamesh wrote:There's a lot running on the machine which probably doesn't help, and the 1.5Ghz speed seems low since the 215 should be good for 2.5GHz.Mikael wrote:That's actually far slower than expected. The general performance of AMD's K10 core should be at least 2530% higher at the same frequency. Probably a decent amount faster still in many loads.jamesh wrote:Cool, comparing with my desktop, an AMD Athlon X2 215 running at 1.5GHz in 64bit Ubuntu.
Found a total of 664579 primes!
real 0m4.400s
user 0m4.389s
sys 0m0.004s
Found a total of 664579 primes!
real 0m4.406s
user 0m4.398s
sys 0m0.000s
Found a total of 664579 primes!
real 0m4.397s
user 0m4.393s
sys 0m0.000s
Almost exactly the same as the Pi3!
Probably need a new desktop, that CPU speed seems a bit low.
I just tested my old Core 2 Duo T8100 (2.1GHz, 45nm, dual core, Penryn core) with the following results:
Core 2 Duo T8100 (2.1GHz): 1.588 s
(gcc 5.2.1, O3 msse2 mfpmath=sse)
Ubuntu 15.10 32bit
Given that AMD's K10 is in the same class, maybe 10% slower per clock, a more reasonable score for the Athlon II X2 215 @ 1.5GHz would be around the 2.5 second mark.
As said above, I think something's up with that result. Average real world performance of the K10 core in the Athlon II can be expected to be at least 70% higher than the CortexA7 in the Pi 2, clockforclock.GTR2Fan wrote:Extrapolating from the result of this one test alone puts my overclocked Pi2B on a par clockforclock with an AMD Athlon X2. I shall quietly gloat.
Build line:
cc O3 prime.c lm
Compiler
gcc (Ubuntu 4.8.42ubuntu1~14.04.1) 4.8.4
Code: Select all
#!/bin/bash
for i in /sys/devices/system/cpu/cpu?/cpufreq
do
echo $i
cat $i/cpuinfo_max_freq >$i/scaling_min_freq
done
Re: Benchmarking a raspberrypi compared to my own PC
Just posted this in another thread, but it's really important for this thread as well:
gcc seems to generate suboptimal code when compiling this program to a 64bit binary. I used Ubuntu 15.10 32bit for my first tests. I just tried it on 64bit and the result for my 6700 changed "slightly":
Core i76700K: 1.82 s (compared to 0.60 s in 32bit)
(gcc 5.2.1, O3)
Ubuntu 15.10 64bit
So, pretty much exactly 1/3 the performance. To compile a 32bit binary in a 64bit environment, give the m32 option to gcc when compiling. Would be interesting to see that Athlon II X2 215 retested with a 32bit binary...
gcc seems to generate suboptimal code when compiling this program to a 64bit binary. I used Ubuntu 15.10 32bit for my first tests. I just tried it on 64bit and the result for my 6700 changed "slightly":
Core i76700K: 1.82 s (compared to 0.60 s in 32bit)
(gcc 5.2.1, O3)
Ubuntu 15.10 64bit
So, pretty much exactly 1/3 the performance. To compile a 32bit binary in a 64bit environment, give the m32 option to gcc when compiling. Would be interesting to see that Athlon II X2 215 retested with a 32bit binary...
Re: Benchmarking a raspberrypi compared to my own PC
Thanks for figuring out what was going on. I've updated the code for prime.c so that the size of the integers is specified explicitly using the stdint.h header. I've also sorted the table of timings for different processors so that it reflects which timings correspond to the use of 32bit integers versus 64bit integers. However, at the moment I'm not distinguishing whether the host kernel is 32bit or 64bit as I think the timing differences result from the size of the integers and not the kernel.Mikael wrote:Just posted this in another thread, but it's really important for this thread as well:
gcc seems to generate suboptimal code when compiling this program to a 64bit binary. I used Ubuntu 15.10 32bit for my first tests. I just tried it on 64bit and the result for my 6700 changed "slightly":
Core i76700K: 1.82 s (compared to 0.60 s in 32bit)
(gcc 5.2.1, O3)
Ubuntu 15.10 64bit
So, pretty much exactly 1/3 the performance. To compile a 32bit binary in a 64bit environment, give the m32 option to gcc when compiling. Would be interesting to see that Athlon II X2 215 retested with a 32bit binary...
It is interesting to note that the size of the integers doesn't seem to make a difference for recent AMD processors. The table is currently missing 32bit results for the i74850HQ, i73770k and i53570k. Also missing are 64bit results for the Pi 3B and Pi B+. However, I reran the Pi 2B benchmarks using 64bit integers by changing uint32_t to uint64_t in the updated source. Surprisingly, the mcpu=cortexa7 flag makes no difference for 64bit integers on the Pi 2B and the results are disappointingly slow. It would be great if someone could run the program using 64bit integers on the Pi 3B and figure out if there are any compiler options I'm missing that could be used to speed things up.
Re: Benchmarking a raspberrypi compared to my own PC
Interesting. I did a quick test on my laptop, using 32/64bit integers and 32/64bit kernel. The kernel does not make a difference, as you say. However, compiling a 32bit or 64bit binary does make a difference:ejolson wrote:Thanks for figuring out what was going on. I've updated the code for prime.c so that the size of the integers is specified explicitly using the stdint.h header. I've also sorted the table of timings for different processors so that it reflects which timings correspond to the use of 32bit integers versus 64bit integers. However, at the moment I'm not distinguishing whether the host kernel is 32bit or 64bit as I think the timing differences result from the size of the integers and not the kernel.
Core i55300U:
32bit binary:
uint32: 1.100 s
uint64: 2.192 s
64bit binary:
uint32: 1.152 s
uint64: 2.772 s
(gcc 5.2.1, O3)
Ubuntu 15.10 64bit
64bit integers should be much slower than 32bit ones when executed in a 32bit binary. You'd need to run a 64bit OS and binary to speed things up. However, the thing I'm not getting here is the strange results on x86 CPUs like the ones above. The results for the 32bit binary look plausible, I think (i.e. 64bit calculations are much slower). The results for 32bit integers in the 64bit binary also look okay. 64bit mode has twice as many general purpose registers compared to 32bit mode, which may speed up some loads. However, it also increases bandwidth requirements. For the result to remain unchanged when going from 32bit to 64bit mode is not uncommon.ejolson wrote:It is interesting to note that the size of the integers doesn't seem to make a difference for recent AMD processors. The table is currently missing 32bit results for the i74850HQ, i73770k and i53570k. Also missing are 64bit results for the Pi 3B and Pi B+. However, I reran the Pi 2B benchmarks using 64bit integers by changing uint32_t to uint64_t in the updated source. Surprisingly, the mcpu=cortexa7 flag makes no difference for 64bit integers on the Pi 2B and the results are disappointingly slow. It would be great if someone could run the program using 64bit integers on the Pi 3B and figure out if there are any compiler options I'm missing that could be used to speed things up.
That leaves us with the last result: the 64bit integers in the 64bit binary. It's by far the slowest and I have no idea why. Given the fact that the CPU natively executes 64bit integers, I would expect it to perform on a similar level as the 32bit integer results. Granted, I'm certainly no compiler expert, so I might be missing something here.
Does anyone have any theories?
Re: Benchmarking a raspberrypi compared to my own PC
Result for i73770 using 32bit "Integer" (gcc 5.3).
"Integer" set to 64bits:"Integer" set to 32 bits:
Note the divq for 64 bits and the divl for 32 bits. Also the extra sign extend instruction at the start which must be completed before the cmp. The register move and the xor, will be eliminated by the decoder in both modes.
Sadly, we cant (yet) run the Pi3 in aarch64 mode.Found a total of 664579 primes (32bit)
real 0m1.063s
user 0m1.060s
sys 0m0.000s
Found a total of 664579 primes (32bit)
real 0m0.975s
user 0m0.972s
sys 0m0.000s
Found a total of 664579 primes (32bit)
real 0m0.976s
user 0m0.976s
sys 0m0.000s
I have found in the past that for larger programs on x86 that 64bit mode gives a modest speed increase  say around 15% or so. I suspect in this case, the program is tiny, mostly in a small inner loop, and some minor effect such as the reduced number of clock cycles to do the 32bit divide, could be dominant.That leaves us with the last result: the 64bit integers in the 64bit binary. It's by far the slowest and I have no idea why. Given the fact that the CPU natively executes 64bit integers, I would expect it to perform on a similar level as the 32bit integer results. Granted, I'm certainly no compiler expert, so I might be missing something here.
Does anyone have any theories?
"Integer" set to 64bits:
Code: Select all
movq prime(%rip), %rcx
movslq %r8d, %r8
cmpq %r8, %rcx
ja .L4
xorl %edx, %edx
movq %rbx, %rax
divq %rcx
testq %rdx, %rdx
je .L5
Code: Select all
movl prime(%rip), %ecx
cmpl %r8d, %ecx
ja .L4
xorl %edx, %edx
movl %ebx, %eax
divl %ecx
testl %edx, %edx
je .L5
Re: Benchmarking a raspberrypi compared to my own PC
Did you perform each timing usingMikael wrote:Interesting. I did a quick test on my laptop, using 32/64bit integers and 32/64bit kernel. The kernel does not make a difference, as you say. However, compiling a 32bit or 64bit binary does make a difference:ejolson wrote:Thanks for figuring out what was going on. I've updated the code for prime.c so that the size of the integers is specified explicitly using the stdint.h header. I've also sorted the table of timings for different processors so that it reflects which timings correspond to the use of 32bit integers versus 64bit integers. However, at the moment I'm not distinguishing whether the host kernel is 32bit or 64bit as I think the timing differences result from the size of the integers and not the kernel.
Core i55300U:
32bit binary:
uint32: 1.100 s
uint64: 2.192 s
64bit binary:
uint32: 1.152 s
uint64: 2.772 s
(gcc 5.2.1, O3)
Ubuntu 15.10 64bit
64bit integers should be much slower than 32bit ones when executed in a 32bit binary. You'd need to run a 64bit OS and binary to speed things up. However, the thing I'm not getting here is the strange results on x86 CPUs like the ones above. The results for the 32bit binary look plausible, I think (i.e. 64bit calculations are much slower). The results for 32bit integers in the 64bit binary also look okay. 64bit mode has twice as many general purpose registers compared to 32bit mode, which may speed up some loads. However, it also increases bandwidth requirements. For the result to remain unchanged when going from 32bit to 64bit mode is not uncommon.ejolson wrote:It is interesting to note that the size of the integers doesn't seem to make a difference for recent AMD processors. The table is currently missing 32bit results for the i74850HQ, i73770k and i53570k. Also missing are 64bit results for the Pi 3B and Pi B+. However, I reran the Pi 2B benchmarks using 64bit integers by changing uint32_t to uint64_t in the updated source. Surprisingly, the mcpu=cortexa7 flag makes no difference for 64bit integers on the Pi 2B and the results are disappointingly slow. It would be great if someone could run the program using 64bit integers on the Pi 3B and figure out if there are any compiler options I'm missing that could be used to speed things up.
That leaves us with the last result: the 64bit integers in the 64bit binary. It's by far the slowest and I have no idea why. Given the fact that the CPU natively executes 64bit integers, I would expect it to perform on a similar level as the 32bit integer results. Granted, I'm certainly no compiler expert, so I might be missing something here.
Does anyone have any theories?
Code: Select all
time ./a.out; time ./a.out; time ./a.out
Re: Benchmarking a raspberrypi compared to my own PC
No theories, however, I can now confirm your timings. The 64bit integers with 32bit compatible binary surprisingly run about 20% faster than 64bit integers with 64bit binary using an i3 550. On the other hand, the situation is reversed for the exact same binaries using an AMD Athlon II X2 255.Mikael wrote:Does anyone have any theories?
Code: Select all
32bit binary 64bit binary
i3 550 2.058 2.430
Athlon II X2 255 5.301 3.830
Code: Select all
$ gcc O3 msse2 mfpmath=sse prime.c lm
Code: Select all
$ grep "model name" /proc/cpuinfo  sort u
model name : Intel(R) Core(TM) i3 CPU 550 @ 3.20GHz
$ file prime64on32
prime64on32: ELF 32bit LSB executable, Intel 80386, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.26, BuildID[sha1]=0x9a55be5cdecd3251f5407682d64c6bcd079bea26, not stripped
$ time ./prime64on32; time ./prime64on32; time ./prime64on32
Found a total of 664579 primes (64bit)
real 0m2.066s
user 0m2.056s
sys 0m0.000s
Found a total of 664579 primes (64bit)
real 0m2.073s
user 0m2.064s
sys 0m0.000s
Found a total of 664579 primes (64bit)
real 0m2.058s
user 0m2.040s
sys 0m0.008s
$ file prime64on64
prime64on64: ELF 64bit LSB executable, x8664, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.26, BuildID[sha1]=0x6837b7d0137cbd9448c7976b1e49b98759215b5e, not stripped
$ time ./prime64on64; time ./prime64on64; time ./prime64on64
Found a total of 664579 primes (64bit)
real 0m2.495s
user 0m2.484s
sys 0m0.000s
Found a total of 664579 primes (64bit)
real 0m2.430s
user 0m2.420s
sys 0m0.000s
Found a total of 664579 primes (64bit)
real 0m2.434s
user 0m2.424s
sys 0m0.000s

$ grep "model name" /proc/cpuinfo  sort u
model name : AMD Athlon(tm) II X2 255 Processor
$ time ./prime64on32; time ./prime64on32; time ./prime64on32
Found a total of 664579 primes (64bit)
real 0m5.317s
user 0m5.304s
sys 0m0.000s
Found a total of 664579 primes (64bit)
real 0m5.308s
user 0m5.296s
sys 0m0.004s
Found a total of 664579 primes (64bit)
real 0m5.301s
user 0m5.292s
sys 0m0.000s
$ time ./prime64on64; time ./prime64on64; time ./prime64on64
Found a total of 664579 primes (64bit)
real 0m3.835s
user 0m3.828s
sys 0m0.000s
Found a total of 664579 primes (64bit)
real 0m3.830s
user 0m3.824s
sys 0m0.004s
Found a total of 664579 primes (64bit)
real 0m3.836s
user 0m3.828s
sys 0m0.000s
Re: Benchmarking a raspberrypi compared to my own PC
Amazed at how speedy the i74850HQ is, not bad for a laptop. Wonder if the eDRAM has any effect?
ejolson wrote:KarlSplatz wrote:Updated to distinguish timings using 32bit integers from 64bit integers.Code: Select all
CPU 32bit 64bit Compiler  i76700K 4.0GHz 0.600 1.820 gcc5.2.1 O3 i74850HQ 2.3GHz 1.850 LLVM7.0.2 O3 i73770k 4GHz 0.975 2.146 gcc5.2 O3 i53570k 3.4GHz 2.497 gcc4.8.4 O3 Xeon E52620v3 2.4GHz 1.135 2.545 gcc4.7.2 O3 Xeon E52650v2 2.6GHz 1.155 2.592 gcc4.4.7 O3 AMD A65400K 3.6GHz 2.023 2.095 gcc4.7.2 O3 Opteron 6212 2.6GHz 3.407 3.421 gcc5.1.0 O3 Phenom II X4 3.4GHz 3.473 3.479 gcc4.7.2 O3 ARMv8 Pi 3B 1200MHz 3.611 gcc4.9.2 O3 \ mcpu=cortexa53 Pentium 4 3.4Ghz 3.759 5.181 gcc5.2.1 O3 AthlonII X2 255 3.1GHz 3.828 3.836 gcc4.7.2 O3 Athlon64 X2 5400+ 2.8GHz 4.601 7.893 gcc4.6.3 O3 Pentium 4D CPU 2.80GHz 4.612 6.271 gcc4.7.2 O3 ARMv7 Pi 2B 900MHz 7.187 74.790 gcc5.2 O3 \ mcpu=cortexa7 Pentium III 866MHz 14.999 20.169 gcc4.7.2 O3 ARMv8 Pi 3B 1200MHz 17.670 gcc4.9.2 O3 Pentium III 650MHz 19.891 26.735 gcc4.7.2 O3 AMDK6 3D 350MHz 26.726 45.469 gcc4.7.2 O3 ARMv7 Pi 2B 900MHz 27.741 74.987 gcc4.6.3 O3 ARMv6 Pi B+ 700MHz 74.027 gcc4.6.3 O3 i586 Pentium 75MHz 155.804 303.710 gcc2.7.2.3 O3 i486 DX/2 66MHz 282.180 919.130 gcc2.6.3 O3
Re: Benchmarking a raspberrypi compared to my own PC
Here is another data point based on the simple prime finding program posted above for an ARMv8 processor running in 64bit mode. I ran the program on a single board computer called the NanoPi T3 which uses the same clock speed as the Raspberry Pi 3B+ but a slightly slower memory speed. The results areThis places the 1400 Mhz ARMv8 processor running in 64bit mode right above the Opteron 6212 in the previous table when computing with either 32bit or 64bit integers. At the moment my Pi 3B+ is serving as a WiFi access point and unavailable for 64bit testing, but I would expect it to run even faster because of the faster memory speed. If anyone has a 3B+ running a 64bit operating system and wants to verify this, that would be greatly appreciated.
It is interesting to note that a Pi 3B running at 1200Mhz in 32bit mode turns in a runtime of 3.611s when using 32bit integers, while the test above indicates a runtime using 32bit integers of only 2.878s. We may attribute 16 percent of this performance increase to the faster clock speed; however, the actual increase is 25 percent. Therefore, the remaining 9 percent is likely due to better code optimization, possibly resulting from the richer set of available registers when running in 64bit mode as opposed to 32bit mode.
Code: Select all
$ gcc o prime32 O3 prime32.c lm
$ gcc o prime64 O3 prime64.c lm
$ time ./prime32; time ./prime32; time ./prime32
Found a total of 664579 primes (32bit)
real 0m2.902s
user 0m2.896s
sys 0m0.008s
Found a total of 664579 primes (32bit)
real 0m2.876s
user 0m2.864s
sys 0m0.012s
Found a total of 664579 primes (32bit)
real 0m2.878s
user 0m2.872s
sys 0m0.004s
$ time ./prime64; time ./prime64; time ./prime64
Found a total of 664579 primes (64bit)
real 0m3.114s
user 0m3.108s
sys 0m0.004s
Found a total of 664579 primes (64bit)
real 0m3.088s
user 0m3.084s
sys 0m0.004s
Found a total of 664579 primes (64bit)
real 0m3.091s
user 0m3.088s
sys 0m0.004s
$ gcc version
gcc (Ubuntu/Linaro 5.4.06ubuntu1~16.04.9) 5.4.0 20160609
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
It is interesting to note that a Pi 3B running at 1200Mhz in 32bit mode turns in a runtime of 3.611s when using 32bit integers, while the test above indicates a runtime using 32bit integers of only 2.878s. We may attribute 16 percent of this performance increase to the faster clock speed; however, the actual increase is 25 percent. Therefore, the remaining 9 percent is likely due to better code optimization, possibly resulting from the richer set of available registers when running in 64bit mode as opposed to 32bit mode.
Re: Benchmarking a raspberrypi compared to my own PC
I've just included these results for the Raspberry Pi 3B+ running in 32bit mode to the table. In particular the run times
Indicate that the Pi 3B+ is about 8 percent faster on this benchmark than the Pi 3B using 32bit integers. Note that the reported timings with 64bit integers, though significantly slower, seem quite good for having been run in 32bit mode. This reflects a dramatic improvement in compiler optimization using 64bit integer NEON instructions since the original tests for the 3B were made. It would be interesting to revisit the 3B results.