64 Bit Raspberry Pi 4B Multithreading Benchmarks
Many of these benchmarks run using 1, 2, 4 and 8 threads, with others executing programs on all available cores via OpenMP.
MP-Whetstone Benchmark
Multiple threads each run the eight test functions at the same time, but with some dedicated variables. Measured speed is based on the last thread to finish, with Mutex functions, used to avoid the updating conflict by only allowing one thread at a time to access common data. Performance is generally proportional to the number of cores used. There can be some significant differences from the single CPU Whetstone benchmark results on particular tests due to a different compiler being used. None of the test functions are suitable for SIMD operation, with the simpler instructions being used. Overall seconds indicates MP efficiency.
As with the single core version, average Pi 4 performance gain, over the Pi 3B+, was just over 2 times, but more similar compared with 32 bit speed, this time the latter being somewhat faster on some floating point calculations.
Code: Select all
MWIPS MFLOPS MFLOPS MFLOPS Cos Exp Fixpt If Equal
Threads 1 2 3 MOPS MOPS MOPS MOPS MOPS
Gentoo Pi 3B+ 64 Bits
1 1152 383 383 328 23.2 13.0 N/A 2721 1365
2 2312 767 767 657 46.5 26.0 N/A 5461 2738
4 4580 1506 1526 1304 92.0 51.6 N/A 10777 5449
8 4788 1815 1961 1382 95.0 53.3 N/A 13827 5811
Overall Seconds 4.96 1T, 4.95 2T, 5.05 4T, 10.07 8T
Gentoo Pi 4B 64 Bits
1 2395 536 538 397 60.8 39.0 N/A 4483 997
2 4784 1062 1079 794 121.2 77.9 N/A 8932 1990
4 9476 2125 2080 1568 240.8 155.3 N/A 17718 3962
8 9834 2631 2744 1630 243.6 160.1 N/A 22265 4053
Overall Seconds 4.99 1T, 5.01 2T, 5.12 4T, 10.17 8T
Pi 4B/3B+ 64 Bits
1 2.08 1.40 1.41 1.21 2.62 3.00 N/A 1.65 0.73
2 2.07 1.39 1.41 1.21 2.61 3.00 N/A 1.64 0.73
4 2.07 1.41 1.36 1.20 2.62 3.01 N/A 1.64 0.73
8 2.05 1.45 1.40 1.18 2.56 3.00 N/A 1.61 0.70
Raspbian Pi 4B 32 Bits
1 2059 673 680 311 55.6 33.1 7462 2245 995
2 4117 1342 1391 624 110.7 65.9 14887 4467 1986
4 7910 2652 2722 1180 208.5 132.6 29291 8952 3832
8 8652 3057 2971 1268 233.2 149.6 38368 11923 3942
Overall Seconds 4.99 1T, 5.01 2T, 5.29 4T, 10.71 8T
Pi 4B 64 bits/32 bits
1 1.16 0.80 0.79 1.28 1.09 1.18 N/A 2.00 1.00
2 1.16 0.79 0.78 1.27 1.09 1.18 N/A 2.00 1.00
4 1.20 0.80 0.76 1.33 1.15 1.17 N/A 1.98 1.03
8 1.14 0.86 0.92 1.28 1.04 1.07 N/A 1.87 1.03
MP Dhrystone Benchmark
This executes multiple copies of the same program, but with some shared data, leading to inconsistent multithreading performance with not much gain using multiple cores.
The single thread speeds were similar to the earlier Dhrystone results, with RPi 4B ratings around twice as fast as those for the Pi 3B+. The single thread Pi 4B 64 bit/32 bit speed ratio was also similar to that during the single core tests.
Code: Select all
MP-Dhrystone Benchmark armv8 64 Bit Fri Aug 23 00:44:05 2019
Using 1, 2, 4 and 8 Threads
Threads 1 2 4 8
Seconds 0.54 0.67 1.23 2.46
Dhrystones per Second 7391586 11954301 11300304 13028539
VAX MIPS Pi 3B+ 64 bits 4207 6804 7401 7415
VAX MIPS Pi 4B 64 bits 8880 7828 8303 8314
Pi 4B/3B+ 64 bits 2.11 1.15 1.12 1.12
VAX MIPS Pi 4B 32 bits 5539 5739 6735 7232
Pi 4B 64 bits/32 bits 1.60 1.36 1.23 1.15
MP Linpack Benchmark (Single Precision NEON)
This executes a single copy of the benchmark, at three data sizes, with the critical daxpy code multithreaded. This code was also modified to allow a higher level of parallelism, without changing any calculations. Still MP performance was much slower than running as a single thread. The main reasons appear to be updating data in RAM, to maintain integrity, with performance reflecting memory speeds, and overheads of exceptionally high thread start/stop overheads.
This benchmark uses the same NEON Intrinsic Functions as the single core program, with similar speeds at N = 100, without the threading overheads, but decreasing with larger data sizes, involving RAM accesses.
The full logged output is shown for the first entry, to demonstrate error checking facilities. The sumchecks were identical from the Pi 3B+ and Pi 4B at Gentoo 64 bits, but those from the Raspbian 32 bit test were different, as shown below. Ignoring the slow threaded results, performance ratios of CPU speed limited tests were similar to the single core version.
Code: Select all
Gentoo Pi 3B+ 64 Bits
Linpack Single Precision MultiThreaded Benchmark
64 Bit NEON Intrinsics, Fri Aug 23 00:45:54 2019
MFLOPS 0 to 4 Threads, N 100, 500, 1000
Threads None 1 2 4
N 100 642.56 66.69 66.05 65.54
N 500 479.48 274.36 274.85 269.07
N 1000 363.77 316.17 310.37 316.71
NR=norm resid RE=resid MA=machep X0=x[0]-1 XN=x[n-1]-1
N 100 500 1000
NR 1.97 5.40 13.51
RE 4.69621336e-05 6.44138840e-04 3.22485110e-03
MA 1.19209290e-07 1.19209290e-07 1.19209290e-07
X0 -1.31130219e-05 5.79357147e-05 -3.08930874e-04
XN -1.30534172e-05 3.51667404e-05 1.90019608e-04
Thread
0 - 4 Same Results Same Results Same Results
Gentoo Pi 4B 64 Bits
N 100 2252.70 97.25 97.43 97.41
N 500 1628.24 665.21 646.63 674.38
N 1000 399.87 406.80 405.84 399.54
Pi 4B/3B+ 64 Bits
N 100 3.51 1.46 1.48 1.49
N 500 3.40 2.42 2.35 2.51
N 1000 1.10 1.29 1.31 1.26
Raspbian Pi 4B 32 Bits
N 100 1921.53 108.66 101.88 102.46
N 500 1548.81 530.23 714.37 733.09
N 1000 399.94 378.11 364.78 398.21
Pi 4B 64 bits/32 bits
N 100 1.17 0.89 0.96 0.95
N 500 1.05 1.25 0.91 0.92
N 1000 1.00 1.08 1.11 1.00
32 bit numeric results
N 100 500 1000
NR 2.17 5.42 9.50
RE 5.16722466e-05 6.46698638e-04 2.26586126e-03
MA 1.19209290e-07 1.19209290e-07 1.19209290e-07
X0 -2.38418579e-07 -5.54323196e-05 -1.26898289e-04
XN -5.06639481e-06 -4.70876694e-06 1.41978264e-04
MP BusSpeed (read only) Benchmark
Each thread accesses all of the data in separate sections, covering caches and RAM, starting at different points, with this version. See single processor BusSpeed details regarding burst reading that can indicate significant differences.
Comparisons are provided for RdAll, at 1, 2 and 4 threads. Pi 4B/3B+ performance ratios were similar to that for the single core tests. There was an exception with two threads, on the Pi 4, using RAM at 64 bits, probably due to caching effects and not seen on subsequent repeated tests.
Particularly note that performance was significantly better using the 32 bit Raspbian compiler. Below are examples of disassembly, showing that Pi 4 code employed scalar operation, using 32 bit w registers, with the 3B benefiting from using 128 bit q registers, for Single Instruction Multiple Data (SIMD) operation. Compile options are included below, where alternative were also tried on the Pi 4B, but failed to implement SIMD operation.
Code: Select all
Gentoo Pi 3B+ 64 Bits
MP-BusSpd armv8 64 Bit Fri Aug 23 00:47:43 2019
MB/Second Reading Data, 1, 2, 4 and 8 Threads
Staggered starting addresses to avoid caching
KB Threads Inc32 Inc16 Inc8 Inc4 Inc2 RdAll
12.3 1T 3138 2822 3044 2383 1708 1737
2T 5354 4865 5647 4519 3303 3362
4T 7922 7504 9717 6794 6216 6597
8T 5125 4159 6987 6696 5350 5195
122.9 1T 640 666 1191 1864 1627 1712
2T 1008 1018 1926 3496 3268 3387
4T 962 1042 2157 4259 6427 4372
8T 1031 1047 2147 3952 6317 6514
12288 1T 124 114 260 527 1016 1363
2T 137 138 275 487 946 2182
4T 105 118 240 409 975 2158
8T 108 117 236 504 1077 2051
RdAll
Gentoo Pi 4B 64 Bits Pi 4B/3B+
12.3 1T 4864 4879 5378 4379 4115 4221 2.43
2T 8159 6924 9179 8006 7689 7837 2.33
4T 12677 11531 14850 12554 13807 14794 2.24
8T 7398 6927 10881 11675 11497 13075 2.52
122.9 1T 665 926 1869 2714 3557 4152 2.43
2T 610 696 1549 4898 7188 8184 2.42
4T 476 865 1885 4107 8058 14617 3.34
8T 474 883 1848 3919 7939 13633 2.09
12288 1T 202 210 514 1044 2033 3616 2.65
2T 258 425 853 1551 3693 6228 2.85
4T 217 346 497 1024 2181 3789 1.76
8T 220 275 540 1030 1937 3577 1.74
Raspbian Pi 4B 32 Bits 64b/32b
12.3 1T 5263 5637 5809 5894 5936 13445 0.31
2T 9412 10020 10567 11454 11604 24980 0.31
4T 16282 15577 16418 21222 20000 45530 0.32
8T 11600 13285 16070 18579 20593 36837 0.35
122.9 1T 739 956 1888 3153 5008 9527 0.44
2T 629 1158 1568 5058 9509 16489 0.50
4T 600 1093 2134 4527 8732 16816 0.87
8T 593 1104 2121 4382 8629 17158 0.79
12288 1T 238 258 518 1005 2001 4029 0.90
2T 278 228 453 1690 1826 3628 1.72
4T 269 257 740 1019 1790 4145 0.91
8T 233 292 532 926 2186 3581 1.00
MP BusSpeed Disassembly
Code: Select all
Source Code 64 AND instructions in main loop
for (i=start; i<end; i=i+64)
{
andsum1[t] = andsum1[t]
& array[i ] & array[i+1 ] & array[i+2 ] & array[i+3 ]
& array[i+4 ] & array[i+5 ] & array[i+6 ] & array[i+7 ]
To
& array[i+56] & array[i+57] & array[i+58] & array[i+59]
& array[i+60] & array[i+61] & array[i+62] & array[i+63];
}
Pi 32 Bit Raspbian Compile
gcc mpbusspd2.c cpuidc.c -lpthread -lm -lrt -O3 -mcpu=cortex-a7
-mfloat-abi=hard -mfpu=neon-vfpv4 -o MP-BusSpd2PiA7
Pi 64 Bit Gentoo Compile
gcc mpbusspd2.c cpuidc.c -lpthread -lm -lrt -O3 -march=armv8-a -o MP-BusSpd2Pi64
Parameters also tried
-march=armv8-a+crc -mtune=cortex-a72 -ftree-vectorize -O2 -pipe
-fomit-frame-pointer"
Pi 32 Bit Disassembly Pi 64 Bit Disassembly
vld1.32 {q6}, [lr] ldp w30, w17, [x0, 52]
vld1.32 {q7}, [r6] and w18, w18, w30
vand q10, q10, q6 and w1, w1, w18
vld1.32 {q6}, [r0] ldp w18, w30, [x0, 60]
vand q9, q9, q7 and w17, w17, w18
vand q12, q12, q6 and w1, w1, w17
vld1.32 {q7}, [ip] ldp w17, w18, [x0, 68]
vld1.32 {q6}, [r7] and w30, w30, w17
add r1, r3, #96 and w1, w1, w30
add r6, r3, #144 ldp w30, w17, [x0, 76]
vand q11, q11, q7 and w18, w18, w30
vand q14, q14, q6 and w1, w1, w18
vld1.32 {q7}, [r1] ldp w18, w30, [x0, 84]
vld1.32 {q6}, [r6] and w17, w17, w18
MP RandMem Benchmark
This benchmark potentially reads and writes all data, in sections covering caches and RAM, each thread starting at different addresses. Random access can select any address after that. Writing tends to be involve updating the appropriate memory area, providing constant speeds. Random access is significantly affected by burst reading and writing.
Pi 4B provided variable gains over the Pi 3B+, at 64 bits but less on the Pi 4B, from 64 bits over 32 bits.
Code: Select all
MP-RandMem armv8 64 Bit Aug 2019 Using 1, 2, 4 and 8 Threads
Serial Serial Random Random Serial Serial Random Random
KB+Thread Read RdWr Read RdWr Read RdWr Read RdWr
Gentoo Pi 4B 64 Bits
12.3 1T 5922 7871 5892 7857
2T 11856 7882 11902 7923
4T 22964 7821 22276 7832
8T 23225 7751 22082 7717
122.9 1T 5827 7276 2052 1921
2T 10965 7258 1754 1924
4T 10969 7232 1848 1929
8T 10896 7158 1834 1909
12288 1T 3879 1052 188 170
2T 4848 935 218 168
4T 4684 943 332 170
8T 3982 1049 340 171
Gentoo Pi 3B+ 64 Bits Raspbian Pi 4B 32 Bits
12.3 1T 4901 3587 4912 3585 5860 7905 5927 7657
2T 8749 3564 8719 3556 11747 7908 11182 7746
4T 17108 3504 17160 3505 21416 7626 17382 7731
8T 16885 3475 16650 3485 20649 7528 20431 7378
122.9 1T 3921 3339 1010 974 5479 7269 1826 1923
2T 7360 3350 1814 972 10355 6964 1667 1920
4T 12199 3313 2281 969 9808 7177 1715 1908
8T 12089 3313 2279 968 11677 7058 1697 1919
12288 1T 2024 828 83 67 3438 1271 179 152
2T 2169 820 142 67 4176 1204 213 167
4T 2178 818 154 67 4227 1117 337 161
8T 2219 821 161 67 3479 1093 287 168
4 Thread Comparisons
Pi 4B/3B+ 64 Bits Pi 4B 64 bits/32 bits
12.3 4T 1.34 2.23 1.30 2.23 1.07 1.03 1.28 1.01
122.9 4T 0.90 2.18 0.81 1.99 1.12 1.01 1.08 1.01
12288 4T 2.15 1.15 2.16 2.54 1.11 0.84 0.99 1.06
MP-MFLOPS Benchmarks
MP-MFLOPS measures floating point speed on data from caches and RAM. The first calculations are as used in Memory Speed Benchmark, with a multiply and an add per data word read. The second uses 32 operations per input data word of the form x
= (x+a)*b-(x+c)*d+(x+e)*f -- more. Tests cover 1, 2, 4 and 8 threads, each carrying out the same calculations but accessing different segments of the data. Versions are available using single precision and double precision data, plus one with NEON intrinsic functions. The numeric results are converted into a simple sumcheck, that should be constant, irrespective of the number of threads used. Correct values are included at the end of the results below. Note the differences using NEON functions and double or single precision floating point instructions.
There can be wide variations in speeds, affected by the short running times and such as cached data variations. In order to help in interpreting results, comparisons are provided of results using one and four threads. These indicate that, with cache based data, the Pi 4B was more than 3.5 times faster than the Pi 3B+ at two operations per word, but less so at 32 operations.
The 64 bit and 32 bit comparisons were, no doubt, influenced by the particular compiler version used, and this is reflected in the main disassembled code shown below, for 32 operations per word. The 32 bit version compile included -mfpu=neon-vfpv4, but NEON was not implemented, resulting in scalar operation, using single word s registers. I have another version with compile including -funsafe-math-optimizations, that compiles NEON instructions, with similar performance as the 64 bit version, but more sumcheck differences.
The benchmark compiled to use NEON Intrinsic Functions does not include any that specify fused multiply and add operations, reducing maximum possible speed. The 64 bit compiler converts the functions to include fused instructions, providing the fastest speeds.
The main compiler independent feature that provides a clear advantage to 64 bit operation is that the CPU, at 32 bits, does not support double precision SIMD (NEON) operation, with single word d registers being compiled. On the other hand, performance gain does not appear to be meet the potential. This suggests that there are other limiting factors - see disassembly below.
Single Precision
Code: Select all
MP-MFLOPS armv8 64Bit Thu Aug 22 19:50:10 2019
FPU Add & Multiply using 1, 2, 4 and 8 Threads
2 Ops/Word 32 Ops/Word 2 Ops/Word 32 Ops/Word
KB 12.8 128 12800 12.8 128 12800 12.8 128 12800 12.8 128 12800
Gentoo Pi 4B 64 Bits MFLOPS
1T 2908 2854 459 5778 5734 5405
2T 5700 5311 457 10935 11212 7968
4T 10375 5588 490 18181 21842 7637
8T 9675 8460 511 20128 20567 8568
Gentoo Pi 3B+ 64 Bits MFLOPS Raspbian Pi 4B 32 Bits MFLOPS
1T 792 806 373 1780 1783 1724 987 993 606 2816 2794 2804
2T 1482 1596 382 3542 3509 3380 1823 1837 567 5610 5541 5497
4T 2861 2742 429 5849 7013 5465 2119 3349 647 9884 10702 9081
8T 2770 2877 429 6434 6700 6101 3136 3783 609 10230 10504 9240
4 Thread Comparisons
Pi 4B/3B+ 64 Bits Pi 4B 64 bits/32 bits
1T 3.67 3.54 1.23 3.25 3.22 3.14 2.95 2.87 0.76 2.05 2.05 1.93
4T 3.63 2.04 1.14 3.11 3.11 1.40 4.90 1.67 0.76 1.84 2.04 0.84
Double Precision
Code: Select all
MP-MFLOPS armv8 64Bit Double Precision Thu Aug 22 19:51:42 2019
FPU Add & Multiply using 1, 2, 4 and 8 Threads
2 Ops/Word 32 Ops/Word 2 Ops/Word 32 Ops/Word
KB 12.8 128 12800 12.8 128 12800 12.8 128 12800 12.8 128 12800
Gentoo Pi 4B 64 Bits MFLOPS
1T 1464 1386 225 3398 3386 3182
2T 2837 2792 228 6720 6741 4547
4T 5172 3414 251 10405 12762 4763
8T 4774 4353 275 11506 12118 4865
Gentoo Pi 3B+ 64 Bits MFLOPS Raspbian Pi 4B 32 Bits MFLOPS
1T 415 386 206 1400 1403 1333 1187 1220 309 2682 2714 2701
2T 820 813 209 2804 2767 2597 2420 2416 282 5379 5415 4780
4T 1328 1323 212 5433 5340 2465 4665 2381 317 10256 10336 5242
8T 1343 1308 214 5090 5006 3280 4385 3114 310 9721 10340 5131
4 Thread Comparisons
Pi 4B/3B+ 64 Bits Pi 4B 64 bits/32 bits
4 Thread Comparisons
Pi 4B/3B+ 64 Bits Pi 4B 64 bits/32 bits
1T 3.99 3.88 1.32 2.16 2.19 2.22 1.33 1.33 0.87 1.49 1.53 1.45
4T 2.83 2.16 1.30 2.04 2.07 1.55 0.59 1.02 1.02 1.40 1.46 1.03
NEON Single Precision
Code: Select all
MP-MFLOPS NEON Intrinsics 64 Bit Thu Aug 22 19:52:48 2019
FPU Add & Multiply using 1, 2, 4 and 8 Threads
2 Ops/Word 32 Ops/Word 2 Ops/Word 32 Ops/Word
KB 12.8 128 12800 12.8 128 12800 12.8 128 12800 12.8 128 12800
Gentoo Pi 4B 64 Bits MFLOPS
1T 3311 3192 535 6442 6548 6198
2T 4607 6186 552 13030 13012 8468
4T 6279 5725 562 23798 24128 9374
8T 7815 12044 486 22725 21712 9395
Gentoo Pi 3B+ 64 Bits MFLOPS Raspbian Pi 4B 32 Bits MFLOPS
1T 830 823 406 2989 2986 2792 2491 2399 615 4325 4285 4261
2T 1575 1498 414 5981 5872 5445 5629 5520 591 8602 8463 8308
4T 2217 2650 431 11661 11644 6061 10580 5594 553 16991 16493 9124
8T 2733 3197 437 10505 10637 6708 7047 10785 513 14325 16219 8867
4 Thread Comparisons
Pi 4B/3B+ 64 Bits Pi 4B 64 bits/32 bits
1T 3.99 3.88 1.32 2.16 2.19 2.22 1.33 1.33 0.87 1.49 1.53 1.45
4T 2.83 2.16 1.30 2.04 2.07 1.55 0.59 1.02 1.02 1.40 1.46 1.03
MP-MFLOPS Disassembly
On the Pi 4B, with single precision floating point and SIMD, four word registers were used (see 4s below). With this, four results of calculations might be expected per clock cycle, or 6 GFLOPS per core and up to 24 GFLOPS using all four cores, Then such as fused multiply and add could double the speed for up to four times to 12 GFLOPS per core. For the mix of instructions below, expectations might by 70% of this or 8.4 GFLOPS. Using double precision, with two words in the 128 bit registers, expectations might be half that at 4.2 GFLOPS per core, with this code.
Code: Select all
SP NEON 24.1 GFLOPS 6.55 1 core DP 12.7 GFLOPS - 3.39 1 core
.L41: .L84:
ldr q1, [x1] ldr q16, [x2, x0]
ldr q0, [sp, 64] add w3, w3, 1
fadd v18.4s, v20.4s, v1.4s cmp w3, w6
fadd v17.4s, v22.4s, v1.4s fadd v15.2d, v16.2d, v14.2d
fadd v0.4s, v0.4s, v1.4s fadd v17.2d, v16.2d, v12.2d
fadd v16.4s, v24.4s, v1.4s fmul v15.2d, v15.2d, v13.2d
fadd v7.4s, v26.4s, v1.4s fmls v15.2d, v17.2d, v11.2d
fadd v6.4s, v28.4s, v1.4s fadd v17.2d, v16.2d, v10.2d
fadd v5.4s, v30.4s, v1.4s fmla v15.2d, v17.2d, v9.2d
fmul v0.4s, v0.4s, v19.4s fadd v17.2d, v16.2d, v8.2d
fadd v4.4s, v10.4s, v1.4s fmls v15.2d, v17.2d, v31.2d
fadd v3.4s, v12.4s, v1.4s fadd v17.2d, v16.2d, v30.2d
fadd v2.4s, v14.4s, v1.4s fmla v15.2d, v17.2d, v29.2d
fadd v1.4s, v8.4s, v1.4s fadd v17.2d, v16.2d, v28.2d
fmls v0.4s, v21.4s, v18.4s fmls v15.2d, v17.2d, v0.2d
fmla v0.4s, v23.4s, v17.4s fadd v17.2d, v16.2d, v27.2d
fmls v0.4s, v25.4s, v16.4s fmla v15.2d, v17.2d, v26.2d
fmla v0.4s, v27.4s, v7.4s fadd v17.2d, v16.2d, v25.2d
fmls v0.4s, v29.4s, v6.4s fmls v15.2d, v17.2d, v24.2d
fmla v0.4s, v31.4s, v5.4s fadd v17.2d, v16.2d, v23.2d
fmls v0.4s, v9.4s, v1.4s fmla v15.2d, v17.2d, v22.2d
fmla v0.4s, v4.4s, v11.4s fadd v17.2d, v16.2d, v21.2d
fmls v0.4s, v3.4s, v13.4s fadd v16.2d, v16.2d, v19.2d
fmla v0.4s, v2.4s, v15.4s fmls v15.2d, v17.2d, v20.2d
str q0, [x1], 16 fmla v15.2d, v16.2d, v18.2d
cmp x1, x0 str q15, [x2, x0]
bne .L41 add x0, x0, 16
bcc .L84
32 bit 64 bit 32 bit 64 bit 32 bit 64 bit
SP SP DP DP NEON SP NEON SP
Maximum GFLOPS 10.7 21.8 10.3 12.7 17.0 24.1
Instructions
Total 27 39 26 27 67 27
Floating point 22 32 22 32 32 22
FP operations
Total 32 128 32 64 128 128
Add or subtract 11 44 11 22 21 44
Multiply 1 4 1 2 11 4
Fused 20 80 20 40 0 80
Add example fadds fadd faddd fadd vadd.f32 fadd
s16, v15.4s, d25, v15.2d, q9, v1.4s,
s23, v16.4s, d17, v16.2d, q8, v8.4s,
s2 v15.4s d15 v14.2d q14 v1.4s
Multiply example fnmuls fmul fmuld fmul vmul.f32 fmul
s16, v15.4s, d16, v15.2d, q9, v0.4s,
s3, v15.4s, d16, v15.2d, q9, v0.4s,
s16 v17.4s d5 v13.2d q12 v19.4s
Fused example vfma.f32 fmla vfma.f64 fmla N/A fmla
s16, v15.4s, d16, v15.2d, v0.4s,
s29, v17.4s, d22, v17.2d, v4.4s,
s9 v0.4s d28 v22.2d v11.4s
FP registers used 32 4 32 25 16 32
MP-MFLOPS Sumchecks
Different instructions, like between SP and DP, may not produce identical numeric results. Variations also depend on the number of passes, here they were close to 1.0 as data size increased. Only anomaly is -X below.
Code: Select all
2 Ops/Word 32 Ops/Word
KB 12.8 128 12800 12.8 128 12800
SP
4B/64 1T 76406 97075 99969 66015 95363 99951
3B/64 1T 76406 97075 99969 66015 95363 99951
4B/32 1T 76406 97075 99969 66015 95363 99951
DP
4B/64 1T 76384 97072 99969 66065 95370 99951
3B/64 1T 76384 97072 99969 66065 95370 99951
4B/32 1T 76384 97072 99969 66065 95370 99951
NEON Bit SP
4B/64 1T 76406 97075 99969 66015 95363 99951
3B/64 1T 76406 97075 99969 66015 95363 99951
4B/32 1T 76406 97075 99969 66014-X 95363 99951
OpenMP-MFLOPS Benchmark
This benchmark carries out the same calculations as the MP-MFLOPS Benchmarks but, in addition, calculations with eight operations per data word. There is also notOpenMP-MFLOPS single core version, compiled from the same code and carrying out identical numbers of floating point calculations, but without an OpenMP compile directive.
Following is an example of full output. The strange test names were carried forward from a 2014 CUDA benchmark, via Windows and Linux Intel CPU versions. Details are in the following GigaFLOPS Benchmarks report, covering MP-MFLOPS, QPAR and OpenMP. This showed nearly 100 GFLOPS from a Core i7 CPU and 400 GFLOPS from a GeForce GTX 650 graphics card, via CUDA.
https://www.webarchive.org.uk/wayback/a ... hmarks.htm
The detail is followed by MFLOPS results on Pi 3B+ and Pi 4B. The direct conversions of the code from large systems lead to excessive memory demands for Raspberry Pi systems, with too many tests dependent on RAM speed, and low MP performance gains. There were glimpses of the usual performance gains an a maximum of over 20 SP GFLOPS on a 64 bit Pi 4B.
Code: Select all
OpenMP MFLOPS64 Thu Aug 22 19:54:59 2019
Test 4 Byte Ops/ Repeat Seconds MFLOPS First All
Words Word Passes Results Same
Data in & out 100000 2 2500 0.092836 5386 0.929538 Yes
Data in & out 1000000 2 250 0.887743 563 0.992550 Yes
Data in & out 10000000 2 25 0.917173 545 0.999250 Yes
Data in & out 100000 8 2500 0.129858 15401 0.957117 Yes
Data in & out 1000000 8 250 0.899561 2223 0.995518 Yes
Data in & out 10000000 8 25 0.847036 2361 0.999549 Yes
Data in & out 100000 32 2500 0.391602 20429 0.890215 Yes
Data in & out 1000000 32 250 0.989877 8082 0.988088 Yes
Data in & out 10000000 32 25 0.944493 8470 0.998796 Yes
End of test Thu Aug 22 19:55:05 2019*
--------------- MFLOPS -------------- -------- Compare --------
Mbytes/ Pi 3B+ Pi 4B Pi 4B Pi 4B
Threads 64b 64b 32b 4b/3b 64/32b
All 1CP All 1CP All 1CP All 1CP All 1CP
0.4/2 2674 755 5386 2780 4716 2850 2.01 3.68 1.14 0.98
4/2 411 404 563 557 556 429 1.37 1.38 1.01 1.30
40/2 419 408 545 588 544 632 1.30 1.44 1.00 0.93
0.4/8 7029 1886 15401 5555 7981 5191 2.19 2.95 1.93 1.07
4/8 1656 1495 2223 2116 2389 2082 1.34 1.42 0.93 1.02
40/8 1725 1507 2361 2310 2199 2003 1.37 1.53 1.07 1.15
0.4/32 6648 1699 20429 5647 8147 5449 3.07 3.32 2.51 1.04
4/32 5977 1616 8082 5445 7951 5385 1.35 3.37 1.02 1.01
40/32 6027 1616 8470 5479 8030 5379 1.41 3.39 1.05 1.02
Next More Multi threading Benchmarks