RoyLongbottom
Posts: 436
Joined: Fri Apr 12, 2013 9:27 am
Location: Essex, UK

Raspberry Pi 64 Bit OS and 8 GB Pi 4B Benchmarks

Thu Jun 18, 2020 2:57 pm

I have compiled and run my benchmarks via the 64 bit Beta Raspberry Pi OS, with tests including variations to exercise the Pi 4B with 8 GB RAM. Full results, with comparisons to 32 bit working are included in my report at ResearchGate:

https://www.researchgate.net/publicatio ... Benchmarks

Both 32 bit and 64 bit programs were mainly produced with the supplied gcc 8 compilers. The 64 bit benchmarks and source codes are available from ResearchGate in a tar.xz file for anyone to play with:

https://www.researchgate.net/profile/Ro ... UpdatesLog

For my regular benchmarks, I am providing the results, without detailed descriptions. More details can be found in the above and in my previous posts in.

viewtopic.php?f=31&t=44080&hilit=longbottom#p350968

This first new post covers single core CPU benchmarks, all providing 64 bit performance gains.

Whetstone Benchmark - Mainly floating point operation with overall rating in MWIPS

Code: Select all

 System    MHz   MWIPS  ------MFLOPS------   -------------MOPS---------------
                          1      2      3     COS    EXP  FIXPT     IF  EQUAL

 32 bit   1500   1883    522    471    313   54.9   26.4   2496   3178    998
 64 bit   1500   2085    524    535    398   57.6   27.3   2493   2979    997

 64/32 bit       1.11   1.00   1.14   1.27   1.05   1.03   1.00   0.94   1.0
Dhrystone Benchmark - Integer operation, rating in VAX MIPS aka DMIPS

Code: Select all

                            DMIPS
 System     MHz    DMIPS     /MHz

 32 bit    1500     5077     3.76
 64 bit    1500     7814     5.21

 64/32 bit          1.54         
Linpack Benchmarks - Floating point with MFLOPS rating, Double and Single Precision floating point

Code: Select all

                                    NEON
 System     MHz      DP      SP      SP

 32 bit    1500   957.1  1068.8  1819.9
 64 bit    1500  1111.5  1938.2  2030.9

 64/32 bit         1.16    1.81    1.12
Livermore Loops Benchmark - Double Precision floating point, MFLOPS rating for 24 loops and overall averages etc.

Code: Select all

 MFLOPS for 24 loops

  32 bit
  1480  1017   974   930   383   657  1624  1861  1664   617   498   741
   221   320   803   640   737  1003   451   378  1047   411   763   187

  64 bit
  2108   936   960   965   383   809  2313  2488  2066   669   500   981
   181   405   815   644   727  1190   450   397  1716   367   818   313

  64 bit / 32 bit gain range - 0.82 to 1.67                             

 Overall Ratings

 System    MHz   Maximum Average Geomean Harmean Minimum

 32 bit   1500    1860.8   800.4   679.0   564.1   179.5
 64 bit   1500    2616.7   959.8   766.7   613.0   169.7

 64/32 bit          1.41    1.20    1.13    1.09    0.95
Next will be single core cache and RAM benchmarks.

RoyLongbottom
Posts: 436
Joined: Fri Apr 12, 2013 9:27 am
Location: Essex, UK

Re: Raspberry Pi 64 Bit OS and 8 GB Pi 4B Benchmarks

Fri Jun 19, 2020 1:25 pm

Single Thread Cache and RAM Benchmarks

These measure performance using data from caches and RAM. There appears to be a compiler issue, in this area, where single precision (SP) speeds are shown to be much slower than using Double Precision (DP), when the opposite should apply. Integer speeds can also be shown as being too slow. Other gcc 8 compiler options can produce more appropriate results for cache speeds but slower from RAM. Older 64 bit versions are available, compiled on other platforms, that demonstrate acceptable performance and are included in the later tables. The following overall assessment ignores those false SP figures, to reflect hardware and Operating System performance.

There are four benchmarks, each with between 60 and 100 measurements. A bottom line assessment is that 64 bit and 32 bit speeds from RAM were the same, as were around half of CPU dependent routines, with the other half an average near 30% faster at 64 bits.

Fast Fourier Transforms Benchmarks - single and double precision milliseconds, on balance were the same at 32 bits and 64 bits. Note that DP running times are generally much longer than SP.

Code: Select all

            32 bit FFT 1    32 bit FFT 3    64 bit FFT 1    64 bit FFT 3  

             SP      DP      SP      DP      SP      DP      SP      DP
 Sixe K                                                             
      1     0.04    0.04    0.05    0.04    0.04    0.04    0.04    0.04
      2     0.08    0.13    0.10    0.10    0.08    0.14    0.08    0.10
      4     0.29    0.34    0.24    0.23    0.23    0.40    0.21    0.24
      8     0.79    0.82    0.57    0.51    0.74    0.99    0.47    0.51
     16     1.65    1.85    1.32    1.19    1.88    2.67    1.15    1.20
     32     3.76    4.71    2.69    3.30    5.04    5.16    2.26    3.31
     64     8.82   30.64    6.60    9.47    8.72   32.58    5.72   10.19
    128    58.54  132.41   16.92   23.85   49.92  160.12   15.92   24.43
    256   275.44  373.12   37.61   55.97  293.06  389.40   37.85   54.60
    512   780.89  751.27   81.54  128.13  559.88  780.79   82.06  119.23
   1024  1578.70 1812.20  186.45  288.27 1376.28 1890.46  178.37  262.30

    Ratios > 1.0 64 bit faster  Average     1.05    0.89    1.13    1.02
                                Minimum     0.75    0.69    0.99    0.93
                                Maximum     1.39    0.96    1.26    1.12
BusSpeed Benchmark MB/second - Inc columns can be used to identify bus speed. Comparison is for Read All where 64 bit versions indicated as between 30% and 60% faster from caches but a little slower from RAM.

Code: Select all

    Reading Speed 4 Byte Words in MBytes/Second         

  Memory  Inc32  Inc16   Inc8   Inc4   Inc2   Read       
  KBytes  Words  Words  Words  Words  Words    All       

                        32 bit                           

      16   4880   5075   5612   5852   5877   5864       
      32    846   1138   2153   3229   4908   5300       
      64    746   1019   2035   3027   4910   5360       
     128    728    983   1952   2908   4888   5389       
     256    683    934   1901   2794   4874   5431       
     512    656    900   1760   2625   4585   5259       
    1024    301    410    870   1356   2846   4238       
    4096    233    248    531    996   2151   4045       
   16384    236    258    511    891   2143   4011       
   65536    237    257    508    881   2172   4015       

                         64 bit                   64 bit/
                                                   32 bit
      16   4898   5109   5626   5860   5879   9238   1.58
      32   1109   1389   2485   3804   5026   8435   1.59
      64    804   1030   2025   3285   4871   8312   1.55
     128    737    951   1877   3130   4908   8556   1.59
     256    732    953   1897   3147   4941   8617   1.55
     512    701    939   1766   2902   4601   8150   1.31
    1024    323    494    986   1807   3060   5553   1.31
    4096    242    259    486    964   1932   3856   0.95
   16384    236    268    493    971   1939   3878   0.97
   65536    242    271    494    973   1942   3884   0.97
MemSpeed Benchmark MB/Second - calculating different floating point and integer functions as shown. Maximum MFLOPS are also shown. Note slow and constant SP and integer speeds at 64 bits, with integer calculations also slow at 32 bits (from gcc 8). An example of older results show what the performance pattern from caches should look like.

Code: Select all

 Memory   x[m]=x[m]+s*y[m] Int+   x[m]=x[m]+y[m]         x[m]=y[m]     
 KBytes    Dble   Sngl  Int32   Dble   Sngl  Int32   Dble   Sngl  Int32
   Used    MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S

 32 bit
       8   11768   9844   3841  11787   9934   4351  10309   7816   7804
      16   11880   9880   3822  11886  10043   4363  10484   7902   7892
      32    9539   8528   3678   9517   8661   4098  10564   7948   7945
      64    9952   9310   3733   9997   9470   4160   8452   7717   7732
     128    9947   9591   3757   9990   9757   4178   8205   7680   7753
     256   10015   9604   3758  10030   9781   4186   8120   7734   7707
     512    9073   9300   3751   9472   9526   4175   7995   7709   7602
    1024    2681   5303   3594   2664   4965   3760   4828   3592   3569
    2048    1671   3488   3242   1757   3635   3540   2882   1036   1023
    4096    1777   3700   3283   1827   3627   3555   2433   1052   1054
    8192    1931   3805   3420   1933   3815   3629   2465    980    971
  MFLOPS    1471   2470                                                 

 64 bit
       8   15531   3999   3957  15576   4387   4358  11629   9313   9314
      16   15717   3992   3922  15770   4355   4377  11799   9444   9446
      32   12020   3818   3814  12043   4179   4198  11549   9496   9497
      64   12228   3816   3887  12220   4166   4195   8935   8506   8506
     128   12265   3869   3941  12157   4182   4206   8080   8193   8196
     256   12230   3873   3932  12073   4199   4216   8129   8224   8223
     512    9731   3832   3902   9709   4150   4171   8029   7845   7865
    1024    3772   3682   3769   3467   3887   3920   5478   5543   5378
    2048    1896   3463   3496   1886   3616   3612   2937   2945   2923
    4096    1924   3520   3528   1933   3651   3394   2752   2796   2785
    8192    1996   3523   3555   1988   3643   3630   2668   2661   2663
  MFLOPS    1964   1000                                                 

64 bit / 32 bit

      16    1.32   0.40   1.03   1.33   0.43   1.00   1.13   1.20   1.20
     256    1.22   0.40   1.05   1.20   0.43   1.01   1.00   1.06   1.07
    8192    1.03   0.93   1.04   1.03   0.95   1.00   1.08   2.72   2.74


 ########################### Earlier Version ###########################

     Memory Reading Speed Test armv8 64 Bit by Roy Longbottom

               Start of test Wed Jun 10 10:04:22 2020

  Memory   x[m]=x[m]+s*y[m] Int+   x[m]=x[m]+y[m]         x[m]=y[m]
  KBytes    Dble   Sngl  Int32   Dble   Sngl  Int32   Dble   Sngl  Int32
    Used    MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S

       8   15504  13974  12580  15552  14024  15534  11521   9313   7791
      16   15707  14173  12747  15758  14183  15746  11751   9445   7890
      32   13356  11998  11123  13372  12300  12836  11450   9500   7937
      64   12340  11302  10651  12156  11698  12044   9415   8937   7910
     128   12253  11384  10707  12207  11861  12083   8260   8299   7821
     256   12259  11408  10694  12089  11896  12091   8101   8220   7894
     512    9855   9593   9246  10264   9482   9801   7917   8057   7754
    1024    3317   3613   3571   3640   3602   3600   5885   5833   5616
    2048    1881   1885   1881   1890   1879   1879   2911   2999   3015
    4096    1950   1946   1949   1952   1941   1925   2672   2666   2661
    8192    1952   1964   1964   1968   1962   1961   2546   2536   2537
 
NeonSpeed Benchmark carries out the same SP and integer calculations as MemSpeed, with the same compiling problem with this and integers. Other results uses NEON Intrinsic Functions, were 32 bit and 64 bit performance can be the same. Results from an older compilation are also provided.

Code: Select all

      Vector Reading Speed in MBytes/Second      
  Memory  Float v=v+s*v  Int v=v+v+s   Neon v=v+v 
  KBytes   Norm   Neon   Norm   Neon  Float    Int

                      32 bit                  

      16   9884  12882   3910  12773  13090  15133
      32   9904  13061   3916  13002  13162  15239
      64   9029  11526   3450  10704  11708  12084
     128   9242  11784   3391  11016  11816  12179
     256   9283  11890   3396  11215  11929  12284
     512   9043  10680   3413  10211  10925  11241
    1024   5818   3310   3507   3288   3239   2902
    4096   4060   1994   3497   1991   2009   2011
   16384   4030   2063   3445   2068   2072   2067
   65536   3936   2109   3391   1858   2122   2121

                      64 bit                  

      16   3629  14987   3925  13643  14457  16642
      32   3475  10933   3821   9970  11029  11055
      64   3447  11749   3845  11098  11802  12079
     128   3332  11392   3912  10813  11430  11513
     256   3325  11565   3926  10981  11598  11699
     512   3313  10553   3917  10269  10755  10740
    1024   3239   3331   3737   3291   3302   3321
    4096   2987   1888   3331   1777   1881   1878
   16384   3150   1821   3347   1814   1812   1834
   65536   2747   1954   3132   2017   1904   2021

64 bit / 32 bit

      16   0.37   1.16   1.00   1.07   1.10   1.10
     256   0.36   0.97   1.16   0.98   0.97   0.95
    8192   0.70   0.93   0.92   1.09   0.90   0.95


 ########################### Earlier Version ###########################

  NEON Speed Test armv8 64 Bit V 1.0 Wed Jun 10 10:06:03 2020

       Vector Reading Speed in MBytes/Second
  Memory  Float v=v+s*v  Int v=v+v+s   Neon v=v+v
  KBytes   Norm   Neon   Norm   Neon  Float    Int

      16  13999  16429  12687  15238  16213  17194
      32  12384  13367  11232  12767  14406  14493
      64  10736  11870  10305  10790  11940  11976
     128  10728  11826  10393  10739  11951  11956
     256  10760  11908  10386  10816  12026  12064
     512  10697  11911  10404  10781  12070  12006
    1024   3854   3941   3810   4015   4315   4402
    4096   2007   2000   2018   1985   1995   1999
   16384   2002   2008   1997   1927   1997   1997
   65536   2030   2027   2022   2020   2012   2023
Next will be multithreading benchmarks
Last edited by RoyLongbottom on Sat Jun 20, 2020 2:30 pm, edited 1 time in total.

RoyLongbottom
Posts: 436
Joined: Fri Apr 12, 2013 9:27 am
Location: Essex, UK

Re: Raspberry Pi 64 Bit OS and 8 GB Pi 4B Benchmarks

Sat Jun 20, 2020 9:23 am

Multithreading Benchmarks

Most of the multithreading benchmarks execute the same calculations using 1, 2, 4 and 8 threads. Some are specifically designed to show what type of programs should not be multithreaded. 64 bit and 32 bit comparisons are not provided for these.

This section includes another benchmark where the compiler gets it wrong and the usual mixture of 32 bit and 64 bit same as and 64 bit better than comparisons, the highlight being maximum GFLOPS demonstrations.


MP-Whetstone Benchmark - Multiple threads each run the eight test functions at the same time, but with some dedicated variables. Measured speed is based on the last thread to finish. Performance was generally proportional to the number of cores used. Overall seconds indicates MP efficiency. The MWIPS performance rating indicated that 64 bit code was 13% faster than that at 32 bits.

Code: Select all

      MWIPS MFLOPS MFLOPS MFLOPS   Cos   Exp   Fixpt      If  Equal
                 1      2      3  MOPS  MOPS    MOPS    MOPS   MOPS
 
                            32 bit                                 
 
 1T  1889.5  538.7  537.6  311.4  56.3  26.1  7450.5  2243.2  659.9
 2T  3782.7 1065.5 1071.2  627.1 112.3  52.0 14525.7  4460.9 1327.3
 4T  7564.1 2101.0 2145.9 1250.4 225.0 104.1 29430.5  8944.2 2660.8
 8T  8003.6 2598.8 2797.0 1313.0 233.2 110.4 37906.3 10786.7 2799.4

   Overall Seconds   4.99 1T,   5.00 2T,   5.03 4T,  10.06 8T      

                             64 bit                                
 
 1T  2147.8  530.7  530.0  397.8  60.5  27.3  7462.8  2237.7  998.2
 2T  4294.1 1058.4 1059.5  795.8 120.9  54.6 14877.9  4457.8 1994.8
 4T  8558.2 2093.8 2112.2 1590.3 241.8 108.3 29221.8  8909.9 3982.1
 8T  8987.0 2689.8 2721.9 1641.0 254.1 112.0 37422.9 10873.9 4122.3

   Overall Seconds   5.00 1T,   5.00 2T,   5.05 4T,  10.13 8T

                4 Thread 64 bit/32 bit Performance ratios          

       1.13   1.00   0.98   1.27  1.07  1.04    0.99   1.00    1.50
MP-Dhrystone Benchmark - This executes multiple copies of the same program, but with some shared data, leading to unacceptable multithreaded performance. The single thread speeds were similar to the earlier Dhrystone results, with 44% 64 bit performance gains. The other results don’t mean much.

Code: Select all

                    Using 1, 2, 4 and 8 Threads            
            
                              32 bit                        

 Threads                        1        2        4        8
 Seconds                     0.79     1.21     2.62     4.88
 Dhrystones per Second   10126308 13262168 12230188 13106002
 VAX MIPS rating             5763     7548     6961     7459

                              64 bit                        

 Seconds                     0.55     1.08     2.15     4.30
 Dhrystones per Second   14531390 14791730 14896723 14872767
 VAX MIPS rating             8271     8419     8478     8465

64 bit / 32 bit              1.44     1.12     1.22     1.13
MP SP NEON Linpack Benchmark - This was produced to show that the original Linpack benchmark was completely unsuitable for benchmarking multiple CPUs or cores, and this is reflected in the results. The program uses NEON intrinsic functions, with increasing data sizes. The unthreaded results are of interest but, using NEON functions, the 64 bit program cannot improve performance much.

Code: Select all

  MFLOPS 0 to 4 Threads, N 100, 500, 1000     

 Threads      None        1        2        4 

                       32 bit                 

 N  100    2007.38   112.55   107.85   106.98 
 N  500    1332.24   686.10   686.11   689.02 
 N 1000     402.61   435.26   432.21   432.01 

                       64 bit                 

 N  100    2167.70    91.82    89.65    89.96 
 N  500    1438.27   644.85   635.89   635.33 
 N 1000     394.99   376.97   383.92   384.19 

                   64 bit / 32 bit            

 N  100       1.08     0.82     0.83     0.84 
 N  500       1.08     0.94     0.93     0.92 
 N 1000       0.98     0.87     0.89     0.89 
MP BusSpeed (read only) Benchmark - Each thread accesses all of the data in separate sections, covering caches and RAM, starting at different points, the latter to avoid misrepresentation of performance using shared L2 cache. Each set of results show appropriate performance gains on increasing the number of threads used. But the 64 bit compiler gets it wrong again, after Inc8. Part of another version is provided to show how it should behave.

Code: Select all

 KB      Inc32  Inc16   Inc8   Inc4   Inc2  RdAll       

                         32 bit                          

 12.3 1T   5310   5616   5801   5898   5940  13425       
      2T   9393  10008  11293  11293  11368  24932       
      4T  15781  15015  17606  19034  22279  40736       
      8T   8465   9599  14580  18465  20034  36831       
122.9 1T    664    930   1861   3191   5017  10281       
      2T    564    726   1523   5376   9387  18985       
      4T    486    919   1886   4289   8337  16979       
      8T    487    912   1854   4275   8271  16826       
12288 1T    225    258    514   1010   1992   3975       
      2T    202    421    450   1765   3307   7396       
      4T    261    288    825   1332   1772   5014       
      8T    218    273    496   1041   2571   4021       

                         64 bit                   Rd All 
                                                  64 bit/
                                                  32 bit 
 12.3 1T   5168   5542   5641   4205   4095   4230   0.32
      2T   8968  10728  10161   8110   8058   8368   0.34
      4T   7874  13255  15586  13641  15485  16533   0.41
      8T   8186  13386  15239  13469  14431  16372   0.44
122.9 1T    598    927   1876   2792   3746   4059   0.39
      2T    514    719   1538   4846   7596   8083   0.43
      4T    486    933   2060   4126   8175  13690   0.81
      8T    483    937   2059   4160   8166  13817   0.82
12288 1T    224    257    488    964   1933   3579   0.90
      2T    219    427    889   1832   3493   5371   0.73
      4T    280    353    562    859   2168   3286   0.66
      8T    229    230    527   1075   1880   4480   1.11


 ###################### gcc 9 Version ###################

 MP-BusSpd 64 Bit gcc 9 Fri May 29 09:56:08 2020         

 12.3 4T   7317  13937  15720  18355  20549  33244
122.9 4T    492    937   1883   4009   7820  16423
MP RandMem Benchmark - The benchmark uses the same complex indexing for serial and random access, with separate read only and read/write tests. The performance patterns were as expected, and essentially the same at 32 bits and 64 bits, with no scope for vectorisation. Random access is dependent on the impact of burst reading and writing, producing those slow speeds. Read only performance increased, as expected, relative to the thread count, with that for read/write remaining constant at particular data size, probably due to write back to shared data space.

Code: Select all

  KB       SerRD SerRDWR   RndRD RndRDWR
 
                    32 bit              

 12.3 1T    5950    7903    5945    7896
      2T   11849    7923   11887    7917
      4T   23404    7785   23395    7761
      8T   21903    7669   23104    7655
122.9 1T    5670    7309    2002    1924
      2T   10682    7285    1648    1923
      4T    9944    7266    1813    1927
      8T    9896    7216    1812    1919
12288 1T    3904    1075     179     164
      2T    7317    1055     215     164
      4T    3398    1063     343     165
      8T    4156    1062     350     165

                    64 bit              

 12.3 1T    5945    7898    5948    7895
      2T   11913    7937   11905    7929
      4T   23601    7875   23385    7867
      8T   23139    7777   23016    7770
122.9 1T    5785    7090    2026    1977
      2T   10941    7074    1654    1968
      4T   10364    7052    1854    1970
      8T   10256    7031    1844    1973
12288 1T    3861    1244     180     169
      2T    3793    1242     220     171
      4T    3941    1100     343     170
      8T    4065    1247     351     171


                64 bit / 32 bit         

 12.3 4T    1.01    1.01    1.00    1.01
122.9 4T    1.04    0.97    1.02    1.02
12288 4T    1.16    1.03    1.00    1.03
MP-MFLOPS Benchmarks - MP-MFLOPS measures floating point speed on data from caches and RAM. The first calculations are as used in Memory Speed Benchmark, with a multiply and an add per data word read. The second uses 32 operations per input data word that should be suitable to use fused multiply and add instructions. Tests cover 1, 2, 4 and 8 threads, each carrying out the same calculations but accessing different segments of the data.

Results from three different versions are shown below, all behaving as expected. At 2 operations per word, 12800 results represent RAM speeds that were not much different at 32 bits and 64 bits. NEON 32 bit performance was much faster than normal 32 bit code and nearly as fast as that at 64 bits. Most show appropriate performance gains, between 1, 2 and 4 threads. As intended, maximum MFLOPS were for cache based data with 4 threads at 32 operations per word, 64 bit SP running at near 26 GFLOPS and DP at around 12.5 GFLOPS. 64 bit performance gains were highest in the first set of results.

Code: Select all

                Single Precision Version         

        2 Ops/Word              32 Ops/Word         
 KB     12.8     128   12800    12.8     128   12800
 MFLOPS                                             
                        32 bit                      

 1T     1224    1257     520    2814    2800    2803
 2T     2485    2257     525    5608    5575    5576
 4T     4119    3243     534   11018   10645    8358
 8T     4131    4618     541    9941   10339    8165

                        64 bit                      

 1T     3303    3113     526    6750    6713    6429
 2T     6410    4860     540   13378   13373    9005
 4T    11696    6413     571   25479   25917   10126
 8T    10262   10054     571   23140   23427    8726

Max                                                 
64b/32b 2.83    2.18    1.06    2.31    2.43    1.21

            NEON Intrinsic Functions Version      

                       32 bit                    

 1T     2797    2870     641    4422    4454    4405
 2T     3217    5601     569    8587    8800    8377
 4T     7902    9864     611   17061   17215    9704
 8T     7070   10562     603   15531   16203    9516

                       64 bit                    

 1T     3319    3245     527    6569    6538    6294
 2T     5737    5333     556   12810   12784    9565
 4T     8497   11088     572   24775   24885    9570
 8T     8037   11330     573   22658   21773    9443

Max                                                 
64b/32b 1.08    1.07    0.89    1.45    1.45    0.99

              Double Precision Version            

                       32 bit                    

 1T     1203    1211     315    2675    2719    2674
 2T     2291    2441     293    5406    5421    4907
 4T     4673    2501     309   10313   10393    5256
 8T     4394    3550     265    8782   10110    5197

                       64 bit                    

 1T     1637    1553     273    3356    3351    3220
 2T     3180    3031     278    6664    6676    4531
 4T     5778    3102     283   12522   12675    4791
 8T     3927    4272     286   12304   11351    4875

Max                                                 
64b/32b 1.24    1.20    0.91    1.21    1.22    0.93
OpenMP-MFLOPS Benchmarks - This was an instant port from a Linux/PC program, with larger data sizes than MP-MFLOPS, only the one with 100000 SP words fitting in L2 cache and an extra test with 8 operations per word. Besides automatic parallelisation through OpenMP, a single core notOpenMP version was produced by omitting the OpenMP compiling directive. As can be seen, the 64 bit version generated a 4 core score of 24 GFLOPS, 21% faster than at 32 bits and 3.5 times faster than notOpenMP.

Code: Select all

Output format used for a CUDA benchmark

  Test             4 Byte  Ops/   Repeat    Seconds   MFLOPS       First   All
                    Words  Word   Passes                         Results  Same

                                OpenMP MFLOPS 32 bit 
 
 Data in & out     100000     2     2500   0.098043     5100    0.929538   Yes       
 Data in & out    1000000     2      250   0.810084      617    0.992550   Yes
 Data in & out   10000000     2       25   0.922891      542    0.999250   Yes

 Data in & out     100000     8     2500   0.144870    13805    0.957126   Yes
 Data in & out    1000000     8      250   0.922568     2168    0.995524   Yes
 Data in & out   10000000     8       25   0.918226     2178    0.999550   Yes

 Data in & out     100000    32     2500   0.401577    19921    0.890282   Yes
 Data in & out    1000000    32      250   0.935064     8556    0.988096   Yes
 Data in & out   10000000    32       25   0.916277     8731    0.998806   Yes

                                 OpenMP MFLOPS 64 bit                           64b/
                                                                                 32b
 Data in & out     100000     2     2500   0.092784     5389    0.929538   Yes  1.06
 Data in & out    1000000     2      250   0.794744      629    0.992550   Yes  1.02
 Data in & out   10000000     2       25   0.784255      638    0.999250   Yes  1.18

 Data in & out     100000     8     2500   0.114583    17455    0.957117   Yes  1.26
 Data in & out    1000000     8      250   0.797846     2507    0.995518   Yes  1.16
 Data in & out   10000000     8       25   0.879850     2273    0.999549   Yes  1.04

 Data in & out     100000    32     2500   0.332392    24068    0.890215   Yes  1.21
 Data in & out    1000000    32      250   0.849420     9418    0.988088   Yes  1.10
 Data in & out   10000000    32       25   0.933336     8571    0.998796   Yes  0.98

                                 notOpenMP MFLOPS 32 bit                                           

 Data in & out     100000     2     2500   0.220277     2270    0.929538   Yes
 Data in & out    1000000     2      250   0.791373      632    0.992550   Yes
 Data in & out   10000000     2       25   0.792594      631    0.999250   Yes

 Data in & out     100000     8     2500   0.362916     5511    0.957126   Yes
 Data in & out    1000000     8      250   0.902125     2217    0.995524   Yes
 Data in & out   10000000     8       25   0.786859     2542    0.999550   Yes

 Data in & out     100000    32     2500   1.497859     5341    0.890282   Yes
 Data in & out    1000000    32      250   1.518747     5267    0.988096   Yes
 Data in & out   10000000    32       25   1.516393     5276    0.998806   Yes

                                 notOpenMP MFLOPS 64 bit                        64b/
                                                                                 32b                      
 Data in & out     100000     2     2500   0.152535     3278    0.929538   Yes  1.44     
 Data in & out    1000000     2      250   0.965797      518    0.992550   Yes  0.82
 Data in & out   10000000     2       25   0.781680      640    0.999250   Yes  1.01

 Data in & out     100000     8     2500   0.356388     5612    0.957117   Yes  1.02
 Data in & out    1000000     8      250   0.925742     2160    0.995518   Yes  0.97
 Data in & out   10000000     8       25   0.840113     2381    0.999549   Yes  0.94

 Data in & out     100000    32     2500   1.176455     6800    0.890215   Yes  1.27
 Data in & out    1000000    32      250   1.227945     6515    0.988088   Yes  1.24
 Data in & out   10000000    32       25   1.225311     6529    0.998796   Yes  1.24
OpenMP-MemSpeed and notOpenMP Version - With the directives I used, OpenMP failed to compile this into a sensible benchmark. The 32 bit result are not shown here, as they looked to be the same as these. The notOpenMP version results were similar to those for the earlier MemSpeed benchmark

Code: Select all

                  Memory Reading Speed Test OpenMP                      

  Memory   x[m]=x[m]+s*y[m] Int+   x[m]=x[m]+y[m]         x[m]=y[m]     
  KBytes    Dble   Sngl  Int32   Dble   Sngl  Int32   Dble   Sngl  Int32
    Used    MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S

                               64 bit                                   

       4    7749   8500   8716   7451   8520   8533  39508  18586  18589
       8    8198   8669   8874   8148   8678   8691  38972  18863  18861
      16    8023   8499   8335   7895   8355   8507  38305  19003  19004
      32    9034   8517   8619   9127   8550   8522  37928  19071  18409
      64    8652   8201   8178   8565   8223   8093  25191  17494  17508
     128   11397  11616  11715  11345  11649  11029  13861  14097  14170
     256   18242  18745  18195  17417  18605  18019  12535  12637  12623
     512   17580  18467  18787  18010  18414  18321  12900  13180  13121
    1024    8043  10172  11540  12510  10220  12082   9800   9586   9857
    2048    4816   6807   6850   6922   6805   6666   3137   3372   3369
    4096    7029   6846   6881   7017   5145   6801   2776   3124   3112
    8192    2428   7085   7124   7068   7134   6904   2571   3092   3112
   16384    7133   7152   7328   7008   3445   7178   2473   3099   3104
   32768    2656   7643   7669   7802   7616   7559   2043   3112   3104
   65536    7995   6523   2572   7059   6514   6485   2431   2955   3036
  131072    1981   7273   7327   1878   3615   7267   2538   2968   2976

 Not OMP                                                                
       8   15532   3990   4394  15567   4386   4394  11629   9315   9314
     256   12318   3871   4219  12134   4206   4219   8092   8231   8229
   65536    2005   2588   2937   2011   2930   2621   2577   2565   2566
Next will be I/O benchmarks

RoyLongbottom
Posts: 436
Joined: Fri Apr 12, 2013 9:27 am
Location: Essex, UK

Re: Raspberry Pi 64 Bit OS and 8 GB Pi 4B Benchmarks

Sat Jun 20, 2020 2:41 pm

I/O Benchmarks

This section relates to a benchmark that measures drive and network speeds, covering large files, random access and small files. There are two varieties, one that allows caching and the other that uses Direct I/O, without caching. Originally, the former was used for network testing and the latter for local drives, but not anymore, as indicated below.

The main purpose for running these benchmarks was to confirm that they could run as 64 bit programs. Performance comparisons between 64 bit and 32 bit working are not generally provided as they are likely to be the same, limited by I/O speed. What is highlighted (using these particular programs?) is the huge increase in the size of files that can be written using the 64 bit program.

LanSpeed Benchmark - WiFi - These all use the default large file sizes of 8 and 16 MB. The tests demonstrate operation at 2.5 and 5 GHz using 32 bit and 64 bit operation. Particularly at 5 GHz, speeds can vary widely, (with my setup?) where obtaining consistent operation was extremely difficult to achieve, in both cases. Data transfers were to and from a Windows based PC.

Code: Select all

 *********************. 32 bit 2.4 GHz ********************

                        MBytes/Second
      MB  Write1  Write2  Write3   Read1   Read2   Read3

       8    6.35    6.33    6.38    7.05    6.98    7.10
      16    6.70    6.82    6.76    7.19    6.53    7.22

 Random     Read                    Write
 From MB       4       8      16       4       8      16
 msecs     2.691   2.875   3.048    3.13    2.93    2.84

 200 Files  Write                   Read                  Delete
 File KB       4       8      16       4       8      16   secs
 MB/sec     0.34    0.44    1.04    0.37    0.37    1.26
 ms/file   12.14   18.59    15.7    11.1    22.2   12.99   2.153


 ********************** 32 bit 5 GHz *********************

                         MBytes/Second
      MB  Write1  Write2  Write3   Read1   Read2   Read3

       8   11.90   12.96   13.16   10.11    9.55    9.66
      16   11.50   13.93   14.13    9.91    8.88    9.92

 200 Files  Write                   Read                  Delete
 File KB       4       8      16       4       8      16   secs
 MB/sec     0.13    0.46    0.91    0.25    0.55    1.02
 ms/file   30.85   17.83   18.10   16.62   14.93   16.01   3.361

 Random similar to 2.4 GHz


 ********************** 64 bit 2.4 GHz *******************

                        MBytes/Second
  MB   Write1   Write2   Write3    Read1    Read2    Read3

   8     5.48     5.14     5.39     6.86     6.61     5.30
  16     5.62     5.64     5.69     5.17     5.02     5.18

 Random         Read                       Write
 From MB        4        8       16        4        8       16
 msecs      3.666    4.035    5.131     4.82     4.67     3.90

 200 Files      Write                      Read                  Delete
 File KB        4        8       16        4        8       16     secs
 MB/sec      0.24     0.52     0.95     0.34     0.60     1.14
 ms/file    17.10    15.73    17.20    12.00    13.68    14.35    2.437


 ********************** 64 bit 5 GHz *********************

                        MBytes/Second
  MB   Write1   Write2   Write3    Read1    Read2    Read3

   8    11.43    11.70    11.57     8.21     3.64     7.05
  16    10.96     7.30    11.84     8.40     6.24     7.94

 200 Files      Write                      Read                  Delete
 File KB        4        8       16        4        8       16     secs
 MB/sec      0.38     0.73     1.12     0.39     0.73     0.98
 ms/file    10.82    11.15    14.62    10.55    11.23    16.73    2.618

 Random similar to 2.4 GHz
LanSpeed Benchmark - (1G bits per second Ethernet) - These tests were run connected to the Windows PC, again, demonstrating Gigabit speeds of more than 100 MB/second at both 32 bit and 64 bit working.

Larger Files - Previously, at 32 bit working, it had been noted that 2 GB files could not be written. As shown below, the build up to this was examined, by specifying the size of the first set of large files that doubles up for the second set. The first 2 GB file was nearly fully written, the file properties size indicated 2,147,483,647 bytes (or 2^31 - 1) or 2 GB - 1 byte. A benchmark with these parameters ran successfully, using the 64 bit program.

Accessing a large Windows drive provided sufficient free space to try larger files at 64 bits, where three 16 GB files were written and read successfully (at an average speed of around 100 MB/second).

Code: Select all

 ************************ 32 bit ************************

                        MBytes/Second
      MB  Write1  Write2  Write3   Read1   Read2   Read3

       8   67.82   12.97   90.19   99.84   93.49   96.83
      16   92.25   92.66   92.96   103.9  105.28   91.17

Random     Read                    Write
From MB        4       8      16       4       8      16
msecs      0.007    0.01    0.04    1.01    0.85    0.91

200 Files  Write                   Read                  Delete
File KB        4       8      16       4       8      16  secs
MB/sec      1.47     2.8    5.14    2.47    4.71    8.61
ms/file     2.78    2.92    3.19    1.66    1.74     1.9   0.256

 Larger Files
                        MBytes/Second
      MB  Write1  Write2  Write3   Read1   Read2   Read3

      32    78.2   34.46   80.71   84.94   87.11   84.97
      64   88.18   87.52   87.03  111.34  109.58  107.28

     128   98.84   99.24   96.58  110.99  110.57   87.43
     256  106.75  105.43   106.4   85.78  108.99  106.29

    1024   96.13   93.34   94.98  114.51  112.16  114.91
    2048   Error writing file  Segmentation fault
           Wrote 2,147,483,647 bytes

 ************************ 64 bit ************************

                        MBytes/Second
  MB   Write1   Write2   Write3    Read1    Read2    Read3

1024    93.63    93.17    96.38   108.02   109.36   109.30
2048    98.41    96.54    99.18   111.26   111.89   111.83

 Random         Read                       Write
 From MB        4        8       16        4        8       16
 msecs      0.003    0.005    0.014     0.81     0.75     1.23

 200 Files      Write                      Read                  Delete
 File KB        4        8       16        4        8       16     secs
 MB/sec      1.42     2.82     5.24     2.30     4.56     8.09
 ms/file     2.89     2.90     3.13     1.78     1.80     2.02    0.288


 Much Larger Files

 8192   89.77    89.98    91.86   117.29   117.21   117.17
16384   90.64    89.47    91.10   116.58   117.24   117.13
USB 3 Benchmarks - Following are Direct I/O DriveSpeed results at 32 bits and 64 bits, accessing the same USB 3 flash drive. Note the difference in performance during the various test procedures (They might not be the same next time). The 32 bit system again failed on attempting to write a 2 GB file (2^31-1 limit).

The 64 bit DriveSpeed version successfully handled the 2 GB files, but failed to write 4 GB (Note files are written and read using a 1 MB buffer, avoiding space occupancy restrictions). I decided to try LanSpeed, with caching and not Direct I/O, with a USB 3 hard drive mounted instead of a network source. This enabled successful runs using 3 near 12 GB files. The vmstat recordings show that there was no serious memory swapping, with around 7.5 GB of RAM used for caching and confirming the benchmark’s recorded speeds, reading apparently not affected much by caching.

Later, results from a new variation of the benchmark are provided. This only deals with large files, where a between 1 and 3 files can be used via separate write/read and read only programs.

Code: Select all

 ********************* 32 bit USB 3 *********************

   DriveSpeed RasPi 1.1 Sat May 30 15:31:20 2020
 
 Selected File Path: /media/pi/PATRIOT1/
 Total MB  120832, Free MB  112565, Used MB    8267

                        MBytes/Second
  MB   Write1   Write2   Write3    Read1    Read2    Read3

 512    73.43    74.88    74.88   217.60   219.98   218.02
1024    63.03    76.64    74.46   220.72   220.60   219.97
 Cached
   8    38.07    41.95    39.95   700.06   693.26   677.20

 Random         Read                       Write
 From MB        4        8       16        4        8       16
 msecs      0.982    0.981    1.001     6.81     6.31     6.31

 200 Files      Write                      Read                  Delete
 File KB        4        8       16        4        8       16     secs
 MB/sec      0.03     0.07     0.14     2.58     5.23    10.32
 ms/file   120.08   120.06   120.00     1.59     1.57     1.59    2.491

 
 Larger Files           MBytes/Second
  MB   Write1   Write2   Write3    Read1    Read2    Read3

2000    75.14    74.93    74.93   216.19   217.22   216.53
2048 Error writing file Segmentation fault

 ********************* 64 bit USB 3 *********************

   DriveSpeed RasPi 64 Bit gcc 8 Wed May 27 11:43:43 2020
 
 Selected File Path: /media/pi/PATRIOT1/
 Total MB  120832, Free MB  114614, Used MB    6218

                        MBytes/Second
  MB   Write1   Write2   Write3    Read1    Read2    Read3

1024    27.78    21.39    21.43   270.32   278.81   274.98
2048    21.40    21.14    21.44   275.79   273.14   319.95
 Cached
   8    40.27    42.81    42.81  1206.64  1068.72  1031.56

 Random         Read                       Write
 From MB        4        8       16        4        8       16
 msecs      0.004    0.004    0.184     4.33     4.00     4.04

 200 Files      Write                      Read                  Delete
 File KB        4        8       16        4        8       16     secs
 MB/sec      0.03     0.07     0.14   261.45    11.19    84.39
 ms/file   119.60   119.05   119.64     0.02     0.73     0.19    2.477

 
Larger Files           MBytes/Second
  MB   Write1   Write2   Write3    Read1    Read2    Read3

2048    23.77    19.89    20.64   320.34   272.90   271.96 
4096    Write failure

2000    21.72    22.38    26.57   275.40   273.85   309.57
4000    37.38    36.30    37.67   297.09   299.91   286.94


Caching Benchmark - USB 3 Hard Drive - 3 files up to near 36 GB capacity used  

 6000  169.80   136.20   126.26    90.43   146.13   144.05
12000  146.65   108.83    67.14   108.13   146.84   143.76

 swpd    free   buff  cache    si   so     bi     bo   vmstat memory and I/O activity

  768 7417668 102040  250844    0    0   1299   1329   Start
  768 1970544  94436 5704132    0    0      0 132723   Writing 12000 MB
  768  107908  92712 7568500    0    0 140339      0   Reading 12000 MB
Main Drive Benchmark - The DriveSpeed benchmark failed to execute on the 64 bit system, providing the message “Error writing file Segmentation fault”, so
I decided to try the LanSpeed caching version again. DriveSpeed ran successfully at 32 bits, again failing to write 2 GB files and also failing using LanSpeed.

Below are default results from running LanSpeed on the Pi 4 at 64 bits, initially intended to verify that the main drive could be accessed by one of my programs. Note the caching effects. Initially, I could not run specifying large files, as there was limited free space on the OS drive. After cloning the card to a 32 GB version, 19 GB free space was indicated. I then ran the program to write three 6000 MB files. This was followed by specifying 16000 MBytes, where one file was written and the second one generated an error after writing around 2500 MB. The good news also was that the test did not crash the system.

Later are tests, using a 64 GB SD card, writing and reading one near 40 GB file.

Code: Select all

 ************************ 32 bit ************************
  
 Current Directory Path: /home/pi/Raspberry_Pi_Benchmarks
 Total MB   14845, Free MB    8198, Used MB    6646

                        MBytes/Second
      MB  Write1  Write2  Write3   Read1   Read2   Read3

       8   16.41   11.21   12.27   39.81   40.10   40.39
      16   11.79   21.10   34.05   40.18   40.19   40.33
Cached
       8  137.47  156.43  285.59  580.73  598.66  587.97

Random      Read                   Write
From MB        4       8      16       4       8      16
msecs      0.371   0.371   0.363    1.28    1.53    1.30

200 File   Write                    Read                  Delete
File KB        4       8      16       4       8      16   secs
MB/sec      3.49    6.41    8.26    7.67   11.68   17.51
ms/file     1.17    1.28    1.98    0.53    0.70    0.94   0.014

Larger Files
        
                        MBytes/Second
      MB  Write1  Write2  Write3   Read1   Read2   Read3


1024    13.38    13.35    13.39    42.68    42.59    42.36
2048    Error writing file Segmentation fault

LanSpeed
 
1024    11.65    13.46    13.48   560.78   574.76   617.67
2048    Error writing file Segmentation fault


 ************************ 64 bit ************************

   LanSpeed RasPi 64 Bit gcc 8 Wed May 27 10:36:54 2020
 
 Current Directory Path: /home/pi/Raspberry-Pi-4-64-Bit-Benchmarks
 Total MB   14637, Free MB    8724, Used MB    5913

                        MBytes/Second
  MB   Write1   Write2   Write3    Read1    Read2    Read3

   8   265.13   281.30   292.28  1270.88  1286.35  1329.42
  16   246.59   277.53   299.05  1201.20  1327.24  1095.78

 Random         Read                       Write
 From MB        4        8       16        4        8       16
 msecs      0.002    0.002    0.002     7.68     9.01     7.14

 200 Files      Write                      Read                  Delete
 File KB        4        8       16        4        8       16     secs
 MB/sec     56.52    64.92    94.20   303.96   549.54   538.32
 ms/file     0.07     0.13     0.17     0.01     0.01     0.03    0.014


Larger File - 32 GB SD card  

Total MB   29643, Free MB   19776, Used MB    9868         

                        MBytes/Second
   MB   Write1   Write2   Write3    Read1    Read2    Read3

 6000    24.14    18.80    19.39    31.07    45.60    45.76
16000    21.12      Error writing file Segmentation fault

File 1  15.6 GiB (16,777,216,000 bytes)
File 2   2.5 GiB ( 2,645,176,320 bytes) - Not enough free space
Next are Java and OpenGL Benchmarks

ejolson
Posts: 10972
Joined: Tue Mar 18, 2014 11:47 am

Re: Raspberry Pi 64 Bit OS and 8 GB Pi 4B Benchmarks

Sat Jun 20, 2020 2:51 pm

RoyLongbottom wrote:
Fri Jun 19, 2020 1:25 pm
Single Thread Cache and RAM Benchmarks

These measure performance using data from caches and RAM. There appears to be a compiler issue, in this area, where single precision (SP) speeds are shown to be much slower than using Double Precision (DP), when the opposite should apply. Integer speeds can also be shown as being too slow.
On a parallel merge sort coded in C the version 10.1 gcc compiler produced a binary executable which ran 1.61 times faster than the system compiler.

viewtopic.php?f=63&t=227177&start=100#p1657704

This fixes a compiler regression for the Pi introduced in gcc 7.0 that has persisted through the 8.x versions. It would be interesting to know how the benchmarks tabulated here are affected by the 10.1 version of gcc.

RoyLongbottom
Posts: 436
Joined: Fri Apr 12, 2013 9:27 am
Location: Essex, UK

Re: Raspberry Pi 64 Bit OS and 8 GB Pi 4B Benchmarks

Sun Jun 21, 2020 10:01 am

Java and OpenGL Benchmarks

The Java benchmarks comprise class files that were produced some time ago. But source codes are available to renew the files. Performance can vary significantly using different Java Virtual Machines. So, comparisons might not be appropriate

Java Whetstone Benchmark - The results below suggest that 32 bit overall performance, in MWIPS, was faster than at 64 bits. This was due to the most time consuming functions (N5 and N6) taking less time

Code: Select all

 ************************* 32 bit *************************

  Whetstone Benchmark OpenJDK11 Java Version, May 15 2019, 18:48:20

                                                       1 Pass
  Test                  Result       MFLOPS     MOPS  millisecs

  N1 floating point  -1.124750137    524.02             0.0366
  N2 floating point  -1.131330490    494.12             0.2720
  N3 if then else     1.000000000             289.92    0.3570
  N4 fixed point     12.000000000            1092.99    0.2882
  N5 sin,cos etc.     0.499110132              59.86    1.3900 x
  N6 floating point   0.999999821    345.95             1.5592 x
  N7 assignments      3.000000000             331.54    0.5574
  N8 exp,sqrt etc.    0.825148463              25.41    1.4640

  MWIPS                             1687.92             5.9244

  Operating System    Linux, Arch. arm, Version 4.19.37-v7l+
  Java Vendor         BellSoft, Version  11.0.2-BellSoft


 ************************* 64 bit *************************

    Whetstone Benchmark Java Version, May 22 2020, 14:24:09

                                                       1 Pass
  Test                  Result       MFLOPS     MOPS  millisecs

  N1 floating point  -1.124750137    520.61             0.0369
  N2 floating point  -1.131330490    481.38             0.2792
  N3 if then else     1.000000000             236.41    0.4378
  N4 fixed point     12.000000000            1320.20    0.2386
  N5 sin,cos etc.     0.499110132              47.96    1.7348 x
  N6 floating point   0.999999821    276.33             1.9520 x
  N7 assignments      3.000000000             320.17    0.5772
  N8 exp,sqrt etc.    0.825148463              25.41    1.4640

  MWIPS                             1487.99             6.7205

  Operating System    Linux, Arch. aarch64, Version 4.19.118-v8+
  Java Vendor         Debian, Version  11.0.7
JavaDraw Benchmark - The benchmark uses small to rather excessive simple objects to measure drawing performance in Frames Per Second (FPS). Five tests draw on a background of continuously changing colour shades, each test adding to the load. In order for this to run at maximum speed, it was necessary to disable the experimental GL driver.
In this case, performance at 32 bits and 64 bits was quite similar.

Code: Select all

 ************************* 32 bit *************************

   Java Drawing Benchmark, May 15 2019, 18:55:41
            Produced by OpenJDK 11 javac

  Test                              Frames      FPS

  Display PNG Bitmap Twice Pass 1      877    87.65
  Display PNG Bitmap Twice Pass 2     1042   104.18
  Plus 2 SweepGradient Circles        1015   101.47
  Plus 200 Random Small Circles        779    77.85
  Plus 320 Long Lines                  336    33.52
  Plus 4000 Random Small Circles        83     8.25

         Total Elapsed Time  60.1 seconds

  Operating System    Linux, Arch. arm, Version 4.19.37-v7l+
  Java Vendor         BellSoft, Version  11.0.2-BellSoft


 ************************* 64 bit *************************

   Java Drawing Benchmark, May 22 2020, 14:25:15
            Produced by javac 1.8.0_222

  Test                              Frames      FPS

  Display PNG Bitmap Twice Pass 1      833    83.26
  Display PNG Bitmap Twice Pass 2     1001   100.05
  Plus 2 SweepGradient Circles         994    99.39
  Plus 200 Random Small Circles        836    83.54
  Plus 320 Long Lines                  380    37.98
  Plus 4000 Random Small Circles        95     9.44

         Total Elapsed Time  60.1 seconds
        
  Operating System    Linux, Arch. aarch64, Version 4.19.118-v8+
  Java Vendor         Debian, Version  11.0.7
OpenGL GLUT Benchmark - The benchmark measures graphics speed in terms of Frames Per Second (FPS) via six simple and more complex tests. The first four tests portray moving up and down a tunnel including various independently moving objects, with and without texturing. The last two tests, represent a real application for designing kitchens. The first is in wireframe format, drawn with 23,000 straight lines. The second has colours and textures applied to the surfaces.


The benchmark could not be recompiled, at 64 bits, as certain freeglut functions were not readily available. So, an earlier version was used. In this case, the 64 bit version, at the higher pixel settings, appeared to be slower on the graphics speed dependent tests, but faster elsewhere.

As indicated below, the dual monitor connections enabled this option to be tested at 64 bits.

Code: Select all

 ************************ 32 bit ************************

 GLUT OpenGL Benchmark 32 Bit Version 1, Thu May  2 19:01:05 2019

          Running Time Approximately 5 Seconds Each Test

 Window Size  Coloured Objects  Textured Objects  WireFrm  Texture
    Pixels        Few      All      Few      All  Kitchen  Kitchen
  Wide  High      FPS      FPS      FPS      FPS      FPS      FPS

   320   240    766.7    371.4    230.6    130.2     32.5     22.7
   640   480    427.3    276.5    206.0    121.8     31.7     22.2
  1024   768    193.1    178.8    150.5    110.4     31.9     21.5
  1920  1080     81.4     79.4     74.6     68.3     30.8     20.0

 ************************ 64 bit ************************

 GLUT OpenGL Benchmark 64 Bit gcc 9, Fri May 22 13:50:00 2020

          Running Time Approximately 5 Seconds Each Test

 Window Size  Coloured Objects  Textured Objects  WireFrm  Texture
    Pixels        Few      All      Few      All  Kitchen  Kitchen
  Wide  High      FPS      FPS      FPS      FPS      FPS      FPS

   160   120    753.4    414.5    258.3    152.0     42.7     30.0
   320   240    644.5    385.9    243.9    145.6     41.5     29.1
   640   480    320.6    270.6    217.9    136.8     43.0     29.4
  1024   768    140.6    135.1    122.6    114.1     41.8     28.5
  1920  1080     57.7     56.4     55.7     52.4     40.5     26.7
Dual Monitor - Monitor and TV each at 1920 x 1080 pixels.

Code: Select all

 ****************** 64 bit Dual Monitor*******************

  3840  1080     26.9     26.7     27.0     26.0     27.5     21.6

RoyLongbottom
Posts: 436
Joined: Fri Apr 12, 2013 9:27 am
Location: Essex, UK

Re: Raspberry Pi 64 Bit OS and 8 GB Pi 4B Benchmarks

Sun Jun 21, 2020 5:35 pm

Usable RAM Measurements

On running various benchmarks, it became clear that there were restrictions on how much RAM could be used by my C based benchmarks.

Usable RAM - MALLOC - A simple program was written that allocated a specified amount of memory, using malloc, filled it with data, freed the space, then repeated the sequence incrementally, until an allocation failure was indicated. Both 32 bit and 64 bit versions were produced and each run on 4 GB and 8 GB systems. Except at 64 bit 8 GB, all were restricted to less than 4,000,000,000 bytes . For the former, vmstat memory utilisation details are provided, showing the low points and samples between, identifying that memory space had been freed.

Code: Select all

 ############################### 32 Bit OS ###############################
 
                                 4 GB RAM
 Bytes 1000000000   250000000 words allocated   250000000 written finished 
 Bytes 2000000000   500000000 words allocated   500000000 written finished 
 Bytes 3000000000   750000000 words allocated   750000000 written finished 
 Bytes 4000000000  Memory allocation failed - Exit Later OK to 3050000000 (2.84 GB)
 
                                 8 GB RAM
 Bytes 1000000000   250000000 words allocated   250000000 written finished 
 Bytes 2000000000   500000000 words allocated   500000000 written finished 
 Bytes 3000000000   750000000 words allocated   750000000 written finished 
 Bytes 4000000000  Memory allocation failed - Exit Later OK to 3060000000 (2.85 GB)

 ############################### 64 Bit OS ###############################
 
                                 4 GB RAM 
 Bytes 1000000000   250000000 words allocated   250000000 written finished 
 Bytes 2000000000   500000000 words allocated   500000000 written finished 
 Bytes 3000000000   750000000 words allocated   750000000 written finished 
 Bytes 4000000000  Memory allocation failed - Exit Later OK to 3700000000 (3.45 GB)
 
                                 8 GB RAM
 Bytes 1000000000   250000000 words allocated   250000000 written finished 
 Bytes 2000000000   500000000 words allocated   500000000 written finished 
 Bytes 3000000000   750000000 words allocated   750000000 written finished 
 Bytes 4000000000  1000000000 words allocated  1000000000 written finished 
 Bytes 5000000000  1250000000 words allocated  1250000000 written finished 
 Bytes 6000000000  1500000000 words allocated  1500000000 written finished 
 Bytes 7000000000  1750000000 words allocated  1750000000 written finished 
 Bytes 8000000000  Memory allocation failed - Exit Later OK to 7900000000 (7.36 GB)

pass swpd    free   buff  cache    pass swpd    free   buff  cache

        0 7412260  85908 274472            0 7234852  85908 278140
 1      0 6615688  85908 277608     5      0 2600856  85908 277096
        0 7385388  85908 277264            0 7184736  85908 277612
 2      0 5671192  85908 277612     6      0 1571436  85908 277096
        0 7210328  85908 277264            0 7257464  85908 277096
 3      0 4526104  85908 277096     7      0 624436   86228 281456
        0 7324312  85908 277096            0 7402400  86228 283200
 4      0 3665272  85908 277264                                   
Usable RAM - Specified Dimensions - Where dimensions were specified in the programs, rather than malloc, some differences were apparent. Using the 32 bit system, a compile error was indicated when the dimensions required 2 GB (2^31) bytes, with 1 word (or 1 byte) less being accepted. As shown below, with 64 bits, at both 4 GB and 8 GB, close to these sizes could be used.

Code: Select all

 ######################## 32 Bit OS 4 GB and 8 GB ########################
  int    array[536870912]; size of array 'array' is too large 2 GB
  int    array[536870911]; compiles
  float  array[536870912]; size of array 'array' is too large 2 GB
  float  array[536870911]; compiles
  double array[268435456]; size of array 'array' is too large 2 GB
  double array[268435455]; compiles

 ############################# 64 Bit OS 4 GB ############################
 int    array[920000000];  OK 3.43 GB
 int    array[1073741824]; Segmentation fault 4 GB
 float  array[920000000];  OK 3.43 GB
 float  array[1073741824]; Segmentation fault 4 GB
 double array[460000000];  OK 3.43 GB
 double array[536870912];  Segmentation fault 4 GB

 ############################# 64 Bit OS 8 GB ############################
 int    array[1950000000]; OK 7.9 GB paging
 int    array[2147483648]; Segmentation fault 8 GB
 float  array[1950000000]; OK 7.9 GB
 float  array[2147483648]; Segmentation fault 8 GB 
 double array[975000000];  OK 7.9 GB
 double array[1073741824]; Segmentation fault 8 GB
Next is High Performance Linpack Benchmark

RoyLongbottom
Posts: 436
Joined: Fri Apr 12, 2013 9:27 am
Location: Essex, UK

Re: Raspberry Pi 64 Bit OS and 8 GB Pi 4B Benchmarks

Mon Jun 22, 2020 11:00 am

High Performance Linpack Benchmark

I ported my ATLAS version of HPL, that I have run on earlier Raspberry Pi systems, to both of the 64 bit and 32 bit SD cards. See my report at ResearchGate:

https://www.researchgate.net/publicatio ... ce_Linpack

The GFLOPS performance of this benchmark increases with higher memory demands, controlled by the variable N, where it seems that this is the same as the original scalar Linpack Benchmark of N x N x 8 bytes, for double precision operations. Maximum values of N used below are 20000 for 4 GB RAM and 30000 for 8 GB, requiring 2.98 and 6.71 GB respectively (plus more for code). The following is an example of results using N = 30000.

Code: Select all

================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR11C2R4       30000   128     2     2            1584.59              1.136e+01
HPL_pdgesv() start time Wed Jun  3 16:53:40 2020

HPL_pdgesv() end time   Wed Jun  3 17:20:04 2020

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=        0.0008750 ...... PASSED
================================================================================
Using N slightly higher than this causes an error condition. Following was from trying N=31000, where vmstat demonstrated that swapping out and in could not save the day, after 5 minutes.

Code: Select all

================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 1463 RUNNING AT localhost
=   EXIT CODE: 9
================================================================================

vmstat 60 6
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r  b   swpd   free    buff  cache   si   so    bi    bo   in   cs us sy id wa st
0  0  80084 7719904   4808 119828   11   67   292    78  311  382 27  3 69  1  0
4  0  75068 5784024  11944 133564   92    0   464    29  987  945 39  2 59  1  0
4  0  75004  984636  11952 133800    2    0     6     0 1102   92 97  3  0  0  0
1  9 102396   44916    384  67176   31  481 31096   481 2680 2716 61 17  1 21  0
1 12 102396   45616    376  66804   42   33 66125    33 4402 5375  7 28 13 51  0
1 12 102396   45036    388  67600   17   14 60099    14 3472 4748  2 28  6 63  0
0  0  88216 7755132   3188  98236  107    5 23260     7 1820 2504  2 16 48 34  0
Following are results from tests run without and with a cooling fan in place. The first were for the original Pi 4 with 4 GB RAM, carried out in June 2019. The others, with 8 GB, are running via recent 32 bit and 64 bit Operating System versions, in 2020. With the fan in place, clock speeds were effectively constant at 1500 MHz, on all three test rigs, with the same MFLOPS performance at each problem size. Then, the 4 GB system appeared to be running at a higher temperature, but not high enough to introduce CPU MHz throttling. Note that performance does not improve than much on increasing N size.

With no fan in use, throttling occurred on all systems, at N=16000. From then on, the 4 GB system suffered from more of this than the 8 GB models, reflected in higher temperatures and slower performance. The difference is thought to be due to the improvements that have been made in thermal management.

These tests show that the HPL benchmark is an excellent stress testing application that can demonstrate using most of available RAM and running at high performance levels. The double precision speed approached the 12.6 GFLOPS achieved by one of my benchmarks. The 64 bit production does not appear to benefit from using advanced vector operations, but I could not identify whether other compiling parameters could be included.

Code: Select all

                 No Fan                           Fan

 RAM at bits    N  GFLOPS Seconds  Max °C  Min MHz  GFLOPS Seconds  Max °C Min MHz

 4 GB 32b    8000     8.6      40     81      1500     9.3      37     61     1500
 8 GB 32b    8000     9.7      35     58      1500     9.6      35     57     1500
 8 GB 64b    8000     8.8      39     76      1500     8.7      39     55     1500

 4 GB 32b   16000     6.8     404     86   750/600    10.4     263     70     1500
 8 GB 32b   16000     8.6     319     83      1000    10.4     263     63     1500
 8 GB 64b   16000     8.1     338     84      1000    10.0     273     61     1500

 4 GB 32b   20000     6.2     856     87   750/600    10.8     494     71     1500
 8 GB 32b   20000     8.8     604     85      1000    10.7     497     63     1500
 8 GB 64b   20000     8.5     625     85  1000/600    10.3     519     63     1500

 4 GB 32b   30000     N/A                              N/A
 8 GB 32b   30000     8.2    2195     85  1000/600    11.3    1590     64     1500
 8 GB 64b   30000     7.6    2370     86  1000/600    11.4    1584     63     1500


Below are vmstat details, showing that most of the RAM was in use and four cores were running at 100% utilisation. Then there are examples of environmental differences between older 32 bit and later 64 bit operation, particularly MHz throttling variations, core voltage and pmic temperature differences

Code: Select all

 8 GB 64b 30000 
procs  -----------memory--------- ---swap--  -----io---- -system-- ------cpu-----
r  b   swpd    free   buff  cache  si   so    bi    bo   in   cs  us sy id wa st

0  0      0 7422216  83712 264952   0    0   213     4  211  345   2  2 96  1  0
4  0      0 5366940  83720 269572   0    0   144     2 1130  483  82  3 15  0  0
4  0      0 2974924  83728 271960   0    0     0     3 1287  585  97  3  0  0  0
4  0      0  637296  83960 275704   0    0     0    48 1859 2130  96  4  0  0  0
4  0   3072  246724  43176 207604   1   83   141    95 1663 1402  97  3  0  0  0
4  0   3584  243388  32412 191932   3   17    11    23 1110  131 100  0  0  0  
6  0   3584  247168  32420 187520   0    0     0     2 1085   59 100  0  0  0  0
Later
5  0   3584  238580  34324 193432   0    0     4     2 1196  361  99  1  0  0  0
5  0   7936  238124  26356 193392   0  140   386   193 1993 2064  97  3  0  0  0
4  0   7936  247408  27264 194160   1    0    70    11 1889 1888  98  2  0  0  0

 4 GB 32b 20000 No Fan
  485.3   ARM MHz=1000, core volt=0.8771V, CPU temp=84.0'C, pmic temp=74.1'C
  506.6   ARM MHz= 750, core volt=0.8771V, CPU temp=85.0'C, pmic temp=74.1'C
  528.0   ARM MHz= 750, core volt=0.8771V, CPU temp=86.0'C, pmic temp=74.1'C
  549.2   ARM MHz= 600, core volt=0.8771V, CPU temp=85.0'C, pmic temp=74.1'C
  570.6   ARM MHz=1000, core volt=0.8771V, CPU temp=85.0'C, pmic temp=74.1'C
  591.9   ARM MHz= 750, core volt=0.8771V, CPU temp=84.0'C, pmic temp=74.1'C

 8 GB 64b 30000 No Fan 
 1546.8   ARM MHz=1000, core volt=0.8600V, CPU temp=86.0'C, pmic temp=70.3'C
 1577.8   ARM MHz= 600, core volt=0.8600V, CPU temp=85.0'C, pmic temp=70.3'C
 1608.8   ARM MHz=1000, core volt=0.8600V, CPU temp=86.0'C, pmic temp=70.3'C
 1639.9   ARM MHz=1000, core volt=0.8350V, CPU temp=85.0'C, pmic temp=70.3'C
 1670.8   ARM MHz=1000, core volt=0.8600V, CPU temp=85.0'C, pmic temp=70.3'C
 1701.8   ARM MHz= 600, core volt=0.8600V, CPU temp=85.0'C, pmic temp=70.3'C
 1732.8   ARM MHz=1000, core volt=0.8600V, CPU temp=85.0'C, pmic temp=70.3'C


Next are Stress Tests

RoyLongbottom
Posts: 436
Joined: Fri Apr 12, 2013 9:27 am
Location: Essex, UK

Re: Raspberry Pi 64 Bit OS and 8 GB Pi 4B Benchmarks

Mon Jun 22, 2020 4:15 pm

Stress Tests

Floating Point Stress Tests Benchmarking Mode - These stress tests have a benchmarking mode that provides choices for a long running test. They cover number of threads, floating point operations carried out on each data word, and memory size to cover caches and RAM. Numeric sumchecks are carried out, where the same number of calculations apply at different thread counts, in each section. Below are results for both 64 bit and 32 bit compilations, where sumchecks were identical. Average 64 bit/32 bit performance ratio, for one to four cores, was 1.40, maximum 1.98 and minimum 0.84, only apparent at 8 operations per word.

Code: Select all

                    64 Bits MFLOPS       Numeric Results      32 Bits MFLOPS
             Ops/   KB    KB    MB      KB     KB     MB      KB    KB    MB
  Secs  Thrd Word 12.8   128  12.8    12.8    128   12.8    12.8   128  12.8

  Single Precision
   0.9    T1   2  3845  4032  1232   40394  76395  99700    2134  2607   656
   1.6    T2   2  7947  7992  1083   40394  76395  99700    5048  5156   621
   2.3    T4   2 14295 14760  1145   40394  76395  99700    7536  9939   681
   3.0    T8   2 13427 14985  1166   40394  76395  99700    7934  9839   639
   4.9    T1   8  4665  4740  3200   54764  85092  99820    5535  5420  2569
   6.0    T2   8  9334  9453  4143   54764  85092  99820   10757 10732  2454
   6.9    T4   8 17902 18462  4693   54764  85092  99820   18108 20703  2444
   7.7    T8   8 17473 18460  4570   54764  85092  99820   19236 20286  2245
  13.0    T1  32  5827  5869  5861   35206  66015  99520    5309  5270  5262
  15.6    T2  32 11712 11729 11524   35206  66015  99520   10551 10528  9753
  17.2    T4  32 23149 22887 16343   35206  66015  99520   20120 20886 11064
  18.7    T8  32 22202 23048 16411   35206  66015  99520   19415 20464  9929

  Double Precision
   1.8    T1   2  1802  1878   587   40395  76384  99700     921   998   326
   3.4    T2   2  3716  3741   527   40395  76384  99700    1968  1995   308
   4.8    T4   2  6814  7335   547   40395  76384  99700    3465  3925   342
   6.1    T8   2  6633  7011   588   40395  76384  99700    3646  3702   301
   9.2    T1   8  2738  2796  2014   54805  85108  99820    2377  2446  1283
  11.4    T2   8  5598  5582  2114   54805  85108  99820    4916  4860  1326
  13.0    T4   8 10545 11132  2196   54805  85108  99820    9202  9510  1391
  14.7    T8   8 10693 10849  2149   54805  85108  99820    9090  9006  1298
  24.1    T1  32  3280  3296  3279   35159  66065  99521    2695  2725  2707
  28.8    T2  32  6583  6588  6430   35159  66065  99521    5416  5441  5121
  31.6    T4  32 12785 13162  8477   35159  66065  99521   10666 10831  5275
  34.4    T8  32 12718 12781  8816   35159  66065  99521   10427 10602  4832
Floating Point Stress Tests - Below, are results from 10 minute stress tests, showing measured GFLOPS and CPU temperatures, for fanless operation. CPU MHz variations were between 1500/1000/750 at 32 bits and 1500/1000 for all 64 bit tests. Again, indicating improved thermal management, with lower temperatures, faster average speeds, with less degradation over the testing period.

Code: Select all

          Original 32 Bits     ------------------ 64 Bits ------------------
             
               8 Ops/word      8 Ops/word      32 Ops/Word     32 Ops/Word DP
    Seconds    °C  GFLOPS      °C  GFLOPS       °C  GFLOPS      °C  GFLOPS

        0      61              59               58              58
       20      76    19.2      65    18.4       71    22.9      73    12.9
       40      81    19.0      74    18.2       74    23.1      77    12.9
       60      82    17.8      76    18.4       76    22.9      78    12.9
       80      83    15.5      78    18.1       78    23.0      80    13.0
      100      84    15.0      78    18.1       79    23.0      83    12.4
      120      83    14.0      82    18.2       81    23.0      82    11.7
      140      84    13.3      82    17.6       82    22.5      82    11.2
      160      84    13.3      81    16.8       82    21.6      82    10.9
      180      86    12.9      82    16.3       82    21.0      83    10.9
      200      85    13.0      82    16.2       82    20.7      83    10.5
      220      84    12.8      82    15.8       82    20.4      82    10.2
      240      84    12.6      83    15.6       82    20.1      83    10.2
      260      83    12.6      83    15.9       83    19.9      82    10.2
      280      85    12.2      83    15.3       82    19.9      83    10.0
      300      84    12.1      83    15.4       81    19.6      83     9.9
      320      85    12.0      83    15.5       82    19.5      82     9.7
      340      84    11.6      82    15.2       82    19.5      82     9.9
      360      85    11.6      83    14.7       83    19.3      83     9.8
      380      85    11.3      82    14.7       82    19.2      83     9.6
      400      85    11.6      83    14.8       82    19.0      83     9.6
      420      84    11.6      83    14.9       82    18.9      82     9.5
      440      85    11.5      82    14.6       83    18.8      82     9.6
      460      84    11.5      83    14.9       83    18.7      82     9.5
      480      85    11.5      83    14.6       82    18.8      83     9.5
      500      84    11.1      83    14.7       83    18.8      83     9.5
      520      85    11.3      82    14.6       82    18.6      83     9.4
      540      84    11.4      83    14.7       82    18.7      83     9.4
      560      84    11.3      83    14.6       82    18.7      83     9.6
      580      85    11.3      83    14.6       83    18.4      83     9.6
      600      85    11.3      83    14.5       83    18.5      83     9.7

 Average     83.9    12.9    81.2    15.9     81.1    20.2    81.9    10.5
 Min/max             0.58            0.78             0.80            0.72
Integer Stress Tests Benchmarking Mode - This program has variables for number of threads, memory required and running time. The test loop comprises 32 add or subtract instructions, operating on hexadecimal data patterns, with sequences of 8 subtracts then 8 adds to restore the original pattern. Performance is measured in MBytes per second. Results show the varying hexadecimal data patters used and compared verification, not shown on the summary benchmarking mode logged details below. Here, it can be seen that the 64 bit performance was much slower using the latest gcc 8 64 bit tests. Earlier 64 bit results confirm that the poor performance was due to a compiling issue.

Code: Select all

                              Benchmark MBytes/second

         ------ 32 Bits ------    ------ 64 Bits ------    --- 2019 64 Bits ---
            KB      KB      MB      KB       KB      MB      KB      KB      MB
 Threads    16     160      16      16      160      16      16     160      16

     1    5956    5754    3977    2878     2936    2602    5928    6786    3903
     2   11861   11429    3763    5855     5817    3641   14468   13292    3772
     4   22998   21799    3464   11403    11416    3564   27146   25103    3425
     8   22695   21128    3490   10853    11297    3557   27576   24844    3432
    16   22835   23491    3485   11069    11612    3548   27365   28511    3434
    32   22593   23485    3591   10790    11646    3758   26377   28527    3455

Integer Stress Tests RAM and CPU - Next are details from two stress tests, running without an operational fan. The first represents one user demanding 7600000 KB ( 7.25 GB) of memory space. Performance throughout was effectively the same as memory speed indicted by the benchmark (1 thread 16 MB), CPU MHz being constant, with little change in temperatures. As shown by vmstat details, some data was swapped out, to make room for that of the application.

The second stress test involved 8 threads and cache based data, initially running at maximum CPU speed (for this code). This time, there were CPU clock throttling, down to 1000 MHz, CPU temperature rises up to 84°C and a 31% decrease in measured MBytes per second.

Code: Select all

                 Stress Test 1 Log at Start  
           
              Data                          Same All
 Seconds      Size Threads   MB/sec Sumcheck Threads

   20.0 7600000 KB      1      2606 00000000  Yes   
   57.8 7600000 KB      1      2604 FFFFFFFF  Yes   
   91.0 7600000 KB      1      2575 5A5A5A5A  Yes   
  129.5 7600000 KB      1      2608 AAAAAAAA  Yes   

                 Stress Test 1 vmstat 10 second samples

procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r  b   swpd    free   buff  cache   si   so    bi    bo   in   cs us sy id wa st

0  0      0 7433336  83140 266140    0    0   222     2  177  273  1  1 97  1  0
1  0      0 5535964  83152 268248    0    0     2     7  501  707  2  7 92  0  0
1  8  69888   63404   1048 106744   16 6943   515  6951  664  506  3 18 54 25  0
1  0  67072   63916   4548 123920    3    0     3     8  468  260 26  1 74  0  0
 Later to end
1  0  95336   62748   4868 135672    4    0     4     6  475  274 26  1 73  0  0


                  Stress Test 1                       Stress Test 2
       ------- 7600000 KB 1 Thread -------  ------- 1280 KB 8 Threads --------
Secs   MB/sec   MHz   Volts °C CPU °C PMIC  MB/sec   MHz   Volts °C CPU °C PMIC
 
   0           1500  0.8500    59    55.2           1500  0.8600    57    54.3
   20    2606  1500  0.8500    61    55.2   10902   1500  0.8600    70    55.2
   40    2599  1500  0.8500    63    55.2   10267   1500  0.8600    73    58.0
   60    2604  1500  0.8500    63    56.2   10150   1500  0.8600    75    59.0
   80    2575  1500  0.8500    65    57.1   11046   1500  0.8600    79    61.8
  100    2566  1500  0.8500    65    57.1   11039   1500  0.8600    80    62.8
  120    2605  1500  0.8500    66    58.0   10503   1000  0.8600    81    64.6
  140    2608  1500  0.8500    66    58.0    8780   1500  0.8600    82    65.6
  160    2583  1500  0.8500    67    59.0    8501   1500  0.8600    82    66.5
  180    2605  1500  0.8500    66    59.0    8704   1500  0.8600    83    66.5
  200    2604  1500  0.8500    66    59.0    8507   1500  0.8600    83    66.5
  220    2608  1500  0.8500    67    59.0    8829   1000  0.8600    83    67.5
  240    2608  1500  0.8500    68    59.0    8749   1000  0.8600    82    67.5
  260    2605  1500  0.8500    68    59.0    8542   1500  0.8600    83    68.4
  280    2573  1500  0.8500    67    59.0    8500   1000  0.8600    82    67.5
  300    2601  1500  0.8500    68    59.0    8434   1000  0.8600    83    68.4
  320    2607  1500  0.8500    68    59.0    8360   1500  0.8600    83    68.4
  340    2605  1500  0.8500    68    59.0    8302   1000  0.8600    83    68.4
  360    2575  1500  0.8500    67    59.0    8179   1000  0.8600    82    68.4
  380    2608  1500  0.8500    68    59.0    8102   1000  0.8600    84    68.4
  400    2584  1500  0.8500    68    59.0    8215   1500  0.8600    84    68.4
  420    2575  1500  0.8500    68    59.0    8070   1000  0.8600    82    69.4
  440    2574  1500  0.8500    66    59.0    8042   1500  0.8600    82    69.4
  460    2608  1500  0.8500    67    59.0    7945   1500  0.8600    82    69.4
  480    2581  1500  0.8500    68    59.0    8100   1000  0.8600    84    69.4
  500    2583  1500  0.8500    67    59.0    8024   1000  0.8600    84    69.4
  520    2609  1500  0.8500    69    59.0    7933   1000  0.8600    82    69.4
  540    2602  1500  0.8500    67    60.9    7813   1000  0.8600    84    69.4
  560    2606  1500  0.8500    68    59.0    7988   1500  0.8600    83    69.4
  580    2606  1500  0.8500    69    60.9    7882   1000  0.8600    83    69.4
  600    2704  1500  0.8500    69    60.9    7597   1500  0.8600    83    69.4
Next 64 GB SD Card

RoyLongbottom
Posts: 436
Joined: Fri Apr 12, 2013 9:27 am
Location: Essex, UK

Re: Raspberry Pi 64 Bit OS and 8 GB Pi 4B Benchmarks

Tue Jun 23, 2020 1:06 pm

64 GB SD Card

My initial 64 bit Raspberry Pi OS was installed on a 16 GB SD card, later cloned (by Windows Win32DiskImager) to one with 32 GB capacity. It soon became apparent that this was too small to handle extra large files on the main drive. So I bought an 64 GB higher speed version, which, surprisingly, resized free space after booting. I then ran some tests to see how much of this could be used.

USB Drive - The first exercise was to compare performance of 64 GB and 32 GB SanDisk cards, using a USB 3 card reader, via DriveSpeed Direct I/O. The former has maximum MB/second speciications of read 160, write 60 and the latter only read at 98.

For the large file tests, handling near 6 GB (3 x 2000 MB), reading speeds were similar, with the 64 GB card being much faster on writing (as the specification might suggest). Random access and small file performance were also similar.

Code: Select all

############################ USB 3 ############################

64 GB Total MB   59639, Free MB   48318, Used MB   11321
32 GB Total MB   29643, Free MB   19707, Used MB    9936

                            MBytes/Second
         MB   Write1   Write2   Write3    Read1    Read2    Read3

64 GB  2000    58.77    59.24    59.10    68.68    69.18    68.84
32 GB  2000    21.23    21.14    21.16    70.22    70.27    70.33
Main Drive - As indicated earlier, DriveSpeed benchmark would not run on the main drive under the 64 bit OS. So, LanSpeed, with caching, was used instead, to test the 64 GB drive. Large files were specified to minimise caching effects, up to nearly 24 GB (3 x 8), in this case. As for DriveSpeed, writing and reading is in 1 MB blocks, with little impact on free memory space.

Writing and reading speeds were different to those using USB, but results for the 64 GB drive were still faster on writing than on the 32 GB card.

Using the 64 GB card, output from vmstat, with 10 second sampling shown below, indicates that most the memory was used then released, repeating the activity for the second three files. As observed in other tests, it seems that writing of cached data is deferred, overlapped with reading.

Code: Select all

########################## Main Drive #########################

                        MBytes/Second
         MB   Write1   Write2   Write3    Read1    Read2    Read3

64 GB  4000    54.53    38.01    38.91    32.91    45.90    45.88
64 GB  8000    43.16    36.73    36.63    38.34    45.90    45.91
32 GB  6000    24.14    18.80    19.39    31.07    45.60    45.76

     -----------memory---------- ---swap-- -----io----
    swpd    free    buff   cache   si   so    bi    bo
Stsrt/Write 4K
       0 6430000 1024660  317064    0    0   270     3
       0 4232720 1024696 2511212    0    0     0 27790
       0 3138388 1024744 3605256    0    0     0 37690
Write/Read
     512  258740  427420 7089616    0    0  8336 30214
     512  67632   400000 7309488    0    0 24475 14101
     512  61368   340176 7376464    0    0 44800     0
Delete/Read/Write
     512   56868  121324 7600856    0    0 44817     0
     512 5605880  115092 2057148    0    0 18298 17233
     512 4472096  115140 3191272    0    0     0 36872
Write/Read *K
     512  267968   17524 7492716    0    0     3 33253
     512   75996   17596 7684276    0    0  8107 31443
     512   63056   17652 7698440    0    0 44817     0
End  512 7521128   18700  238356    0    0 37260     0
Huge File - Finally, an example of results from separate write/read and read only benchmarks, with caching enabled, is provided below. This just deals with large files, where up to three can be selected. In this case, one file of near 40 GB was written. The read only test loads the data into an array in RAM, where the maximum size that could be used, in this case, was restricted to be around 6 GB. In some cases, the reading speed can only be measured following a reboot. It made no difference, in this case.

Code: Select all

 Before  Total MB   59639, Free MB   48324, Used MB   11315
 After   Total MB   59639, Free MB    8325, Used MB   51314

                        MBytes/Second
   MB   Write1   Write2   Write3    Read1    Read2    Read3

40000    36.65                      45.89
Read only
 6000     N/A      N/A      N/A     45.74

     -----------memory---------- ---swap-- -----io----
    swpd    free    buff   cache   si   so    bi    bo

Example write
     256  270432   33192 7473084    0    0     1 36069
Example read
     256   62384   31332 7681720    0    0 44809     0
Example read only after reboot
     256  272032   25052 3041320    0    0 44812     0 
Next is a system stress test

ejolson
Posts: 10972
Joined: Tue Mar 18, 2014 11:47 am

Re: Raspberry Pi 64 Bit OS and 8 GB Pi 4B Benchmarks

Tue Jun 23, 2020 2:35 pm

I think SanDisk may manufacture more than one grade of 32GB SD card. Do you know which SanDisk card you tested?

RoyLongbottom
Posts: 436
Joined: Fri Apr 12, 2013 9:27 am
Location: Essex, UK

Re: Raspberry Pi 64 Bit OS and 8 GB Pi 4B Benchmarks

Tue Jun 23, 2020 8:58 pm

ejolson wrote:
Tue Jun 23, 2020 2:35 pm
I think SanDisk may manufacture more than one grade of 32GB SD card. Do you know which SanDisk card you tested?
It is SanDisk Ultra that I happened to have a bunch of. 64 GB drive is SanDisk Extreme that I deliberately bought to test handling large files that are more applicable for 64 bit working. I just measured performance of a SanDisk Ultra on my PC via USB 3. It was same speed on writing large files as on the Pi at 21 MB/sec but faster on reading at 95 MB/se, near the spec of 98 MB/sec..

RoyLongbottom
Posts: 436
Joined: Fri Apr 12, 2013 9:27 am
Location: Essex, UK

Re: Raspberry Pi 64 Bit OS and 8 GB Pi 4B Benchmarks

Wed Jun 24, 2020 3:17 pm

System Stress Tests

These stress tests were run twice, once with a cooling fan in use and then with the fan disabled. Using a script file, six applications were started at the same time, each set to run for around 15 minutes. They covered known stressful floating point calculations, exercising near 6 GB of RAM, a demanding full screen OpenGL test and a main drive stress tester, along with vmstat measuring system utilisation, and my CPU MHz, voltage and temperature monitor.

On running these, as indicated in the environmental monitor, the system ran at much higher temperatures, with no fan in use, but with no indication of CPU MHz throttling in the periodic instantaneous measurement samples. Vmstat recordings were virtually the same, with and without cooling, starting with MP-IntStress64g8 grabbing near 6 GB of RAM, with continuing CPU utilisation of around 82% (3 cores at 100%, one at 28%) and, after a short write phase, the main drive being read at 30 MB/second.

Code: Select all

############## With Cooling #############  ############### No Cooling ##############

================== CPU MHz CPU Voltage and Temperature Measurement =================

Secs  Start at Wed Jun 10 12:56:49 2020    Secs  Start at Wed Jun 10 13:19:58 2020
  0 ARM MHz=1500 0.85V CPU=39'C pmic=34'C    0 ARM MHz=1500 0.85V CPU=40'C pmic=35'C
 60 ARM MHz=1500 0.85V CPU=47'C pmic=39'C   60 ARM MHz=1500 0.85V CPU=58'C pmic=46'C
120 ARM MHz=1500 0.85V CPU=50'C pmic=41'C  120 ARM MHz=1500 0.85V CPU=65'C pmic=53'C
180 ARM MHz=1500 0.85V CPU=50'C pmic=42'C  180 ARM MHz=1500 0.85V CPU=68'C pmic=55'C
241 ARM MHz=1500 0.85V CPU=49'C pmic=41'C  241 ARM MHz=1500 0.85V CPU=71'C pmic=59'C
301 ARM MHz=1500 0.85V CPU=51'C pmic=42'C  301 ARM MHz=1500 0.85V CPU=74'C pmic=60'C
362 ARM MHz=1500 0.85V CPU=52'C pmic=42'C  362 ARM MHz=1500 0.85V CPU=76'C pmic=62'C
422 ARM MHz=1500 0.85V CPU=52'C pmic=42'C  422 ARM MHz=1500 0.85V CPU=76'C pmic=62'C
483 ARM MHz=1500 0.85V CPU=51'C pmic=42'C  482 ARM MHz=1500 0.85V CPU=76'C pmic=62'C
543 ARM MHz=1500 0.85V CPU=51'C pmic=41'C  543 ARM MHz=1500 0.85V CPU=77'C pmic=64'C
604 ARM MHz=1500 0.85V CPU=52'C pmic=42'C  603 ARM MHz=1500 0.85V CPU=78'C pmic=65'C
664 ARM MHz=1500 0.85V CPU=51'C pmic=42'C  664 ARM MHz=1500 0.85V CPU=81'C pmic=66'C
725 ARM MHz=1500 0.85V CPU=51'C pmic=42'C  724 ARM MHz=1500 0.85V CPU=80'C pmic=67'C
785 ARM MHz=1500 0.85V CPU=52'C pmic=42'C  785 ARM MHz=1500 0.85V CPU=81'C pmic=67'C
846 ARM MHz=1500 0.85V CPU=51'C pmic=42'C  845 ARM MHz=1500 0.85V CPU=76'C pmic=66'C
906 ARM MHz=1500 0.85V CPU=46'C pmic=42'C  905 ARM MHz=1500 0.85V CPU=73'C pmic=65'C
966 ARM MHz=1500 0.85V CPU=40'C pmic=37'C  966 ARM MHz=1500 0.85V CPU=65'C pmic=60'C
End at   Wed Jun 10 13:12:56 2020          End at   Wed Jun 10 13:36:04 2020

============================== vmstat 60 second samples =============================

  Memory MB        Swap MB/sec %utilise      Memory MB        Swap MB/sec %utilise
swpd free buf cach si so bi bo us sy id wa swpd free buf cach si so bi bo us sy id wa

   0 7231  45  486  0  0  1  0 14  2 81  3    0 7231  45  486  0  0  1  0 14  2 81  3
   0 1147  45  533  0  0 11 11 71 11  1 17    0 1147  45  533  0  0 11 11 71 11  1 17
   0 1145  45  535  0  0 29  0 76  8  1 16    0 1145  45  535  0  0 29  0 76  8  1 16
   0 1142  45  538  0  0 30  0 75  8  1 17    0 1142  45  538  0  0 30  0 75  8  1 17
   0 1142  45  536  0  0 30  0 75  7  1 17    0 1142  45  536  0  0 30  0 75  7  1 17
   0 1143  45  536  0  0 30  0 75  7  1 17    0 1143  45  536  0  0 30  0 75  7  1 17
   0 1141  45  539  0  0 30  0 75  7  1 17    0 1141  45  539  0  0 30  0 75  7  1 17
   0 1141  45  538  0  0 30  0 75  8  1 16    0 1141  45  538  0  0 30  0 75  8  1 16
   0 1138  45  541  0  0 30  0 75  7  1 17    0 1138  45  541  0  0 30  0 75  7  1 17
   0 1141  45  536  0  0 30  0 76  7  0 17    0 1141  45  536  0  0 30  0 76  7  0 17
   0 1139  45  540  0  0 30  0 75  7  1 16    0 1139  45  540  0  0 30  0 75  7  1 16
   0 1140  46  539  0  0 30  0 74  7  2 17    0 1140  46  539  0  0 30  0 74  7  2 17
   0 1143  46  536  0  0 30  0 75  7  2 17    0 1143  46  536  0  0 30  0 75  7  2 17
   0 1139  46  537  0  0 30  0 75  7  1 16    0 1139  46  537  0  0 30  0 75  7  1 16
   0 1143  46  537  0  0 31  0 61  7 13 18    0 1143  46  537  0  0 31  0 61  7 13 18
   0 1142  46  537  0  0 31  0 52  7 21 20    0 1142  46  537  0  0 31  0 52  7 21 20
Livermore Loops - A variation of the Livermore Loops Benchmark has options to change the running time of each of the 72 program floating point kernels, to control running time for stress testing purposes, where results are also checked for correctness, and log numbers assigned to enable multiple copies to be run.

Code: Select all

======= Livermore Loops 64 Bit Reliability test 12 seconds each loop x 24 x 3 =======

Wed Jun 10 12:56:49 2020                   Wed Jun 10 13:19:58 2020

Numeric results were as expected           Numeric results were as expected
MFLOPS for 24 loops                        MFLOPS for 24 loops
2061.5  944.0  950.8  946.9  362.4  646.6  1498.8  991.4  920.0  733.5  370.3  561.1
2073.5 2695.3 1403.8  547.2  493.9  959.9  2202.2 2453.3 1991.9  711.4  473.4  676.4
 206.5  362.3  794.9  634.4  721.9 1143.2   178.3  349.0  766.6  601.3  641.1 1007.9
 411.8  367.7 1469.5  389.4  739.6  306.1   435.3  376.9 1530.5  365.2  801.5  309.5

Maximum Average Geomean Harmean Minimum     Maximum Average Geomean Harmean Minimum
 2698.1   912.3   737.2   602.3   187.7      2654.4   924.2   742.1   597.9   158.9

End of test Wed Jun 10 13:11:53 2020       End of test Wed Jun 10 13:33:21 2020 
MP Integer RAM Exerciser and OpenGL Benchmark - These both report results as the tests progress. Performance for both is provided together below. The former is testing near 6 GB of RAM and the latter running the OpenGL kitchen display test at 1920 x 1080 pixels. Performance varied over the whole period, probably due to the influence of the other programs, but, averages over the 15 minutes, were no different, with and without cooling.

Code: Select all

MP Integer RAM and OpenGL Tests    With Cooling    No Cooling

                                   Integer    OGL  Integer    OGL
 Seconds     KB Threads  Pattern   MB/sec     FPS  MB/sec     FPS

      30 6000000   1    00000000     1978      21    1999      21
      60 6000000   1    FFFFFFFF     1976      21    1864      21
      90 6000000   1    FFFFFFFF     2053      20    1979      21
     120 6000000   1    5A5A5A5A     1918      18    1762      20
     150 6000000   1    AAAAAAAA     1867      19    2066      20
     180 6000000   1    CCCCCCCC     2113      19    1974      21
     210 6000000   1    0F0F0F0F     1841      20    1995      20
     240 6000000   1    FFFFFFFF     1902      20    1928      21
     270 6000000   1    FFFFFFFF     1971      20    2089      20
     300 6000000   1    00000000     2033      20    2084      19
     330 6000000   1    5A5A5A5A     1863      21    1840      21
     360 6000000   1    AAAAAAAA     1974      21    1966      22
     390 6000000   1    AAAAAAAA     2012      21    1956      19
     420 6000000   1    CCCCCCCC     1929      20    1860      20
     450 6000000   1    0F0F0F0F     1964      20    1911      22
     480 6000000   1    00000000     1954      20    2007      21
     510 6000000   1    FFFFFFFF     2019      21    2010      20
     540 6000000   1    FFFFFFFF     1987      21    1999      21
     570 6000000   1    5A5A5A5A     1836      21    1981      21
     600 6000000   1    AAAAAAAA     1991      21    2551      18
     630 6000000   1    CCCCCCCC     1837      21    1996      20
     660 6000000   1    0F0F0F0F     2025      12    1824      21
     690 6000000   1    FFFFFFFF     2017      20    1870      21
     720 6000000   1    FFFFFFFF     2017      20    1843      21
     750 6000000   1    00000000     1858      21    1847      21
     780 6000000   1    5A5A5A5A     2100      21    1905      21
     810 6000000   1    AAAAAAAA     2008      20    1963      21
     840 6000000   1    CCCCCCCC     1966      21    1962      21
     870 6000000   1    CCCCCCCC     1983      21    1980      21
     900 6000000   1    0F0F0F0F     1970      20    1897      21

                        Average      1965    20.1    1964    20.6
BurnInDrive uses 64 KB block sizes, with 164 variations of data patterns, where a parameter controls file size, in this case 16 blocks for 164 MB files. Four of these are written then read by random selection for a specified time. Finally, blocks are read continuously for a specified number of seconds. For more information see:

https://www.researchgate.net/publicatio ... ce_Linpack

Again, there was no real difference with and without cooling. Measured performance, like 33 x 4 x 164 MB in 12.32 minutes is 29.3 MB/second, or of the same order measured by vmstat.

Code: Select all

Current Path: /home/pi/0test/morestress  Total MB 59639 Free MB 20353, Used MB 39286

Wed Jun 10 12:56:49 2020                   Wed Jun 10 13:19:58 2020
File 1 164 MB written 9.19 seconds         File 1 164 MB written in 9.15 seconds
File 2 164 MB written 9.05 seconds         File 2 164 MB written in 8.94 seconds
File 3 164 MB written 9.63 seconds         File 3 164 MB written in 9.67 seconds
File 4 164 MB written 8.91 seconds         File 4 164 MB written in 8.97 seconds
Total                36.78 seconds         Total                   36.74 seconds

Start Reading Wed Jun 10 12:57:26 2020     Start Reading Wed Jun 10 13:20:35 2020
Passes  1 x 4 Files x 164 MB  0.38 minutes Passes  1 x 4 Files x 164 MB  0.38 minutes
Passes  2 x 4 Files x 164 MB  0.76 minutes Passes  2 x 4 Files x 164 MB  0.75 minutes
Passes  3 x 4 Files x 164 MB  1.13 minutes Passes  3 x 4 Files x 164 MB  1.14 minutes
To
Passes 31 x 4 Files x 164 MB 11.58 minutes Passes 31 x 4 Files x 164 MB 11.56 minutes
Passes 32 x 4 Files x 164 MB 11.95 minutes Passes 32 x 4 Files x 164 MB 11.93 minutes
Passes 33 x 4 Files x 164 MB 12.32 minutes Passes 33 x 4 Files x 164 MB 12.31 minutes

Start Repeat Read Wed Jun 10 13:09:45 2020 Start Repeat Read Wed Jun 10 13:32:53 2020
Passes in 1 second for 164 blocks of 64KB  Passes in 1 second for 164 blocks of 64KB

460 500 540 540 520 440 420 480 540 520    560 560 480 440 440 500 540 540 520 440
520 440 440 460 540 540 520 460 440 440    440 440 540 540 540 420 420 420 520 540
540 540 540 440 440 420 540 540 540 440    540 480 440 460 500 540 540 480 440 460
To                                         To
580 580 580 580 580 580 580 580 580 580    580 580 580 580 580 580 580 580 580 580
580 580 580 580 580 580 580 580 580 580    580 580 580 580 580 580 580 580 580 580
580 580 580 580 580 580 580 580 580 580    580 580 580 580 580 580 580 580 580 580

83300 Passes of 64KB blocks  2.78 minutes  83900 Passes of 64KB blocks 2.78 minutes
No errors found during reading tests       No errors found during reading tests
End of test Wed Jun 10 13:12:32 2020       End of test Wed Jun 10 13:35:40 2020
Next is Power Over Ethernet

RoyLongbottom
Posts: 436
Joined: Fri Apr 12, 2013 9:27 am
Location: Essex, UK

Re: Raspberry Pi 64 Bit OS and 8 GB Pi 4B Benchmarks

Thu Jun 25, 2020 9:44 am

Power Over Ethernet

I recently carried out tests of Raspberry Pi 4 systems using power supplied over LAN cables. My report is at ResearchGate in:

https://www.researchgate.net/publicatio ... r_Ethernet

This covers using long, short, thick and thin cables, measuring data transmission speeds and the ability to run using my most power consuming benchmarks, particularly with the only wire connected to the Pi being the Ethernet cable. Screenshots of remote control via Windows, Linux and Android are provided. PoE requires additional hardware that injects high voltage power on to the cable and, at the other end, converts it to that normally used by the destination device. For Raspberry Pi, there is a PoE HAT, with a fan, for this purpose, or separate fanless connectors can be obtained.

A few simple tests were run on the configuration being considered here, simply to verify that the facility was operational. In this case, 48 metres of CAT 6 cables were used and a fanless connector (the 8 GB Pi was fitted with an inexpensive fan). A hard disk and a USB flash drive were plugged in to USB 3 sockets, but not in use. The tests were executed via remote control Terminals, using PuTTy on a Windows 7 based PC. After the first one, the only wire plugged in to to the Pi was the power connecter, from the PoE converter, with communication via WiFi. Result below were all copied from the Windows PuTTy displays.

Network Speed - The first tests were run using the LAN Benchmark, with only large file results shown. The Ethernet performance was at the same 1 Gbps speeds identified earlier. WiFi was from a greater distance, apparently mainly at 2.4 GHz speeds.

Code: Select all

################ Data Transmission Speeds ################

                       MBytes/Second
  MB   Write1   Write2   Write3    Read1    Read2    Read3

Ethernet
 512    80.81    81.27    83.18   112.53   111.69   112.38
1024    93.91    91.64    88.02   112.68   112.64   112.68

WiFi
  50     7.28     8.55     8.15     5.51     6.10     6.37
 100     5.95     7.97     7.14     6.58     5.26     6.75
Power - The other example is from running a Floating Point Stress Test, for 10 minutes, with 8 threads running at the same near 24 GFLOPS continuously. The vmstat report indicates 8 processes in use and 100% CPU utilisation (of 4 cores) over the whole period. With the fan in use, temperature increases were insignificant. Core voltage did not change between idle and full speed operation.

Code: Select all

############ High Power Demanding CPU Stress Test ###########
 
           Data             Ops/         Numeric
 Seconds    Size  Threads    Word  MFLOPS Results     Passes

    9.3  1280 KB        8      32   23435   50160      19677
   18.2  1280 KB        8      32   23274   50160      19677
   27.0  1280 KB        8      32   23375   50160      19677
   35.8  1280 KB        8      32   23374   50160      19677
   44.7  1280 KB        8      32   23357   50160      19677
 To
  566.3  1280 KB        8      32   23396   50160      19677
  575.1  1280 KB        8      32   23406   50160      19677
  583.9  1280 KB        8      32   23424   50160      19677
  592.7  1280 KB        8      32   23359   50160      19677
  601.7  1280 KB        8      32   23145   50160      19677


############################# vmstat Activity Monitor #############################

 procs  -----------memory---------- ---swap-- -----io---- -system--  ------cpu-----
  r  b   swpd    free   buff  cache   si   so    bi    bo   in   cs  us sy id wa st

  8  0      0 7723004  15720 148020    0    0    11     5  975  421  91  0  9  0  0
  8  0      0 7722396  15780 148140    0    0     0     4 1048  428 100  0  0  0  0
  8  0      0 7725468  15844 148044    0    0     0     5 1059  447 100  0  0  0  0
  8  0      0 7725720  15892 148052    0    0     0     3 1052  431 100  0  0  0  0
  8  0      0 7725404  15948 148072    0    0     0     3 1051  432 100  0  0  0  0
  8  0      0 7725368  16004 148072    0    0     0     3 1040  413 100  0  0  0  0
  8  0      0 7725984  16060 148076    0    0     0     4 1050  431 100  0  0  0  0
  9  0      0 7725908  16116 148076    0    0     0     3 1040  409 100  0  0  0  0
  8  0      0 7725656  16164 148084    0    0     0     3 1044  415 100  0  0  0  0
  8  0      0 7725372  16220 148092    0    0     0     4 1067  437 100  0  0  0  0


##################### CPU MHz, Voltage and Temperatures ####################
Seconds
    0.0   ARM MHz= 600, core volt=0.8500V, CPU temp=34.0'C, pmic temp=33.5'C
   60.0   ARM MHz=1500, core volt=0.8500V, CPU temp=51.0'C, pmic temp=38.2'C
  120.4   ARM MHz=1500, core volt=0.8500V, CPU temp=52.0'C, pmic temp=40.1'C
  180.8   ARM MHz=1500, core volt=0.8500V, CPU temp=52.0'C, pmic temp=40.1'C
  241.3   ARM MHz=1500, core volt=0.8500V, CPU temp=53.0'C, pmic temp=41.1'C
  301.7   ARM MHz=1500, core volt=0.8500V, CPU temp=53.0'C, pmic temp=41.1'C
  362.1   ARM MHz=1500, core volt=0.8500V, CPU temp=53.0'C, pmic temp=41.1'C
  422.4   ARM MHz=1500, core volt=0.8500V, CPU temp=54.0'C, pmic temp=41.1'C
  482.8   ARM MHz=1500, core volt=0.8500V, CPU temp=53.0'C, pmic temp=41.1'C
  543.2   ARM MHz=1500, core volt=0.8500V, CPU temp=54.0'C, pmic temp=41.1'C
  603.6   ARM MHz=1500, core volt=0.8500V, CPU temp=53.0'C, pmic temp=41.1'C
Next are CPU Performance Throttling Effects

RoyLongbottom
Posts: 436
Joined: Fri Apr 12, 2013 9:27 am
Location: Essex, UK

Re: Raspberry Pi 64 Bit OS and 8 GB Pi 4B Benchmarks

Fri Jun 26, 2020 9:15 am

CPU Performance Throttling Effects

Another of my reports covered this in:

https://www.researchgate.net/publicatio ... ce_Effects

This was demonstrated by forcing the CPU clock speed to run continuously at 600 MHz, by setting the frequency scaling governor to powersave mode.

This exercise involved using BBC iPlayer for two and a half hours, the main reason being to see if it survives using minimum available resources.

The Raspberry Pi was connected to a TV with a 1920 x 1080 display, using WiFi communication and the CPU at 600 MHz. A drama programme was watched for two hours, with no apparent buffering and, in my opinion, a perfectly good display, where the activity report was 960 x 540 size at 1700 kbps. A second programme wildlife documentary did produce the occasional short delay, with buffering, reporting the same size but down to 923 kbps. The tests were run without an active cooling fan.

Following are vmstat details, showing CPU utilisation of around 47%, indicating using two CPU cores at 100%, for most of the time. Then, the environment monitor are provided, showing constant MHz and voltage, without significant rises in temperatures.

Code: Select all

 vmstat
    -----------memory---------- ---swap--  -----io---- -system-- ------cpu-----
     swpd    free   buff  cache   si   so    bi    bo   in   cs  us sy id wa st

 Early  0 6475260 109296 736232    0    0     0   242 2795 3640  40  7 52  0  0
 End    0 6467036 111324 740656    0    0     0   248 2867 3752  39  7 54  0  0


 RPiHeatMHzVolts2 Program - Room at 27°C
  
 Hot start ARM MHz= 600, core volt=0.8500V, CPU temp=69.0'C, pmic temp=62.8'C
 Later     ARM MHz= 600, core volt=0.8500V, CPU temp=70.0'C, pmic temp=64.6'C
 Near End  ARM MHz= 600, core volt=0.8500V, CPU temp=72.0'C, pmic temp=66.5'C
This was the last entry for my initial 64 bit benchmarking activity.


Return to “General programming discussion”