deater
Posts: 34
Joined: Fri Mar 11, 2016 3:58 pm
Location: 45N

Pi3 incorrect results under load (possibly heat related)

Fri Mar 11, 2016 4:08 pm

Hello

testing out my new Pi3 with Linpack/OpenBLAS with N=8000 gets an impressive 3.6GFLOPS, but 3 out of 4 times the verification checks at the end fail. Occasionally it will segfault halfway through.

Smaller values of N are fine, probably because they aren't stressing the system as much.

Has anyone else seen this? Possibly it is heat related. The Pi3 is not in a case, has a small heatsink on, and I've upgraded to the most recent firmware and none of those things have helped much.

ejolson
Posts: 11759
Joined: Tue Mar 18, 2014 11:47 am

Re: Pi3 incorrect results under load (possibly heat related)

Fri Mar 11, 2016 5:56 pm

deater wrote:Hello

testing out my new Pi3 with Linpack/OpenBLAS with N=8000 gets an impressive 3.6GFLOPS, but 3 out of 4 times the verification checks at the end fail. Occasionally it will segfault halfway through.

Smaller values of N are fine, probably because they aren't stressing the system as much.

Has anyone else seen this? Possibly it is heat related. The Pi3 is not in a case, has a small heatsink on, and I've upgraded to the most recent firmware and none of those things have helped much.
Smaller values of N working also point to a possible out-of-memory condition. Assuming you are using double precision arithmetic a 8000x8000 matrix takes 512MB which is already half the physical memory. Have you tried reducing the memory split for the GPU? It is possible that the run sometimes fails because certain system processes, which run periodically, need memory and the out-of-memory daemon kills your test to make room for the system processes. If you can describe the exact way you built your test, I might be able to run it on a Pi 2B to see if there are stability issues on this older and well-tested hardware.

Another way to determine if stress is an issue is to reduce the cpu-governor maximum frequency to the minimum frequency by running the shell script

Code: Select all

#!/bin/bash
for i in /sys/devices/system/cpu/cpu?/cpufreq
do
    cat $i/cpuinfo_min_freq >$i/scaling_max_freq
done
and then checking whether your Linpack test runs slower but more accurately.

deater
Posts: 34
Joined: Fri Mar 11, 2016 3:58 pm
Location: 45N

Re: Pi3 incorrect results under load (possibly heat related)

Fri Mar 11, 2016 6:26 pm

I've run this test on all models of Pi (a+, b+, 2b, compute-node) and it only fails on my pi-3.

It's definitely not out-of-memory. The benchmark finishes, it's just the correctness (residual) checks fail.

For a while N=6000 was running fine, but now it's failing for me to.

With further tests the N=8000 case sometimes actually completely locks up the system.

I'll try clamping the cpufreq to see if it makes the problem go away.

User avatar
jojopi
Posts: 3862
Joined: Tue Oct 11, 2011 8:38 pm

Re: Pi3 incorrect results under load (possibly heat related)

Fri Mar 11, 2016 6:51 pm

May be worth trying:

Code: Select all

sudo apt-get install memtester
memtester 768M
It would be good to provide a link to "Linpack/OpenBLAS", and details of how you built or installed it, if you want others to be able to repeat your experiment.

ejolson
Posts: 11759
Joined: Tue Mar 18, 2014 11:47 am

Re: Pi3 incorrect results under load (possibly heat related)

Fri Mar 11, 2016 7:21 pm

jojopi wrote:May be worth trying:

Code: Select all

sudo apt-get install memtester
memtester 768M
It would be good to provide a link to "Linpack/OpenBLAS", and details of how you built or installed it, if you want others to be able to repeat your experiment.
It definitely appears you have a hardware malfunction as the same code is reliable on the other Pi models. This reminds me of the 1.13Ghz Pentium III CPU that was recalled because of stability issues. About that same time I was debugging a number of unstable dual-CPU servers which turned out to be super-stable when under clocked. No matter how comprehensive, manufacturer testing occasionally misses something that is later discovered in the field. I'm not sure the CPU governor is the best way to under clock the Pi as there may be a boot setting that also reduces memory speed as well. Perhaps it would be good to look at the threads on over clocking and then do the opposite.

Another thought is that your power supply is marginal.

If you provide exact test conditions under which your Pi 3B fails, then help from other users might be able to determine how widely spread the problem is.

deater
Posts: 34
Joined: Fri Mar 11, 2016 3:58 pm
Location: 45N

Re: Pi3 incorrect results under load (possibly heat related)

Fri Mar 11, 2016 9:58 pm

The setup is a bit involved.

I run Linpack on all my machines (you can see a summary here http://web.eece.maine.edu/~vweaver/group/machines.html )

A quick runthrough on how I compile it can be found here, but just to warn you it takes a while to get it going.
https://github.com/deater/performance_r ... structions

deater
Posts: 34
Joined: Fri Mar 11, 2016 3:58 pm
Location: 45N

Re: Pi3 incorrect results under load (possibly heat related)

Fri Mar 11, 2016 11:50 pm

I realize those directions are probably too hard.

Alternately, you can install gfortran and libmpich-dev libs and then the following binary shoud work:
http://web.eece.maine.edu/~vweaver/junk/pi3_hpl.tar.gz

I have been having trouble isolating the problem, other than that N=8000 in the HPL.dat file almost always has issues, while N=6000 and lower often (but not always) works.

It is quite possible that it's power supply related. The supply I am using is rated 5V@2A. I have a power meter hooked up and with N=8000 and 1.2GHz it maxes out at around 4.8W. If I hold it to 600MHz it maxes out at about 2.7W. I will test with another supply, but that will have to wait until Monday.

User avatar
jojopi
Posts: 3862
Joined: Tue Oct 11, 2011 8:38 pm

Re: Pi3 incorrect results under load (possibly heat related)

Sat Mar 12, 2016 3:43 pm

I see the same issues on a Pi3, so it is not just your board.

At the larger values of N, the residual test almost always fails. The actual numbers are different every time. I spent a while looking at the source, and convinced myself that the PRNG is always seeded with the same constant. I also disabled kernel ASLR. At the very least, the same results should be produced on every run.

On one occasion I also observed a complete system lock up, with no response to Magic SysRq. I do not believe I had any power warnings.

I do not have a Pi2. Can you confirm that the same binary that works consistently on Pi2 fails on Pi3? That would help to rule out kernel bugs such as not restoring VFP3/NEON state correctly after context switch or interrupt.

(I built an ARMv6 version of OpenBLAS, and it runs consistently on a Pi1, but the resulting xhpl appears to use 400% CPU forever on Pi3. I am not sure if that is my fault or another issue.)

User avatar
jojopi
Posts: 3862
Joined: Tue Oct 11, 2011 8:38 pm

Re: Pi3 incorrect results under load (possibly heat related)

Sat Mar 12, 2016 10:16 pm

UPDATE: With very aggressive cooling (80mm fan blowing towards the existing heatsink), N=8000 now passes reliably for me at ~6.4Gflops, 53s.

Previously "vcgencmd measure_temp" never exceeded 75°C, but that may not tell the whole story since the sensors are presumably in the GPU and not the ARM. With high airflow the maximum is now 53°C, which is 33 above ambient.

My heatsink was an oversized but very low profile (5mm) anodised aluminium job that I had salvaged from some old board, attached using cheap thermal tape.

It would be interesting to hear reports of success or failure, and maximum temperature on this benchmark from others. Especially if we can find working configurations that are passive cooled and not incompatible with HATs.

ejolson
Posts: 11759
Joined: Tue Mar 18, 2014 11:47 am

Re: Pi3 incorrect results under load (possibly heat related)

Sun Mar 13, 2016 12:11 am

jojopi wrote:UPDATE: With very aggressive cooling (80mm fan blowing towards the existing heatsink), N=8000 now passes reliably for me at ~6.4Gflops, 53s.
This is a very interesting data point. Not only does the heat sink and fan make the Pi 3B run twice as fast, but it shows that without extra cooling the CPU doesn't throttle down fast enough to prevent errors when doing linear algebra. Maybe there are no 3Bs that can do this calculation reliably at 1.2 GHz without a heat sink. I wonder if there is an under clock setting that would work without the fan.

pxgator
Posts: 105
Joined: Mon Feb 16, 2015 6:45 pm
Location: Southern Colorado, USA

Re: Pi3 incorrect results under load (possibly heat related)

Sun Mar 13, 2016 3:48 am

Very interesting.....thanks for the information. Maybe this will be an opportunity for
someone to create some oversize copper heatsinks that would eliminate the need
for a fan. I'm sure there would be a market for such a device.
So what's all this RPi stuff anyhow? Well folks, it's a feat of engineering from
the UK almost as remarkable as the De Havilland Mosquito and the Colossus.

deater
Posts: 34
Joined: Fri Mar 11, 2016 3:58 pm
Location: 45N

Re: Pi3 incorrect results under load (possibly heat related)

Sun Mar 13, 2016 5:33 am

I just wanted to answer that yes, the hpl binary provided does run just fine on a pi2 with no problems (you can run it with N=10000 and it will still run). Currently only getting 1GFLOP out of it though, which is odd, because I know I've gotten 1.4GFLOP in the past.

It's also nice to see that the pi3 crashing results aren't specific to my board, although ideally it wouldn't be happening at all.

lb
Posts: 301
Joined: Sat Jan 28, 2012 8:07 pm

Re: Pi3 incorrect results under load (possibly heat related)

Sun Mar 13, 2016 9:50 am

I have seen non-deterministic compiler crashes (and system crashes, too) while compiling big packages like OpenCV on one of my Pi 3. This is with passive heatsink and open case. I also did some experiments with active cooling which kept the temperature below 60 degC according to sensor, but it did not eradicate the problems completely.

This is a big issue, obviously. If throttling doesn't work correctly and/or there is a general stability issue, that is a major hardware fault of the Pi 3! Maybe the foundation should investigate this? As a first step I'm going to try to RMA the problematic Pi.

joyrider3774
Posts: 56
Joined: Sun Mar 13, 2016 12:21 pm

Re: Pi3 incorrect results under load (possibly heat related)

Sun Mar 13, 2016 12:43 pm

Hey,

i just tested this as well. I had a simple sh script that kept echoing the cpu temp in one ssh connection and was running the program you supplied in another ssh connection over wifi.

My pi3 locked up while the last shown value of cpu temp / gpu temp was 73°C. Now it could be that the wifi or ssh connection got locked due to stressing the cpu but can't confirm that.

I also tried running with N = 6000 my pi3 did not lock up the max temp reported was 83°C BUT it finished saying
"1 tests completed and failed residual checks" .. "||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 44393470.5663437 ...... FAILED"

while the only thing i change was N=6000 comming form N=8000 in the dat file. It seems with me N=6000 also fails although does not crash the pi

Also it seems there are 2 revisions that can be shown in /proc/cpuinfo at least according to what i saw here https://github.com/RetroPie/RetroPie-Se ... f190fde76d
on the one hand you got a "a02082" revision, which i have and on the other hand it seems there also exist a "a22082" revision.

I don't know if anyone has that other revision (a22082) but it would be intresting to for one know the diffrence between them and for the other if people running that a22082 have the same problem ?

clivem
Posts: 114
Joined: Sun Aug 03, 2014 11:18 am

Re: Pi3 incorrect results under load (possibly heat related)

Sun Mar 13, 2016 1:10 pm

lb wrote: This is a big issue, obviously. If throttling doesn't work correctly and/or there is a general stability issue, that is a major hardware fault of the Pi 3! Maybe the foundation should investigate this? As a first step I'm going to try to RMA the problematic Pi.
I've now "thrown" 2x of the 8x Pi3B's I purchased in the junk drawer, if you read my comments towards the end of the issue I opened, Pi3B thermal throttling. One was totally unreliable, regardless of load/thermal, fresh from the box and would lock-up as soon as you try and execute any NEON optimised code. The other was fine, until I stress tested it, exceeding temperatures of 100degC...... That now suffers from random hard lock-ups, far too frequently to use it for any practical purpose.

Heater
Posts: 19682
Joined: Tue Jul 17, 2012 3:02 pm

Re: Pi3 incorrect results under load (possibly heat related)

Sun Mar 13, 2016 1:30 pm

clive,

If this is as serious a problem as you say I think this it needs addressing. Perhaps the best thing to do is fish those malfunctioning Pi 3 out of your junk draw and return them to the supplier.

The second best thing to do is forward them to me for, err..., evaluation. P.M. me for my address.
Slava Ukrayini.

joyrider3774
Posts: 56
Joined: Sun Mar 13, 2016 12:21 pm

Re: Pi3 incorrect results under load (possibly heat related)

Sun Mar 13, 2016 1:49 pm

I've been testing some more.

i'm using this script in one ssh session to show temps & cpu frequency based on part of the retropie login message

Code: Select all


#!/bin/bash
function cputemp() {
local cpuTempC
local cpuTempF
local gpuTempC
local gpuTempF
if [[ -f "/sys/class/thermal/thermal_zone0/temp" ]]; then
        cpuTempC=$(($(cat /sys/class/thermal/thermal_zone0/temp)/1000)) && cpuTempF=$((cpuTempC*9/5+32))
fi
if [[ -f "/opt/vc/bin/vcgencmd" ]]; then
        if gpuTempC=$(/opt/vc/bin/vcgencmd measure_temp); then
             gpuTempC=${gpuTempC:5:2}
             gpuTempF=$((gpuTempC*9/5+32))
        else
             gpuTempC=""
        fi
        cpuFreq=$(/opt/vc/bin/vcgencmd measure_clock arm)
fi

echo "cpu temp=$cpuTempC°C gpu temp=$gpuTempC°C $cpuFreq"
}

while true; do cputemp; sleep 2; done;
i then underclocked my pi3 to 1150mhz in config.txt

Code: Select all

arm_freq=1150
and reran the test with N=8000
it's been working fine without a problem now, it did not lock up nor did it gave any failed tests. I still have to test some more though

i've gotten with these settings about 4 Gflops but i can see it's still throttling a lot (going to 600mhz)

Code: Select all

T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR02R2L2        8000   256     1     1              85.22              4.006e+00
edit:
just did a rpi-update running again at default clock of 1200mhz i reran with N=8000 the test failed again but my p3 did not lock up also intresting fact is the Gflops is only around 3 ish now

Code: Select all

================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR02R2L2        8000   256     1     1             113.01              3.021e+00
i think i might be running it slightly underclocked and will search for a heatsink (i'm running stock rpi3 without case)

ejolson
Posts: 11759
Joined: Tue Mar 18, 2014 11:47 am

Re: Pi3 incorrect results under load (possibly heat related)

Mon Mar 14, 2016 5:16 am

joyrider3774 wrote:just did a rpi-update running again at default clock of 1200mhz i reran with N=8000 the test failed again but my p3 did not lock up also intresting fact is the Gflops is only around 3 ish now
joyrider3774 wrote: i think i might be running it slightly underclocked and will search for a heatsink (i'm running stock rpi3 without case)
A number of years ago an MMX2 optimized memory copy was added to the Linux kernel and a number of VIA KT133A motherboard based systems started to fail when the optimization was turned on. At first people blamed the new code and then the hardware. Eventually the cause was traced to a faulty BIOS which initialized the chipset in a way that worked for legacy code but which broke when MMX2 instructions were used.

The case with the Pi 3B seems similar. The faster CPU clock speed gives a slight performance increase when running ARMv6 compatible code, but when NEON instructions are used the higher clock speed makes the hardware unreliable. It is amazing that only a 5% reduction in speed restores stability. I would have thought a 10% or 20% reduction might be necessary. As no Pi 3B seems able to reliably run the Linpack linear algebra benchmark at 1200 MHz without a heat sink, I wonder how many can do it at 1150 MHz.

User avatar
PeterO
Posts: 6228
Joined: Sun Jul 22, 2012 4:14 pm

Re: Pi3 incorrect results under load (possibly heat related)

Mon Mar 14, 2016 7:37 am

ejolson wrote: It is amazing that only a 5% reduction in speed restores stability. I would have thought a 10% or 20% reduction might be necessary.
I doubt the relation between clock speed and power is linear (esp. at the top end), so a 10% reduction in clock speed may get you more that 10% reduction in power consumption.

PeterO
Discoverer of the PI2 XENON DEATH FLASH!
Interests: C,Python,PICO,Electronics,Ham Radio (G0DZB),1960s British Computers.
"The primary requirement (as we've always seen in your examples) is that the code is readable. " Dougie Lawson

lb
Posts: 301
Joined: Sat Jan 28, 2012 8:07 pm

Re: Pi3 incorrect results under load (possibly heat related)

Mon Mar 14, 2016 12:50 pm

The Linpack benchmark also fails for me on both of my Pi 3 boards. On one board, I either get hard system crashes or incorrect results, even with over_voltage=2. On the other, I see system crashes or "illegal instruction" aborts. Haven't tested over_voltage yet. In both cases, this was with passive heatsinks and open case, so adequate cooling. The crashes happen so quickly that the SoC doesn't actually have time to heat up anyway.

TL;DR Pi 3 seems broken by design.

ejolson
Posts: 11759
Joined: Tue Mar 18, 2014 11:47 am

Re: Pi3 incorrect results under load (possibly heat related)

Mon Mar 14, 2016 8:20 pm

lb wrote:The Linpack benchmark also fails for me on both of my Pi 3 boards. On one board, I either get hard system crashes or incorrect results, even with over_voltage=2. On the other, I see system crashes or "illegal instruction" aborts. Haven't tested over_voltage yet. In both cases, this was with passive heatsinks and open case, so adequate cooling. The crashes happen so quickly that the SoC doesn't actually have time to heat up anyway.

TL;DR Pi 3 seems broken by design.
Rather than over volting have you tried under clocking, say to 1000 MHz, to see if your Pi becomes reliable at a slower speed? As this reliability and stability problem seems to affect every 3B tested so far, it would be nice if someone on the engineering team at the Raspberry Pi Foundation could look into the following questions:

1. Has anyone noticed similar stability issues with the Pi 3B running ARMv6 Pi B+ compatible code?

2. Can multi-threaded applications like the Chromium web browser create the same reliability and stability issues?

3. What level of under clocking is safe for running optimized NEON code on a Pi 3B without a heat sink?

4. Are there Pi 3B specific heat sinks that will allow the NEON optimized Linpack, FFTs and neural networks to run reliably at 1200 GHz?

5. Is it possible to enter a mode on the Pi 3B that disables the execution of all NEON and ARMv8 specific code?

User avatar
jahboater
Posts: 8829
Joined: Wed Feb 04, 2015 6:38 pm
Location: Wonderful West Dorset

Re: Pi3 incorrect results under load (possibly heat related)

Mon Mar 14, 2016 8:27 pm

This stress test code is all NEON and uses all four cores.
On a Pi3 with no heatsink it either crashes the system within a few seconds (and about 75C) or carries on and throttles back. The Pi3 is at stock frequency settings. With a heatsink it always successfully throttles.

Code: Select all

wget https://raw.githubusercontent.com/ssvb/cpuburn-arm/master/cpuburn-a53.S
gcc -o cpuburn-a53 cpuburn-a53.S
./cpuburn-a53

ejolson
Posts: 11759
Joined: Tue Mar 18, 2014 11:47 am

Re: Pi3 incorrect results under load (possibly heat related)

Mon Mar 14, 2016 8:50 pm

jahboater wrote:This stress test code is all NEON and uses all four cores.
On a Pi3 with no heatsink it either crashes the system within a few seconds (and about 75C) or carries on and throttles back.

Code: Select all

wget https://raw.githubusercontent.com/ssvb/cpuburn-arm/master/cpuburn-a53.S
gcc -o cpuburn-a53 cpuburn-a53.S
./cpuburn-a53
I looked at the code but unfortunately don't know ARM assembler very well. It appears to be an infinite loop that doesn't compute any residual to detect errors. If this understanding is correct, then this test provides no way to know whether computational errors were made in the cases when the system successfully throttles back.

The fact that the system crashes before the CPU has a chance to overheat may implicate a faulty power supply or a too high resistance in the USB cable delivering the power. If the power surge when the linear algebra solver starts creates enough of a voltage drop, this could cause errors right at the beginning of the computation before the CPU throttles back because of heat.

lb
Posts: 301
Joined: Sat Jan 28, 2012 8:07 pm

Re: Pi3 incorrect results under load (possibly heat related)

Mon Mar 14, 2016 10:41 pm

I just did some more testing with the second Pi 3 board I mentioned - some overvolting (over_voltage=2) fixes stability and correctness on that one. Of course that's still not a solution, though. On the first Pi 3 board, no overvolting is able to fix things, only downclocking to at least 1100 MHz makes it stable.

I'm not sure how to proceed, I could definitely RMA the boards, but I don't think there's a good chance that I'll get back new ones that actually work as advertised.

clivem
Posts: 114
Joined: Sun Aug 03, 2014 11:18 am

Re: Pi3 incorrect results under load (possibly heat related)

Mon Mar 14, 2016 10:48 pm

ejolson wrote: The fact that the system crashes before the CPU has a chance to overheat may implicate a faulty power supply or a too high resistance in the USB cable delivering the power.
Looks like I need to return 4x of the new "official" 2.5A power supplies to RS as well then......

Return to “General discussion”