Memotech Bill
Posts: 330
Joined: Sun Nov 18, 2018 9:23 am

[RESOLVED] XIP flush affects code running from RAM - NOT

Mon Sep 27, 2021 10:07 am

Edit: Finally found the explanation :oops:

I am currently attempting to debug a fairly complex program which, given a fixed sequence of commands is failing randomly in many different ways. For specific details see viewtopic.php?f=144&t=316761&p=1917894#p1917811 and https://github.com/raspberrypi/pico-extras/issues/26.

The following is the record of my gdb session attempting to understand one such failure:

Code: Select all

Thread 2 received signal SIGTRAP, Trace/breakpoint trap.
unhandled_user_irq_num_in_r0 () at /home/pi/pico/pico-sdk/src/rp2_common/pico_standard_link/crt0.S:155
155	    bkpt #0
(gdb) bt
#0  unhandled_user_irq_num_in_r0 () at /home/pi/pico/pico-sdk/src/rp2_common/pico_standard_link/crt0.S:155
#1  0x100001d0 in isr_irq9 () at /home/pi/pico/pico-sdk/src/rp2_common/pico_standard_link/crt0.S:151
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) info registers
r0             0xfffffff3          -13
r1             0x2000b090          536916112
r2             0x2000b126          536916262
r3             0x10000             65536
r4             0x2000e650          536929872
r5             0x80037c14          -2147255276
r6             0x2000df05          536928005
r7             0x1003605c          268656732
r8             0x2000cbc8          536923080
r9             0x2b5421b4          726933940
r10            0xffffffff          -1
r11            0xffffffff          -1
r12            0x10000             65536
sp             0x20041fb0          0x20041fb0
lr             0x100001d1          268435921
pc             0x100001d2          0x100001d2 <unhandled_user_irq_num_in_r0>
xPSR           0x81000003          -2130706429
msp            0x20041fb0          0x20041fb0
psp            0xfffffffc          0xfffffffc
primask        0x0                 0
basepri        0x0                 0
faultmask      0x0                 0
control        0x0                 0
(gdb) f 1
#1  0x100001d0 in isr_irq9 () at /home/pi/pico/pico-sdk/src/rp2_common/pico_standard_link/crt0.S:151
151	    bl __get_current_exception
(gdb) info registers
r0             0xfffffff3          -13
r1             0x2000b090          536916112
r2             0x2000b126          536916262
r3             0x10000             65536
r4             0x2000e650          536929872
r5             0x80037c14          -2147255276
r6             0x2000df05          536928005
r7             0x1003605c          268656732
r8             0x2000cbc8          536923080
r9             0x2b5421b4          726933940
r10            0xffffffff          -1
r11            0xffffffff          -1
r12            0x10000             65536
sp             0x20041fb0          0x20041fb0
lr             0x100001d1          268435921
pc             0x100001d0          0x100001d0 <isr_irq9+4>
xPSR           0x81000003          -2130706429
msp            0x20041fb0          0x20041fb0
psp            0xfffffffc          0xfffffffc
primask        0x0                 0
basepri        0x0                 0
faultmask      0x0                 0
control        0x0                 0
(gdb) x/8xw $sp
0x20041fb0:	0x2b5421b4	0x2000b090	0x2000b090	0x00010000
0x20041fc0:	0x00010000	0x2000064d	0x2000061a	0x01000000
(gdb) x/i 0x2000061a
   0x2000061a <render_loop+230>:	stmia	r0!, {r3}
(gdb) x/i 0x2000064d
   0x2000064d <render_loop+280>:	ldr	r3, [pc, #44]	; (0x2000067c <render_loop+328>)
(gdb) x/8xw 0x10000100
0x10000100 <__vectors>:	0x20041800	0x100001f7	0x100001c3	0x100001c5
0x10000110 <__vectors+16>:	0x100001c1	0x100001c1	0x100001c1	0x100001c1
(gdb) x/20i 0x100001c0
   0x100001c0 <isr_invalid>:	bkpt	0x0000
   0x100001c2 <isr_nmi>:	bkpt	0x0000
   0x100001c4 <isr_hardfault>:	bkpt	0x0000
   0x100001c6 <isr_svcall>:	bkpt	0x0000
   0x100001c8 <isr_pendsv>:	bkpt	0x0000
   0x100001ca <isr_systick>:	bkpt	0x0000
   0x100001cc <isr_irq9>:	bl	0x10000266 <__get_current_exception>
=> 0x100001d0 <isr_irq9+4>:	subs	r0, #16
   0x100001d2 <unhandled_user_irq_num_in_r0>:	bkpt	0x0000
   0x100001d4 <binary_info_header>:			; <UNDEFINED> instruction: 0xebf27188
   0x100001d8 <binary_info_header+4>:	strh	r0, [r3, #22]
   0x100001da <binary_info_header+6>:	asrs	r3, r0, #32
   0x100001dc <binary_info_header+8>:	strh	r4, [r2, #24]
   0x100001de <binary_info_header+10>:	asrs	r3, r0, #32
   0x100001e0 <binary_info_header+12>:	lsls	r4, r6, #8
   0x100001e2 <binary_info_header+14>:	asrs	r0, r0, #32
   0x100001e4 <binary_info_header+16>:	add	r3, pc, #576	; (adr r3, 0x10000428 <lrange+76>)
   0x100001e6 <binary_info_header+18>:	b.n	0x1000001e <__flash_binary_start+30>
   0x100001e8 <_entry_point>:	movs	r0, #0
   0x100001ea <_entry_point+2>:	ldr	r1, [pc, #132]	; (0x10000270 <__get_current_exception+10>)
From the top of the listing, the program hit a breakpoint in routine unhandled_user_irq_num_in_r0(). From the register dump r0 = -13. This implies that the interrupt was not a user interrupt, but a system exception. Adding 16 to r0 gives interrupt 3 (hardfault). This is confirmed by xPSR = 0x81000003, the least significant byte of which is the exception number.

But if the exception is hardfault, why did the code hit the breakpoint in unhandled_user_irq_num_in_r0() ? The dump of the vector table gives a vector of 0x100001c5 for exception 3. The last bit of that is the thumb bit, so on a hardfault exception the code should vector to 0x100001c4. And from the disassembly, that should be an immediate break at isr_hardfault. There are four consecutive breaks between the address the code should vector to, and the breakpoint actually taken. How is this possible?

There is also another anomaly. On the occurrence of an exception the processor is supposed to push onto the stack: R0, R1, R2, R3, R12, LR, Return address, XPSR<31:10>:frameptralign:XPSR<8:0>. From above, dumping the stack gave:

Code: Select all

(gdb) x/8xw $sp
0x20041fb0:	R0 = 0x2b5421b4	R1 = 0x2000b090	R2 = 0x2000b090	R3 = 0x00010000
0x20041fc0:	R12 = 0x00010000	LR = 0x2000064d	Rtn = 0x2000061a	Sta = 0x01000000
R0 is known to have been changed. R1, R3 and R12 agree between the stack dump and the register dump. However, in the stack dump R2 = 0x2000b090 (same as R1), whereas the register dump gives a value of 0x2000b126. This register should not have been changed, so why the difference?

Is there anything wrong with my interpretation of the debug above? Or any explanation of what happened?

As mentioned at the start I am many different random failures, and my attempts to debug them often lead to similar inconsistencies.
Last edited by Memotech Bill on Sat Oct 02, 2021 10:24 am, edited 2 times in total.

Memotech Bill
Posts: 330
Joined: Sun Nov 18, 2018 9:23 am

Re: Please check my understanding of Debug

Tue Sep 28, 2021 10:41 am

Another inexplicable (to me) debug session:

Code: Select all

Thread 1 received signal SIGTRAP, Trace/breakpoint trap.
isr_hardfault () at /home/pi/pico/pico-sdk/src/rp2_common/pico_standard_link/crt0.S:99
99	decl_isr_bkpt isr_hardfault
(gdb) bt
#0  isr_hardfault () at /home/pi/pico/pico-sdk/src/rp2_common/pico_standard_link/crt0.S:99
#1  <signal handler called>
#2  flipcsr () at /home/pi/pico/BBCSDL_Work/mcu/pico_vga/picovdu.c:737
#3  0x1001d37e in flashcsr () at /home/pi/pico/BBCSDL_Work/mcu/pico_vga/picovdu.c:828
#4  video_periodic () at /home/pi/pico/BBCSDL_Work/mcu/pico_vga/picovdu.c:2937
#5  0x100198f2 in keyboard_periodic (prt=<optimized out>)
    at /home/pi/pico/BBCSDL_Work/mcu/pico_vga/picokbd.c:451
#6  0x1001dbca in repeating_timer_callback (id=<optimized out>, user_data=0x20002c88 <s_kbd_timer>)
    at /home/pi/pico/pico-sdk/src/common/pico_time/time.c:289
#7  0x1001de9a in alarm_pool_alarm_callback (alarm_num=3)
    at /home/pi/pico/pico-sdk/src/common/pico_time/time.c:155
#8  0x1001e356 in hardware_alarm_irq_handler ()
    at /home/pi/pico/pico-sdk/src/rp2_common/hardware_timer/timer.c:140
#9  <signal handler called>
#10 lfs_bbc_prog (cfg=0x20001fd8 <lfs_root_cfg>, block=98, off=768, buffer=0x20012768, size=256)
    at /home/pi/pico/BBCSDL_Work/mcu/lfsmcu.c:99
#11 0x100135c2 in lfs_bd_flush (lfs=lfs@entry=0x2000e390 <lfs_root>, pcache=pcache@entry=0x2001257c, 
    rcache=rcache@entry=0x2000e390 <lfs_root>, validate=validate@entry=true)
    at /home/pi/pico/BBCSDL_Work/mcu/littlefs/lfs.c:149
#12 0x100136c0 in lfs_bd_prog (lfs=lfs@entry=0x2000e390 <lfs_root>, pcache=pcache@entry=0x2001257c, 
    rcache=rcache@entry=0x2000e390 <lfs_root>, validate=validate@entry=true, block=98, off=1024, 
    buffer=<optimized out>, buffer@entry=0x20018700, size=1757, size@entry=2777)
    at /home/pi/pico/BBCSDL_Work/mcu/littlefs/lfs.c:219
#13 0x10014d12 in lfs_file_rawwrite (lfs=lfs@entry=0x2000e390 <lfs_root>, file=file@entry=0x2001253c, 
    buffer=buffer@entry=0x20017700, size=size@entry=6873) at /home/pi/pico/BBCSDL_Work/mcu/littlefs/lfs.c:3030
#14 0x1001657a in lfs_file_write (lfs=lfs@entry=0x2000e390 <lfs_root>, file=file@entry=0x2001253c, 
    buffer=buffer@entry=0x20017700, size=size@entry=6873) at /home/pi/pico/BBCSDL_Work/mcu/littlefs/lfs.c:5181
#15 0x1001283a in myfwrite (ptr=ptr@entry=0x20017700, size=size@entry=1, nmemb=nmemb@entry=6873, 
    fp=fp@entry=0x20012538) at /home/pi/pico/BBCSDL_Work/mcu/lfswrap.c:555
#16 0x100117a8 in ossave (p=p@entry=0x20016004 " test\r", addr=addr@entry=0x20017700, len=6873)
    at /home/pi/pico/BBCSDL_Work/src/bbpico.c:1276
#17 0x100023d8 in basic (ecx=<optimized out>, edx=<optimized out>, prompt=prompt@entry=0x1)
--Type <RET> for more, q to quit, c to continue without paging--c
    at /home/pi/pico/BBCSDL_Work/src/bbmain.c:1749
#18 0x10011e32 in entry (immediate=immediate@entry=0x1) at /home/pi/pico/BBCSDL_Work/src/bbpico.c:1569
#19 0x1001208a in main (argc=1, argv=<synthetic pointer>) at /home/pi/pico/BBCSDL_Work/src/bbpico.c:2063
(gdb) info registers
r0             0x0                 0
r1             0x0                 0
r2             0x20003218          536883736
r3             0x2000338d          536884109
r4             0x10036088          268656776
r5             0x1                 1
r6             0x3                 3
r7             0xff                255
r8             0x0                 0
r9             0x7                 7
r10            0x1                 1
r11            0x0                 0
r12            0x20003250          536883792
sp             0x20041430          0x20041430
lr             0xfffffff1          -15
pc             0x100001c4          0x100001c4 <isr_hardfault>
xPSR           0x21000003          553648131
msp            0x20041430          0x20041430
psp            0xfffffffc          0xfffffffc
primask        0x1                 1
basepri        0x0                 0
faultmask      0x0                 0
control        0x0                 0
(gdb) x/32xw $sp
0x20041430:	0x00000000	0x00000000	0x20003218	0x2000338d
0x20041440:	0x20003250	0x1001d37f	0x10019d86	0x21000213
0x20041450:	0x00000007	0x20003210	0x00000000	0x20002070
0x20041460:	0x20002c88	0x1001d37f	0x2000df08	0x100198f3
0x20041470:	0x20002c88	0x1001dbc1	0x20002070	0x1001dbcb
0x20041480:	0x00000000	0x1001de9b	0x1001dbc1	0x20002ca0
0x20041490:	0x00000008	0x00000001	0x00000000	0x00000001
0x200414a0:	0x0279e3a2	0x00000000	0x0003d090	0x00000000
(gdb) x/10i 0x10019d80
   0x10019d80 <flipcsr+200>:	ble.n	0x10019d9e <flipcsr+230>
   0x10019d82 <flipcsr+202>:	movs	r1, #0
   0x10019d84 <flipcsr+204>:	movs	r6, #3
   0x10019d86 <flipcsr+206>:	ldr	r2, [r3, #0]
   0x10019d88 <flipcsr+208>:	eors	r2, r7
   0x10019d8a <flipcsr+210>:	str	r2, [r3, #0]
   0x10019d8c <flipcsr+212>:	ldr	r2, [r3, #4]
   0x10019d8e <flipcsr+214>:	eors	r2, r0
   0x10019d90 <flipcsr+216>:	str	r2, [r3, #4]
   0x10019d92 <flipcsr+218>:	ldr	r2, [r4, #32]
A hardfault occurred, this time correctly caught at the hardfault breakpoint.

The exception pushes on the stack R0-R3, R12, LR, PC, XPSR. So from the stack dump, at the time of the exception PC = 0x10019d86. From the disassembly, the instruction at that location is:

Code: Select all

0x10019d86 <flipcsr+206>:	ldr	r2, [r3, #0]
 
And from either the register dump or the stack dump R3 = 0x2000338d, which is a perfectly valid RAM address, so why the hardfault?

For reference, a few extracts from the program map:

Code: Select all

 .text.testkey  0x10019c78       0x40 CMakeFiles/bbcbasic.dir/home/pi/pico/BBCSDL_Work/mcu/pico_vga/picokbd.c.obj
                0x10019c78                testkey
 .text.flipcsr  0x10019cb8      0x140 CMakeFiles/bbcbasic.dir/home/pi/pico/BBCSDL_Work/mcu/pico_vga/picovdu.c.obj
 .text.getpix   0x10019df8       0x48 CMakeFiles/bbcbasic.dir/home/pi/pico/BBCSDL_Work/mcu/pico_vga/picovdu.c.obj
    : 
    :
  .bss.filter    0x2000329c        0x4 CMakeFiles/bbcbasic.dir/home/pi/pico/pico-sdk/src/rp2_common/pico_stdio/stdio.c.obj
 .bss.framebuf  0x200032a0     0x9600 CMakeFiles/bbcbasic.dir/home/pi/pico/BBCSDL_Work/mcu/pico_vga/picovdu.c.obj
 .bss.fswcwd    0x2000c8a0      0x104 CMakeFiles/bbcbasic.dir/home/pi/pico/BBCSDL_Work/mcu/lfswrap.c.obj
From which it can be seen that R3 lies within framebuf[]

And from an obdump of picovdu.c:

Code: Select all

    for (int i = 0; i < csrhgt; ++i)
  c2:	4a1e      	ldr	r2, [pc, #120]	; (13c <flipcsr+0x13c> (File Offset: 0x170))
  c4:	6815      	ldr	r5, [r2, #0]
  c6:	2d00      	cmp	r5, #0
  c8:	dd0d      	ble.n	e6 <flipcsr+0xe6> (File Offset: 0x11a)
  ca:	2100      	movs	r1, #0
        fb += pmode->nbpl / 4;
  cc:	2603      	movs	r6, #3
        *fb ^= msk1;                                                    <== picovdu.c:737
  ce:	681a      	ldr	r2, [r3, #0]
  d0:	407a      	eors	r2, r7
  d2:	601a      	str	r2, [r3, #0]
        *(fb + 1) ^= msk2;
  d4:	685a      	ldr	r2, [r3, #4]
  d6:	4042      	eors	r2, r0
  d8:	605a      	str	r2, [r3, #4]
        fb += pmode->nbpl / 4;
  da:	6a22      	ldr	r2, [r4, #32]
  dc:	43b2      	bics	r2, r6
  de:	189b      	adds	r3, r3, r2
So the instruction apparently causing the hardfault is at line 737 of picovdu.c as reported in the backtrace.

kilograham
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 975
Joined: Fri Apr 12, 2019 11:00 am
Location: austin tx

Re: Please check my understanding of Debug

Tue Sep 28, 2021 1:39 pm

Memotech Bill wrote:
Tue Sep 28, 2021 10:41 am

Code: Select all

0x10019d86 <flipcsr+206>:	ldr	r2, [r3, #0]
 
And from either the register dump or the stack dump R3 = 0x2000338d, which is a perfectly valid RAM address, so why the hardfault?
That is an unaligned 32 bit read, which will cause a hard fault on CM0+

Memotech Bill
Posts: 330
Joined: Sun Nov 18, 2018 9:23 am

Re: Please check my understanding of Debug

Tue Sep 28, 2021 1:53 pm

Yet another debug session:

Code: Select all

rp2040.core0 -- clearing lockup after double fault
target halted due to debug-request, current mode: Thread 
xPSR: 0x21000000 pc: 0x200014e2 msp: 0x20041fc4

Thread 1 received signal SIGINT, Interrupt.
0xfffffffe in ?? ()
(gdb) bt
#0  0xfffffffe in ?? ()
#1  <signal handler called>
#2  0x14000030 in ?? ()
#3  0x20003166 in boot2_copyout ()
#4  0x00061000 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) info registers
r0             0x200009d1          536873425
r1             0x200009d1          536873425
r2             0x61                97
r3             0x80002744          -2147473596
r4             0xbd                189
r5             0x27440000          658767872
r6             0x0                 0
r7             0x1840000           25427968
r8             0xd00               3328
r9             0x61                97
r10            0x61                97
r11            0x10012c69          268512361
r12            0x4001801c          1073840156
sp             0x200414f8          0x200414f8
lr             0xfffffff9          -7
pc             0xfffffffe          0xfffffffe
xPSR           0x1000003           16777219
msp            0x200414f8          0x200414f8
psp            0xfffffffc          0xfffffffc
primask        0x1                 1
basepri        0x0                 0
faultmask      0x0                 0
control        0x0                 0
(gdb) x/32xw $sp
0x200414f8:	0x200009d1	0x200009d1	0x00000061	0x200009d1
0x20041508:	0x4001801c	0x20003167	0x14000030	0x01000000
0x20041518:	0x00061000	0x20012768	0x00000000	0x00000200
0x20041528:	0x00000000	0x00000100	0x2001257c	0x2000e390
0x20041538:	0x20001fd8	0x100135c3	0x00000100	0x00000c00
0x20041548:	0x20012768	0x00000100	0x00000001	0x2000e390
0x20041558:	0x20012544	0x00000100	0x20017700	0x2001257c
0x20041568:	0x00000e00	0x00000061	0x2000e390	0x100136c1
(gdb) x/10i 0x20003160
   0x20003160 <boot2_copyout+128>:	str	r1, [r3, #96]	; 0x60
   0x20003162 <boot2_copyout+130>:	bl	0x2000318a <boot2_copyout+170>
   0x20003166 <boot2_copyout+134>:	movs	r1, #0
   0x20003168 <boot2_copyout+136>:	str	r1, [r3, #8]
   0x2000316a <boot2_copyout+138>:	ldr	r1, [pc, #88]	; (0x200031c4 <boot2_copyout+228>)
   0x2000316c <boot2_copyout+140>:	ldr	r0, [pc, #80]	; (0x200031c0 <boot2_copyout+224>)
   0x2000316e <boot2_copyout+142>:	str	r1, [r0, #0]
   0x20003170 <boot2_copyout+144>:	movs	r1, #1
   0x20003172 <boot2_copyout+146>:	str	r1, [r3, #8]
   0x20003174 <boot2_copyout+148>:	pop	{r0}
(gdb) x/10i 0x20003188
   0x20003188 <boot2_copyout+168>:	bx	r1
   0x2000318a <boot2_copyout+170>:	push	{r0, r1, lr}
   0x2000318c <boot2_copyout+172>:	ldr	r1, [r3, #40]	; 0x28
   0x2000318e <boot2_copyout+174>:	movs	r0, #4
   0x20003190 <boot2_copyout+176>:	tst	r1, r0
   0x20003192 <boot2_copyout+178>:	beq.n	0x2000318c <boot2_copyout+172>
   0x20003194 <boot2_copyout+180>:	movs	r0, #1
   0x20003196 <boot2_copyout+182>:	tst	r1, r0
   0x20003198 <boot2_copyout+184>:	bne.n	0x2000318c <boot2_copyout+172>
   0x2000319a <boot2_copyout+186>:	pop	{r0, r1, pc}
(gdb) x/8xw 0x10000100
0x10000100 <__vectors>:	0x00000000	0x00000000	0x00000000	0x00000000
0x10000110 <__vectors+16>:	0x00000000	0x00000000	0x00000000	0x00000000
From the stack dump, at the time the hardfault exception occurred LR = 0x20003167. The last bit of that is the thumb bit, so this indicates a return address of 0x20003166. This is consistent with entry #3 in the back trace. From the first disassembly listing, the instruction preceding this address is a subroutine call to 0x2000318a <boot2_copyout+170>, which would be expected to set the link register to the value observed.

The second disassembly shows the called routine, which is quite small 0x2000318a <boot2_copyout+170> to 0x2000319a <boot2_copyout+186>. Nowhere in that code is there anything that would jump to 0x14000030. So how did the program get there?

The address 0x14000030 is in the range of the XIP control registers, but is beyond the last defined register, so it is not surprising that attempting to access that address should result in a hardfault. Dumping the exception vectors gives all zeros. No idea how that should have happened. So the exception should have lead to executing the start of ROM. No idea how that resulted in a PC of 0xfffffffe.

Memotech Bill
Posts: 330
Joined: Sun Nov 18, 2018 9:23 am

Re: Please check my understanding of Debug

Tue Sep 28, 2021 1:55 pm

kilograham wrote:
Tue Sep 28, 2021 1:39 pm
Memotech Bill wrote:
Tue Sep 28, 2021 10:41 am

Code: Select all

0x10019d86 <flipcsr+206>:	ldr	r2, [r3, #0]
 
And from either the register dump or the stack dump R3 = 0x2000338d, which is a perfectly valid RAM address, so why the hardfault?
That is an unaligned 32 bit read, which will cause a hard fault on CM0+
Thank you. Good catch, that is why I need some other eyes on these sessions. I now need to try and work out why it should be unaligned.

Any explanation as to the first post?

Memotech Bill
Posts: 330
Joined: Sun Nov 18, 2018 9:23 am

Re: Please check my understanding of Debug

Tue Sep 28, 2021 3:34 pm

Memotech Bill wrote:
Tue Sep 28, 2021 1:53 pm
Dumping the exception vectors gives all zeros. No idea how that should have happened. So the exception should have lead to executing the start of ROM.
Very strange. Exiting GDB, re-entering it and examining the interrupt vectors still gives:

Code: Select all

(gdb) x/8xw 0x10000100
0x10000100 <__vectors>:	0x00000000	0x00000000	0x00000000	0x00000000
0x10000110 <__vectors+16>:	0x00000000	0x00000000	0x00000000	0x00000000
However, exiting GDB, power-cycling the Pico, re-entering GDB and examining the vectors gives:

Code: Select all

0x10000100 <__vectors>:	0x20041800	0x100001f7	0x100001c3	0x100001c5
0x10000110 <__vectors+16>:	0x100001c1	0x100001c1	0x100001c1	0x100001c1
Which are the expected values. Note that nothing was done to reload the program into Flash. So presumably the Flash itself was always correct but GDB (and the the program?) was seeing spurious zero values.

What could cause that? Corrupt cache?

kilograham
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 975
Joined: Fri Apr 12, 2019 11:00 am
Location: austin tx

Re: Please check my understanding of Debug

Tue Sep 28, 2021 8:26 pm

Memotech Bill wrote:
Tue Sep 28, 2021 3:34 pm
Memotech Bill wrote:
Tue Sep 28, 2021 1:53 pm
Dumping the exception vectors gives all zeros. No idea how that should have happened. So the exception should have lead to executing the start of ROM.
Very strange. Exiting GDB, re-entering it and examining the interrupt vectors still gives:

Code: Select all

(gdb) x/8xw 0x10000100
0x10000100 <__vectors>:	0x00000000	0x00000000	0x00000000	0x00000000
0x10000110 <__vectors+16>:	0x00000000	0x00000000	0x00000000	0x00000000
This all smacks of weirdness with the debugger, and disabling XIP mode (which I assume all of this strangeness occurs near/after)
if XIP is disabled when you stop are in the debugger, then it won't be able to read the flash. I suppose also that (to your earlier post) that maybe there is a missing memory synchronization on restoring XIP access which causes both a fault, and the actual hardfault handler bkpt #0 to be read as a different instruction.

I'll point someone who might know more this way.

Memotech Bill
Posts: 330
Joined: Sun Nov 18, 2018 9:23 am

Re: Please check my understanding of Debug

Wed Sep 29, 2021 6:53 am

kilograham wrote:
Tue Sep 28, 2021 8:26 pm
This all smacks of weirdness with the debugger, and disabling XIP mode (which I assume all of this strangeness occurs near/after)
if XIP is disabled when you stop are in the debugger, then it won't be able to read the flash. I suppose also that (to your earlier post) that maybe there is a missing memory synchronization on restoring XIP access which causes both a fault, and the actual hardfault handler bkpt #0 to be read as a different instruction.

I'll point someone who might know more this way.
Yes, all these posts relate to trying to debug the Pico port of BBC Basic, with VGA generation running on core 1, and saving files to Flash on core 0.

I am reasonably confident I have now got the entire video generation code on core 1 running from RAM. On core 0 the Flash erase and Flash write routines are bracketed in interrupt disable and restore.

Code: Select all

	// program data
#ifdef PICO
    printf ("Start saving to Flash at %p, size %d.\n", &bd->buffer[block*cfg->block_size + off], size);
    if ( *((uint32_t *) 0x10000100) == 0 ) __asm__("bkpt #0");
#ifdef PICO_VGA
    // multicore_lockout_start_blocking ();
#endif
	uint32_t ints = save_and_disable_interrupts();
	flash_range_program(&bd->buffer[block*cfg->block_size + off]
		-(uint8_t *)XIP_BASE, buffer, size);
	restore_interrupts(ints);
#ifdef PICO_VGA
    // multicore_lockout_end_blocking ();
#endif
    if ( *((uint32_t *) 0x10000100) == 0 ) __asm__("bkpt #0");
    printf ("Complete saving to Flash.\n");
#else

Code: Select all

#ifdef PICO
    printf ("Start erase Flash at %p, size %d.\n", &bd->buffer[block*cfg->block_size], cfg->block_size);
    if ( *((uint32_t *) 0x10000100) == 0 ) __asm__("bkpt #0");
#ifdef PICO_VGA
    // multicore_lockout_start_blocking ();
#endif
	uint32_t ints = save_and_disable_interrupts();
	flash_range_erase(&bd->buffer[block*cfg->block_size]
		-(uint8_t *)XIP_BASE, cfg->block_size);
	restore_interrupts(ints);
#ifdef PICO_VGA
    // multicore_lockout_end_blocking ();
#endif
    if ( *((uint32_t *) 0x10000100) == 0 ) __asm__("bkpt #0");
    printf ("Completed erase Flash.\n");
#else
Doing the save with video running is resulting in random corruption or crashes of core 0 and / or core 1.

Memotech Bill
Posts: 330
Joined: Sun Nov 18, 2018 9:23 am

Re: Please check my understanding of Debug

Thu Sep 30, 2021 12:59 pm

Having found out how to step core 0 while core 1 is running, I have been trying to determine exactly when in the process of updating Flash is video generation being corrupted.

Code: Select all

(gdb) monitor reset init
target halted due to debug-request, current mode: Thread 
xPSR: 0xf1000000 pc: 0x000000ee msp: 0x20041f00
target halted due to debug-request, current mode: Thread 
xPSR: 0xf1000000 pc: 0x000000ee msp: 0x20041f00
(gdb) c
Continuing.

Breakpoint 4, lfs_bbc_erase (cfg=0x20001fd8 <lfs_root_cfg>, block=237)
    at /home/pi/pico/BBCSDL_Work/mcu/lfsmcu.c:119
119	    printf ("Start erase Flash at %p, size %d.\n", &bd->buffer[block*cfg->block_size], cfg->block_size);
1: x/i $pc
=> 0x10012cec <lfs_bbc_erase+8>:	ldr	r1, [r0, #28]
(gdb) c
Continuing.

Breakpoint 1, flash_range_erase (flash_offs=2019328, count=4096)
    at /home/pi/pico/pico-sdk/src/rp2_common/hardware_flash/flash.c:63
63	void __no_inline_not_in_flash_func(flash_range_erase)(uint32_t flash_offs, size_t count) {
1: x/i $pc
=> 0x200009dc <flash_range_erase>:	push	{r3, r4, r5, r6, r7, lr}
(gdb) c
Continuing.

Breakpoint 9, flash_range_erase (flash_offs=2019328, count=4096)
    at /home/pi/pico/pico-sdk/src/rp2_common/hardware_flash/flash.c:83
83	    flash_flush_cache(); // Note this is needed to remove CSn IO force as well as cache flushing
1: x/i $pc
=> 0x20000a26 <flash_range_erase+74>:	blx	r6
(gdb) si
0x00002360 in ?? ()
1: x/i $pc
=> 0x2360:	movs	r2, #160	; 0xa0
(gdb) 
0x00002362 in ?? ()
1: x/i $pc
=> 0x2362:	movs	r3, #1
(gdb) 
0x00002364 in ?? ()
1: x/i $pc
=> 0x2364:	push	{r4, lr}
(gdb) 
0x00002366 in ?? ()
1: x/i $pc
=> 0x2366:	lsls	r2, r2, #21
(gdb) 
0x00002368 in ?? ()
1: x/i $pc
=> 0x2368:	str	r3, [r2, #4]
(gdb) 
0x0000236a in ?? ()
1: x/i $pc
=> 0x236a:	ldr	r2, [r2, #4]
(gdb) info registers
r0             0x300               768
r1             0x0                 0
r2             0x14000000          335544320
r3             0x1                 1
r4             0x1ed000            2019328
r5             0x1000              4096
r6             0x2361              9057
r7             0x237d              9085
r8             0x23f5              9205
r9             0x24a1              9377
r10            0x20016400          536962048
r11            0x20017700          536966912
r12            0x4001801c          1073840156
sp             0x20041500          0x20041500
lr             0x20000a29          536873513
pc             0x236a              0x236a
xPSR           0x1000000           16777216
msp            0x20041500          0x20041500
psp            0xfffffffc          0xfffffffc
primask        0x1                 1
basepri        0x0                 0
faultmask      0x0                 0
control        0x0                 0
In the above debug log video generation in core 1 was continuing successfully until the very last step, when it suddenly blanked.

Note that XIP mode is exited at line 81 of flash.c, so by the time line 83 is reached XIP is disabled. So the fact that video generation was continuing at that point proves that the video generation code is entirely within RAM.

The instruction that killed video generation was a write of 1 to register XIP_FLUSH, i.e. requesting flushing the XIP cache. This has clearly affected the other core, even though that is running entirely from RAM.

It will require someone with information on the RP2040 internals to explain that.

kilograham
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 975
Joined: Fri Apr 12, 2019 11:00 am
Location: austin tx

Re: XIP flush affects code running from RAM

Fri Oct 01, 2021 1:31 pm

my guess would be that not everything is actually in RAM, but you're getting by with what is already in the XIP cache, but then flushing that breaks things.

Are you still able to catch the exception on the other core in the debugger?

Memotech Bill
Posts: 330
Joined: Sun Nov 18, 2018 9:23 am

Re: XIP flush affects code running from RAM

Fri Oct 01, 2021 6:35 pm

kilograham wrote:
Fri Oct 01, 2021 1:31 pm
my guess would be that not everything is actually in RAM, but you're getting by with what is already in the XIP cache, but then flushing that breaks things.

Are you still able to catch the exception on the other core in the debugger?
I am as confident as I can be that all my video code is in RAM. I have looked through the generated bbcbasic.dis file fairly thoroughly looking for references to code in Flash.

Often the video code remains running, no exception, but something has happened to the addresses such that the display is blank or showing random patterns rather than the expected text. While a corruption of some sort is probable (but not certain) on any given trial, the exact nature is variable and not generally repeatable, making it very difficult to debug.

If I add a bit of code to Core 0 which sets Bit 3 (POWER_DOWN) on the XIP CTRL register for a few seconds and the video generation continues to run on Core 1, would that be conclusive evidence that the video code is entirely in RAM?

ejolson
Posts: 8305
Joined: Tue Mar 18, 2014 11:47 am

Re: XIP flush affects code running from RAM

Sat Oct 02, 2021 12:49 am

Memotech Bill wrote:
Fri Oct 01, 2021 6:35 pm
kilograham wrote:
Fri Oct 01, 2021 1:31 pm
my guess would be that not everything is actually in RAM, but you're getting by with what is already in the XIP cache, but then flushing that breaks things.

Are you still able to catch the exception on the other core in the debugger?
I am as confident as I can be that all my video code is in RAM. I have looked through the generated bbcbasic.dis file fairly thoroughly looking for references to code in Flash.
Since the code has been checked, could there be some data that appears in flash? Is it possible the compiler is placing some stuff there automatically even if it does not have a const declaration?

Memotech Bill
Posts: 330
Joined: Sun Nov 18, 2018 9:23 am

Re: XIP flush affects code running from RAM

Sat Oct 02, 2021 9:10 am

Memotech Bill wrote:
Fri Oct 01, 2021 6:35 pm
If I add a bit of code to Core 0 which sets Bit 3 (POWER_DOWN) on the XIP CTRL register for a few seconds and the video generation continues to run on Core 1, would that be conclusive evidence that the video code is entirely in RAM?
Tried the experiment but it was not conclusive. Not only did core 1 continue running after setting the POWER_DOWN bit, but so did core 0, presumably more slowly, accessing Flash directly not via cache.

Can anyone suggest a sequence of commands that will (temporarily) completely disable XIP access to Flash?

Memotech Bill
Posts: 330
Joined: Sun Nov 18, 2018 9:23 am

Re: XIP flush affects code running from RAM

Sat Oct 02, 2021 9:14 am

ejolson wrote:
Sat Oct 02, 2021 12:49 am
Since the code has been checked, could there be some data that appears in flash? Is it possible the compiler is placing some stuff there automatically even if it does not have a const declaration?
Again, I am as certain as it is possible to be that this is not the case for the framebuffer modes (everything except mode 7).

For mode 7, it was referencing the font definition in flash. I have now revised the code to copy the font into an unused part of the framebuffer, and access it from there.

Memotech Bill
Posts: 330
Joined: Sun Nov 18, 2018 9:23 am

Re: XIP flush affects code running from RAM

Sat Oct 02, 2021 10:22 am

Memotech Bill wrote:
Sat Oct 02, 2021 9:14 am
ejolson wrote:
Sat Oct 02, 2021 12:49 am
Since the code has been checked, could there be some data that appears in flash? Is it possible the compiler is placing some stuff there automatically even if it does not have a const declaration?
Again, I am as certain as it is possible to be that this is not the case for the framebuffer modes (everything except mode 7).

For mode 7, it was referencing the font definition in flash. I have now revised the code to copy the font into an unused part of the framebuffer, and access it from there.
Talk about missing the obvious :oops:

The mode definitions are in Flash!!!

kilograham
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 975
Joined: Fri Apr 12, 2019 11:00 am
Location: austin tx

Re: XIP flush affects code running from RAM

Sat Oct 02, 2021 7:08 pm

Memotech Bill wrote:
Sat Oct 02, 2021 10:22 am

Talk about missing the obvious :oops:

The mode definitions are in Flash!!!
Ha; i think i have been hit by that b4!

Return to “General”