Opened 14 years ago
Closed 10 years ago
#324 closed defect (fixed)
sun4v: Boot failure on a real Sun SPARC Enterprise T1000
Reported by: | Martin Decky | Owned by: | Jakub Jermář |
---|---|---|---|
Priority: | major | Milestone: | 0.6.0 |
Component: | helenos/kernel/sparc64 | Version: | mainline |
Keywords: | sashimi_regression | Cc: | |
Blocker for: | Depends on: | ||
See also: |
Description
HelenOS revision 898 (w/o SMP) does not boot on a real world Sun SPARC Enterprise T1000:
SPARC Enterprise T1000, No Keyboard Copyright 2008 Sun Microsystems, Inc. All rights reserved. OpenBoot 4.30.0, 3968 MB memory available, Serial #78494854. Ethernet address 0:14:4f:ad:bc:86, Host ID: 84adbc86. Boot device: /pci@7c0/pci@0/pci@8/scsi@2/disk@0,0:d File and args: SILO Version 1.4.13 boot: HelenOS Allocated 8 Megs of memory at 0x40000000 for kernel Kernel doesn't support loading to high memory, relocating...done. Loaded kernel version 0.0.0 HelenOS bootloader, release 0.4.2 (Skewer), revision 898M (jakub@jermar.eu-20110324225033-83yrugtwobx927l9) Built on 2011-03-25 14:53:02 for sparc64 Copyright (c) 2001-2010 HelenOS project Memory statistics (total 3964 MB, starting at 0x0000000008400000) 0x000000000000c730|0x000000000840c730: boot info structure 0x0000000000400000|0x0000000008800000: kernel entry point 0x0000000000004000|0x0000000008404000: loader entry point 0x000000000000d1fc|0x000000000840d1fc: kernel image (720080/125653 bytes) 0x000000000002bcd1|0x000000000842bcd1: ns image (99497/40678 bytes) 0x0000000000035bb7|0x0000000008435bb7: loader image (98086/40413 bytes) 0x000000000003f994|0x000000000843f994: init image (100315/40621 bytes) 0x0000000000049841|0x0000000008449841: devmap image (101892/42130 bytes) 0x0000000000053cd3|0x0000000008453cd3: rd image (95298/38857 bytes) 0x000000000005d49c|0x000000000845d49c: vfs image (111923/46438 bytes) 0x0000000000068a02|0x0000000008468a02: fat image (137457/57951 bytes) 0x0000000000076c61|0x0000000008476c61: initrd image (7012352/2380573 bytes) Inflating components ... initrd fat vfs rd devmap init loader ns kernel . Setting up boot allocator ... Setting up screens ... Canonizing OpenFirmware device tree ... Booting the kernel ... ERROR: Last Trap: Data Access Exception
The boot process ends in OBP prompt. The output of ctrace is:
No saved state
.regs:
.regs ?
.locals:
INs LOCALs OUTs 0: 0 c730 8400001 1: fee33c30 b918 c730 2: bfd8 400000 0 3: c24000 c400 400000 4: 0 ffffffffffffc000 6b0000 5: 3fff b920 24531d 6: 5861 5fd0 5721 7: 4048 c400 60f4
Attachments (1)
Change History (18)
by , 14 years ago
Attachment: | t1_fail.zip added |
---|
comment:1 by , 14 years ago
Ok, those commands work on sun4u, even though ctrace
works for me on sun4v too.
Instead of .regs
, could you try:
ok .registers ok .globals ok .pc ok .window
(this should work on sun4v).
comment:2 by , 14 years ago
Here is a diff between the boot on the emulated sun4v and the real one which fails:
--- emulated 2011-03-25 16:47:53.652526995 +0100 +++ t1000 2011-03-25 16:45:14.942526996 +0100 @@ -2,22 +2,22 @@ Kernel doesn't support loading to high memory, relocating...done. Loaded kernel version 0.0.0 HelenOS bootloader, release 0.4.2 (Skewer), revision 898M (jakub@jermar.eu-20110324225033-83yrugtwobx927l9) -Built on 2011-03-25 16:20:18 for sparc64 +Built on 2011-03-25 14:53:02 for sparc64 Copyright (c) 2001-2010 HelenOS project -Memory statistics (total 252 MB, starting at 0x0000000080400000) - 0x000000000000c730|0x000000008040c730: boot info structure - 0x0000000000400000|0x0000000080800000: kernel entry point - 0x0000000000004000|0x0000000080404000: loader entry point - 0x000000000000d1fc|0x000000008040d1fc: kernel image (720080/125653 bytes) - 0x000000000002bcd1|0x000000008042bcd1: ns image (99497/40678 bytes) - 0x0000000000035bb7|0x0000000080435bb7: loader image (98086/40413 bytes) - 0x000000000003f994|0x000000008043f994: init image (100315/40621 bytes) - 0x0000000000049841|0x0000000080449841: devmap image (101892/42130 bytes) - 0x0000000000053cd3|0x0000000080453cd3: rd image (95298/38857 bytes) - 0x000000000005d49c|0x000000008045d49c: vfs image (111923/46438 bytes) - 0x0000000000068a02|0x0000000080468a02: fat image (137457/57951 bytes) - 0x0000000000076c61|0x0000000080476c61: initrd image (7012352/2379035 bytes) +Memory statistics (total 3964 MB, starting at 0x0000000008400000) + 0x000000000000c730|0x000000000840c730: boot info structure + 0x0000000000400000|0x0000000008800000: kernel entry point + 0x0000000000004000|0x0000000008404000: loader entry point + 0x000000000000d1fc|0x000000000840d1fc: kernel image (720080/125653 bytes) + 0x000000000002bcd1|0x000000000842bcd1: ns image (99497/40678 bytes) + 0x0000000000035bb7|0x0000000008435bb7: loader image (98086/40413 bytes) + 0x000000000003f994|0x000000000843f994: init image (100315/40621 bytes) + 0x0000000000049841|0x0000000008449841: devmap image (101892/42130 bytes) + 0x0000000000053cd3|0x0000000008453cd3: rd image (95298/38857 bytes) + 0x000000000005d49c|0x000000000845d49c: vfs image (111923/46438 bytes) + 0x0000000000068a02|0x0000000008468a02: fat image (137457/57951 bytes) + 0x0000000000076c61|0x0000000008476c61: initrd image (7012352/2380573 bytes) Inflating components ... initrd fat vfs rd devmap init loader ns kernel . Setting up boot allocator ...
comment:3 by , 14 years ago
Summary: | sun4v: Boot failure on a real Sun SPARC Enterprise T1000mc → sun4v: Boot failure on a real Sun SPARC Enterprise T1000 |
---|
comment:4 by , 14 years ago
Keywords: | sashimi_regression added |
---|
comment:5 by , 14 years ago
I collected the output of Solaris prtconf -pv.
Here is the memory node:
Node 0xf023f4ec reg: 00000000.08000000.00000000.f8000000 available: 00000000.fa4ec000.00000000.058cc000.00000000.fa306000.00000000.001d0000.00000000.fa2ce000.00000000.00034000.00000000.fa286000.00000000.00002000.00000000 .fa032000.00000000.00252000.00000000.fa024000.00000000.0000c000.00000000.f97b8000.00000000.00048000.00000000.f0000000.00000000.01ca0000.00000000.08400000.00000000.d7c00000 name: 'memory'
comment:6 by , 13 years ago
Milestone: | 0.5.0 → 0.5.1 |
---|
comment:7 by , 11 years ago
Sun Fire(TM) T1000, No Keyboard Copyright (c) 1998, 2011, Oracle and/or its affiliates. All rights reserved. OpenBoot 4.30.4.d, 16256 MB memory available, Serial #81907054. Ethernet address 0:14:4f:e1:cd:6e, Host ID: 84e1cd6e. Boot device: /pci@7c0/pci@0/network@4:bootp,10.163.102.27,image.boot,10.163.101.55,10.163.100.1 File and args: 1000 Mbps full duplex Link up HelenOS bootloader, release 0.5.0 (Fajtl), revision 1903M (martin@decky.cz-20130719173805-d7pm0z90qp8nzzqm) Built on 2013-07-24 11:14:38 for sparc64 Copyright (c) 2001-2013 HelenOS project Memory statistics (total 3964 MB, starting at 0x0000000008400000) 0x000000000000c1f0|0x000000000840c1f0: boot info structure 0x0000000000400000|0x0000000008800000: kernel entry point 0x0000000000004000|0x0000000008404000: loader entry point 0x000000000000cccc|0x000000000840cccc: kernel image (781248/137767 bytes) 0x000000000002e6f3|0x000000000842e6f3: ns image (152485/64394 bytes) 0x000000000003e27d|0x000000000843e27d: loader image (151124/64348 bytes) 0x000000000004ddd9|0x000000000844ddd9: init image (152882/64815 bytes) 0x000000000005db08|0x000000000845db08: locsrv image (160002/68204 bytes) 0x000000000006e574|0x000000000846e574: rd image (150352/63580 bytes) 0x000000000007ddd0|0x000000000847ddd0: vfs image (167172/71632 bytes) 0x000000000008f5a0|0x000000000848f5a0: logger image (155917/66305 bytes) 0x000000000009f8a1|0x000000000849f8a1: ext4fs image (235854/98507 bytes) 0x00000000000b796c|0x00000000084b796c: initrd image (8388608/1866909 bytes) Inflating components ... initrd ext4fs logger vfs rd locsrv init loader ns kernel . Setting up boot allocator ... Setting up screens ... Canonizing OpenFirmware device tree ... Booting the kernel ... ERROR: Last Trap: Data Access Exception {0} ok .registers Normal GL=1 0: 0 0 1: 3fe0 1 2: 1 1 3: 1 1 4: a 3fff20000 5: a 3fff20000 6: 0 3fff20600 7: 0 f0243cec %PC 6114 %nPC 6118 %TBA f0200000 %CCR 99000004 XCC:nzvc ICC:nZvc {0} ok .globals 0: 0 1: 3fe0 2: 1 3: 1 4: a 5: a 6: 0 7: 0 {0} ok .pc 6114 {0} ok 0 .window INs LOCALs OUTs 0: c800 e1bfff 8400001 1: 6008 e18000 c1f0 2: 400000 e18000 0 3: 0 bb38 400000 4: c1f0 b3e0 b800 5: b3d8 0 5f20 6: 5861 0 5761 7: 4048 0 60f4 {0} ok 1 .window INs LOCALs OUTs 0: 0 ccc0 c800 1: 0 0 6008 2: 0 0 400000 3: 0 0 0 4: 0 0 c1f0 5: 0 0 b3d8 6: 0 0 5861 7: 0 0 4048 {0} ok
From disassembly of image.boot:
000000000000610c <icache_flush>: 610c: 03 00 00 0f sethi %hi(0x3c00), %g1 6110: 82 10 63 e0 or %g1, 0x3e0, %g1 ! 3fe0 <start-0x20> XXX 6114: c0 f0 4c e0 stxa %g0, [ %g1 ] #ASI_IC_TAG 6118: 81 43 e0 40 membar #Sync 611c: 82 a0 60 20 subcc %g1, 0x20, %g1 6120: 12 6f ff fe bne %xcc, 6118 <icache_flush+0xc> 6124: c0 f0 4c e0 stxa %g0, [ %g1 ] #ASI_IC_TAG 6128: 81 43 e0 40 membar #Sync 612c: 81 c3 e0 08 retl 6130: 01 00 00 00 nop
00000000000060e8 <jump_to_kernel>:
60e8: 80 a2 a0 03 cmp %o2, 3
60ec: 02 68 00 04 be %xcc, 60fc <jump_to_kernel+0x14>
60f0: 01 00 00 00 nop
XXX 60f4: 40 00 00 06 call 610c <icache_flush>
60f8: 01 00 00 00 nop
60fc: 81 43 e0 08 membar #StoreStore
6100: 81 df c0 00 flush %i7
6104: 81 c2 c0 00 jmp %o3
6108: 01 00 00 00 nop
0000000000004000 <start>:
4000: 10 68 00 08 b %xcc, 4020 <start+0x20>
4004: 01 00 00 00 nop
4008: 48 64 72 53 call 21920954 <ofw_claim_phys_internal.part.1+0x21
6a132c>
400c: 00 00 00 00 illtrap 0
…
4020: 8d 90 20 04 wrpr 4, %pstate
4024: 95 90 20 06 wrpr 6, %cansave
4028: 97 90 20 00 wrpr 0, %canrestore
402c: 9b 90 20 00 wrpr 0, %otherwin
4030: 99 90 20 07 wrpr 7, %cleanwin
4034: 1d 00 00 18 sethi %hi(0x6000), %sp
4038: 9c 13 a0 60 or %sp, 0x60, %sp ! 6060 <initial_stack>
403c: 9c 03 b8 01 add %sp, -2047, %sp
4040: 21 00 00 33 sethi %hi(0xcc00), %l0
4044: a0 14 20 c0 or %l0, 0xc0, %l0 ! ccc0 <ofw_cif>
XXX 4048: 40 00 0b ae call 6f00 <ofw_init>
404c: d8 74 00 00 stx %o4, [ %l0 ]
4050: 10 68 08 4c b %xcc, 6180 <bootstrap>
4054: 01 00 00 00 nop
4058: 01 00 00 00 nop
405c: 01 00 00 00 nop
…
and so the stack trace is:
xxxx 6114=icache_flush+0x8()
xxxx 60f4=jump_to_kernel+0xc()
xxxx 4048=start+0x48()
The faulting instruction is
6114: c0 f0 4c e0 stxa %g0, [ %g1 ] #ASI_IC_TAG
where
{0} ok %g1 . 3fe0
comment:8 by , 11 years ago
From UA2005:
10.2
ASI Values
The range of address space identifiers (ASIs) is 0016-FF16. That range is divided into
restricted and unrestricted portions. ASIs in the range 8016 –FF16 are unrestricted;
they may be accessed by software running in any privilege mode.
387ASIs in the range 0016 –7F16 are restricted; they may only be accessed by software
running in a mode with sufficient privilege for the particular ASI. ASIs in the range
0016 –2F16 may only be accessed by software running in privileged or
hyperprivileged mode and ASIs in the range 3016 –7F16 may only be accessed by
software running in hyperprivileged mode.
SPARC V9 In SPARC V9, the range of ASIs was evenly divided into
Compatibility restricted (0016-7F16) and unrestricted (8016-FF16) halves.
Note
An attempt by nonprivileged software to access a restricted (privileged or
hyperprivileged) ASI (0016 –7F16) causes a privileged_action trap.
An attempt by privileged software to access a hyperprivileged ASI (3016 –7F16) also
causes a privileged_action trap.
So it seems we cannot write to ICACHE ASI (0x67) on sun4v. This problem was introduced by mainline,422. Before we would not flush the I-cache on sun4v, it seems. Not sure if we need to do it another way (e.g. some hypercall?)
comment:9 by , 11 years ago
Note that mainline,1920 #ifdefs the ASI_ICACHE accesses out for sun4v, which enables HelenOS to boot to kconsole
.
comment:10 by , 11 years ago
Owner: | changed from | to
---|---|
Priority: | critical → major |
Status: | new → assigned |
comment:11 by , 11 years ago
After mainline,1921, mainline,1922 and mainline,1923, HelenOS/sun4v can make it quite far into userspace initialization. As far as stability is concerned, the only problem seems to be around this area in fibril.c:
fibril_t *srcf = __tcb_get()->fibril_data; if (stype != FIBRIL_FROM_DEAD) { /* Save current state */ if (!context_save(&srcf->ctx)) { if (serialization_count) srcf->flags &= ~FIBRIL_SERIALIZED; if (srcf->clean_after_me) { <========== HERE /* * Cleanup after the dead fibril from which we * restored context here. */ void *stack = srcf->clean_after_me->stack; <=========== or HERE if (stack) { /* * This check is necessary because a * thread could have exited like a * normal fibril using the * FIBRIL_FROM_DEAD switch type. In that * case, its fibril will not have the * stack member filled. */
Either srcf→clean_after_me or srcf→clean_after_me→stack contain some garbage (unaligned or unmapped).
comment:12 by , 11 years ago
The corresponding disasm is here:
c6b4: 82 10 00 07 mov %g7, %g1 c6b8: c4 5f a8 7f ldx [ %fp + 0x87f ], %g2 c6bc: c2 58 60 08 ldx [ %g1 + 8 ], %g1 c6c0: 80 a0 a0 03 cmp %g2, 3 c6c4: 02 40 00 73 be,pn %icc, c890 <fibril_switch+0x250> c6c8: c2 77 a7 f7 stx %g1, [ %fp + 0x7f7 ] c6cc: 40 00 53 f5 call 216a0 <context_save> c6d0: 90 00 60 10 add %g1, 0x10, %o0 c6d4: 80 a2 20 00 cmp %o0, 0 c6d8: 12 40 00 a1 bne,pn %icc, c95c <fibril_switch+0x31c> c6dc: 03 00 00 00 sethi %hi(0), %g1 c6e0: 82 18 7f e8 xor %g1, -24, %g1 c6e4: c2 01 c0 01 ld [ %g7 + %g1 ], %g1 c6e8: 80 a0 60 00 cmp %g1, 0 c6ec: 12 48 00 2f bne %icc, c7a8 <fibril_switch+0x168> c6f0: c8 5f a7 f7 ldx [ %fp + 0x7f7 ], %g4 c6f4: ca 5f a7 f7 ldx [ %fp + 0x7f7 ], %g5 c6f8: fa 59 60 c8 ldx [ %g5 + 0xc8 ], %i5 <======== here %g5 is misaligned c6fc: 22 c7 40 10 brz,a,pn %i5, c73c <fibril_switch+0xfc> c700: b0 10 20 01 mov 1, %i0 c704: d0 5f 60 a8 ldx [ %i5 + 0xa8 ], %o0 c708: 02 c2 00 06 brz,pn %o0, c720 <fibril_switch+0xe0> c70c: 01 00 00 00 nop c710: 7f ff e8 b4 call 69e0 <as_area_destroy>
comment:13 by , 11 years ago
For me not only vfs, but also getterm crashed. It turned out the getterm crash is a consequence of the vfs crash. Getterm got NULL for stdout and attempted to call setvbuf() on stdout before checking it. I fixed that in mainline,2011.
follow-up: 15 comment:14 by , 11 years ago
I was trying to debug this and that made the bug go away Heisenbug style:
- setting optimization level to -O1
- adding line debugging information
compile sparc64/Niagara with Line debugging information enabled. Then VFS does not crash. I hit another bug, the kernel niagara output driver does not set "fb" sysinfo node. Fixing that got me a fully working shell.
Patch below:
=== modified file 'kernel/arch/sparc64/src/drivers/niagara.c' --- kernel/arch/sparc64/src/drivers/niagara.c 2012-06-20 16:18:37 +0000 +++ kernel/arch/sparc64/src/drivers/niagara.c 2013-09-26 15:37:36 +0000 @@ -205,6 +205,7 @@ * buffers. */ + sysinfo_set_item_val("fb", NULL, true); sysinfo_set_item_val("fb.kind", NULL, 5); sysinfo_set_item_val("niagara.outbuf.address", NULL, === modified file 'uspace/Makefile.common' --- uspace/Makefile.common 2013-09-14 15:41:36 +0000 +++ uspace/Makefile.common 2013-09-24 13:40:56 +0000 @@ -173,9 +173,9 @@ endif ifeq ($(CONFIG_OPTIMIZE_FOR_SIZE),y) - OPTIMIZATION = s + OPTIMIZATION = 1 else - OPTIMIZATION = 3 + OPTIMIZATION = 1 endif .PHONY: all clean === modified file 'uspace/lib/c/arch/sparc64/_link.ld.in' --- uspace/lib/c/arch/sparc64/_link.ld.in 2012-09-05 14:47:41 +0000 +++ uspace/lib/c/arch/sparc64/_link.ld.in 2013-09-26 15:04:36 +0000 @@ -9,6 +9,7 @@ text PT_LOAD FLAGS(5); #endif data PT_LOAD FLAGS(6); + debug PT_NOTE; } SECTIONS { @@ -62,6 +63,19 @@ *(.bss); } :data +#ifdef CONFIG_LINE_DEBUG + .comment 0 : { *(.comment); } :debug + .debug_abbrev 0 : { *(.debug_abbrev); } :debug + .debug_aranges 0 : { *(.debug_aranges); } :debug + .debug_info 0 : { *(.debug_info); } :debug + .debug_line 0 : { *(.debug_line); } :debug + .debug_loc 0 : { *(.debug_loc); } :debug + .debug_pubnames 0 : { *(.debug_pubnames); } :debug + .debug_pubtypes 0 : { *(.debug_pubtypes); } :debug + .debug_ranges 0 : { *(.debug_ranges); } :debug + .debug_str 0 : { *(.debug_str); } :debug +#endif + /DISCARD/ : { *(*); } (END)
follow-up: 16 comment:15 by , 10 years ago
There is a fix in the CHT preintegration branch for a bug, which may have been causing this random memory corruption:
http://bazaar.launchpad.net/~jakub/helenos/cht-preintegration/revision/2290
Replying to svoboda:
I was trying to debug this and that made the bug go away Heisenbug style:
- setting optimization level to -O1
- adding line debugging information
compile sparc64/Niagara with Line debugging information enabled. Then VFS does not crash. I hit another bug, the kernel niagara output driver does not set "fb" sysinfo node. Fixing that got me a fully working shell.
Patch below:
=== modified file 'kernel/arch/sparc64/src/drivers/niagara.c' --- kernel/arch/sparc64/src/drivers/niagara.c 2012-06-20 16:18:37 +0000 +++ kernel/arch/sparc64/src/drivers/niagara.c 2013-09-26 15:37:36 +0000 @@ -205,6 +205,7 @@ * buffers. */ + sysinfo_set_item_val("fb", NULL, true); sysinfo_set_item_val("fb.kind", NULL, 5); sysinfo_set_item_val("niagara.outbuf.address", NULL, === modified file 'uspace/Makefile.common' --- uspace/Makefile.common 2013-09-14 15:41:36 +0000 +++ uspace/Makefile.common 2013-09-24 13:40:56 +0000 @@ -173,9 +173,9 @@ endif ifeq ($(CONFIG_OPTIMIZE_FOR_SIZE),y) - OPTIMIZATION = s + OPTIMIZATION = 1 else - OPTIMIZATION = 3 + OPTIMIZATION = 1 endif .PHONY: all clean === modified file 'uspace/lib/c/arch/sparc64/_link.ld.in' --- uspace/lib/c/arch/sparc64/_link.ld.in 2012-09-05 14:47:41 +0000 +++ uspace/lib/c/arch/sparc64/_link.ld.in 2013-09-26 15:04:36 +0000 @@ -9,6 +9,7 @@ text PT_LOAD FLAGS(5); #endif data PT_LOAD FLAGS(6); + debug PT_NOTE; } SECTIONS { @@ -62,6 +63,19 @@ *(.bss); } :data +#ifdef CONFIG_LINE_DEBUG + .comment 0 : { *(.comment); } :debug + .debug_abbrev 0 : { *(.debug_abbrev); } :debug + .debug_aranges 0 : { *(.debug_aranges); } :debug + .debug_info 0 : { *(.debug_info); } :debug + .debug_line 0 : { *(.debug_line); } :debug + .debug_loc 0 : { *(.debug_loc); } :debug + .debug_pubnames 0 : { *(.debug_pubnames); } :debug + .debug_pubtypes 0 : { *(.debug_pubtypes); } :debug + .debug_ranges 0 : { *(.debug_ranges); } :debug + .debug_str 0 : { *(.debug_str); } :debug +#endif + /DISCARD/ : { *(*); } (END)
comment:16 by , 10 years ago
Replying to jermar:
There is a fix in the CHT preintegration branch for a bug, which may have been causing this random memory corruption:
http://bazaar.launchpad.net/~jakub/helenos/cht-preintegration/revision/2290
Unfortunately, I am still able to reproduce this from time to time even with the above fix.
comment:17 by , 10 years ago
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
I am closing this ticket and going to create new tickets for the pending issues.
boot image and kernel binary