Opened 6 years ago

Closed 6 years ago

Last modified 6 years ago

#782 closed defect (fixed)

HelenOS does not boot on Raspberry Pi

Reported by: Jakub Jermář Owned by:
Priority: major Milestone: 0.9.1
Component: helenos/boot/arm32 Version: mainline
Keywords: Cc:
Blocker for: Depends on:
See also:

Description

As of commit 4bb4cf88f506ddc6012f655a28835fe8872e9f71, the console output stops upon entering the kernel:

HelenOS bootloader, release 0.7.2 (Boosted Effort), revision 4bb4cf88f
Built on 2018-12-17 21:01:02 for arm32
Copyright (c) 2001-2018 HelenOS project
Boot loader: 0x00008000 -> 0x00015d40

Memory statistics
 0x00015000|0x00015000: bootstrap stack
 0x00010000|0x00010000: bootstrap page table
 0x00015838|0x00015838: boot info structure
 0x80a08000|0x00a08000: kernel entry point
Boot loader: 0x00008000 -> 0x00015d40
Payload: 0x00015d40 -> 0x00263d40
Kernel load address: 0x00a08000
Kernel start: 0x80a08000
RAM end: 0x01a08000 (16777216 bytes available)

Inflating components ... 
 0x80a08000|0x00a08000: kernel.elf.gz image (550396/154178 bytes)
 0x80a8f000|0x00a8f000: ns.gz image (107112/50188 bytes)
 0x80aaa000|0x00aaa000: loader.gz image (107240/50411 bytes)
 0x80ac5000|0x00ac5000: init.gz image (134476/62563 bytes)
 0x80ae6000|0x00ae6000: locsrv.gz image (122668/58017 bytes)
 0x80b04000|0x00b04000: rd.gz image (113604/53331 bytes)
 0x80b20000|0x00b20000: vfs.gz image (136864/63692 bytes)
 0x80b42000|0x00b42000: logger.gz image (119284/55647 bytes)
 0x80b60000|0x00b60000: fat.gz image (182444/86286 bytes)
 0x80b8d000|0x00b8d000: initrd.img.gz image (5345280/1769857 bytes)
Done.
Booting the kernel...

Change History (7)

comment:1 by Jakub Jermář, 6 years ago

This behavior (no kernel messages printed) occurs since:

4621d2311994bf63dea425ed923239d4ca1babc9 is the first bad commit
commit 4621d2311994bf63dea425ed923239d4ca1babc9
Author: Jiří Zárevúcky <jiri.zarevucky@nic.cz>
Date:   Mon Aug 13 05:00:17 2018 +0200

    Use compiler builtins for kernel atomics

:040000 040000 ac60c15132946569adf411d62658b0c5d133d5ac 556da68f1448a4d71cf4e6c6c159fe050a640994 M	abi
:040000 040000 ca0e6b391de9b4fd2f196b5f1b6be33072a352e1 ba3cea6956d002fb330343d54451a4b923305d8f M	kernel
:040000 040000 4522e5b77223c8b4e0dbb4dc2e914e90f861ce3d 717b54d24b9459244fec8ea8df7538d8e4da789f M	uspace

However, even before this commit, as far as the switch to the HelenOS-specific toolchain in commit bbe5e34956da986df4d32357c697e539e8cfec0d, the boot was failing with:

SPARTAN kernel, release 0.7.2 (Boosted Effort), revision ffa73c60d
Built on 2018-12-17 21:51:01 for arm32
Copyright (c) 2001-2018 HelenOS project
Detected 1 CPU(s), 196576 KiB free memory

######> Kernel panic on cpu0 due to a failed assertion: <######
waitq_sleep_timeout() at generic/src/synch/waitq.c:242:
(!PREEMPTION_DISABLED) || (PARAM_NON_BLOCKING(flags, usec))

THE=0x80544000: pe=1 thread=0x80543000 task=0x8053f000 cpu=0x80536400 as=0x80009000 magic=0xfacefeed
thread="kinit"
task="kernel"
0x80545eac: generic/src/debug/stacktrace.o:stack_trace()+0x0000001c
0x80545edc: generic/src/debug/panic.o:panic_common()+0x00000178
0x80545f24: generic/src/synch/waitq.o:waitq_sleep_timeout()+0x00000100
0x80545f7c: generic/src/mm/as.o:as_page_fault()+0x00000060
0x80545fb4: generic/src/interrupt/interrupt.o:exc_dispatch()+0x000000c8
cpu0: halted

This kernel panic, is more difficult to bisect due to the toolchain change.

comment:2 by Jakub Jermář, 6 years ago

The kernel panic started with this commit:

edc64c03b91257aecae0d60886bd274aea300bf9 is the first bad commit
commit edc64c03b91257aecae0d60886bd274aea300bf9
Author: Jakub Jermar <jakub@jermar.eu>
Date:   Wed Jul 18 00:42:57 2018 +0200

    Zero out new thread's register context
    
    This removes the information leak in which the new thread inherited some
    register values from the thread which created it. Also, now each thread
    begins execution with a well-defined register state.

:040000 040000 00a5a6a1f0af764b7222a75ae8d5c5b472a9f4f9 06d1f1b58faa1025b6c39f5089ac29a686ebf744 M	kernel

comment:3 by Jakub Jermář, 6 years ago

Commit 336b7393ec3e072439a0e045724088e669be87d4 fixed the panic caused by edc64c03b91257aecae0d60886bd274aea300bf9 (zero cpu_mode in context_t), but the crash due to 4621d2311994bf63dea425ed923239d4ca1babc9 (switch to compiler builtins for atomics) still remains.

comment:4 by Jakub Jermář, 6 years ago

Milestone: 0.8.00.9.1

comment:5 by Jakub Jermář, 6 years ago

I made a couple of experiments which helped me to narrow down the problem. It looks like the following test procedure executes as expected when called after the kernel's call to as_switch() in page_arch_init() and misbehaves if executed before:

80a47b10:       e1a0c00d        mov     ip, sp
80a47b14:       e92dd800        push    {fp, ip, lr, pc}
80a47b18:       e24cb004        sub     fp, ip, #4
80a47b1c:       e24dd008        sub     sp, sp, #8

80a47b20:       ee070fba        mcr     15, 0, r0, cr7, cr10, {5}   <= DMB
80a47b24:       e24b3010        sub     r3, fp, #16

80a47b28:       e1932f9f        ldrex   r2, [r3]
80a47b2c:       e2822001        add     r2, r2, #1
80a47b30:       e1831f92        strex   r1, r2, [r3]
80a47b34:       e3510000        cmp     r1, #0
80a47b38:       1afffffa        bne     80a47b28                    <= atomic_inc()

80a47b3c:       e3a00000        mov     r0, #0
80a47b40:       ee070fba        mcr     15, 0, r0, cr7, cr10, {5}   <= DMB
80a47b44:       e24bd00c        sub     sp, fp, #12
80a47b48:       e89da800        ldm     sp, {fp, sp, pc}

As for what exactly misbehaves mean, I suspect STREX always returns 1, forming thus an infinite loop. It's as if the system was not ready to execute the LDREX-ADD-STREX atomic sequence yet and calling page_arch_init() fixed that.

Last edited 6 years ago by Jakub Jermář (previous) (diff)

comment:6 by Jakub Jermář, 6 years ago

Ok, I figured this out.

The problem is that the loader installs a 1:1 mapping between the virtual and physical address space (and wickedly assumes physical mirrors at 2G). Virtual addresses that map identically to physical memory are mapped as cacheable (both inner- and outer- write-back, write-allocate) and everything else is mapped noncacheable as it is assumed to be a device. Unfortunately this "everything else" happens to include also kernel virtual addresses that use a PA2KA() mapping (i.e. identity with a shift to 2G). So until the kernel installs its own page tables, the LDREX/STREX instructions use mappings which are marked as noncacheable device memory. No wonder it doesn't work. Previous versions were not affected because they used a different mechanism which is not sensitive the memory attribute of the used memory.

Splitting the loader's page table into two halves, first with a 1:1 mapping and second with a PA2KA mapping fixes the problem on RaspberryPi. Unfortunately it breaks bbone, most likely because of its physical memory starts already at 2G (which might also be the reason it was not affected by the issue in the first place).

I am now looking into ways to fix this so that nothing breaks.

comment:7 by Jakub Jermář, 6 years ago

Component: helenos/kernel/arm32helenos/boot/arm32
Resolution: fixed
Status: newclosed
Note: See TracTickets for help on using tickets.