Opened 8 months ago

Closed 8 months ago

#871 closed defect (fixed)

Tasks crash during boot of 486 build

Reported by: Jiri Svoboda Owned by:
Priority: major Milestone: 0.14.2
Component: helenos/unspecified Version: mainline
Keywords: Cc:
Blocker for: Depends on:
See also:

Description

When I configure ia32 preset and change processor to 486, build the system and boot it using tools/ew.py, the system starts to boot, but a number of tasks crash.

Change History (19)

comment:1 by Jiri Svoboda, 8 months ago

Bisecting the changeset history revealed that this problem starts with changeset a635535739c9e7901c0b9b5b1890f38965383f17:

commit a635535739c9e7901c0b9b5b1890f38965383f17 (HEAD)
Author: Jiri Svoboda <jiri@wiwaxia>
Date:   Thu Nov 30 21:29:05 2023 +0100

    Move console/con_srv out of libc into a separate library

comment:2 by Jiri Svoboda, 8 months ago

The first task to crash is /srv/hid/input:

Task /srv/hid/input (55) killed due to an exception at program counter 0x00000000.
cs =0x0000001b	eip=0x00000000	efl=0x00210202	err=0x00000004
ds =0x00000023	es =0x00000023	fs =0x00000023	gs =0x00000030
ss =0x00000023
eax=0x701705c4	ebx=0x7016f6dc	ecx=0x00000001	edx=0x0804fc20
esi=0x70014ac0	edi=0x70014ac0	ebp=0x7012ee48	esp=0x7012ee1c
0x7012ee48: 0x00000000()
0x7012ee68: 0x7014cea0()
0x7012ee98: 0x70144e0c()
0x7012eeb8: 0x0804b151()
0x7012eec4: 0x0804b0c2()
Kill message: Page fault: 0x00000000.
[/srv/taskmon(16)] taskmon: Task 55 fault in thread 0x836d73b0.
[/srv/taskmon(16)] taskmon: Executing /app/taskdump -t 55
[/app/taskdump(56)] Task Dump Utility
[/app/taskdump(56)] Dumping task '/srv/hid/input' (task ID 55).
[/app/taskdump(56)] Loaded symbol table from /srv/hid/input
[/app/taskdump(56)] Threads:
[/app/taskdump(56)]  [1] hash: 0x836d73b0
[/app/taskdump(56)] Thread 0x836d73b0: PC = 0x00000000 (loc_category_get_id+0). FP = 0x7012ee48
[/app/taskdump(56)]   0x7012ee48: 0x00000000 (loc_category_get_id+0)
[/app/taskdump(56)]   0x7012ee68: 0x7014cea0 (_end+1745867220)
[/app/taskdump(56)]   0x7012ee98: 0x70144e0c (_end+1745834304)
[/app/taskdump(56)]   0x7012eeb8: 0x0804b151 (__c_start+129)
[/app/taskdump(56)]   0x7012eec4: 0x0804b0c2 (_start+21)
[/app/taskdump(56)] Address space areas:
[/app/taskdump(56)]  [1] flags: R-XC base: 0x08048000 size: 16384
[/app/taskdump(56)]  [2] flags: RW-C base: 0x0804c000 size: 16384
[/app/taskdump(56)]  [3] flags: R-XC base: 0x70000000 size: 81920
[/app/taskdump(56)]  [4] flags: RW-C base: 0x70014000 size: 69632
[/app/taskdump(56)]  [5] flags: RW-C base: 0x70025000 size: 4096
[/app/taskdump(56)]  [6] flags: RW-C base: 0x70026000 size: 4096
[/app/taskdump(56)]  [7] flags: RW-C base: 0x70028000 size: 4096
[/app/taskdump(56)]  [8] flags: RW-C base: 0x7002a000 size: 4096
[/app/taskdump(56)]  [9] flags: R--C base: 0x7002c000 size: 4096
[/app/taskdump(56)]  [10] flags: RW-C base: 0x7002d000 size: 4096
[/app/taskdump(56)]  [11] flags: RW-C base: 0x7002f000 size: 1048576
[/app/taskdump(56)]  [12] flags: RW-C base: 0x70130000 size: 258048
[/app/taskdump(56)]  [13] flags: RW-C base: 0x7016f000 size: 73728
[/app/taskdump(56)]  [14] flags: RW-C base: 0x7ff00000 size: 1048576
[/app/taskdump(56)] Failed dumping fibrils.

comment:3 by Jiri Svoboda, 8 months ago

I extracted core file from /srv/input and started analyzing it:

core55

(gdb) print /x *0x70130e48
$4 = 0x70146e0c
(gdb) print /x *0x70130e68
$5 = 0x70130e98
(gdb) print /x *0x70130e98
$6 = 0x70130eb8
(gdb) print /x *0x70130eb8
$7 = 0x70130ec4
(gdb) print /x *0x70130ec4
$8 = 0x0

Stack trace constructed by hand:
0x70130e48 0x0
0x70130e68 0x7014ef70
0x70130e98 0x70146e0c
0x70130eb8 0x804b161
0x70130ec4 0x804b0d2

(gdb) backtrace
#0  0x00000000 in ?? ()
#1  0x701593d8 in ?? ()
#2  0x7014ef70 in ?? ()
#3  0x70146e0c in ?? ()
#4  0x0804b161 in __c_start ()
#5  0x0804b0d2 in _start ()

(gdb) disas 0x0804b0d2
Dump of assembler code for function _start:
   0x0804b0bd <+0>:	mov    %ss,%ax
   0x0804b0c0 <+3>:	mov    %eax,%ds
   0x0804b0c2 <+5>:	mov    %eax,%es
   0x0804b0c4 <+7>:	mov    %eax,%fs
   0x0804b0c6 <+9>:	push   $0x0
   0x0804b0c8 <+11>:	push   $0x0
   0x0804b0ca <+13>:	mov    %esp,%ebp
   0x0804b0cc <+15>:	push   %edi
   0x0804b0cd <+16>:	call   0x804b0e0 <__c_start>
   0x0804b0d2 <+21>:	xchg   %ax,%ax
   0x0804b0d4 <+23>:	xchg   %ax,%ax
   0x0804b0d6 <+25>:	xchg   %ax,%ax
   0x0804b0d8 <+27>:	xchg   %ax,%ax
   0x0804b0da <+29>:	xchg   %ax,%ax
   0x0804b0dc <+31>:	xchg   %ax,%ax
   0x0804b0de <+33>:	xchg   %ax,%ax
End of assembler dump.
(gdb) 

(gdb) disas 0x804b0e0
Dump of assembler code for function __c_start:
   0x0804b0e0 <+0>:	push   %ebp
   0x0804b0e1 <+1>:	mov    %esp,%ebp
   0x0804b0e3 <+3>:	sub    $0x14,%esp
   0x0804b0e6 <+6>:	movl   $0x8048dc0,0x804fc80
   0x0804b0f0 <+16>:	movl   $0x8048000,0x804fc84
   0x0804b0fa <+26>:	movl   $0x804fd0c,0x804fc88
   0x0804b104 <+36>:	movl   $0x804c7ec,0x804fc8c
   0x0804b10e <+46>:	mov    $0x804c7ec,%eax
   0x0804b113 <+51>:	sub    $0x804c7ec,%eax
   0x0804b118 <+56>:	sar    $0x2,%eax
   0x0804b11b <+59>:	mov    %eax,0x804fc90
   0x0804b120 <+64>:	movl   $0x804c7ec,0x804fc94
   0x0804b12a <+74>:	mov    $0x804c7ec,%eax
   0x0804b12f <+79>:	sub    $0x804c7ec,%eax
   0x0804b134 <+84>:	sar    $0x2,%eax
   0x0804b137 <+87>:	mov    %eax,0x804fc98
   0x0804b13c <+92>:	movl   $0x804c7ec,0x804fc9c
   0x0804b146 <+102>:	mov    $0x804c7ec,%eax
   0x0804b14b <+107>:	sub    $0x804c7ec,%eax
   0x0804b150 <+112>:	sar    $0x2,%eax
--Type <RET> for more, q to quit, c to continue without paging--
   0x0804b153 <+115>:	mov    %eax,0x804fca0
   0x0804b158 <+120>:	mov    0x8(%ebp),%eax
   0x0804b15b <+123>:	push   %eax
   0x0804b15c <+124>:	call   0x8048d20 <__libc_main@plt>
End of assembler dump.
(gdb) 

(gdb) disas 0x8048d20
Dump of assembler code for function __libc_main@plt:
   0x08048d20 <+0>:	jmp    *0x804c940
   0x08048d26 <+6>:	push   $0x140
   0x08048d2b <+11>:	jmp    0x8048a90
End of assembler dump.

gdb) print /x *0x804c940
$27 = 0x70146df0

Where is this?

comment:4 by Jiri Svoboda, 8 months ago

If we objdump srv/loader, we can see its segments:

    LOAD off    0x00000000 vaddr 0x70000000 paddr 0x70000000 align 2**12
         filesz 0x00013460 memsz 0x00013460 flags r-x
    LOAD off    0x00014000 vaddr 0x70014000 paddr 0x70014000 align 2**12
         filesz 0x0000023c memsz 0x00010f28 flags rw-
     TLS off    0x00014000 vaddr 0x70014000 paddr 0x70014000 align 2**2
         filesz 0x00000000 memsz 0x00000010 flags r--


[loader(55)] libc_main: main=0x70000120
[loader(55)] Load segment v_addr=0x8048000 at addr 0x08048000, size 0x37ec, flags r-x
[loader(55)] Load segment v_addr=0x804c7ec at addr 0x0804c7ec, size 0x3520, flags rw-
[loader(55)] Parse program .dynamic section at 0x0804c7ec
[loader(55)] prog tdata at 0x00000000 size 0, tbss size 0
[loader(55)] module_load_deps('[program]')
[loader(55)] [program] needs libc.so.0
[loader(55)] Load segment v_addr=0x0 at addr 0x70132000, size 0x3e3b4, flags r-x
[loader(55)] Load segment v_addr=0x3f3b4 at addr 0x701713b4, size 0x118f4, flags rw-
[loader(55)] loaded '/lib/libc.so.0' at 0x70132000
[loader(55)] tdata at 0x701713b4 size 0, tbss size 80
[loader(55)] module_load_deps('libc.so.0')
Task /srv/hid/input (55) killed due to an exception at program counter 0x00000000.

0x70130e48 0x0
0x70130e68 0x7014ef70 in __kio_init
0x70130e98 0x70146e0c in __libc_main
0x70130eb8 0x804b161 in __c_start
0x70130ec4 0x804b0d2 in _start
0001cf50 <__kio_init>:
   1cf50:       55                      push   %ebp
   1cf51:       89 e5                   mov    %esp,%ebp
   1cf53:       53                      push   %ebx
   1cf54:       83 ec 0c                sub    $0xc,%esp
   1cf57:       e8 43 e7 fe ff          call   b69f <__x86.get_pc_thunk.bx>
   1cf5c:       81 c3 e0 2a 02 00       add    $0x22ae0,%ebx
   1cf62:       6a 01                   push   $0x1
   1cf64:       8d 83 e4 0e 00 00       lea    0xee4(%ebx),%eax
   1cf6a:       50                      push   %eax
   1cf6b:       e8 d0 d0 fe ff          call   a040 <futex_initialize@plt>
   1cf70:       83 c4 10                add    $0x10,%esp
   1cf73:       85 c0                   test   %eax,%eax
   1cf75:       75 07                   jne    1cf7e <__kio_init+0x2e>
   1cf77:       8b 5d fc                mov    -0x4(%ebp),%ebx
   1cf7a:       89 ec                   mov    %ebp,%esp
   1cf7c:       5d                      pop    %ebp
   1cf7d:       c3                      ret
   1cf7e:       e8 dd d3 fe ff          call   a360 <abort@plt>
   1cf83:       66 90                   xchg   %ax,%ax
   1cf85:       66 90                   xchg   %ax,%ax
   1cf87:       66 90                   xchg   %ax,%ax
   1cf89:       66 90                   xchg   %ax,%ax
   1cf8b:       66 90                   xchg   %ax,%ax
   1cf8d:       66 90                   xchg   %ax,%ax
   1cf8f:       90                      nop

then it calls futex_initialize@plt

comment:5 by Jiri Svoboda, 8 months ago

0001cf50 <__kio_init>:
   1cf50:       55                      push   %ebp
   1cf51:       89 e5                   mov    %esp,%ebp
   1cf53:       53                      push   %ebx
   1cf54:       83 ec 0c                sub    $0xc,%esp
   1cf57:       e8 43 e7 fe ff          call   b69f <__x86.get_pc_thunk.bx>  -- ebx := 1cf5c+0x70132000  == 0x7014ef5c
   1cf5c:       81 c3 e0 2a 02 00       add    $0x22ae0,%ebx ... ebx := 0x7014ef5c + 0x22ae0 == 0x70171a3c
   1cf62:       6a 01                   push   $0x1
   1cf64:       8d 83 e4 0e 00 00       lea    0xee4(%ebx),%eax
   1cf6a:       50                      push   %eax
   1cf6b:       e8 d0 d0 fe ff          call   a040  <futex_initialize@plt>

0000a040 <futex_initialize@plt>:
    a040:       ff a3 d0 01 00 00       jmp    *0x1d0(%ebx)
    a046:       68 88 03 00 00          push   $0x388
    a04b:       e9 d0 f8 ff ff          jmp    9920 <failed_asserts+0x98d4>

(gdb) print /x *(0x7014ef5c+0x22ae0+0x1d0)
$16 = 0x701593a0

(gdb) print /x 0x701593a0 - 0x70132000
$17 = 0x273a0

000273a0 <futex_initialize>:
   273a0:       55                      push   %ebp
   273a1:       89 e5                   mov    %esp,%ebp
   273a3:       83 ec 0c                sub    $0xc,%esp
   273a6:       e8 8d 46 fe ff          call   ba38 <__x86.get_pc_thunk.dx>
   273ab:       81 c2 91 86 01 00       add    $0x18691,%edx
   273b1:       8b 45 08                mov    0x8(%ebp),%eax
   273b4:       8b 4d 0c                mov    0xc(%ebp),%ecx

So we can see that we should have loaded the correct address of futex_initialize from the PLT

comment:6 by Jiri Svoboda, 8 months ago

(gdb) info registers
eax            0x70172924          1880566052
ecx            0x1                 1
edx            0x804fc60           134544480
ebx            0x70171a3c          1880562236 .... contains correct address of GOT entry for futex_initialize
esp            0x70130e1c          0x70130e1c
ebp            0x70130e48          0x70130e48
esi            0x70014ac0          1879132864
edi            0x70014ac0          1879132864
eip            0x0                 0x0           ... WHAT IS GOING ON ???
eflags         0x210206            [ PF IF RF ID ]
cs             0x1b                27
ss             0x23                35
ds             0x23                35
es             0x23                35
fs             0x23                35
gs             0x30                48
(gdb) 

comment:7 by Jiri Svoboda, 8 months ago

The problem is equally reproducible with Qemu and KVM disbled, and in VirtualBox.

comment:8 by Jiri Svoboda, 8 months ago

Stack trace might not be complete. Let's look at the top of the stack for a return address:

(gdb) print /x 0x701593d8 - 0x70132000
$2 = 0x273d8

000273a0 <futex_initialize>:

273a0: 55 push %ebp
273a1: 89 e5 mov %esp,%ebp
273a3: 83 ec 0c sub $0xc,%esp
273a6: e8 8d 46 fe ff call ba38 <x86.get_pc_thunk.dx>
273ab: 81 c2 91 86 01 00 add $0x18691,%edx
273b1: 8b 45 08 mov 0x8(%ebp),%eax
273b4: 8b 4d 0c mov 0xc(%ebp),%ecx
273b7: 89 08 mov %ecx,(%eax)
273b9: c7 40 04 00 00 00 00 movl $0x0,0x4(%eax)
273c0: 8b 92 e0 ff ff ff mov -0x20(%edx),%edx
273c6: 6a 0b push $0xb
273c8: 6a 00 push $0x0
273ca: 6a 00 push $0x0
273cc: 6a 00 push $0x0
273ce: 6a 00 push $0x0
273d0: 6a 00 push $0x0
273d2: 83 c0 04 add $0x4,%eax
273d5: 50 push %eax
273d6: ff 12 call *(%edx)
273d8: 89 ec mov %ebp,%esp ←-return here

comment:9 by Jiri Svoboda, 8 months ago

Note that we ran call *(%edx).

edx contains 0x804fc60
which is

$ objdump -D input
....
0804fc60 <__syscall_fast_func>:
        ...

and:

(gdb) print /x *0x804fc60
$16 = 0x0
(gdb) 

So syscall_fast_func is NULL which seems to be the reason for the core dump.

comment:10 by Jakub Jermář, 8 months ago

I think it's because of how we define syscalls for 486 in uspace/lib/c/arch/ia32/include/libarch/syscall.h:

#define __syscall0  __syscall_fast_func                                          
#define __syscall1  __syscall_fast_func                                          
#define __syscall2  __syscall_fast_func                                          
#define __syscall3  __syscall_fast_func                                          
#define __syscall4  __syscall_fast_func                                          
#define __syscall5  __syscall_slow                                               
#define __syscall6  __syscall_slow

On 486 we can only use syscall_slow.

comment:11 by Jiri Svoboda, 8 months ago

syscall_fast_func is a global variable / function pointer that is initialized to syscall_slow. On processors > 486 we then detect if we can use SYSENTER/SYSEXIT and if so, we set syscall_fast_func to syscall_fast.

Note however that 0x804fc60 is inside the main program. While libc is at 0x70130000.

[jirka@omelette dist]$ objdump -T lib/libc.so.0 | grep syscall_fast
0003fd60 g    DO .data	00000004 __syscall_fast_func
0000b577 g    DF .text	00000034 __syscall_fast
[jirka@omelette dist]$ objdump -T srv/hid/input | grep syscall_fast
0804fc60 g    DO .bss	00000004 __syscall_fast_func

The problem is that __syscall_fast_func is defined both in libc.so.0 and input.

In the binary/library the original value of the variable is zero and it needs to be fixed by the dynamic linker using R_386_32 relocation. It looks like we access the main executable's version of the variable, which is not correctly relocated.

The solution is to make sure we have just one instance of the variable.

comment:12 by Jiri Svoboda, 8 months ago

To make things even more confusing, not all binaries contain a definition of __syscall_fast_func, only some!

Specificaly, only these three binaries:

/app/bdsh
/srv/hid/console
/srv/hid/input

comment:13 by Jiri Svoboda, 8 months ago

It seems that what these have in common is that they depend on the console library. IN case of /app/bdsh, it depends on libclui, which itself depends on libconsole.

comment:14 by Jiri Svoboda, 8 months ago

Here's the culprit in libconsole's console.c:

bool console_kcon(void)
{
        return __SYSCALL0(SYS_DEBUG_CONSOLE);
}

Removing this line removes the definition of __syscall_fast_func variable from /srv/hid/input. So it was the move of this function outside libc that caused the problem. Now why does this create a definition instead of a declaration? Need to look at the preprocessed version of this file.

comment:15 by Jiri Svoboda, 8 months ago

OK so this is even more complicated. There is nothing wrong with this code. It simply accesses an extern variable that happens to be defined in a shared library. However, variables and shared libraries work differently than one might think.

For functions it is straightforward. If we call a function foo that is defined in the main program, we will call foo. If foo is defined in a shared library, we will call foo@plt. So far so good.

If there is a (non-static) variable "v" defined in a shared library, the library code will always access it via its address in GOT, for which there is a relocation entry (e.g R_386_GLOB_DAT) with "v" symbol name. The dynamic linker will patch the GOT entry with the address of "v" in the shared library. This is how it works if the main program never refers to the variable "v".

If the main program contains a reference to "v", something surprising will happen. The main program will *also* contain a definition of "v" in bss. The main program will refer to the variable directly just as to any other variable defined in the main program.

Now the dynamic linker, when relocating the library's GOT, will find the definition of "v" in the main program first (due to the symbol lookup rules) and will patch it in. So now both the main program and the library will use the defition of "v" in the main program and the instance in the library is unused.

What if "v" is initialized? The shared library will contain a COPY relocation that copies the value of library's "v" to the main program's "v".

Now we are getting to the meat of things.

comment:16 by Jiri Svoboda, 8 months ago

__syscall_fast_func is initialized, but not with a constant. It's initialized with the value of the symbol __syscall_slow. This means there must be a relocation entry that initializes __syscall_fast_func.

So first we must relocate libc's __syscall_fast_func to contain the address of __syscall_slow.
Then we must COPY value of libc's __syscall_fast_func to the main program's __syscall_fast_func.

But we can see from the debug output that in our case the relocations were performed in the opposite order:

[loader(55)] name '__syscall_fast_func', value 0x804f840, size 0x4
[loader(55)] rel_type=5 r_ptr=0x0804f840
[loader(55)] dest name: '[program]'
[loader(55)] dest bias: 0x0
[loader(55)] symbol definition found, addr=0x804f840, value=804f840, size=4, bind=1, type=1
[loader(55)] copy to 0x0804f840 from 70160de0 size 4 bytes
[loader(55)] byte[0]=0x0
[loader(55)] byte[1]=0x0
[loader(55)] byte[2]=0x0
[loader(55)] byte[3]=0x0
[loader(55)] name '__syscall_slow', value 0xb5cd, size 0x27
[loader(55)] rel_type=7 r_ptr=0x70160bc8
[loader(55)] dest name: 'libc.so.0'
[loader(55)] dest bias: 0x7012d000
[loader(55)] symbol definition found, addr=0x701385cd, value=b5cd, size=39, bind=1, type=2
[loader(55)] name '__syscall_fast_func', value 0x33de0, size 0x4
[loader(55)] rel_type=6 r_ptr=0x701606c4
[loader(55)] dest name: '[program]'
[loader(55)] dest bias: 0x0
[loader(55)] symbol definition found, addr=0x804f840, value=804f840, size=4, bind=1, type=1
[loader(55)] name '__syscall_slow', value 0xb5cd, size 0x27
[loader(55)] rel_type=1 r_ptr=0x70160de0
[loader(55)] dest name: 'libc.so.0'
[loader(55)] dest bias: 0x7012d000
[loader(55)] symbol definition found, addr=0x701385cd, value=b5cd, size=39, bind=1, type=2
[loader(55)] fixup R_386_32 (b+v+a) sym_addr=701385cd
[loader(55)] now *rptr=701385cd

Which means they appear in the opposite order in the relocation table.

What is not clear to me is whether the relocation table is incorrect, or whether the dynamic linker is supposed to handle this situation and figure out the correct relocation order.

comment:17 by Jiri Svoboda, 8 months ago

I am not entirely sure whether I want to pursue this or just work around it. I think I would need to ask in the linker community as I didn't find any information in this regard in the System V ABI (generic or 386 supplement).

comment:18 by Jiri Svoboda, 8 months ago

Alright, as was pointed out to me the COPY relocation is in the main executable, not the shared library! So the two relocations are in different relocation tables. All we need is to make sure we process the main program's relocations LAST (i.e. after the shared libraries).

comment:19 by Jiri Svoboda, 8 months ago

Resolution: fixed
Status: newclosed
Note: See TracTickets for help on using tickets.