Opened 14 years ago

Closed 14 years ago

#260 closed defect (fixed)

Booting process sometimes gets stuck while starting shells on VCs

Reported by: Jiri Svoboda Owned by: Jiri Svoboda
Priority: major Milestone: 0.4.3
Component: helenos/fs/fat Version: mainline
Keywords: Cc: jakub@…
Blocker for: Depends on:
See also:

Description

Booting sometimes gets stuck at the point where the first four VCs contain the getterm banner. No more banners are printed (no other VCs are active) and command line is not reached on any VC. Keyboard and mouse input on the console work and it is possible to enter the kernel console.

Reproduced on revision: mainline,644
Config: defaults/ia32
Qemu version: 0.10.3
Qemu command line: qemu -m 32 -cdrom image.iso -boot d
Reproducibility: non-deterministic, in about 50% of attempts

Change History (11)

comment:1 by Jiri Svoboda, 14 years ago

Owner: set to Jiri Svoboda
Status: newaccepted

comment:2 by Jiri Svoboda, 14 years ago

Running 'tasks' in kcon shows four tasks with the name 'getterm' and five task with the name 'loader'. On non-debug build we can see 7x getterm and 7x loader.

comment:3 by Martin Decky, 14 years ago

Just a guess: What about available memory? The non-determinism can be simply caused by race conditions on allocating memory (mapping and demapping of address space areas). Then the second question would be why all the tasks don't get unblock eventually.

comment:4 by Jiri Svoboda, 14 years ago

That was my first guess as well, but no, it's not an OOM. Increasing memory does not help.

A tiny bit of further investigation: The loader tasks are waiting for VFS (1x vfs_in_read, 4x vfs_in_open), VFS is waiting for FAT (1x vfs_out_read, 4x vfs_out_lookup). FAT is not waiting for any other server.

With tmpfs root filesystem, the problem does not occur.

comment:5 by Jakub Jermář, 14 years ago

Cc: jakub@… added

comment:6 by Jakub Jermář, 14 years ago

Component: unspecifiedfs/fat

I have quickly prototyped a deadlock detection mechanism for fibril synchronization primitives (only mutexes as of now), it can be found in lp:~jakub/helenos/deadlock-detection branch where it will soak until it is ready for mainline.

Nevertheless, the detection mechanism is useful even now as it detected the following deadlock between two fibrils in fat:

fibril A:
fibril_mutex_lock()
fat_idx_get_by_pos()
fat_match()
libfs_lookup()
fat_lookup()

fibril B:
fibril_mutex_lock()
fat_idx_get_by_index()
fat_root_get()
libfs_lookup()
fat_lookup()

comment:7 by Jakub Jermář, 14 years ago

Ok, I think I know what is the problem, based on the above stacks. In fat_match(), we first lock parent→idx→lock and then call fat_idx_get_by_pos(), in which we want to lock used_lock. But in another fibril, we manage to lock used_lock first in fat_idx_get_by_index(), but cannot get the idx lock for parent, because it is already taken by the first fibril.

comment:8 by Jakub Jermář, 14 years ago

Jiri, I have just (hopefully) fixed this in lp:~jakub/helenos/fs. Can you merge from there and verify the issue is no longer reproducible?

Thanks,
Jakub

comment:9 by Jiri Svoboda, 14 years ago

Yes, I can confirm it fixes the issue :-)

comment:10 by Jiri Svoboda, 14 years ago

Nice work!

comment:11 by Jakub Jermář, 14 years ago

Resolution: fixed
Status: acceptedclosed
Note: See TracTickets for help on using tickets.