Opened 2 years ago
Last modified 10 months ago
#856 new defect
XHCI driver does not start reliably on amd64
Reported by: | Colin Parker | Owned by: | |
---|---|---|---|
Priority: | major | Milestone: | 0.14.2 |
Component: | helenos/drv/other | Version: | mainline |
Keywords: | xhci | Cc: | |
Blocker for: | Depends on: | ||
See also: |
Description
When starting the XHCI driver, there appears to be a race condition that prevents it from starting in some cases. Steps to reproduce:
Configure HelenOS for amd64 with UEFI and SMP.
Launch QEMU with UEFI firmware (EDK2):
qemu-system-x86_64 -enable-kvm —bios OVMF.fd -drive file=hdisk.img,index=0,media=disk,format=raw -device e1000 -device qemu-xhci,id=xhci -device usb-mouse -device usb-kbd -device intel-hda -device hda-duplex -serial mon:stdio -boot d -cdrom image.iso
OVMF.fd is the firmware description from EDK2 build.
The symptom will be that although you can use the mouse (it falls back to PS2 mouse), the usb-kbd will be inoperable and you can't type. You will see xhci does not register the devices, but does not report an error either.
The issue is sensitive to timing or sequencing, and does not happen in all cases. To try to trigger it, it helps to add "fibril_usleep(10000);" to line 357 (at the bottom of xhci_event_ring_init, before "fibril_mutex_initialize(&ring→guard);") of trb_ring.c in the xhci driver code. I use values around 10000 or so and it usually causes the issue but you might need to try 50000, 100000, etc, although if you set it too long it might start working again. Note that changing the log warning level to get debug information will also alter timing, so it is challenging to investigate.
This problem can also be triggered on Mac Mini hardware, although in that case there is no PS2 fallback.
Attachments (2)
Change History (5)
by , 2 years ago
Attachment: | Makefile.config added |
---|
by , 2 years ago
Attachment: | trb_ring.c added |
---|
comment:1 by , 2 years ago
Investigating this more, I have found two possible causes, both related to interrupt handling.
The first is that the interrupts are enabled in XHCI HC driver (and all USB HC drivers) prior to the device being reset. Hence, if the boot process left the HC in a state where interrupts can be triggered, that might happen between installing the interrupts and resetting the device. I have made an effort to fix that here: https://github.com/cvparker/helenos/commit/ddf2de9e7a49f7bae48542e87755244fd9d1d784
The second concerns reliability of the interrupt delivery when two devices share the same IRQ. I'm not an expert on this, but my understanding is that with edge-triggered interrupts there's no guarantee that the irq handler will be entered once for every time a device raises the interrupt line. I see 3 ways to make the system resilient if the boot process assigns shared IRQs:
1) Make each driver resilient against missed interrupts. For example, a periodic timer could be set to poll the device at a reasonably slow interval (like 100 ms), so that the outcome of missed interrupts is a small delay rather than permanent lockup.
2) Improve the interrupt reliability, either by better using the existing framework (e.g. reassigning interrupts to avoid conflicts) or switching to something like MSI-X.
3) Have the kernel call every registered irq handler that claims an irq, rather than descending through a list and calling the first one. I have made an effort in that direction here: https://github.com/cvparker/helenos/commit/ea156b8e4fee7914337739feb12eb55820f7f37f
Combined, these two commits do get the system full functional on both QEMU (using UEFI/EDK2) and on my Mac Mini hardware.
comment:3 by , 10 months ago
Milestone: | 0.14.1 → 0.14.2 |
---|
uspace/drv/bus/usb/trb_ring.c showing timing modifications to trigger the issue