Opened 7 years ago
Closed 7 years ago
#700 closed defect (fixed)
Deadlock between devman and locsrv
Reported by: | Ondra Hlavatý | Owned by: | Jakub Jermář |
---|---|---|---|
Priority: | major | Milestone: | |
Component: | helenos/srv/locsrv | Version: | |
Keywords: | deadlock ipc | Cc: | |
Blocker for: | Depends on: | ||
See also: |
Description
In yet undiscovered conditions, a deadlock while booting occurs. The kernel is still alive, but devman is stuck and the compositor won't start. It does happen quite frequently on "bad" builds, but "bad" build happens very sparsely. It seems to be a result of a race condition. Also, level of optimization is related - it happens more on optimized builds, never happened with -O0.
In the attached image, you can see the waiting IPC calls. We already discussed on this on IRC and mailing list, but then it happened once again, so here you have the bug report.
There is a cycle, which I haven't noticed at first. It is probably the cause of the deadlock, but I'm not able to identify the root cause myself.
Of course, there is a possibility that our xhci driver is causing it. But in every case, faulty driver should not block devman.
Attachments (4)
Change History (10)
by , 7 years ago
Attachment: | helenos-deadlock.svg added |
---|
comment:1 by , 7 years ago
Component: | helenos/unspecified → helenos/srv/loc |
---|---|
Owner: | set to |
Priority: | minor → major |
Status: | new → assigned |
With some effort, I was able to reproduce this issue. I get a similar picture of the waiting processes, see the attached log from kconsle for details.
by , 7 years ago
kconsole log with the output of the ipc command for the essential tasks
comment:2 by , 7 years ago
From the ipc log, I have reconstructed the wait graph.
devman
sends DRIVER_DEV_ADD(1024) tops2mouse
ps2mouse
sends DEVMAN_ADD_FUNCTION(1025) todevman
devman
grabs device_tree.rwlock in fun_online()devman
enters loc_register_tree_function() and loc_service_register(); sends LOC_SERVICE_REGISTER(1027) tolocsrv
followed by IPC_M_DATA_WRITE(7); the connection tolocsrv
is "serialized" and so it uses only a single phonelocsrv
does not reply to the LOC_SERVICE_REGISTER request because the fibril which is handling connections fromdevman
is itself blockedlocsrv
itself has 4 active LOC_EVENT_CAT_CHANGE(1024) calls toethip
; these had all been sent one after another by drivers that already added themselves into a category, but we don't see them now because event change notifications happen after enlisting in a category is completed (ie. after the request is answered).ethip
is itself not answering any of the four LOC_EVENT_CAT_CHANGE calls fromlocsrv
and blocks on NIC_SET_STATE(NIC_STATE_ACTIVE) toe1k
e1k
sends HW_RES_ENABLE_INTERRUPT for IRQ 11 topciintel
pciintel
sends IPC_CONNECT_ME_TO(2) tons
which in turn forwards it todevman
devman
tries to handle the call in devman_connection_device(), but gets blocked on the device_tree.rwlock which was locked in step 3 by fun_online()
comment:3 by , 7 years ago
So it looks like a driver registered itself in a category at locsrv
. This resulted in a "category changed event" notification which transitively depended on devman
and its ability to grab the device_tree.rwlock, but in the meantime devman
started another request which grabbed device_tree.rwlock and was waiting for the locsrv
connection fibril to handle its LOC_SERVICE_REGISTER request. locsrv
was thus waiting for devman
and devman
was waiting for locsrv
.
comment:4 by , 7 years ago
Summary: | Randomly occuring IPC deadlock on boot → Deadlock between devman and locsrv |
---|
comment:5 by , 7 years ago
I suspect this became reproducible when the interrupt controller drivers were DDFified. Before that pciintel
would not go through devman
when enabling the interrupt.
In order to fix this, three things are needed:
locsrv
needs to be able to process multiple exchanges fromdevman
in parallel, so the devman-to-locsrv exchange type will need to be upgraded to parallel from the current serial;devman
needs to be careful not to end exchanges withlocsrv
before it receives the answer, otherwise another request might reuse the old phone instead of using a new one, which would again lead to the present deadlock;- the
locsrv
connection fibril can get blocked even after answering the call fromdevman
in loc_category_change_event() if the limit of maximum async calls per phone (currently 4, see IPC_MAX_ASYNC_CALLS) was reached; we basically need to make sure that when the call is answered the fibril will not block otherwise we will still be getting this deadlock.
As for 3., we might either change the code to return error instead of blocking the fibril if there are more than IPC_MAX_ASYNC_CALLS per phone or try to answer LOC_SERVICE_ADD_TO_CAT only after loc_category_change_event() returns.
comment:6 by , 7 years ago
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
Fixed in mainline,2873, mainline,2874, mainline,2875, mainline,2876 and mainline,2877.
IPC state in the deadlock