Opened 7 years ago

Closed 7 years ago

#700 closed defect (fixed)

Deadlock between devman and locsrv

Reported by: Ondra Hlavatý Owned by: Jakub Jermář
Priority: major Milestone:
Component: helenos/srv/locsrv Version:
Keywords: deadlock ipc Cc:
Blocker for: Depends on:
See also:

Description

In yet undiscovered conditions, a deadlock while booting occurs. The kernel is still alive, but devman is stuck and the compositor won't start. It does happen quite frequently on "bad" builds, but "bad" build happens very sparsely. It seems to be a result of a race condition. Also, level of optimization is related - it happens more on optimized builds, never happened with -O0.

In the attached image, you can see the waiting IPC calls. We already discussed on this on IRC and mailing list, but then it happened once again, so here you have the bug report.

There is a cycle, which I haven't noticed at first. It is probably the cause of the deadlock, but I'm not able to identify the root cause myself.

Of course, there is a possibility that our xhci driver is causing it. But in every case, faulty driver should not block devman.

Attachments (4)

helenos-deadlock.svg (12.5 KB ) - added by Ondra Hlavatý 7 years ago.
IPC state in the deadlock
ipc.log (10.6 KB ) - added by Jakub Jermář 7 years ago.
kconsole log with the output of the ipc command for the essential tasks
tasks.log (3.0 KB ) - added by Jakub Jermář 7 years ago.
List of all running tasks
console.log (11.8 KB ) - added by Jakub Jermář 7 years ago.
Console output up to the hang

Download all attachments as: .zip

Change History (10)

by Ondra Hlavatý, 7 years ago

Attachment: helenos-deadlock.svg added

IPC state in the deadlock

comment:1 by Jakub Jermář, 7 years ago

Component: helenos/unspecifiedhelenos/srv/loc
Owner: set to Jakub Jermář
Priority: minormajor
Status: newassigned

With some effort, I was able to reproduce this issue. I get a similar picture of the waiting processes, see the attached log from kconsle for details.

by Jakub Jermář, 7 years ago

Attachment: ipc.log added

kconsole log with the output of the ipc command for the essential tasks

by Jakub Jermář, 7 years ago

Attachment: tasks.log added

List of all running tasks

by Jakub Jermář, 7 years ago

Attachment: console.log added

Console output up to the hang

comment:2 by Jakub Jermář, 7 years ago

From the ipc log, I have partially reconstructed the wait graph of the system.

  1. devman sends DRIVER_DEV_ADD(1024) to ps2mouse
  2. ps2mouse sends DEVMAN_ADD_FUNCTION(1025) to devman
  3. devman grabs device_tree.rwlock in fun_online()
  4. devman enters loc_register_tree_function() and loc_service_register(); sends LOC_SERVICE_REGISTER(1027) to locsrv followed by IPC_M_DATA_WRITE(7)
  5. locsrv for some reason does not reply to the LOC_SERVICE_REGISTER request; since there is no IPC on which LOC_SERVICE_REGISTER depends directly, this could mean that the request in locsrv is blocking on a synchronization primitive
  6. locsrv itself has 4 (!) active calls (method=1024) to ethip; had it been a single request, I would say this is LOC_EVENT_CAT_CHANGE(1024), but in this case I am not sure.
  7. ethip is itself not answering any of the four 1024 calls from locsrv and sends NIC_SET_STATE(NIC_STATE_ACTIVE) to e1k; it blocks in ethip_nic_check_new()→ethip_nic_open()→nic_set_state() and holds ethip_discovery_lock to prevent other invocations of ethip_nic_check_new()
  8. e1k sends HW_RES_ENABLE_INTERRUPT for IRQ 11 to pciintel
  9. pciintel sends IPC_CONNECT_ME_TO(2) to ns which in turn forwards it to devman
  10. devman tries to handle the call in devman_connection_device(), but gets blocked on the device_tree.rwlock which was locked in step 3 by fun_online()

Some notes about the things I don't know. Speculations follow:

If in step 6 the four messages sent by locsrv to ethip are indeed LOC_SERVICE_REGISTER, then they are sent from loc_category_change_event(), which takes the callback_sess_mutex. Currently, there can be IPC_MAX_ASYNC_CALLS=4 active calls per one phone. This means that if there are more than 4 callback sessions in the callback_sess_list all going to a single destination, the whole fibril will be blocked inside the callback_sess_mutex-protected critical section. This would correspond to the blocked nconfsrv which is waiting for LOC_CALLBACK_CREATE reply from locsrv. This message could not be answered at this point because callback_sess_mutex is held. I don't, however, see how ethip could have over 4 callbacks registered at locsrv.

Version 1, edited 7 years ago by Jakub Jermář (previous) (next) (diff)

comment:3 by Jakub Jermář, 7 years ago

So it looks like a driver registered itself in a category at locsrv. This resulted in a "category changed event" notification which transitively depended on devman and its ability to grab the device_tree.rwlock, but in the meantime devman started another request which grabbed device_tree.rwlock and was waiting for the locsrv connection fibril to handle its LOC_SERVICE_REGISTER request. locsrv was thus waiting for devman and devman was waiting for locsrv.

comment:4 by Jakub Jermář, 7 years ago

Summary: Randomly occuring IPC deadlock on bootDeadlock between devman and locsrv

comment:5 by Jakub Jermář, 7 years ago

I suspect this became reproducible when the interrupt controller drivers were DDFified. Before that pciintel would not go through devman when enabling the interrupt.

In order to fix this, three things are needed:

  1. locsrv needs to be able to process multiple exchanges from devman in parallel, so the devman-to-locsrv exchange type will need to be upgraded to parallel from the current serial;
  2. devman needs to be careful not to end exchanges with locsrv before it receives the answer, otherwise another request might reuse the old phone instead of using a new one, which would again lead to the present deadlock;
  3. the locsrv connection fibril can get blocked even after answering the call from devman in loc_category_change_event() if the limit of maximum async calls per phone (currently 4, see IPC_MAX_ASYNC_CALLS) was reached; we basically need to make sure that when the call is answered the fibril will not block otherwise we will still be getting this deadlock.

As for 3., we might either change the code to return error instead of blocking the fibril if there are more than IPC_MAX_ASYNC_CALLS per phone or try to answer LOC_SERVICE_ADD_TO_CAT only after loc_category_change_event() returns.

comment:6 by Jakub Jermář, 7 years ago

Resolution: fixed
Status: assignedclosed
Note: See TracTickets for help on using tickets.