Possible (risk of a) deadlock in ros::Timer impl?
Hi all,
I am using ros::Timer
as illustrated in the following snippet:
Called from a thread A:
std::lock_guard<std::mutex> lock(my_mutex);
if (some_condition) {
tm = nh.createTimer(ros::Duration(...), MyClass::timerCallback);
}
Called from a thread B:
std::lock_guard<std::mutex> lock(my_mutex);
if (some_other_condition) {
tm.stop();
}
Called from a thread C:
void MyClass::timerCallback(ros::TimerEvent ev) {
std::lock_guard<std::mutex> lock(my_mutex);
// ...
}
All my threads are basically spinner from my pool of (async.) spinner threads that will handle events when they occur. However, nearly every thread needs to lock my_mutex
first (as illustrated above). Thread A doesn't really matter here and I've added it just for completeness.
Problem: what happens is that the thread B locks (owns) my_mutex
, the above mentioned condition is true
so that it calls lm.stop()
(verified with a debugger). Meanwhile (or most probably shortly before lm.stop()
but after B acquired my_mutex
) thread C enters the timerCallback()
and tries to lock my_mutex
(first line of that callback). Since my_mutex
is already locked by thread B, this enters a passive wait so that timerCallback()
cannot return. Unfortunately, lm.stop()
will block forever (deadlock) so that my_mutex
is never released in thread B and thread C can never continue. I've tracked this down (debugger) by following the ROS code, to (ROS Noetic):
#0 futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7fbc10002950) at ../sysdeps/nptl/futex-internal.h:183
#1 __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x7fbc10002900, cond=0x7fbc10002928) at pthread_cond_wait.c:508
#2 __pthread_cond_wait (cond=0x7fbc10002928, mutex=0x7fbc10002900) at pthread_cond_wait.c:638
#3 0x00007fbc3bb552e3 in ros::CallbackQueue::removeByID(unsigned long) () from /opt/ros/noetic/lib/libroscpp.so
#4 0x00007fbc3bb2eac0 in ros::TimerManager<ros::Time, ros::Duration, ros::TimerEvent>::remove(int) () from /opt/ros/noetic/lib/libroscpp.so
#5 0x00007fbc3bb2d28b in ros::Timer::Impl::stop() () from /opt/ros/noetic/lib/libroscpp.so
and it seems that its stuck in an internal mutex of the ROS timer implemenation. So my lm.stop()
calls ros::Timer::Impl::stop()
which, finally, calls ros::CallbackQueue::removeByID()
. (Inside that function, there's a lot of locking going on and it appears to me that there's also a lot of corner cases being handled.)
Is this intended behavior? Does it mean that I am not allowed to do the above locking scheme? Or is this a bug?
Thanks!
Since I got no answer here yet, I also posted it as an issue on Github: https://github.com/ros/ros_comm/issue...
So this is not an answer to your question but more an observation: if all your threads need to lock
my_mutex
first, wouldn't a single threaded spinner be more easy? This way you can remove all your mutexes and guaranty that everything is thread-safe.Thx for your hint! I was a bit simplifying in that not really ALL threads are locking that mutex. There are indeed some callbacks that are not using that 'global' mutex or some other mutexes as well. But I don't think that this is relevant to my question here and that's way I omitted it. ๐
So, although there's some serialization enforced by that mutex, some events are (and even have to be) processed in parallel.
Any ideas on this? :-(
You could try to lock in your timer callback and return from the callback function if someone else holds the mutex, allowing thread B to stop the timer.
Thanks for your idea! Unfortunately, there are many) situations where the mutex is locked and waiting for it to be released is the intended behavior (i.e., it would be wrong to not execute the code in the callback). In other words, calling timer.stop() isn't the only reason why my mutex is locked. And it's even not a problem if it would get executed even if a timer.stop() has been issued in parallel.
More ideas?
If your timer can be stopped during callback execution, I don't see a reason to lock the timer mutex in your callback in the first place. Have you considered splitting the mutex' responsibilities, i.e. timer_mutex, ressourceA_mutex, ressourceB_mutex... ?
Thanks for your thoughts!
my_mutex
is not intended to protect the timer itself (because I thought ROS timers are thread-safe anyway) but rather to protect against data race within the callbacks (both the callback callingtimer.stop()
as well as the timer callback itself). Note that all my callbacks have to do a lot of more stuff that I excluded in my examples above for improved readability.So, I am already using multiple mutexes for a more fine-grained locking scheme but the somewhat generic mutex
my_mutex
must be locked in both places here.IMHO, it boils down to why is it necessary (in the ROS impl.) that
timer.stop()
must wait for the callback to return? I think this is a bug (or a very bad feature). In fact, thestop()
docs (only) say "Once this call returns, no more callbacks will be called." So to my understanding, it must ...(more)