ROS Resources: Documentation | Support | Discussion Forum | Index | Service Status | ros @ Robotics Stack Exchange
Ask Your Question
0

Possible (risk of a) deadlock in ros::Timer impl?

asked 2020-06-23 10:45:23 -0500

CodeFinder gravatar image

Hi all,

I am using ros::Timer as illustrated in the following snippet:

Called from a thread A:

std::lock_guard<std::mutex> lock(my_mutex);
if (some_condition) {
  tm = nh.createTimer(ros::Duration(...), MyClass::timerCallback);
}

Called from a thread B:

std::lock_guard<std::mutex> lock(my_mutex);
if (some_other_condition) {
  tm.stop();
}

Called from a thread C:

void MyClass::timerCallback(ros::TimerEvent ev) {
  std::lock_guard<std::mutex> lock(my_mutex);
    // ...
}

All my threads are basically spinner from my pool of (async.) spinner threads that will handle events when they occur. However, nearly every thread needs to lock my_mutex first (as illustrated above). Thread A doesn't really matter here and I've added it just for completeness.

Problem: what happens is that the thread B locks (owns) my_mutex, the above mentioned condition is true so that it calls lm.stop() (verified with a debugger). Meanwhile (or most probably shortly before lm.stop() but after B acquired my_mutex) thread C enters the timerCallback() and tries to lock my_mutex (first line of that callback). Since my_mutex is already locked by thread B, this enters a passive wait so that timerCallback() cannot return. Unfortunately, lm.stop() will block forever (deadlock) so that my_mutex is never released in thread B and thread C can never continue. I've tracked this down (debugger) by following the ROS code, to (ROS Noetic):

#0  futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7fbc10002950) at ../sysdeps/nptl/futex-internal.h:183
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x7fbc10002900, cond=0x7fbc10002928) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=0x7fbc10002928, mutex=0x7fbc10002900) at pthread_cond_wait.c:638
#3  0x00007fbc3bb552e3 in ros::CallbackQueue::removeByID(unsigned long) () from /opt/ros/noetic/lib/libroscpp.so
#4  0x00007fbc3bb2eac0 in ros::TimerManager<ros::Time, ros::Duration, ros::TimerEvent>::remove(int) () from /opt/ros/noetic/lib/libroscpp.so
#5  0x00007fbc3bb2d28b in ros::Timer::Impl::stop() () from /opt/ros/noetic/lib/libroscpp.so

and it seems that its stuck in an internal mutex of the ROS timer implemenation. So my lm.stop() calls ros::Timer::Impl::stop() which, finally, calls ros::CallbackQueue::removeByID(). (Inside that function, there's a lot of locking going on and it appears to me that there's also a lot of corner cases being handled.)

Is this intended behavior? Does it mean that I am not allowed to do the above locking scheme? Or is this a bug?

Thanks!

edit retag flag offensive close merge delete

Comments

Since I got no answer here yet, I also posted it as an issue on Github: https://github.com/ros/ros_comm/issue...

CodeFinder gravatar image CodeFinder  ( 2020-06-26 08:52:17 -0500 )edit

So this is not an answer to your question but more an observation: if all your threads need to lock my_mutex first, wouldn't a single threaded spinner be more easy? This way you can remove all your mutexes and guaranty that everything is thread-safe.

Rayman gravatar image Rayman  ( 2020-06-26 09:14:32 -0500 )edit

Thx for your hint! I was a bit simplifying in that not really ALL threads are locking that mutex. There are indeed some callbacks that are not using that 'global' mutex or some other mutexes as well. But I don't think that this is relevant to my question here and that's way I omitted it. ๐Ÿ™ˆ

So, although there's some serialization enforced by that mutex, some events are (and even have to be) processed in parallel.

CodeFinder gravatar image CodeFinder  ( 2020-06-26 09:57:40 -0500 )edit

Any ideas on this? :-(

  • push *
CodeFinder gravatar image CodeFinder  ( 2020-09-04 09:27:18 -0500 )edit

You could try to lock in your timer callback and return from the callback function if someone else holds the mutex, allowing thread B to stop the timer.

pcoenen gravatar image pcoenen  ( 2020-09-07 09:43:33 -0500 )edit

Thanks for your idea! Unfortunately, there are many) situations where the mutex is locked and waiting for it to be released is the intended behavior (i.e., it would be wrong to not execute the code in the callback). In other words, calling timer.stop() isn't the only reason why my mutex is locked. And it's even not a problem if it would get executed even if a timer.stop() has been issued in parallel.

More ideas?

CodeFinder gravatar image CodeFinder  ( 2020-09-07 09:48:24 -0500 )edit

If your timer can be stopped during callback execution, I don't see a reason to lock the timer mutex in your callback in the first place. Have you considered splitting the mutex' responsibilities, i.e. timer_mutex, ressourceA_mutex, ressourceB_mutex... ?

pcoenen gravatar image pcoenen  ( 2020-09-08 02:57:03 -0500 )edit

Thanks for your thoughts! my_mutex is not intended to protect the timer itself (because I thought ROS timers are thread-safe anyway) but rather to protect against data race within the callbacks (both the callback calling timer.stop() as well as the timer callback itself). Note that all my callbacks have to do a lot of more stuff that I excluded in my examples above for improved readability.

So, I am already using multiple mutexes for a more fine-grained locking scheme but the somewhat generic mutex my_mutex must be locked in both places here.

IMHO, it boils down to why is it necessary (in the ROS impl.) that timer.stop() must wait for the callback to return? I think this is a bug (or a very bad feature). In fact, the stop() docs (only) say "Once this call returns, no more callbacks will be called." So to my understanding, it must ...(more)

CodeFinder gravatar image CodeFinder  ( 2020-09-09 06:29:51 -0500 )edit

1 Answer

Sort by ยป oldest newest most voted
0

answered 2020-09-09 07:20:41 -0500

Rayman gravatar image

updated 2020-09-09 07:21:39 -0500

You have to really think about what resource your mutex is protecting, it was not clear what your intention is. I guess you have some protected process that you want to start/stop. In that case, it would look something like this:

Called from a thread A:

tm = nh.createTimer(ros::Duration(...), MyClass::timerCallback);

Called from a thread B:

my_mutex.lock();
// at this point, we are guaranteed that no protected process is running, but a timerCallback could be blocked
if (some_condition) {
  need_to_process = false
  my_mutex.unlock();

  // don't lock around timer because it has an internal mutex
  tm.stop(); // wait for finishing of the timerCallback

  // at this point in the code, the timer is timerCallback not running anymore
} else {
  my_mutex.unlock();
}

Called from a thread C:

void MyClass::timerCallback(ros::TimerEvent ev) {
  std::lock_guard<std::mutex> lock(my_mutex);
  if (need_to_process)
    // protected process ...
}
edit flag offensive delete link more

Comments

Sorry for not elaborating more on the background and making that clear(er). And thanks for your reply! :-)

In my use-case, I have several callbacks that are called when ROS msgs are received. Such messages are used as events for processing state changes in a finite state machine (FSM). So, once my_mutex.lock() has been called, the entire callback (= processing of that particular event) must run until return to not break the FSM logic due to race conditions. That's why doing my_mutex.unlock() in between is not an option. Consider this "called from a thread B" as the logic to be executed as part of a particular event of my FSM and such logic/code must always be executed without interruption (thus, the mutex).

I currently ended up in spawning a detachted thread in all my timer callbacks so that after thread creation, the callback can return w/o ...(more)

CodeFinder gravatar image CodeFinder  ( 2020-09-14 06:47:44 -0500 )edit

However, I am still not sure if this is a bug or intended behavior of the ROS timer logic. And I doubt that it's really intended...

CodeFinder gravatar image CodeFinder  ( 2020-09-14 06:48:24 -0500 )edit

Question Tools

2 followers

Stats

Asked: 2020-06-23 10:45:23 -0500

Seen: 705 times

Last updated: Sep 09 '20