Robotics StackExchange | Archived questions

[ROS 2 Humble] Add Publisher in runtime with intra process communication yields deadlock

Hi all,

i am chasing a strange deadlock in my program for a while now. Sadly i am also not a experienced gdb user so any advice for additional gdb commands to get more info is appreciated.

So for one of my threads which wants to add a new publisher i got this backtrace (removed the last parts which show the call trace through my program as i think it isn't relevant)

#0  __futex_abstimed_wait_common (cancel=false, private=<optimized out>, abstime=0x0, clockid=0, expected=3, futex_word=0x55cf5859ca8c) at ./nptl/futex-internal.c:103
#1  __GI___futex_abstimed_wait64 (futex_word=futex_word@entry=0x55cf5859ca8c, expected=expected@entry=3, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=<optimized out>) at ./nptl/futex-internal.c:128
#2  0x00007f869ec6a2cf in __pthread_rwlock_wrlock_full64 (abstime=0x0, clockid=0, rwlock=0x55cf5859ca80) at ./nptl/pthread_rwlock_common.c:730
#3  ___pthread_rwlock_wrlock (rwlock=0x55cf5859ca80) at ./nptl/pthread_rwlock_wrlock.c:26
#4  0x00007f869fa8bca1 in std::__glibcxx_rwlock_wrlock (__rwlock=0x55cf5859ca80) at /usr/include/c++/11/shared_mutex:80
#5  0x00007f869fa8bcf6 in std::__shared_mutex_pthread::lock (this=0x55cf5859ca80) at /usr/include/c++/11/shared_mutex:193
#6  0x00007f869fa8bda2 in std::shared_timed_mutex::lock (this=0x55cf5859ca80) at /usr/include/c++/11/shared_mutex:473
#7  0x00007f869fa8d3b1 in std::unique_lock<std::shared_timed_mutex>::lock (this=0x7f865b7fae20) at /usr/include/c++/11/bits/unique_lock.h:139
#8  0x00007f869fa8c6f7 in std::unique_lock<std::shared_timed_mutex>::unique_lock (this=0x7f865b7fae20, __m=...) at /usr/include/c++/11/bits/unique_lock.h:69
#9  0x00007f869fa8aa74 in rclcpp::experimental::IntraProcessManager::add_publisher (this=0x55cf5859c9d0, publisher=std::shared_ptr<rclcpp::PublisherBase> (use count 2, weak count 1) = {...}) at /home/sven/colcon_ws/src/rclcpp/rclcpp/src/rclcpp/intra_process_manager.cpp:37
#10 0x00007f86a011925a in rclcpp::Publisher<geometry_msgs::msg::PoseWithCovarianceStamped_<std::allocator<void> >, std::allocator<void> >::post_init_setup (this=0x7f864000fd50, node_base=0x55cf585aed30, topic=..., qos=..., options=...) at /opt/ros/humble/include/rclcpp/rclcpp/publisher.hpp:202
#11 0x00007f86a014ad6f in rclcpp::create_publisher_factory<geometry_msgs::msg::PoseWithCovarianceStamped_<std::allocator<void> >, std::allocator<void>, rclcpp_lifecycle::LifecyclePublisher<geometry_msgs::msg::PoseWithCovarianceStamped_<std::allocator<void> >, std::allocator<void> > >(rclcpp::PublisherOptionsWithAllocator<std::allocator<void> > const&)::{lambda(rclcpp::node_interfaces::NodeBaseInterface*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rclcpp::QoS const&)#1}::operator()(rclcpp::node_interfaces::NodeBaseInterface*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rclcpp::QoS const&) const (qos=..., topic_name="/alom/multi_ukf_tracker/track_3/lidar", node_base=0x55cf585aed30, __closure=0x7f86400044c0) at /opt/ros/humble/include/rclcpp/rclcpp/publisher_factory.hpp:80
#12 std::__invoke_impl<std::shared_ptr<rclcpp_lifecycle::LifecyclePublisher<geometry_msgs::msg::PoseWithCovarianceStamped_<std::allocator<void> >, std::allocator<void> > >, rclcpp::create_publisher_factory<geometry_msgs::msg::PoseWithCovarianceStamped_<std::allocator<void> >, std::allocator<void>, rclcpp_lifecycle::LifecyclePublisher<geometry_msgs::msg::PoseWithCovarianceStamped_<std::allocator<void> >, std::allocator<void> > >(rclcpp::PublisherOptionsWithAllocator<std::allocator<void> > const&)::{lambda(rclcpp::node_interfaces::NodeBaseInterface*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rclcpp::QoS const&)#1}&, rclcpp::node_interfaces::NodeBaseInterface*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rclcpp::QoS const&>(std::__invoke_other, rclcpp::create_publisher_factory<geometry_msgs::msg::PoseWithCovarianceStamped_<std::allocator<void> >, std::allocator<void>, rclcpp_lifecycle::LifecyclePublisher<geometry_msgs::msg::PoseWithCovarianceStamped_<std::allocator<void> >, std::allocator<void> > >(rclcpp::PublisherOptionsWithAllocator<std::allocator<void> > const&)::{lambda(rclcpp::node_interfaces::NodeBaseInterface*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rclcpp::QoS const&)#1}&, rclcpp::node_interfaces::NodeBaseInterface*&&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rclcpp::QoS const&) (__f=...) at /usr/include/c++/11/bits/invoke.h:61
#13 std::__invoke_r<std::shared_ptr<rclcpp::PublisherBase>, rclcpp::create_publisher_factory<geometry_msgs::msg::PoseWithCovarianceStamped_<std::allocator<void> >, std::allocator<void>, rclcpp_lifecycle::LifecyclePublisher<geometry_msgs::msg::PoseWithCovarianceStamped_<std::allocator<void> >, std::allocator<void> > >(rclcpp::PublisherOptionsWithAllocator<std::allocator<void> > const&)::{lambda(rclcpp::node_interfaces::NodeBaseInterface*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rclcpp::QoS const&)#1}&, rclcpp::node_interfaces::NodeBaseInterface*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rclcpp::QoS const&>(rclcpp::create_publisher_factory<geometry_msgs::msg::PoseWithCovarianceStamped_<std::allocator<void> >, std::allocator<void>, rclcpp_lifecycle::LifecyclePublisher<geometry_msgs::msg::PoseWithCovarianceStamped_<std::allocator<void> >, std::allocator<void> > >(rclcpp::PublisherOptionsWithAllocator<std::allocator<void> > const&)::{lambda(rclcpp::node_interfaces::NodeBaseInterface*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rclcpp::QoS const&)#1}&, rclcpp::node_interfaces::NodeBaseInterface*&&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rclcpp::QoS const&) (__fn=...) at /usr/include/c++/11/bits/invoke.h:116
#14 std::_Function_handler<std::shared_ptr<rclcpp::PublisherBase> (rclcpp::node_interfaces::NodeBaseInterface*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rclcpp::QoS const&), rclcpp::create_publisher_factory<geometry_msgs::msg::PoseWithCovarianceStamped_<std::allocator<void> >, std::allocator<void>, rclcpp_lifecycle::LifecyclePublisher<geometry_msgs::msg::PoseWithCovarianceStamped_<std::allocator<void> >, std::allocator<void> > >(rclcpp::PublisherOptionsWithAllocator<std::allocator<void> > const&)::{lambda(rclcpp::node_interfaces::NodeBaseInterface*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rclcpp::QoS const&)#1}>::_M_invoke(std::_Any_data const&, rclcpp::node_interfaces::NodeBaseInterface*&&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rclcpp::QoS const&) (__functor=..., __args#0=<optimized out>, __args#1="/alom/multi_ukf_tracker/track_3/lidar", __args#2=...) at /usr/include/c++/11/bits/std_function.h:291
#15 0x00007f869fafe3d1 in std::function<std::shared_ptr<rclcpp::PublisherBase> (rclcpp::node_interfaces::NodeBaseInterface*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rclcpp::QoS const&)>::operator()(rclcpp::node_interfaces::NodeBaseInterface*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rclcpp::QoS const&) const (this=0x7f865b7fb120, __args#0=0x55cf585aed30, __args#1="/alom/multi_ukf_tracker/track_3/lidar", __args#2=...) at /usr/include/c++/11/bits/std_function.h:590
#16 0x00007f869fafd724 in rclcpp::node_interfaces::NodeTopics::create_publisher (this=0x55cf58703e50, topic_name="/alom/multi_ukf_tracker/track_3/lidar", publisher_factory=..., qos=...) at /home/sven/colcon_ws/src/rclcpp/rclcpp/src/rclcpp/node_interfaces/node_topics.cpp:50
#17 0x00007f86a010d3ad in rclcpp::detail::create_publisher<geometry_msgs::msg::PoseWithCovarianceStamped_<std::allocator<void> >, std::allocator<void>, rclcpp_lifecycle::LifecyclePublisher<geometry_msgs::msg::PoseWithCovarianceStamped_<std::allocator<void> >, std::allocator<void> >, rclcpp_lifecycle::LifecycleNode, rclcpp_lifecycle::LifecycleNode> (node_parameters=..., node_topics=..., topic_name="/alom/multi_ukf_tracker/track_3/lidar", qos=..., options=...) at /opt/ros/humble/include/rclcpp/rclcpp/create_publisher.hpp:65
#18 0x00007f86a010d5bf in rclcpp::create_publisher<geometry_msgs::msg::PoseWithCovarianceStamped_<std::allocator<void> >, std::allocator<void>, rclcpp_lifecycle::LifecyclePublisher<geometry_msgs::msg::PoseWithCovarianceStamped_<std::allocator<void> >, std::allocator<void> >, rclcpp_lifecycle::LifecycleNode&> (node=..., topic_name="/alom/multi_ukf_tracker/track_3/lidar", qos=..., options=...) at /opt/ros/humble/include/rclcpp/rclcpp/create_publisher.hpp:104
#19 0x00007f86a010d603 in rclcpp_lifecycle::LifecycleNode::create_publisher<geometry_msgs::msg::PoseWithCovarianceStamped_<std::allocator<void> >, std::allocator<void> > (this=this@entry=0x55cf585bdc80, topic_name="/alom/multi_ukf_tracker/track_3/lidar", qos=..., options=...) at /opt/ros/humble/include/rclcpp_lifecycle/rclcpp_lifecycle/lifecycle_node_impl.hpp:52

With source installation of rclcpp i got the already enhanced info of the std:shared_timed_mutex::lock so it tries lock a mutex which is currently in shared (read) lock (otherwise it would be able to lock it, i assume). i checked the mutex data

(gdb) p mutex_
$2 = {<std::__shared_mutex_pthread> = {_M_rwlock = {__data = {__readers = 10, __writers = 0, __wrphase_futex = 2, __writers_futex = 3, __pad3 = 0, __pad4 = 0, __cur_writer = 0, __shared = 0, 
    __rwelision = 0 '\000', __pad1 = "\000\000\000\000\000\000", __pad2 = 0, __flags = 0}, __size = "\n\000\000\000\000\000\000\000\002\000\000\000\003", '\000' <repeats 42 times>, 
  __align = 10}}, <No data fields>}
(gdb) p mutex_.__data.__readers

This confirmed that the mutex is locked by 10 readers. But i searched through my threads all apply bt for a call of IntraProcessManagerso if any process is hanging within some of the shared lock functions, but i didn't find anything.

I examend the IntraProcessManager Code quickly and found that publishing locks the mutex as shared. Could it be that if we publish many things in parallel and also with a high frequency might it be possible that the mutex never comes unlocked for writing anymore (only by chance)?

is there a method to see the processes which hold the shared lock, similar to the mutex owner? as i would assume that i would find in the backtrace some calls of IntraProcessManager i think it also could be that deleted processes didn't unlock their mutex (which i would think is unlikely but i am new with mutex locks)

Thanks for any idea

Asked by wienans on 2023-02-13 04:42:43 UTC

Comments

Answers