ROS2 publisher dies when subscriber segfaults.

asked 2019-10-22 12:47:50 -0500

csdgn gravatar image

updated 2019-10-22 14:19:52 -0500

When running a publisher/subscriber in ros2 dashing (update 3), on a rare occasion, a subscriber faulting or crashing will lead to the publisher failing as well using FastRTPS with the following error.

[component_container-1] terminate called after throwing an instance of 'rclcpp::exceptions::RCLError'
[component_container-1]   what():  failed to publish message: cannot publish data, at /tmp/binarydeb/ros-dashing-rmw-fastrtps-shared-cpp-0.7.5/src/rmw_publish.cpp:52, at /tmp/binarydeb/ros-dashing-rcl-0.7.7/src/rcl/publisher.c:257
[ERROR] [component_container-1]: process has died [pid 27020, exit code -6, cmd '/opt/ros/dashing/lib/rclcpp_components/component_container __node:=example __ns:=/'].

OpenSplice does NOT have this problem, but has other issues that make it unsuitable (e.g. missing messages).

Each of our modules is in it's own docker, with it's own copy of ros2. Each module is up for a very long time and subscribes to and publishes to many other modules. Making this rare occurrence much more common and causing significant reliability issues.

I was able to reproduce this (or something with the same error message at least) with running a publisher on a tight loop and segfaulting the subscriber (sometimes took a few tries). I have uploaded this example to gitlab.

Is there anything I can do to prevent this issue?

edit retag flag offensive close merge delete

Comments

I've seen this as well, I think folks are aware of it but the best option I can give you is to move to another RMW for the meantime.

stevemacenski gravatar image stevemacenski  ( 2019-10-22 14:03:04 -0500 )edit

Thanks, I fixed a few typos, I am going to look into OpenSplice's configuration a bit more, it might have a way to reduce the message loss issues.

csdgn gravatar image csdgn  ( 2019-10-22 14:21:04 -0500 )edit