ROS Resources: Documentation | Support | Discussion Forum | Index | Service Status | ros @ Robotics Stack Exchange
Ask Your Question
1

rosbag randomly missing topics on record

asked 2016-05-03 10:46:44 -0500

dejanpan gravatar image

Hi all. We've got a very weird case. Our configuration:

  • main_computer: running roscore
  • slave_A_computer (ROS_MASTER_URI=main_computer)
  • slave_B_computer (ROS_MASTER_URI=main_computer)
  • slave_C_computer (ROS_MASTER_URI=main_computer)
  • ROS: indigo
  • Ubuntu 14.04 (x86)

All computers are connected in one GigE network and synced using chrony. On main_computer we run https://github.com/ros-drivers/nmea_n... and on each slave computers we run GigE camera driver and record the image and gps topics (along with some other low bandwidth stuff (diagnostics, tf, etc)) with rosbag (using c++ program, i.e. rosrun rosbag record). Recordings on all 3 computers are done simulatenously but locally on each computer.

Now a mysterious thing that happens is that in every e.g. 1 out of 100 bags recorded on slave computers we do not get ONE of the gps topics (out of 4 that nmea_navsat_driver publishes). So for instance slave_A and slave_B have all the topics but on slave_C /gps/fix would be missing. This is all happening on a field robot during operation and where it is impossible to actually pause and debug.

So my question is how to debug an issue like this? Clearly the topic is being advertised and active since 2 computers get it. Also since the 3rd computer gets 3 of 4 topics from the nmea_navsat_driver the network link is up. Is it then a rosbag tool that is not able to build up all socket connections? Can I somehow constantly log a list of active topics for every computer?

thx upfront

edit retag flag offensive close merge delete

Comments

Hi Dejan, curious problem. It happening in only 1 of 100 cases is really weird. I guess you know about rostopic list and rostopic hz /gps/fix?

Felix Endres gravatar image Felix Endres  ( 2016-05-04 09:47:49 -0500 )edit

Thx @Felix Endress. I know yes and I could for instance write a program that would monitor that. And I am pretty sure it would tell me that the topic is missing and when it happens but then what do I do next?

Meanwhile I also found out that we are also missing some other topics from master comp.

dejanpan gravatar image dejanpan  ( 2016-05-05 14:00:04 -0500 )edit

2 Answers

Sort by ยป oldest newest most voted
1

answered 2016-11-02 01:48:40 -0500

dejanpan gravatar image

@tfoote: we were partiioning the bags based on size inside a single run.

What fixed above issue was NOT to record /rosout and /diagnostics topics. Instead we recorded /rosout_agg and /diagnostics_agg. Since we had around 30 nodes in the system with all of them publishing on /rosout there were apparently just too many connections. Not sure if it makes fully sense but we hadn't had this issue ever after.

edit flag offensive delete link more
2

answered 2016-05-07 17:06:56 -0500

tfoote gravatar image

That is odd. From the fact that it is not available for the whole bag i'm a little suspicious that the connection is getting dropped for some networking/kernel reason. We don't have robust reconnect logic for broken tcp connections. For that problem the workaround that can force a reconnect is to bring up a new publisher on the topic which will cause the master to notify all subscribers on the topic and they will attempt to connect to all publishers, both the new one and the one with the broken connection.

Related to that how are you partitioning your bags? Are you restarting rosbag or is it partiioning the bags based on size inside a single run?

If you have some extra space/bandwidth for logging you could turn up the roscpp logging level for rosbag to debug or superdebug which will give you details about the connection states etc on rosout.

edit flag offensive delete link more

Question Tools

2 followers

Stats

Asked: 2016-05-03 10:46:44 -0500

Seen: 1,400 times

Last updated: Nov 02 '16