roscore, publisher, subscriber recovery after failures

asked 2019-11-07 03:02:15 -0500

rickvanderzwet gravatar image

I am facing an issue about failure to recover after restart of the roscore or publisher. Requiring my to restart the processes in a specific order. Which is sometimes not possible; Two sample cases: a) when I am driving and making a recording using rosbag if (one of my) publishers crashes (and restarts) I like to make sure it automatically continues to work. b) The roscore is running on a different host, which is sometimes rebooted during mid-flight.

Both roscore, listener and publisher and running on different hosts, yet for the sake of simplicity (behaviour is the same) the example below will have it running all on the same host.

I am running Melodic with the listener.py and talker.py sample code found in the https://wiki.ros.org/rospy_tutorials/.... And using the following sequence to start/stop roscore, listener and talker: (legend: @t = sequence number, - gives action, >> gives state changes).

  @t=1;
  - start roscore
  @t=2;
  - start talker.py
  >> talker shows output
  >> /chatter published

  @t=3;
  - start listener.py
  >> listener shows output

  @t=4;
  - stop roscore
  >> talker shows output
  >> listener shows output
  >> /chatter removed

  @t=5;
  - start roscore
  >> talker shows output
  >> listener shows output

 @t=6;
  - stop listener.py
  >> listener empty

  @t=7;
  - start listener.py

  @t=8;
  - stop talker.py
  >> talker empty

  @t=9;
  - start talker.py
  >> talker shows output
  >> /chatter published

  @t=10;
  - stop listener.py

  @t=11;
  - start listener.py
  >> listener shows output

which is graphically represented like this: image description

I am looking for ways to:

  • a) Have the publisher 'republish' it's state after detecting the roscore has been rebooted and/or making the topic persistent in roscore.
  • b) Make the listener reconnect to the talker after a roscore reboot.

Any suggestions appreciated.

edit retag flag offensive close merge delete

Comments

Two comments:

  1. the master disappearing should not result in nodes dropping connections to each other. Unless you also stop your nodes, things should just continue working (as the master is only involved in setting up connections, not in actual data exchange). If not, something else is going wrong.
  2. re: recoverable master: take a look at vapor_master or DMTCP: Fixing the Single Point of Failure of the ROS Master (slides).
gvdhoorn gravatar image gvdhoorn  ( 2019-11-07 03:27:21 -0500 )edit