ROS Resources: Documentation | Support | Discussion Forum | Index | Service Status | ros @ Robotics Stack Exchange
Ask Your Question
1

Remote node doesn't shut down when connection with master is lost

asked 2021-10-20 13:48:35 -0500

aa-tom gravatar image

I'm currently working in a ROS environment where master is running on one host and another node is running on a remote host and I'm attempting to modify the remote node to shut down automatically when the connection to master is severed (by physically removing the ethernet cable, in this case). I've tried the following approaches with no luck so far:

  • Publishing a custom heartbeat message which the remote node listens for, calling cleanup if one is not received in 1 second, the idea being that if master is unreachable then the node will not be able to read from the topic. I've tried variations where the remote node itself is the heartbeat sender and where a node on the master host is the sender instead.

  • Using rosgraph.is_master_online() with a Timer set to 1 second, calling cleanup when False. I've tried using both rospy.Timer and threading.Timer obects to schedule the callback.

In all cases, the node stays running when the ethernet cable is disconnected, as indicated by a light on the remote host showing the state of the node.

I don't think the issue is with the shutdown, as the same code is called when the node is terminated through Ctrl+C or rosnode kill, which both work fine. Moreover, when the cable is reconnected, the node detects the stopping condition and shuts down.

This is just speculation, but my gut feeling is that when the connection is severed, the remote node is getting paused and never reaching the connection checking code. When the connection is restored, the node unpauses and retroactively hits the check. I haven't seen this behaviour documented anywhere, though. It may just be that I'm overlooking something obvious.

I'd much appreciate any help resolving this issue!

edit retag flag offensive close merge delete

Comments

The issue seems to stem from rosgraph.is_master_online() blocking the node. I've raised a PR with rosgraph here adding an optional timeout parameter to rosgraph.is_master_online(). In the meantime, the following solution seems to be sufficient, despite feeling a little hacky

import socket
...
socket.setdefaulttimeout(timeout)
rosgraph.is_master_online()
socket.setdefaulttimeout(None)
...
aa-tom gravatar image aa-tom  ( 2021-10-20 17:01:43 -0500 )edit

Thank you for sharing. I enjoyed reading your troubleshooting steps and how you found a solution and raised a PR

osilva gravatar image osilva  ( 2021-10-20 18:53:54 -0500 )edit

I doubt the PR will get merged as-is (still: +1000 for submitting it of course).

Getting that timeout setting where it should be (ie: xmlrpclib) would have a much higher chance I believe.

Or: monkey-patch the specific socket object that is being used to make the request. Not as elegant as it could be, but probably better than setting a program-wide default timeout value.

gvdhoorn gravatar image gvdhoorn  ( 2021-10-21 03:39:30 -0500 )edit

Agree that the workaround feels a bit extreme. I'll take a deeper look into xmlrpclib when I get time. Had a brief look yesterday, but couldn't seem to find the ServerProxy.getPid() method being called from rosgraph

aa-tom gravatar image aa-tom  ( 2021-10-21 05:30:38 -0500 )edit

the idea being that if master is unreachable then the node will not be able to read from the topic.

Note btw also: it's perfectly possible for the master to not be reachable, while the rest of your application still works. It's not a "normal" situation, but technically master-not-reachable != nodes-are-down. As long as no new connections need to be created, nodes can keep communicating with what they have. It's all peer-to-peer, so as long as connections are established, the master is not needed.

gvdhoorn gravatar image gvdhoorn  ( 2021-10-21 05:38:26 -0500 )edit

Publishing a custom heartbeat message which the remote node listens for, calling cleanup if one is not received in 1 second,

have you seen wiki/bond?

gvdhoorn gravatar image gvdhoorn  ( 2021-10-21 05:39:24 -0500 )edit

That would explain the behaviour I was seeing when the heartbeat sender was also the remote node, as I presume the p2p connection to itself would've already been established (and would stay alive) when the connection was cut. Sill a little unsure about the behaviour when the sender was located on the master host though. Do ROS topics also use xmlrpc under the hood? Possibly a similar blocking issue as the rosgraph example.

The remote node contains a controller for the robot. If the network goes down, we want everything to halt ASAP for safety. It also causes issues when trying to spin up the node again since it's already running (though there's probably options to kill and reboot in that case). Either way, I think the first point rules out that option.

I haven't seen bond before. Thanks for letting me know about it! Looks ...(more)

aa-tom gravatar image aa-tom  ( 2021-10-21 06:25:15 -0500 )edit

1 Answer

Sort by ยป oldest newest most voted
0

answered 2021-10-22 04:52:03 -0500

aa-tom gravatar image

updated 2021-10-22 05:42:21 -0500

rosgraph.is_master_online() was blocking when the network was down, however, I'd also missed that there was a rospy.logwarn() call in the cleanup code of the remote node, which was also blocking under the same conditions.

I've used the following lines of code to force rosgraph.is_master_online() to return False after a given timeout period:

import socket
...
socket.setdefaulttimeout(timeout)
rosgraph.is_master_online()
socket.setdefaulttimeout(None)
...

And replaced the rospy.logwarn() call with a logging.Logger object calling warn()

edit flag offensive delete link more

Comments

So what is "fully fixed now"?

It would be good to describe your full solution / work-around in your answer, so future readers don't have to guess what you did in the end.

gvdhoorn gravatar image gvdhoorn  ( 2021-10-22 05:04:34 -0500 )edit

Good point! Edited with the specifics

aa-tom gravatar image aa-tom  ( 2021-10-22 05:42:45 -0500 )edit

Question Tools

1 follower

Stats

Asked: 2021-10-20 13:48:35 -0500

Seen: 362 times

Last updated: Oct 22 '21