Remote node doesn't shut down when connection with master is lost
I'm currently working in a ROS environment where master is running on one host and another node is running on a remote host and I'm attempting to modify the remote node to shut down automatically when the connection to master is severed (by physically removing the ethernet cable, in this case). I've tried the following approaches with no luck so far:
Publishing a custom heartbeat message which the remote node listens for, calling cleanup if one is not received in 1 second, the idea being that if master is unreachable then the node will not be able to read from the topic. I've tried variations where the remote node itself is the heartbeat sender and where a node on the master host is the sender instead.
Using
rosgraph.is_master_online()
with a Timer set to 1 second, calling cleanup when False. I've tried using bothrospy.Timer
andthreading.Timer
obects to schedule the callback.
In all cases, the node stays running when the ethernet cable is disconnected, as indicated by a light on the remote host showing the state of the node.
I don't think the issue is with the shutdown, as the same code is called when the node is terminated through Ctrl+C or rosnode kill
, which both work fine. Moreover, when the cable is reconnected, the node detects the stopping condition and shuts down.
This is just speculation, but my gut feeling is that when the connection is severed, the remote node is getting paused and never reaching the connection checking code. When the connection is restored, the node unpauses and retroactively hits the check. I haven't seen this behaviour documented anywhere, though. It may just be that I'm overlooking something obvious.
I'd much appreciate any help resolving this issue!
The issue seems to stem from
rosgraph.is_master_online()
blocking the node. I've raised a PR with rosgraph here adding an optional timeout parameter torosgraph.is_master_online()
. In the meantime, the following solution seems to be sufficient, despite feeling a little hackyThank you for sharing. I enjoyed reading your troubleshooting steps and how you found a solution and raised a PR
I doubt the PR will get merged as-is (still: +1000 for submitting it of course).
Getting that timeout setting where it should be (ie:
xmlrpclib
) would have a much higher chance I believe.Or: monkey-patch the specific
socket
object that is being used to make the request. Not as elegant as it could be, but probably better than setting a program-wide default timeout value.Agree that the workaround feels a bit extreme. I'll take a deeper look into
xmlrpclib
when I get time. Had a brief look yesterday, but couldn't seem to find theServerProxy.getPid()
method being called fromrosgraph
Note btw also: it's perfectly possible for the master to not be reachable, while the rest of your application still works. It's not a "normal" situation, but technically master-not-reachable != nodes-are-down. As long as no new connections need to be created, nodes can keep communicating with what they have. It's all peer-to-peer, so as long as connections are established, the master is not needed.
have you seen wiki/bond?
That would explain the behaviour I was seeing when the heartbeat sender was also the remote node, as I presume the p2p connection to itself would've already been established (and would stay alive) when the connection was cut. Sill a little unsure about the behaviour when the sender was located on the master host though. Do ROS topics also use xmlrpc under the hood? Possibly a similar blocking issue as the
rosgraph
example.The remote node contains a controller for the robot. If the network goes down, we want everything to halt ASAP for safety. It also causes issues when trying to spin up the node again since it's already running (though there's probably options to kill and reboot in that case). Either way, I think the first point rules out that option.
I haven't seen bond before. Thanks for letting me know about it! Looks ...(more)