# Message-delivery guarantees in ROS1

Q: What are the message-delivery guarantees of ROS1? I could not find documentation on that. And corollary question assuming no guarantees are present, would be: are there mitigation strategies to be used for side-effecting messages like a trajectory joint action goal not being delivered?

The context of question is that I have a long-running process (~10 hours), that makes use of the JointTrajectoryAction server of ROS-I, and sends a goal to the action server about 3000 times during that time. I use rosbridge to connect to ROS.

The problem I see is that, in those 3000 action goals, sometimes, one goes missing, never goes through, then the robot doesn't move, the code will wait forever thinking the goal is sent, but it didn't even make it to the action server. I am trying to understand where the problem is. I have confirmed that when this happens, the goal message is never published to the goal topic of the action server, so, that explains most of the rest, but doesn't clarify WHY it was not published.

As far as I can see, the are two potential sources of failure:

1) rosbridge is dropping the message somewhere internally without notice (nothing in ~/.ros/log/latest/rosbridge_websocket-*.log indicates this is the case thou); or...

2) ROS1 message-delivery guarantees simply do not guarantee at-least-once, and then this is situation that is to be expected, and needs to be dealt with somehow.

This question is specifically assuming the second scenario, because if it's a fact that there's no delivery guarantee (which I'm leaning to believe due to the lack of docs), then a mitigation strategy is necessary in any case regardless of whether scenario one is also possible. But right now, I'm failing to see any viable option to deal with it, a naive approach would involve something like: 1) send goal, 2) wait for status on the sent goal id, 3) if no status is received after a certain timespan (at least a few seconds) 4) declare the goal as unpublished and publish it again. I honestly don't feel very good about implementing such a thing, it doesn't feel like a safe or correct strategy to deal with this, so I'm interested in hearing alternatives.

edit retag close merge delete

1

Afaik, there are no guarantees other than what TCP gives you. That's partly why the QoS provided by DDS (and similar middlewares) are so valuable in ROS 2.

re: publishing goal: this confuses me. Goals are part of actionlib, and you're not really supposed to be handling all of that yourself. I'm guessing you're doing this because you're using a non-standard client-library and have rosbridge in there somewhere.

re: using rosbridge: this adds a nr of additional points of failure, as messages are being transformed multiple times between different communication domains. All of that will not make it easier to provide any guarantees, if there were any.

I honestly don't feel very good about implementing such a thing, it doesn't feel like a safe or correct strategy to deal with this

could you expand on why the described approach is not "safe or ...(more)

( 2020-01-04 09:09:46 -0600 )edit
1

.. as well as the goal (ie: task) that is being processed. If you send a goal to a server, but don't see it acknowledging reception by giving you feedback it has accepted it, it would seem reasonable to assume it hasn't received it (or you haven't received the rejection feedback that was already sent).

Rosbrige has been known to not be entirely stable when dealing with either long sessions or very busy / contentious sessions. I haven't used it extensively recently, so perhaps this has been improved. I would however recommend to run wireshark (fi) in parallel and monitor the websocket connection, as well as perhaps rosbag to record the action topics. That would give you an idea where messages could be lost.

And I'm posting this as a comment as I cannot give you an authoritative answer.

I can say however that IIRC, all of ...(more)

( 2020-01-04 09:14:14 -0600 )edit

Finally, I would also say that it could well be there are (unknown) issues with the joint_trajectory_action from industrial_robot_client (I'm guessing it is the node you refer to).

I believe you have a network and system topology something like the following:

roslibpy <-> websocket <-> rosbridge <-> ROSTCP <-> JTA <-> ROSTCP <-> robot_driver <-> TCP/IP


there's quite some points where this could fail. Capturing the bare TCP/IP connection with wireshark and the rest with rosbag could provide you with information on where things fail.

As far as I can see, the are two potential sources of failure

what about your (custom?) client lib not actually publishing (ie: actually pushing out the bytes)?

There is a way to use Services to host actions (see actionlib/server/service_server.h), but it's not async any more.

( 2020-01-04 09:17:10 -0600 )edit

Thanks @gvdhoorn for the very complete answer (as usual)! I believe you should actually post is as an answer because it's exactly the confirmation I was looking for, you mention the two things that answer it: 1) there are no guarantees other than what TCP gives you. That's partly why the QoS provided by DDS [..] are so valuable in ROS 2 and 2) all of the networking code in ROS 1 was written with the assumption of there being a "perfect network" [..]

publishing goal

I am not handling it myself, but I know that goals are effectively published to a topic, so I monitored this by recording a rosbag of all action topics and checking whether the goal is ever published.

could you expand on why the described approach is not "safe or correct"?

I was hoping for a more fundamental guarantee; but ...(more)

( 2020-01-04 09:50:39 -0600 )edit

I believe you should actually post is as an answer because it's exactly the confirmation I was looking for

Well, it's all based on my current understanding of the code involved. I was not around when this was written, so I cannot authoritatively answer your question. Hence the comment.

As TCP/IP is pretty good at delivering data, even over lossy networks (note: not perfect, but good enough), I'm not sure whether the message(s) is actually lost during transport. I would expect it to be somewhere inside a node, or perhaps there's even a logic error somewhere causing things just not to be published at all.

I am not handling it myself

If you could give a bit more information on what you are actually doing (ie: how are things connected, and who is sending what and when), perhaps we can provide some more (actual) insight.

( 2020-01-04 09:54:46 -0600 )edit

Rosbrige has been known to not be entirely stable when dealing with either long sessions or very busy / contentious sessions.

Yes, I noticed it. I initially tried to initialize the action client before every move, and that quickly failed in a myriad of ways.

I believe you have a network and system topology something like the following:

Indeed, that's exactly the system.

what about your (custom?) client lib not actually publishing (ie: actually pushing out the bytes)?

It could be. Wireshark will help there, as you suggest. It's not so custom thou, the library is roslibpy, I use the same released version, no customization there, and the actionlib code in roslibpy is a very literal port of roslibjs. However, there might be some weird edge cases based on the difference of the event loops of Javascript vs the one of Twisted/Autobahn used by roslibpy.

There is a ...

(more)
( 2020-01-04 09:59:37 -0600 )edit
1

It's not so custom thou, the library is roslibpy, I use the same released version, no customization there, and the actionlib code in roslibpy is a very literal port of roslibjs. However, there might be some weird edge cases based on the difference of the event loops of Javascript vs the one of Twisted/Autobahn used by roslibpy.

neither roslibjs or any clients based on it come close to how well the behaviour of roscpp and rospy are known (note: not how good those client libraries are. They are far from faultless, but at least quite a nr of their failure modes are known: see the ros_comm issue tracker fi).

Given the amount of time (10hrs) and the nr of requests (3000) it's most likely not going to be easy to reproduce this, so I'd add as much logging and introspection capability to nodes and infrastructure under ...(more)

( 2020-01-04 10:11:32 -0600 )edit