Robotics StackExchange | Archived questions

Optionally disable ros publisher buffering

Hi everyone,

We are using ROS as a basis for our robotics/AI project. It's been great so far with all the tools and code it provides, however today I ran into a small (big?) problem.

A small conceptual explanation of how our system is designed and how it works. We have some input on a socket port which is basically visual data. We have a ros node that reads this port out, does some processing, and publishes the data ready for consumption on a topic. This happens at 60Hz. We have another node that consumes this visual data and sends instructions out a serial port, the control node. This also happens at 60Hz. So far so good; this is standard ROS stuff and besides some minor caveats we've had no problems using this model.

Now here comes my problem. Some buffering takes place in the visual node when publishing the processed data. So at 60Hz we receive visual data from the socket, and at 60Hz we call publisher.publish(processedVisualData);. What sometimes happens is that new data from the socket arrives before the previously published data is actually sent to the control node. I know this for sure because sometimes this happens in the control node (function calls derived from print statements throughout my program):

So what happens here is because of the buffering in the visual node (or in the TCP of the subscriber? I've been looking at the ROS source code a lot lately and I've seen buffers in literally every class I look at) the control node sometimes runs without a new visual update, even though that visual update was given to the publisher in the visual node. So even though both nodes run at 60Hz (I've confirmed this with rostopic hz; the throughput of visual messages and control messages is almost 60Hz), the control nodes effectively runs at 30Hz, because every odd cycle it has to reuse "stale" data.

There are a few possible solutions I've found:

In the best case scenario there would be an option in ROS publishers/subscribers or in the roscore program to make some topics "real-time", i.e. to make sure that some pieces of data get pumped through the system A.S.A.P., and don't lie around for several milliseconds in some buffers. However I've googled quite a bit already and read a lot of ROS source, and there doesn't seem to be a button for that (except for the earlier thing on line 215, but that's not configurable nor reliable). I really hope I'm wrong though.

We can probably work around it, but the situation as it is now is highly undesirable.

EDIT: As suggested by @gvdhoorn tcpNoDelay can be set on subscribers. This improves my control throughput from 30 to 45 "fresh" cycles per second. The problem is still not solved but it's already better with very little effort. I have a feeling that if tcpNoDelay would also be present on publishers the problem would be solved; at the time of writing however this is not the case.

Asked by bobismijnnaam on 2017-06-20 05:07:15 UTC

Comments

Change line 215 on [link] from true to false

I'm confused here: according to this and the page you link, immediate_write is already false.

Asked by gvdhoorn on 2017-06-20 05:13:41 UTC

Whoops, that's on me. Of course I meant to write true. Changing it now.

Asked by bobismijnnaam on 2017-06-20 05:14:41 UTC

And a question: have you looked into the transport hints? Especially tcpNoDelay?

Asked by gvdhoorn on 2017-06-20 05:15:18 UTC

I assumed tcpNoDelay is true always and would only be false if you want to test your application under dire circumstances. Is this not the case?

Asked by bobismijnnaam on 2017-06-20 05:16:47 UTC

As can be seen on http://docs.ros.org/api/roscpp/html/classros_1_1TransportHints.html#a03191a9987162fca0ae2c81fa79fcde9, I suppose I'm right? Please correct me if I'm wrong.

Asked by bobismijnnaam on 2017-06-20 05:19:09 UTC

And a high-level comment: I think what you are seeing is the classic event-based vs polling (or periodic) system clash. ie: on the one hand you have an async comms pattern (pub-sub) and on the other a sync, periodic 'control system'. Marrying those is always a bit of a challenge.

Asked by gvdhoorn on 2017-06-20 05:20:34 UTC

I assumed tcpNoDelay is true always and would only be false if you want to test your application under dire circumstances. Is this not the case?

Afaik, tcpNoDelay is not enabled by default. That is also not what the code shows. The function arg has a true default, but that is something else.

Asked by gvdhoorn on 2017-06-20 05:22:31 UTC

About the high-level comment: I agree. With the information I have now after almost a year of development I'm not sure if I would pick ROS's publisher/subscriber system again. At the very least I would go with nodelets (at least for the critical stuff. For debugging infrastructure pub/sub is great).

Asked by bobismijnnaam on 2017-06-20 05:24:45 UTC

About the tcpNoDelay: the docs say it only works for subscribers. I will test, but I'm skeptical if it will solve my problem. Thanks for your input!

Asked by bobismijnnaam on 2017-06-20 05:25:12 UTC

Setting tcpNoDelay by default would also probably not be such a good idea: it only really works well with small messages, and in some cases could actually be detrimental to performance.

I'm also not saying it'll necessarily solve your problem btw.

Asked by gvdhoorn on 2017-06-20 05:25:39 UTC

I would suggest you look at OROCOS: there is nothing inherently 'wrong' with pub/sub, it's the scheduling of tasks and how they interact with their message queues. As ROS1 does not have a component runtime model, there is not too much that you can do about it, other than setting your buffer ..

Asked by gvdhoorn on 2017-06-20 05:27:22 UTC

.. depth to 1. This should ensure you are always working with the latest data. At least on the subscriber side (but the data must have arrived first, of course).

ROS2 should change this and should allow you much more control over node scheduling, similar to how other frameworks have done this.

Asked by gvdhoorn on 2017-06-20 05:28:07 UTC

there is not too much that you can do about it

Not entirely true: with some effort, a custom CallbackQueue and scheduling mechanism could probably do something about this.

Asked by gvdhoorn on 2017-06-20 05:31:02 UTC

Detting tcpNoDelay to true bumps up the control output to 45 per second instead of 30. Which makes sense, because now only the publisher side is not using the tcpNoDelay. If there would be a tcpNoDelay for publishers the problem would be solved I imagine.

Asked by bobismijnnaam on 2017-06-20 05:33:43 UTC

Also, I think our messages are relatively small, just a bunch of ints and floats.

Asked by bobismijnnaam on 2017-06-20 05:33:59 UTC

I'm not sure if I'm willing to go as far as implementing my own message queue. As it is now we've got enough complexity to deal with already. Maybe this is good enough.

Asked by bobismijnnaam on 2017-06-20 05:34:51 UTC

Something else I noticed:

we've already had some performance issues with in the past w.r.t. many frequently accessed ROS parameters)

this seems strange: the parameter server is not meant to be accessed frequently. That is what topics are for.

Asked by gvdhoorn on 2017-06-20 05:51:41 UTC

Btw: changing a node to a nodelet isn't really that much work - depending on what your dataflow is like and how you use the msgs. In a nutshell: Nodelet Nodehandle io regular one, subclass io main, ConstPtr io regular Ptrs for msgs.

Asked by gvdhoorn on 2017-06-20 05:53:05 UTC

There's even a page that provides an overview of what to do: wiki/nodelet/Tutorials/Porting nodes to nodelets (ignore the rosbuild references, obviously).

Asked by gvdhoorn on 2017-06-20 05:54:20 UTC

I think it is quite feasible to convert our current project to nodelets for the critical stuff. However we have a very critical deadline in less than a few weeks so we just can't afford it now. About params: hindsight is 20/20; I agree with you.

Asked by bobismijnnaam on 2017-06-20 05:58:37 UTC

One more thing to check, as you don't mention it: how is your control flow set up in your subscriber(s)? You mention spinOnce(), but it's hard to tell whether you are running a while(true) .. or are using timers or something else.

Asked by gvdhoorn on 2017-06-21 01:31:13 UTC

I'm using an ordinary while(ros::ok()) {....;spinOnce();....}. I could possibly call spinOnce() directly after publishing the "critical" data, but I'm unsure if that wil solve the problem (ros will still buffer everything).

Asked by bobismijnnaam on 2017-06-21 04:39:43 UTC

I was more interested in the consumer here (ie: subscriber). If you use a while-loop with a sleep or ros::Rate there as well, that could be limiting your throughput.

Asked by gvdhoorn on 2017-06-22 11:08:24 UTC

Ah, ok. I use a rate at 60Hz. Disabling it doesn't change it, neither on the subscriber nor publisher side.

Asked by bobismijnnaam on 2017-06-22 11:29:26 UTC

Answers