subscribers in noetic seem to fail at a faster rate than in previous versions when there are a large number of subscribers

asked 2021-02-01 23:00:16 -0500

jblee123
13 ●1 ●3 ●4

updated 2021-02-02 10:05:32 -0500

gvdhoorn
86574 ●283 ●1432 ●1054 http://cor.tudelft.nl/

My organization has a large ROS-based system with on the order of almost 300 topics being transmitted at various rates. I was recently tasked with updating from Kinetic on Ubuntu 16.04 to Noetic on 20.04, and after I got it up and running, there was abnormal behavior that I eventually tracked down to some topics simply not being handled in some nodes. However, there was a probabilistic nature to which topics were not being handled. Sometimes the system would work well, and sometimes one or more topics were unhandled in certain nodes.

After much investigation, I was able to determine several data points.

I was not simply getting fewer messages of a certain topic handled, but no messages handled.
The outages were on a per-connection basis as opposed to a per-topic basis. I.e. if Node1 and Node2 both subscribed to TopicA, there were times Node1 ignored the topic while Node2 was handling it.
rosnode info and rostopic info both showed that all the connections that were supposed to be there were.
roswtf just crashes with an OsNotDetected exception.
The ss networking tool showed that data was going across the sockets and being successfully received/ACKed, which would seem to imply that data was getting there, just not being handled.
Which handlers failed to be called was somewhat probabalistic, and possibly dependent on position in code. For example, a ROS timer that was initialized at the end of the onInit() of a node with many publishers and subscribers would be handled, while if it was moved to the top of the function, it would not be handled.
Using single vs multi-threaded NodeHandlers did not solve the problem.

So I decided to make a small roscpp-based app for stress testing. I created two nodes -- one publisher and one subscriber. The publisher advertised N topics and published an Int32 message over each of them once a second. The subscriber node simply received them and kept track of which topics had been heard. The nodes were run in separate processes.

When N was in the 400-500 range, I began seeing the same issue manifest -- some topics were being ignored for the duration of the run. Which topics they were and how many there were changed from run to run, and at the lower end of the range (N=400), sometimes all topics were handled.

Ignored topics were in continuous blocks (going by the order in which they were subscribed to):

On 20.04, the ignored topics are nearly always among those first subscribed to (e.g. topics 0-120).
On 18.04, the ignored topics are nearly always among the last of those subscribed to (e.g. topics 950-990).

I understand that our system is rather large, and that any system will eventually degrade if loaded enough, but we have been running it on Ubuntu 16.04 and 18.04 systems, so I wasn't sure why this issue was suddenly becoming visible. I ran the stress test on 16 ... (more)

edit retag flag offensive close merge delete

Comments

And a suggestion: please update the title of your question with the actual observation.

Your current title does not really mean anything and seems to already suggest a possible conclusion.

gvdhoorn ( 2021-02-02 03:07:31 -0500 )edit

add a comment

Comments

Yes, they are nodelets, but they are all running in their own process (defeats the purpose of nodelets, I know, but project history... :/ ), so I mentally conflate the two. I'm betting that issue is related. I did a quick test where I set a single one-shot timer in onInit() and did the "real" initialization in its handler, and I am now getting result comparable to kinetic/melodic -- it now starts bugging out at the N=[900-1000] range, and when it does, I am getting the "couldn't resolve publisher host" error messages like I did with previous versions. I'll need to spend some time investigating in the large project to see if this workaround pans out.

jblee123 ( 2021-02-02 09:39:21 -0500 )edit

IIUC, the regression discussed in the issue I linked has been 'fixed' by reverting the PR which introduced it.

So depending on when you ran your tests, you may just need to update your ROS packages and it could already be back to how things worked in Kinetic.

I would not start changing your code just yet.

According to status_page/ros_noetic_default.html?q=nodelet, the fixed version (ie: 1.10.1-1) is waiting for a sync. To test it, you could configure the testing repository and install the updated version.

Oh, and:

Yes, they are nodelets, but they are all running in their own process [..], so I mentally conflate the two.

that may be, but when asking for help on a forum, it pays to be precise. You only use the word nodes. Because of your single mention of onInit(..) I wondered whether it were actually nodelets, and whether the ...(more)

gvdhoorn ( 2021-02-02 09:57:56 -0500 )edit

Yep -- point taken on precision. I have tested with the 1.10.1-1 version of the nodelet package, and it's behaving much better. I think we'll just have to do prelim work now and wait until this gets into mainline before making the official switch. Thanks -- you've been tons of help!

jblee123 ( 2021-02-02 15:36:52 -0500 )edit

Thanks for trying out the new version of nodelet in the testing repo. Sounds like a new Noetic sync is a good idea.

@sloretz: yes, please do a sync. This is a critical regression which has come up in multiple cases of reduced performance / non-functioning systems.

gvdhoorn ( 2021-02-03 02:52:33 -0500 )edit

ROS Noetic sync with nodelet 1.10.1 is out

sloretz ( 2021-02-09 11:42:46 -0500 )edit

add a comment

subscribers in noetic seem to fail at a faster rate than in previous versions when there are a large number of subscribers

Comments

1 Answer

Comments

Question Tools

Stats

Related questions

subscribers in noetic seem to fail at a faster rate than in previous versions when there are a large number of subscribers edit

Comments

1 Answer

Comments

Question Tools

Stats

Related questions

subscribers in noetic seem to fail at a faster rate than in previous versions when there are a large number of subscribers