subscribers in noetic seem to fail at a faster rate than in previous versions when there are a large number of subscribers
My organization has a large ROS-based system with on the order of almost 300 topics being transmitted at various rates. I was recently tasked with updating from Kinetic on Ubuntu 16.04 to Noetic on 20.04, and after I got it up and running, there was abnormal behavior that I eventually tracked down to some topics simply not being handled in some nodes. However, there was a probabilistic nature to which topics were not being handled. Sometimes the system would work well, and sometimes one or more topics were unhandled in certain nodes.
After much investigation, I was able to determine several data points.
- I was not simply getting fewer messages of a certain topic handled, but no messages handled.
- The outages were on a per-connection basis as opposed to a per-topic basis. I.e. if Node1 and Node2 both subscribed to TopicA, there were times Node1 ignored the topic while Node2 was handling it.
rosnode info
androstopic info
both showed that all the connections that were supposed to be there were.roswtf
just crashes with anOsNotDetected
exception.- The
ss
networking tool showed that data was going across the sockets and being successfully received/ACKed, which would seem to imply that data was getting there, just not being handled. - Which handlers failed to be called was somewhat probabalistic, and possibly dependent on position in code. For example, a ROS timer that was initialized at the end of the
onInit()
of a node with many publishers and subscribers would be handled, while if it was moved to the top of the function, it would not be handled. - Using single vs multi-threaded
NodeHandler
s did not solve the problem.
So I decided to make a small roscpp
-based app for stress testing. I created two nodes -- one publisher and one subscriber. The publisher advertised N topics and published an Int32 message over each of them once a second. The subscriber node simply received them and kept track of which topics had been heard. The nodes were run in separate processes.
When N was in the 400-500 range, I began seeing the same issue manifest -- some topics were being ignored for the duration of the run. Which topics they were and how many there were changed from run to run, and at the lower end of the range (N=400), sometimes all topics were handled.
Ignored topics were in continuous blocks (going by the order in which they were subscribed to):
- On 20.04, the ignored topics are nearly always among those first subscribed to (e.g. topics 0-120).
- On 18.04, the ignored topics are nearly always among the last of those subscribed to (e.g. topics 950-990).
I understand that our system is rather large, and that any system will eventually degrade if loaded enough, but we have been running it on Ubuntu 16.04 and 18.04 systems, so I wasn't sure why this issue was suddenly becoming visible. I ran the stress test on 16 ...
And a suggestion: please update the title of your question with the actual observation.
Your current title does not really mean anything and seems to already suggest a possible conclusion.