I looked into the problem together with scgroot. I've added a dockerfile and a docker-compose to the github page, this way people familiar with docker can quickly recreate the problem stated above. I've also added one extra piece of example code which uses 1 node instead of 10.
A description of what each binary contains is given here:
Binary | Publishers| Subscribers| ROS| ROS nodes| ROS timers| DDS participants|
ros | 20| 200| yes| 10| 10| 10|
rosonenode| 20| 200| yes| 1| 1| 1|
nopub | 0| 0| yes| 10| 10| 10|
rtps | 20| 200| no| 0| 0| 1|
noros | 20*| 200*| no| 0| 0| 0|
*C++ implementation no network publishing/subscribing
The resulting CPU usage when the binaries each get their own isolated docker container is given here:
Name CPU% MEM USAGE PIDS
ros 63.36 103.70MiB 43
rosonenode 21.13 17.35MiB 7
nopub 6.55 67.77MiB 43
rtps 6.48 15.20MiB 6
noros 0.32 3.98MiB 1
Inspection with valgrind (callgrind) and perf gives:
Binary Total DDS Executor Other
rtps 3.6e9 3.0e9 NA 6.0e8
rosonenode 1.3e10 2.5e9 9.1e9 1.4e9
ros 3.8e10 1.2e10 1.6e10 1.0e10
Values given in average #CPU cycles
Binary Total DDS Executor Other
rtps 100 83.7 NA 16.3
rosonenode 100 19.6 70.0 10.4
ros 100 32.0 43.9 24.1
Values given in % of total CPU usage contribution
Conclusions:
- Both the rtps implementation and the rosonenode implementation have 1 DDS participant leading to very similar work done in the eProsima part of the appliation (3.0e9 vs 2.5e9). The SingleThreadedExecutor adds a lot of overhead (70% of the work for rosonenode is done here), this overhead increases with the number of nodes added to the executor ( 9.1e9 increases to 1.6e10 for the same runtime).
- The 1 to 1 mapping of ROS2 nodes to DDS participants increases CPU usage in the eProsima part of the application (2.5e9 vs 1.2e10).
Solution / Answer:
- The SingleThreadedExecutor needs to be optimized or implemented differently.
- The ROS2 middleware needs to be changed to allow for an option where nodes do not have a 1-to-1 mapping to DDS participants.
A more in-depth explanation of our research and measurements can be found in the README.md of the github page mentioned in the question. Since there does not appear to be a quick and easy solution to these problems (besides using 1 node for your entire application to reduce CPU usage) we opened two discussions on ROS discourse. People who are interested in discussing either problem can follow the links that will be posted on the github page.
What tool did you use to get those results?
top
I also used perf to investigate a bit. I noticed it spends a lot of time in the ros executor... (using ros dashing)
Yeah I'm looking into the executor as well. Another interesting thing is, if I modify your nopub testcase and only create 1 node with 10 timers (instead of 10 nodes with 1 timer each) I get around 2% CPU usage (vs. 14-15% with the original nopub).
The discussion on Discourse about the SingleThreadedExecutor is now live at https://discourse.ros.org/t/singlethr... : . Please post all your interesting findings there. The link to the Discourse discussion will also be added at the bottom of the github README. Hopefully we can come in contact with the right people to optimize the executor. The more data we collect and share there, the more response we can expect. More detailed results will also speed up the process of fixing/optimizing the executor.