Revision history - ROS Answers: Open Source Q&A Forum

I looked into the problem together with scgroot. I've added a dockerfile and a docker-compose to the github page, this way people familiar with docker can quickly recreate the problem stated above. I've also added one extra piece of example code which uses 1 node instead of 10.

A description of what each binary contains is given here:

image description

The resulting CPU usage when the binaries each get their own isolated docker container is given here:

image description

Inspection with valgrind (callgrind) and perf gives:

image description

Conclusions:

Both the rtps implementation and the rosonenode implementation have 1 DDS participant leading to very similar work done in the eProsima part of the appliation (3.0e9 vs 2.5e9). The SingleThreadedExecutor adds a lot of overhead (70% of the work for rosonenode is done here), this overhead increases with the number of nodes added to the executor ( 9.1e9 increases to 1.6e10 for the same runtime).
The 1 to 1 mapping of ROS2 nodes to DDS participants increases CPU usage in the eProsima part of the application (2.5e9 vs 1.2e10).

Solution / Answer:

The SingleThreadedExecutor needs to be optimized or implemented differently.
The ROS2 middleware needs to be changed to allow for an option where nodes do not have a 1-to-1 mapping to DDS participants.

A more in-depth explanation of our research and measurements can be found in the README.md of the github page mentioned in the question. Since there does not appear to be a quick and easy solution to this problem (besides using 1 node for your entire application to reduce CPU usage) we opened a discussion on ROS discourse. People who are interested can follow the link that will be posted on the github page.

I looked into the problem together with scgroot. I've added a dockerfile and a docker-compose to the github page, this way people familiar with docker can quickly recreate the problem stated above. I've also added one extra piece of example code which uses 1 node instead of 10.

A description of what each binary contains is given ~~here:~~here:

image description

The resulting CPU usage when the binaries each get their own isolated docker container is given here:

Name         CPU%   MEM USAGE   PIDS
ros          63.36   103.70MiB   43
one node     21.13    17.35MiB    7
nopub         6.55    67.77MiB   43
rtps          6.48    15.20MiB    6
no ros        0.32     3.98MiB    1

Inspection with valgrind (callgrind) and perf gives:

image description

Conclusions:

Both the rtps implementation and the rosonenode implementation have 1 DDS participant leading to very similar work done in the eProsima part of the appliation (3.0e9 vs 2.5e9). The SingleThreadedExecutor adds a lot of overhead (70% of the work for rosonenode is done here), this overhead increases with the number of nodes added to the executor ( 9.1e9 increases to 1.6e10 for the same runtime).
The 1 to 1 mapping of ROS2 nodes to DDS participants increases CPU usage in the eProsima part of the application (2.5e9 vs 1.2e10).

Solution / Answer:

The SingleThreadedExecutor needs to be optimized or implemented differently.
The ROS2 middleware needs to be changed to allow for an option where nodes do not have a 1-to-1 mapping to DDS participants.

A more in-depth explanation of our research and measurements can be found in the README.md of the github page mentioned in the question. Since there does not appear to be a quick and easy solution to this problem (besides using 1 node for your entire application to reduce CPU usage) we opened a discussion on ROS discourse. People who are interested can follow the link that will be posted on the github ~~page.~~

page.

I looked into the problem together with scgroot. I've added a dockerfile and a docker-compose to the github page, this way people familiar with docker can quickly recreate the problem stated above. I've also added one extra piece of example code which uses 1 node instead of 10.

A description of what each binary contains is given here:

image description

The resulting CPU usage when the binaries each get their own isolated docker container is given here:

Name         CPU%   MEM USAGE   PIDS
ros          63.36   103.70MiB   43
one node   rosonenode   21.13    17.35MiB    7
nopub         6.55    67.77MiB   43
rtps          6.48    15.20MiB    6
no ros noros         0.32     3.98MiB    1

Inspection with valgrind (callgrind) and perf gives:

image description

Conclusions:

Both the rtps implementation and the rosonenode implementation have 1 DDS participant leading to very similar work done in the eProsima part of the appliation (3.0e9 vs 2.5e9). The SingleThreadedExecutor adds a lot of overhead (70% of the work for rosonenode is done here), this overhead increases with the number of nodes added to the executor ( 9.1e9 increases to 1.6e10 for the same runtime).
The 1 to 1 mapping of ROS2 nodes to DDS participants increases CPU usage in the eProsima part of the application (2.5e9 vs 1.2e10).

Solution / Answer:

The SingleThreadedExecutor needs to be optimized or implemented differently.
The ROS2 middleware needs to be changed to allow for an option where nodes do not have a 1-to-1 mapping to DDS participants.

A more in-depth explanation of our research and measurements can be found in the README.md of the github page mentioned in the question. Since there does not appear to be a quick and easy solution to this problem (besides using 1 node for your entire application to reduce CPU usage) we opened a discussion on ROS discourse. People who are interested can follow the link that will be posted on the github page.

I looked into the problem together with scgroot. I've added a dockerfile and a docker-compose to the github page, this way people familiar with docker can quickly recreate the problem stated above. I've also added one extra piece of example code which uses 1 node instead of 10.

A description of what each binary contains is given here:

Binary    |  Publishers|   Subscribers|   ROS|   ROS nodes|  ROS timers| DDS participants|
ros       |          20|           200|   yes|          10|          10|               10|
rosonenode|          20|           200|   yes|           1|           1|                1|
nopub     |           0|             0|   yes|          10|          10|               10|
rtps      |          20|           200|    no|           0|           0|                1|
noros     |         20*|          200*|    no|           0|           0|                0|
*C++ implementation no network publishing/subscribing

The resulting CPU usage when the binaries each get their own isolated docker container is given here:

Name         CPU%   MEM USAGE   PIDS
ros          63.36   103.70MiB   43
rosonenode   21.13    17.35MiB    7
nopub         6.55    67.77MiB   43
rtps          6.48    15.20MiB    6
noros         0.32     3.98MiB    1

Inspection with valgrind (callgrind) and perf gives:

Binary        Total     DDS   Executor   Other
rtps          3.6e9   3.0e9        NA    6.0e8
rosonenode    1.3e10  2.5e9     9.1e9    1.4e9
ros           3.8e10  1.2e10    1.6e10   1.0e10
Values given in average #CPU cycles

Binary        Total     DDS   Executor   Other
rtps            100    83.7        NA    16.3
rosonenode      100    19.6      70.0    10.4
ros             100    32.0      43.9    24.1
Values given in % of total CPU usage contribution

Conclusions:

Both the rtps implementation and the rosonenode implementation have 1 DDS participant leading to very similar work done in the eProsima part of the appliation (3.0e9 vs 2.5e9). The SingleThreadedExecutor adds a lot of overhead (70% of the work for rosonenode is done here), this overhead increases with the number of nodes added to the executor ( 9.1e9 increases to 1.6e10 for the same runtime).
The 1 to 1 mapping of ROS2 nodes to DDS participants increases CPU usage in the eProsima part of the application (2.5e9 vs 1.2e10).

Solution / Answer:

The SingleThreadedExecutor needs to be optimized or implemented differently.
The ROS2 middleware needs to be changed to allow for an option where nodes do not have a 1-to-1 mapping to DDS participants.

A more in-depth explanation of our research and measurements can be found in the README.md of the github page mentioned in the question. Since there does not appear to be a quick and easy solution to ~~this problem~~ these problems (besides using 1 node for your entire application to reduce CPU usage) we opened ~~a discussion~~ two discussions on ROS discourse. People who are interested in discussing either problem can follow the ~~link~~ links that will be posted on the github page.

Revision history [back]