ROS Resources: Documentation | Support | Discussion Forum | Index | Service Status | ros @ Robotics Stack Exchange
Ask Your Question
2

Odd Docker roscore/roslaunch delays with Linux kernel 6.1

asked 2023-03-17 21:52:25 -0500

danambrosio gravatar image

updated 2023-03-28 14:44:40 -0500

Hello, I have an odd one here and am trying to figure out a possible root cause!

The Situation

I am running a custom Linux version made using Buildroot, with application ROS code containerized. I have noticed some odd behavior with both roscore and roslaunch delaying between log messages and ultimately taking minutes to execute. This behavior occurs only with ROS kinetic, lunar, and melodic versions of roscore and roslaunch on Linux kernel 6.1. ROS noetic behaves as expected.

Note: the same configuration on Linux kernel 5.15 does not exhibit this behavior.


Test Summary

On Linux kernel 6.1, using the ros:kinetic/ros:lunar/ros:melodic Docker images from Docker Hub.

Run docker run -it --network host --entrypoint bash ros:kinetic, then from within the container run source /opt/ros/${ROS_DISTRO}/setup.bash && roscore -v

The terminal displays:

... logging to /root/.ros/log/0dfbd9a8-c533-11ed-998a-78d0042ad370/roslaunch-oc-general-68.log
Checking log directory for disk usage. This may take awhile.
 Press Ctrl-C to interrupt
Done checking log file disk usage. Usage is <1GB. 

... loading XML file [/opt/ros/kinetic/etc/ros/roscore.xml]
... executing command param [rosversion roslaunch]
Added parameter [/rosversion]
... executing command param [rosversion -d]
Added parameter [/rosdistro]
Added core node of type [rosout/rosout] in namespace [/]
started roslaunch server http://localhost:40105/
ros_comm version 1.12.17


SUMMARY
========

PARAMETERS
* /rosdistro: kinetic
* /rosversion: 1.12.17

NODES

auto-starting new master

And it hangs on auto-starting new master for ~100 seconds. The CPU% from htop is 100%.

After the initial ~100 seconds, roscore printed:

process[master]: started with pid [78]
ROS_MASTER_URI=http://localhost:11311/

setting /run_id to 0dfbd9a8-c533-11ed-998a-78d0042ad370

And hangs for another ~100 seconds on setting /run_id .... The CPU% from htop is also 100%.

Finally roscore finished with:

process[rosout-1]: started with pid [91]
started core service [/rosout]

Similar behavior exists when running roslaunch ... with the ~100 second wait occurring on the ROS_MASTER_URI=http://localhost:11311 output.

ROS Environment Variable

ROS_ROOT=/opt/ros/kinetic/share/ros
ROS_PACKAGE_PATH=/opt/ros/kinetic/share
ROS_MASTER_URI=http://localhost:11311
ROS_PYTHON_VERSION=2
ROS_VERSION=1
ROSLISP_PACKAGE_DIRECTORIES=
ROS_DISTRO=kinetic
ROS_ETC_DIR=/opt/ros/kinetic/etc/ros

Interestingly enough, if I use rosmon to launch I do not see the ~100 second wait. Could this point to a python issue shared by roscore and roslaunch since rosmon is written in C++?

Any advice would be greatly appreciated!


Update

Using pdb running roscore in the ros:melodic container, I have narrowed it down to the following call in the roslaunch/nodeprocess.py file:

/opt/ros/melodic/lib/python2.7/dist-packages/roslaunch/nodeprocess.py(340)

Which is a call to subprocess.Popen (see https://github.com/ros/ros_comm/blob/...)

Stepping into the subprocess.Popen call it seems to be here:

> /usr/lib/python2.7/subprocess.py(394)__init__()
-> errread, errwrite)

After executing that subprocess function the output from ps -aux is:

root         160  0.1  0.0 377148 63644 pts/0    Sl+  01:35   0:00 python -m pdb roscore
root         174 99.2  0.0 377148 54332 ?        Rs ...
(more)
edit retag flag offensive close merge delete

Comments

If it hangs for multiple minutes, seeing what it's hanging on should not be too difficult with something like ptrace, or perhaps even pdb in case of Python scripts/packages.

That might lead to results faster than speculation. Might be worth a try.

gvdhoorn gravatar image gvdhoorn  ( 2023-03-18 10:30:06 -0500 )edit

@gvdhoorn using pdb running roscore in the ros:melodic container, I have narrowed it down to the following call in the roslaunch/nodeprocess.py file:

/opt/ros/melodic/lib/python2.7/dist-packages/roslaunch/nodeprocess.py(340)

Which is a call to subprocess.Popen (see here)

Stepping into the subprocess.Popen call it seems to be here:

> /usr/lib/python2.7/subprocess.py(394)__init__()
-> errread, errwrite)

After executing that subprocess function the output from ps -aux is:

root         160  0.1  0.0 377148 63644 pts/0    Sl+  01:35   0:00 python -m pdb roscore
root         174 99.2  0.0 377148 54332 ?        Rs   01:38   0:53 python -m pdb roscore

Which eventually turns into:

root         174 96.3  0.0 435416 60136 ?        Ssl  01:38   1:40 /usr/bin/python /opt/ros/melodic/bin/rosmaster --core -p 11311 -w 3 __log:=/root/.ros
danambrosio gravatar image danambrosio  ( 2023-03-18 20:44:32 -0500 )edit

Maybe unrelated, I did notice that after the rosmaster subprocess is spawn, it gradually decreases it's CPU% over time.

As regular posts don't have max lengths, please also include context around the subprocess.Popen(..) call. Especially the call chain leading to it. roscore starts multiple processes, and it'd be interesting to know which is causing the issue (could be all of them, or a specific one).

danambrosio gravatar image danambrosio  ( 2023-03-18 20:46:41 -0500 )edit

Please add this to your initial post. Please also update all links to code to permalinks.

gvdhoorn gravatar image gvdhoorn  ( 2023-03-19 02:15:15 -0500 )edit

Seeing as there don't appear to be any really interesting changes to the file you mention (git diff melodic-devel...noetic-devel -- tools/roslaunch/src/roslaunch/nodeprocess.py), I'm wondering whether this could be due to some incompatibility between Python 2.7 and kernel 6.1. You mention Noetic works, so that would be Python 3. Melodic is Python 2.

You could consider building a minimal Melodic with Python 3 support and starting roscore. There are a couple of Q&As here on ROS Answers which document how to do that. Seeing as you only need ros_comm that should not be too much work (ie: you won't be building anything which actually needs Python-bindings, which is where it gets more complex).

ptrace might show you exactly what is happening at the point of the hang though -- on the OS level, below what pdb can show you. subprocess can hang for ...(more)

gvdhoorn gravatar image gvdhoorn  ( 2023-03-19 02:47:31 -0500 )edit

Is it possible that another roscore process is already running on this machine? Only one port 11311 is available per machine.

Mike Scheutzow gravatar image Mike Scheutzow  ( 2023-03-19 08:21:48 -0500 )edit

@gvdhoorn, updated the original post with the above comments, sorry about that. Feel free to remove the comments.

danambrosio gravatar image danambrosio  ( 2023-03-19 09:04:32 -0500 )edit

@Mike Scheutzow there is only a single roscore being run. If there was multiple I would expect to see:

RLException: roscore cannot run as another roscore/master is already running. 
Please kill other roscore/master processes before relaunching.
The ROS_MASTER_URI is http://localhost:11311/
The traceback for the exception was written to the log file
danambrosio gravatar image danambrosio  ( 2023-03-19 09:11:04 -0500 )edit

1 Answer

Sort by ยป oldest newest most voted
2

answered 2023-03-25 14:33:00 -0500

danambrosio gravatar image

updated 2023-03-28 14:47:56 -0500

Update #3 in the original post is a bandaid solution for the delays in roscore and roslaunch.

Update #4 highlights the underlying issue with nofile ulimits within a docker container. To override the values docker passes to containers for nofile soft and hard limits perform one of the following:

docker run ... --ulimit nofile=1024:524288

If you want this for all containers consider adding:

"default-ulimits": {
    "nofile": {
        "Name": "nofile",
        "Hard": 524288,
        "Soft": 1024
    }
}
edit flag offensive delete link more

Comments

Wow, what a deep rabbit hole! Congrats on finding the root cause. That containerd ticket you linked seem to have impacted a lot of projects, and was an interesting read. Thanks!

If I understood correctly, your issue was an unfortunate syzygy of using an older Python2 Popen submodule, with a modern linux kernel, and the latest containerd runtime? Should I expect to see such symptoms with modern or future distros (Ubuntu >= 22.04 LTS) and ROS releases (ROS 2 >= Humble) ?

ruffsl gravatar image ruffsl  ( 2023-04-07 11:00:03 -0500 )edit
1

@ruffsl the issue was really that the newer buildroot LTS (2023.02) brought in a docker/containerd change that increased the containernofile soft and hard limits to infinity (over 1 billion). The python2 popen call with close_fd=True seemed to loop through all of these file descriptors causing it to hang as well as a XmlRpc issue that caused large memory allocation in ROS1 post kinetic. I tested this with Humble and buildroot (2023.02) and did not run into any issue. Also seems as if the fix is going into the future releases of docker, containderd and go.

danambrosio gravatar image danambrosio  ( 2023-04-10 17:44:20 -0500 )edit

Question Tools

3 followers

Stats

Asked: 2023-03-17 21:52:25 -0500

Seen: 452 times

Last updated: Mar 28 '23