Odd Docker roscore/roslaunch delays with Linux kernel 6.1
Hello, I have an odd one here and am trying to figure out a possible root cause!
The Situation
I am running a custom Linux version made using Buildroot, with application ROS code containerized. I have noticed some odd behavior with both roscore
and roslaunch
delaying between log messages and ultimately taking minutes to execute. This behavior occurs only with ROS kinetic, lunar, and melodic versions of roscore
and roslaunch
on Linux kernel 6.1. ROS noetic behaves as expected.
Note: the same configuration on Linux kernel 5.15 does not exhibit this behavior.
Test Summary
On Linux kernel 6.1, using the ros:kinetic
/ros:lunar
/ros:melodic
Docker images from Docker Hub.
Run docker run -it --network host --entrypoint bash ros:kinetic
, then from within the container run source /opt/ros/${ROS_DISTRO}/setup.bash && roscore -v
The terminal displays:
... logging to /root/.ros/log/0dfbd9a8-c533-11ed-998a-78d0042ad370/roslaunch-oc-general-68.log
Checking log directory for disk usage. This may take awhile.
Press Ctrl-C to interrupt
Done checking log file disk usage. Usage is <1GB.
... loading XML file [/opt/ros/kinetic/etc/ros/roscore.xml]
... executing command param [rosversion roslaunch]
Added parameter [/rosversion]
... executing command param [rosversion -d]
Added parameter [/rosdistro]
Added core node of type [rosout/rosout] in namespace [/]
started roslaunch server http://localhost:40105/
ros_comm version 1.12.17
SUMMARY
========
PARAMETERS
* /rosdistro: kinetic
* /rosversion: 1.12.17
NODES
auto-starting new master
And it hangs on auto-starting new master
for ~100 seconds. The CPU% from htop
is 100%.
After the initial ~100 seconds, roscore
printed:
process[master]: started with pid [78]
ROS_MASTER_URI=http://localhost:11311/
setting /run_id to 0dfbd9a8-c533-11ed-998a-78d0042ad370
And hangs for another ~100 seconds on setting /run_id ...
. The CPU% from htop
is also 100%.
Finally roscore
finished with:
process[rosout-1]: started with pid [91]
started core service [/rosout]
Similar behavior exists when running roslaunch ...
with the ~100 second wait occurring on the ROS_MASTER_URI=http://localhost:11311
output.
ROS Environment Variable
ROS_ROOT=/opt/ros/kinetic/share/ros
ROS_PACKAGE_PATH=/opt/ros/kinetic/share
ROS_MASTER_URI=http://localhost:11311
ROS_PYTHON_VERSION=2
ROS_VERSION=1
ROSLISP_PACKAGE_DIRECTORIES=
ROS_DISTRO=kinetic
ROS_ETC_DIR=/opt/ros/kinetic/etc/ros
Interestingly enough, if I use rosmon
to launch I do not see the ~100 second wait. Could this point to a python issue shared by roscore
and roslaunch
since rosmon
is written in C++?
Any advice would be greatly appreciated!
Update
Using pdb
running roscore
in the ros:melodic
container, I have narrowed it down to the following call in the roslaunch/nodeprocess.py file:
/opt/ros/melodic/lib/python2.7/dist-packages/roslaunch/nodeprocess.py(340)
Which is a call to subprocess.Popen (see https://github.com/ros/ros_comm/blob/...)
Stepping into the subprocess.Popen call it seems to be here:
> /usr/lib/python2.7/subprocess.py(394)__init__()
-> errread, errwrite)
After executing that subprocess function the output from ps -aux
is:
root 160 0.1 0.0 377148 63644 pts/0 Sl+ 01:35 0:00 python -m pdb roscore
root 174 99.2 0.0 377148 54332 ? Rs ...
If it hangs for multiple minutes, seeing what it's hanging on should not be too difficult with something like
ptrace
, or perhaps evenpdb
in case of Python scripts/packages.That might lead to results faster than speculation. Might be worth a try.
@gvdhoorn using
pdb
runningroscore
in theros:melodic
container, I have narrowed it down to the following call in theroslaunch/nodeprocess.py
file:/opt/ros/melodic/lib/python2.7/dist-packages/roslaunch/nodeprocess.py(340)
Which is a call to
subprocess.Popen
(see here)Stepping into the
subprocess.Popen
call it seems to be here:After executing that subprocess function the output from
ps -aux
is:Which eventually turns into:
Maybe unrelated, I did notice that after the
rosmaster
subprocess is spawn, it gradually decreases it's CPU% over time.As regular posts don't have max lengths, please also include context around the
subprocess.Popen(..)
call. Especially the call chain leading to it.roscore
starts multiple processes, and it'd be interesting to know which is causing the issue (could be all of them, or a specific one).Please add this to your initial post. Please also update all links to code to permalinks.
Seeing as there don't appear to be any really interesting changes to the file you mention (
git diff melodic-devel...noetic-devel -- tools/roslaunch/src/roslaunch/nodeprocess.py
), I'm wondering whether this could be due to some incompatibility between Python2.7
and kernel6.1
. You mention Noetic works, so that would be Python 3. Melodic is Python 2.You could consider building a minimal Melodic with Python 3 support and starting
roscore
. There are a couple of Q&As here on ROS Answers which document how to do that. Seeing as you only needros_comm
that should not be too much work (ie: you won't be building anything which actually needs Python-bindings, which is where it gets more complex).ptrace
might show you exactly what is happening at the point of the hang though -- on the OS level, below whatpdb
can show you.subprocess
can hang for ...(more)Is it possible that another roscore process is already running on this machine? Only one port 11311 is available per machine.
@gvdhoorn, updated the original post with the above comments, sorry about that. Feel free to remove the comments.
@Mike Scheutzow there is only a single
roscore
being run. If there was multiple I would expect to see: