ROS2 Nodes occasionally dying using LaunchService in a subprocess [closed]
Hello Everyone,
First of all, I am aware there is an ongoing effort to launch the LaunchService in the background via run_async. PR. Howerver, we need something that works now without compiling from source, so we must wait to use this implementation until ROS2 Dashing i guess. Correct me if I'm wrong please.
Right now the only way we have of launching the nodes is the following. GitHub.:
def startLaunchServiceProcess(launchDesc):
launchService = LaunchService()
launchService.include_launch_description(launchDesc)
process = Process(target=launchService.run, daemon=True)
process.start()
return process
The launch description is this one. Im not posting here since it is likely we change the LaunchDescription a bit, so please refer to that file.
So far this approach has worked good enough for us, and we only suffered from occasional startup errors (a node just dies with -9 error, usually cognition node). Once the initial launch process is completed without errors, everything is okay, so if we have an error we can just launch the script again. This is why it has not been a big deal for us, until now.
We are setting up CI, and we have created a test to ensure our code is working fine. Obviously the occasional errors are a real problem now, since we cannot pass the CI test in a reliable way.
Knowing that the occasional errors only appear when using the background thread, the question is.. how could we capture the error and respawn failing nodes? Or even shutdown the whole LaunchService and start again automatically. I am able to capture the error of any specific failing node using the OnExit event-handler, but how could I relaunch it? I also feel like it would be convenient to have some simple code that takes care of this respawn process. (Lifecycle is not an option in our case since we also execute launch one process via cmd
).
Suggestions are welcomed, thanks in advance.
Nestor
Why are you using
billiard.Process
? Can you provide the complete console output of a failure? Am I correct in my understanding that it is the executable that contains the node that is crashing and not launch itself?For further reading on something like
respawn
, see: https://github.com/ros2/launch/pull/179 and https://github.com/ros2/launch/issues...I think having a way to easily specify respawn makes sense, but as I expressed in those issues, I think it needs to be considered carefully before being implemented. That being said, a pull request would help it move along. I don't think we'll get to it very soon given the other tasks we have to do right now.
Yes, the process contains the node that is crashing, which usually is
hros_congnition_mara_components
. The node is launched, but dies straight away. Output in a gist file.We use
billiard.Process
, which is an improved version ofmultiprocessing.Process
. We needed to do this as in another scenario (not this one) for Reinforcement Learning, we need to launch multiple training scripts in threads. Billiard allows us to have a threading structure like:main script
-SubThread1 , train1
SubThread1-1, LaunchService.run
-SubThread2 , train2
SubThread2-1, LaunchService.run
Default
multiprocessing.Process
does not allow to create child subthreads from subthreads. Something like that If I remember correctly.