ROS2 Nodes occasionally dying using LaunchService in a subprocess [closed]

asked 2019-04-15 05:40:19 -0500

nzlz gravatar image

Hello Everyone,

First of all, I am aware there is an ongoing effort to launch the LaunchService in the background via run_async. PR. Howerver, we need something that works now without compiling from source, so we must wait to use this implementation until ROS2 Dashing i guess. Correct me if I'm wrong please.

Right now the only way we have of launching the nodes is the following. GitHub.:

def startLaunchServiceProcess(launchDesc):
    launchService = LaunchService()
    launchService.include_launch_description(launchDesc)
    process = Process(target=launchService.run, daemon=True)
    process.start()
    return process

The launch description is this one. Im not posting here since it is likely we change the LaunchDescription a bit, so please refer to that file.


So far this approach has worked good enough for us, and we only suffered from occasional startup errors (a node just dies with -9 error, usually cognition node). Once the initial launch process is completed without errors, everything is okay, so if we have an error we can just launch the script again. This is why it has not been a big deal for us, until now.

We are setting up CI, and we have created a test to ensure our code is working fine. Obviously the occasional errors are a real problem now, since we cannot pass the CI test in a reliable way.

Knowing that the occasional errors only appear when using the background thread, the question is.. how could we capture the error and respawn failing nodes? Or even shutdown the whole LaunchService and start again automatically. I am able to capture the error of any specific failing node using the OnExit event-handler, but how could I relaunch it? I also feel like it would be convenient to have some simple code that takes care of this respawn process. (Lifecycle is not an option in our case since we also execute launch one process via cmd).

Suggestions are welcomed, thanks in advance.

Nestor

edit retag flag offensive reopen merge delete

Closed for the following reason question is not relevant or outdated by tfoote
close date 2019-09-25 20:27:33.891949

Comments

Why are you using billiard.Process? Can you provide the complete console output of a failure? Am I correct in my understanding that it is the executable that contains the node that is crashing and not launch itself?

William gravatar image William  ( 2019-04-17 15:34:14 -0500 )edit

For further reading on something like respawn , see: https://github.com/ros2/launch/pull/179 and https://github.com/ros2/launch/issues...

I think having a way to easily specify respawn makes sense, but as I expressed in those issues, I think it needs to be considered carefully before being implemented. That being said, a pull request would help it move along. I don't think we'll get to it very soon given the other tasks we have to do right now.

William gravatar image William  ( 2019-04-17 15:38:18 -0500 )edit

Yes, the process contains the node that is crashing, which usually is hros_congnition_mara_components. The node is launched, but dies straight away. Output in a gist file.

We use billiard.Process, which is an improved version of multiprocessing.Process. We needed to do this as in another scenario (not this one) for Reinforcement Learning, we need to launch multiple training scripts in threads. Billiard allows us to have a threading structure like:

  • main script

    -SubThread1 , train1

    • SubThread1-1, LaunchService.run

      • ROS2 Nodes

    -SubThread2 , train2

    • SubThread2-1, LaunchService.run

      • ROS2 Nodes

Default multiprocessing.Process does not allow to create child subthreads from subthreads. Something like that If I remember correctly.

nzlz gravatar image nzlz  ( 2019-04-17 22:30:52 -0500 )edit