Random Node Crashes Over The Network [closed]

asked 2013-10-04 06:31:35 -0600

robotzak gravatar image

updated 2013-10-10 08:42:30 -0600

I am currently running into a problem that only occurs when trying to run nodes over the network.

Let me explain the use case I have. I am currently using Groovy.

I start a roscore on my local machine (a Mac for what it's worth). Then, I have generated a set of launch files that will each load a significant number of parameters to the Parameter Server (this input defines how the node constructs its objects and the initialization values for them.) The number of these files is quite large (about 100). I then start these nodes by roslaunching the launch files on 4 other machines (Ubuntu 12.04.3 LTS).

The problem I am having is that if I start all these nodes at once, a small percentage of them will not execute(2-5%). The process is terminated before the node even starts. I suspect this has to do with the massive amount of data being processed and served by the parameter server, but I am not certain.

My question is, is there a maximum number of nodes that can be ran under the same roscore? Furthermore, are there restrictions on the amount of parameters that can be stored at any given time on the parameter server?

Thanks

EDIT: Here is the master's log file: http://www.filedropper.com/master_5

EDIT 2: It might be of relevance to say that each node will need to load on the order of 400 parameters upon launching. So for 100 nodes, this can result in 40000 parameters.

UPDATE: I tried running the same situation above, except this time I ran the roscore on an Ubuntu 12.04 machine. The amount of crashed nodes is almost zero in my original test case. When I ran a larger experiment, (many more parameters), 3% of my nodes crashed. (15 out of 500). I also noticed that a mutex is being used in the master.cpp execute function on OSX and not on Linux machines (There are include guards around it). Is this behavior expected when communicating between an OSX machine and a Linux machine?

edit retag flag offensive reopen merge delete

Closed for the following reason question is not relevant or outdated by tfoote
close date 2018-01-30 22:43:50.236845

Comments

1

I think it's unlikely that this is related to the number of nodes or parameters. For comparison, a clean PR2 startup launches 54 nodes and sets 1266 parameters. Can you post logs from the nodes that crash?

Dan Lazewatsky gravatar image Dan Lazewatsky  ( 2013-10-04 06:56:34 -0600 )edit