# Inconsistent error using image_transport inside a Docker

I've been trying to run a node that uses CameraSubscriber and image_transport Publisher inside a docker container and I got some weird result. The node would sometimes get stuck after printing the line [setParam] Failed to contact master at [localhost:11311]. Retrying... while in other it would have worked just fine, without me changing anything in the system (in both cases roscore was running in the background with all of its environment variables configured the same).

Furthermore, when I ran this node as part of a bigger launch file (without roscore) I saw that most of the nodes managed to run smoothly while the node I wrote that uses image_transport would sometimes fail with the same error message.

I noticed that the problem only accords when I was declaring an image_transport's Publisher or Subscriber.

For a test I wrote the following node:

#include <ros/ros.h>
#include <image_transport/image_transport.h>

int main(int argc, char **argv)
{
ros::init(argc, argv, "image_test");
ros::NodeHandle nh("");

image_transport::ImageTransport it(nh);

for (auto i = 0; i < 1000; ++i) {
ROS_INFO("%d", i);
auto tmp = it.advertise("image" + std::to_string(i), 1);
}

ros::spin();
return EXIT_SUCCESS;
}


Each time that I ran this code it got stuck after a different amount of publisher it declared. (sometimes after just a few publishers, and sometimes after 800 publishers). A typical result I got when running the node above was:

[ INFO] [1610372592.047928526]: 0
[ INFO] [1610372592.206534307]: 1
[ INFO] [1610372592.212120537]: 2
...
[ INFO] [1610372595.301301137]: 358
[ INFO] [1610372595.308509332]: 359
[ERROR] [1610372595.310783823]: [setParam] Failed to contact master at [localhost:11311].  Retrying...


When I tried replacing the image_transport's publisher with a normal ros publisher the node managed to run smoothly. Also when I tried to run the node using GDB debugger The node ran smoothly.

I tried updating the system libraries inside the docker (including image_transport and it's plugins), removing some of the plugins, changing its network settings, but I couldn't find anything that can solve this problem.

Because of the Inconsistency of the problem, I thought that the problem could be related to a race condition somewhere in the program, but I haven't used anything that requires multi-threading that could cause such a thing (except ros itself).

As for my system, I am using ros melodic with its official image as the base for my docker.

If someone could help me understand the source of this problem I would really appreciate this.

edit retag close merge delete

Could it be your node is running out of resources? You appear to be rapidly creating and destroying a large number of publisher objects. I can imagine TCP ports and/or file handles running out while doing that.

And out of curiosity: why do you create and immediately destroy 1000 publishers?

( 2021-01-11 11:10:48 -0500 )edit

The problem happened in my node even when I had just a few subscribers and publishers, but it was hard to debug it because it happened rearly. So I made this test node to understand the problem, and made it create and destroy multiple publisher just so it would be easy to see the problem.

As for running out of memory, when i replaced the Image_transport publisher with a normal ros publisher the code was fine, could it be that Image_transport requires much more resources?

( 2021-01-11 11:21:05 -0500 )edit

Sort by » oldest newest most voted

Apparently, I had in my CMakeLists.txt these 2 lines for profiling the code that made this problem appire:

add_compile_options(-pg)
set(catkin_LIBRARIES \${catkin_LIBRARIES} -pg)


I deleted these lines and now the node is working fine.

I originally added these lines based on this guild: http://wiki.ros.org/roslaunch/Tutoria...

I'm not sure why this profiling would conflict with image_transport, but deleting them seems to solve the problem.

more

I'm having the same issue appear, without any profiling options set. For me, it happens when I run the code in question whith a nodelet and not when running in a standalone node. Logging shows me that the issue occurs before Nodelet::onInit is called.

I suspect that because the issue occurs to be a race condition, it's likely a bug in how setParam is implemented within https://github.com/ros/ros_comm/blob/... I'm not quite sure why the variable c exists, but isn't used within the do while loop. I suspect it may be getting stuck in this loop

( 2021-06-04 06:10:38 -0500 )edit