Nodes stop subscribing over time

asked 2015-02-05 17:10:55 -0500

updated 2015-03-29 02:54:01 -0500

I'm running ROS hydro on Ubuntu 12.04.

I've written a simple python script that calculates the bandwidth though a network interface and then publishes as an Int16. So there's no data acquisition from an external sensor/hardware.

Whats going on is that at the start, everything seems to be working fine. I left it running overnight, the next morning, some of my nodes stopped being able to subscribe to it, but some are still subscribing to it. rostopic echo also isn't able to subscribe to it. From running rqt_graph, both the subscribing nodes and publisher are still alive and the arrows still indicate the subscribing/publishing.

Has anyone encountered this or have an idea how I should go about debugging it? I've started to look at the .ros/log folder and I can't seem seem to make sense of it. Upgrading to Ubuntu 14.04 and to indigo is an option, but I would like to try other alternatives before an upgrade.

Edit: I've updated to indigo and running ubuntu 14.04 LTS, but I'm still seeing this same issue.

This also doesn't look like an rostopic issue. When I run into this issue, the existing nodes continue to subscribe to this topic, but new nodes or rostopic cannot. In order to fix this, I have to kill the node and let the troubled node respawn/rerun and then new nodes are able to subscribe.

Edit2:

From Ahendrix's post, I thought I would try ROS_HOSTNAME. Here was my set up:

Main computer(main-computer@192.168.1.2) : I've edited the /etc/hosts file to resolve 127.0.0.1 as main-computer. export ROS_MASTER_URI=http://main-computer:11311

export ROS_HOSTNAME=main-computer

Remote computer(remote-computer@192.168.1.3) : Edited /etc/hosts file to resolve 192.168.1.2 as main-computer. export ROS_MASTER_URI=http://main-computer:11311

export ROS_IP=192.168.1.3

What I tried to do was make sure even if my DHCP lease was up, and a new IP was assigned, I all the old nodes and new nodes would be able to resolve "main-computer" to its local ip address.

Unfortunately, this still didn't work. While "main-computer" had this problem, I logged onto "main-computer" for an rostopic echo. I saw that on master.log, it was actually adding a sub.

[rosmaster.master][INFO] 2015-03-27 12:30:47,573: +SUB [/bandwidth_msg] /rostopic_11324_1427484647396 http://main-computer:42724/

I went into the rostopic_11324_1427484647396.log file, and saw that everything looks normal, and shortly after, this error message popped up.

[rospy.internal][WARNING] 2015-03-27 12:32:32,352: Unknown error initiating TCP/IP socket to main-computer:45391 (http://main-computer:52459/): Traceback (most recent call last):
  File "/opt/ros/indigo/lib/python2.7/dist-packages/rospy/impl/tcpros_base.py", line 557, in connect
    self.read_header()
  File "/opt/ros/indigo/lib/python2.7/dist-packages/rospy/impl/tcpros_base.py", line 619, in read_header
    self._validate_header(read_ros_handshake_header(sock, self.read_buff, self.protocol.buff_size))
  File "/opt/ros/indigo/lib ...

(more)

edit retag flag offensive close merge delete

Comments

Sounds like it might be some low-level instability "thing". A first debugging would be to remove your calculations/interfacing and really just publish random data. If it still occurs it's some ROS instability. If everything works fine, it has to do something with your code (not necessarily your faul

dornhege ( 2015-02-06 08:51:40 -0500 )edit

This happened again, but unfortunately I didn't get a chance to do that setup. It might be a rostopic issue. I didn't mention that the node runs with a ROS_MASTER_URI of the local IP address and not local. I saw that the rostopic issue happened when the connection was lost and reconnect.

pwong ( 2015-02-06 20:24:31 -0500 )edit

When this happened, I did an rostopic pub on the troubled topic, and received a warning.[WARN] [WallTime: 1423217612.971032] Inbound TCP/IP connection failed: connection from sender terminated before handshake header received. 0 bytes were received. Please check sender for additional details.

pwong ( 2015-02-06 20:25:13 -0500 )edit

add a comment

answered 2015-03-20 06:29:08 -0500

ahendrix

47576 ●179 ●367 ●662 http://namniart.com/

updated 2015-03-28 14:34:52 -0500

I suspect you're using your computer's IP address instead of the loopback address in your ROS_MASTER_URI and ROS_IP, and that your computer gets assigned a different IP address when it renews its DHCP lease.

Since all of the running nodes are connected locally, they won't have any trouble when your IP address changes.

If you have nodes that advertise their topics on your machine's IP address, and the IP address changes, those nodes won't be updated. Since they're still using the old IP address, other nodes won't be able to connect.

There are a few ways to fix this, depending on your requirements:

Use localhost in your ROS_MASTER_URI and ROS_IP/ROS_HOSTNAME
If you have a hostname and it doesn't change when your IP changes, use that in your ROS_IP and ROS_HOSTNAME
Write a script which detects when your hostname changes, and restarts all of your nodes
Fix ROS so that it deals with IP address changes (this will probably be quite a bit of work)

EDIT

Give your log files, it looks like your system isn't having hostname problems.

The real problem here is that the publisher is rejecting new connections. The relevant error message from your log file is:

error: [Errno 104] Connection reset by peer

You should go look at the logs for your publisher node to see if it's getting any new connection requests, and hopefully you'll find a reason there about why it's rejecting them.

Comments

I am doing it through my computer's IP address instead of loopback. The computer gets an IP address through openVPN so it is static, and shouldn't change. (At least it was the same when I checked at launch and at troubled time). I'll give the ROS_HOSTNAME a try, thanks!

pwong ( 2015-03-20 09:53:54 -0500 )edit

I've added an update regarding the situation on my original question from "Edit2" onwards. If you have some time to take a look, please let me know what you think!

pwong ( 2015-03-27 21:06:51 -0500 )edit

Is your node leaking file descriptors?

ahendrix ( 2015-03-29 03:21:04 -0500 )edit

I've been testing/monitoring it. I want to make sure of the results before an update. Thank you for pointing me towards that direction, its something I've never heard of until you mentioned it.

pwong ( 2015-03-30 15:53:59 -0500 )edit

Turns out it wasn't a file descriptor leak. When I saw this problem happening, I did a: lsof | wc -l which showed 48170. This looks like a lot, but this stayed in control +/- 1000 files. I checked the max fds allowed for my system: /proc/sys/fs/file-max which showed 390022.

pwong ( 2015-04-21 16:56:23 -0500 )edit

I noticed that the computer has over 10gb of files in the .ros/log folder. The problem shows up much sooner than others. After I did a rosclean purge of the folder, the problem seemed to not surface so soon. Do you think there is any correlation between my problem and the log folder being so large?

pwong ( 2015-04-21 17:00:27 -0500 )edit

In addition to the system-wide limit on file descriptors, there is also a per-process limit that is usually around 1024 to 4096 open files.

ahendrix ( 2015-04-21 17:53:32 -0500 )edit

It's possible that the sheer number of files in your log directory is slowing down the file system. Unfortunately, ROS doesn't have an automatic system for cleaning these out, but it's fairly simple to use cron and tmpwatch to clean your log directory out on a regular basis.

ahendrix ( 2015-04-21 17:55:04 -0500 )edit

see more comments

Nodes stop subscribing over time

Comments

1 Answer

Comments

Question Tools

Stats

Related questions

Nodes stop subscribing over time edit

Comments

1 Answer

Comments

Question Tools

Stats

Related questions

Nodes stop subscribing over time