Subscribers not getting updates after publisher restart, or if they start before publisher.

asked 2018-04-13 09:34:57 -0600

AWDunstan gravatar image

Here's my setup:

A Windows 10 host running Ubuntu 16.04 in WSL (Windows Subsystem for Linux) and a Hyper-V VM (also running Ubuntu 16.04). Both Ubuntu environments are running Kinetic 1.12.13.

I run roscore, a publisher and a subscriber in the VM. I run one subscriber in WSL. All my code is written in rospy. I have ROS_MASTER_URI=http://ip-addr-of-the-vm:11311 in both environments, ROS_HOSTNAME is not set. In WSL I set ROS_IP to the network address for the VM. On the VM I set ROS_IP to the VM's own IP address (i.e. not localhost).

I'm using the same code on both sides (copied via tar/sftp), and I ran catkin_make from the top of the workspace on both sides before starting my tests.

Case 1: Start the publisher, then start the two subscribers.
Result: Subscribers get topic updates as expected. Publisher reports 2 connections (Publisher.get_num_connections())

Case 2: Start the subscribers first, then start the publisher.
Result: The VM subscriber sees the topic updates, the WSL subscriber does not. Publisher reports only 1 connection. If I then restart the WSL subscriber it does get topic updates and the publishers # of connections goes up to 2.

Case 3: Start the publisher, then start the subscribers (same as case 1). Then stop the publisher and re-start it.
Result: Both subscribers get topic updates until the publisher stops. When the publisher starts back up only the VM subscriber gets updates. The publisher reports 2 connections until it stops. When it starts back up it only sees 1 connection.

Case 4: Same as case 3, only the publisher explicitly calls Publisher.unregister() before exiting.
Result: Same as case 3, and the publisher logs an error message "Could not process inbound connection: [/thepublisher] is not a publisher of [/thetopicname]. Topics are ..." where ... is a list of topics and the definition of the message I'm sending.

I've run "rostopic echo /thetopicname" on both the VM and WSL. It behaves the same as my subscriber (works as expected in the VM, doesn't in WSL).

I've run roswtf during the 1st 3 cases with the following results:

case 1, roswtf run while topic updates were coming out and after the publisher stopped:

In WSL in finds 1 error "The following nodes should be connected but aren't: * /wsl_listener_3255_152368651380->/rosout (/rosout)". On the VM it gets stuck after "running graph rules...", and prints "unknown network erro contacting node: timed out" repeatedly.

case 2: Same as case 1, if I leave the WSL subscriber running. If I stop the WSL subscriber the WSL roswtf changes to a warning "The following node subscriptions are unconnected: * /vm_listener * /thetopicname"

I haven't run roswtf during case 3 yet.

Questions:

  • Is this configuration even supported?

  • If so, what am I doing wrong?

Thanks!

edit retag flag offensive close merge delete

Comments

Is this configuration even supported?

No :)

If you can reproduce this under a regular Ubuntu install, or with two nodes in the same Ubuntu VM, then it could be something to investigate.

WSL's networking is just not yet entirely there. I'm not sure it's worth it to try and diagnose this.

gvdhoorn gravatar image gvdhoorn  ( 2018-04-13 09:51:44 -0600 )edit

I was afraid of that. At least my question can serve as a warning to others :-) Thanks!

AWDunstan gravatar image AWDunstan  ( 2018-04-13 10:00:08 -0600 )edit

Adding some nuance: I'm just one voice, and not even a ROS maintainer or OSRF employee. I have looked at WSL and ROS in the past though (see #q238646). In my experience things that don't seem to work with networking and WSL point to incomplete implementations on the WSL side. I'm not saying that ..

gvdhoorn gravatar image gvdhoorn  ( 2018-04-13 10:58:26 -0600 )edit

.. it's all hopeless, it's just that you have to keep in mind that WSL is still essentially beta, and MS have some work to do (take a look at the issue tracker on GH they maintain).

Summarising: this could probably work with some effort, but it's not a supported scenario at this point.

gvdhoorn gravatar image gvdhoorn  ( 2018-04-13 10:59:37 -0600 )edit

Just noticed this:

In WSL I set ROS_IP to the network address for the VM

this is probably not correct. ROS_IP should always be set to an IP at which the machine running the ROS node(s) is reachable. The nodes running under WSL are not reachable at the VM's IP, are they? Set ROS_IP to the ..

gvdhoorn gravatar image gvdhoorn  ( 2018-04-13 11:01:16 -0600 )edit

.. IP of the machine running WSL (ie: the IP of your Windows 'host'), not the VM IP and try again.

gvdhoorn gravatar image gvdhoorn  ( 2018-04-13 11:02:11 -0600 )edit

I checked, and it already was (sorry for the confusion). My WSL has three network interfaces: localhost, the main one (eth0), and one to talk to the VM (eth3). ROS_IP was set to the WSL's address on the talk-to-the-vm network, not the address of the VM.

AWDunstan gravatar image AWDunstan  ( 2018-04-13 12:48:14 -0600 )edit

Well, in that case I believe I would check whether this works with a regular/normal setup, and if it does, then it's probable that WSL is getting in the way.

gvdhoorn gravatar image gvdhoorn  ( 2018-04-13 14:03:05 -0600 )edit