Unable to receive large messages

asked 2022-09-08 17:06:53 -0500

Ronoman
48 ●1 ●2 ●7

updated 2022-09-08 17:20:52 -0500

I'm running into an odd problem. I'm using ROS2 Galactic inside of docker images, one of which is running on my laptop, and another running on a Husky robot. In case network topology matters, here is a diagram of the setup: network config The robot is at 192.168.131.1, and my laptop is 192.168.131.116. Both images are run with --net host --privileged.

If I run a simple talker in the Husky's docker image and echo the topic on my laptop docker image, I receive the data just fine. All data can be delivered in a single UDP packet, and so no reassembly needs to take place on either the robot or my laptop.

However, when I send larger messages over topics, I can no longer receive the data. I have tested this with VLP-16 LIDAR PC2 data, as well as L515 Realsense camera data. I again captured packets on both the robot and my laptop, both still under the same IP. I can echo these topics fine locally on the robot's docker image, but not from my laptop's docker image.

DISCLAIMER: I'm not a network expert, so the analysis below could be wrong (and please tell me if it is).

Here is a snippet of the capture from the robot: robot packet frags Packet #s 353708-353720 are fragments of the image. These are reassembled in packet #353721, and the top of the raw packet bytes at the bottom show the camera_color_optical_frame and rgb8 metadata.

Here is a snippet of the received data on my laptop: frags I believe that packets 3285-3302 should all be reassembled in 3303, but only 3294-3302 are reassembled. Here is the data for packet 3285: frag data This includes the aforementioned metadata camera_color_optical_frame and rgb8, but it never ends up in a reassembled packet.

(note that the above packet captures are not the same same message; I don't have synced clocks between robot and laptop. I hope it illustrates the point about packet reassembly, but I can update with matching sends and receives if that helps.)

I have tried:

Swapping DDS implementations from CycloneDDS to Fast RTPS
Following all debugging/tuning recommendations at https://docs.ros.org/en/rolling/How-T...
Changing publisher QOS to "Best Effort"
Tried other large message publishers (namely VLP-16 pointclouds)

to no avail. Furthermore, when subscribing to VLP-16 pointcloud topics, all network connections I have to the robot cease until ~30s after I kill the subscriber on my laptop. I believe that the issue is either a core networking driver problem, or a misconfiguration/bug of Cyclone DDS.

What should I do to debug further, and/or resolve the problem? Thank you!

edit retag flag offensive close merge delete

Comments

That is weird. What happens if you publish only a single large message? I assume the "L515" publisher is publishing at some relatively fast frequency (10-30Hz)? Maybe you could record a bag file and play it back in its place and step through it with the cli or rqt_bag? Or you could write a python script to run on the husky which subscribes to the topic, republishes only the first message on a new topic, then just idles, and then you can echo the new topic on your laptop (effectively creating a throttled topic). I'm wondering if the issue is related to resending packets, but you said best effort also fails... It's definitely strange.

William ( 2022-09-08 17:48:52 -0500 )edit

Also, this is similar to how things failed with the IP fragments issue described in https://docs.ros.org/en/rolling/How-T... (which you said you looked at). Can you confirm that these settings are changed on both ends? I don't know how docker might interact with this... but my first guess is that it shouldn't be an issue.

William ( 2022-09-08 17:50:33 -0500 )edit

add a comment

answered 2022-09-09 14:08:57 -0500

spiderkeys
36 ●2 ●2

Sharing info from our discussion on the ROS Discord here for others.

First, the user informed me that they are receiving a few samples every now and then, but it is very sporadic. This helps us rule out anything fundamentally mismatched at the DDS level (incomplete discovery, failed matching, inability to get some pieces data through in both directions on the network, etc).

The wireshark capture shows that UDP fragments are definitely being lost, but this could be at the network layer (routing over WiFi) or at the OS network stack layer (fragments dropped due to resource constraints). Because full-MTU size messages are being observed on both sides, we can also rule out most network configuration issues (routing, MTU size issues, etc).

The first step I would take is to get a pretty good estimate of the actual bandwidth your software is trying to utilize. The most common scenario I see is that people try to push full framerate uncompressed video over limited network bandwidths and are surprised when it can't get through. As an example, if your camera is producing a 1920x1080 RGB8 image at 30fps, your bandwidth requirement is going to be at a minimum around 1.5Gbps (assuming 24-bit packing of the pixel data), plus overhead, any reliability traffic/retransmissions, and any other traffic in the system. This exceeds common ethernet speeds, let alone common WiFi capabilities. You need to make sure your network can support the bandwidth requirements, regardless of what type of network it is traversing. Having adequate theoretical bandwidth won't guarantee reliability, but it is a pre-requisite.

You also have to consider other devices sharing the WiFi bandwidth, what WiFi technology you are using and how it supports concurrent usage, which frequencies you are operating on, whether you are in a noisy/saturated radio environment, etc. If the bandwidth requirement is adequately less than what your router can actually support over WiFi, then there are some optimizations that can be made at the router level to improve stability (long and short retransmit configurations, QOS settings, etc), the OS level to tune the network stack, and a number of techniques at the DDS QOS level that can be used to improve performance as well.

Once you are confident that your bandwidth requirements are reasonable, the first action to take is to adjust the socket send/receive buffer sizes on both machines. This varies per OS and DDS implementation. See: https://docs.ros.org/en/rolling/How-T...

This is one of the most common issues that causes data loss with large, fast samples, despite there being adequate bandwidth available over the given network topology. In particular, the out-of-the-box socket receive buffer max/default values in most Linux distros is too small for many DDS applications, and I have seen this cause data loss on numerous occasions.

After having done that, if possible, I'd try taking the router out of the equation first and directly connecting both devices over ethernet (through USB adapters is ... (more)

Comments

For some reason, I can't edit my post anymore, so adding this here.

Another issue to consider is disabling multicast in your environment. Long story short, DDS+Multicast over a wifi network can have devastating performance impacts that will grind your entire network to a halt. I had typed up a detailed explanation on this, but something weird happened and my edit didn't save here, but a very detailed discussion on the many facets, workarounds, options for sane defaults, etc can be found here:

https://discourse.ros.org/t/ros2-defa...

spiderkeys ( 2022-09-09 14:09:54 -0500 )edit

i edited your question to include your comment and gave you a bit of karma, though not sure if its enough to give you the ability to edit your answer or not...

shonigmann ( 2022-09-09 15:35:05 -0500 )edit

add a comment

Unable to receive large messages

Comments

1 Answer

Comments

Question Tools

Stats

Related questions

Unable to receive large messages edit

Comments

1 Answer

Comments

Question Tools

Stats

Related questions

Unable to receive large messages