Ask Your Question

Messages getting dropped [ros melodic + industrial camera]

asked 2021-02-18 08:03:33 -0500

affi gravatar image

Hey everyone,

I am trying to perform a ground truth validation, having an Industrial camera and Ouster lidar (os1-64) where both are connected to Nvidia Agx via PoE switch. The bandwidth rate of the camera is 49 MByte/sec and Lidar Mode is 1024*20.

If both the camera and ouster run at the same time, my messages are getting dropped. Whereas, if I only run the ouster client I am getting all the messages. You can see the comparisons below:

I am trying to run with following command to record the messages:

rosbag record -b 1024 --chunksize=1024 --duration=1h  /os_cloud_node/points __name:=recordLidarData &

This is the output of the rosbag file which was recorded without the camera presence on the network: You can see the messages are 71991 and size of the recorded bag for 1 hour is almost 210.9 GB:

path:        _2018-01-28-17-35-57.bag
version:     2.0
duration:    59:59s (3599s)
start:       Jan 28 2018 17:35:57.88 (1517157357.88)
end:         Jan 28 2018 18:35:57.60 (1517160957.60)
size:        210.9 GB
messages:    71991
compression: none [71991/71991 chunks]
types:       sensor_msgs/PointCloud2 [1158d486dd51d683ce2f1be655c3c181]
topics:      /os_cloud_node/points   71991 msgs    : sensor_msgs/PointCloud2

When simultaneously, both the camera and ouster clinet runs, I am getting the correct FPS for the camera but all the messages are getting dropped for ouster and ended up in very small bag file. You can see the output of the rosbag file: Here you can see the messages I am getting for 1 hour only 6294 and the size is of rosbag is 18.04 GB.

path:        _2018-01-28-19-15-51.bag
version:     2.0
duration:    59:58s (3598s)
start:       Jan 28 2018 19:15:52.19 (1517163352.19)
end:         Jan 28 2018 20:15:50.64 (1517166950.64)
size:        18.4 GB
messages:    6294
compression: none [6294/6294 chunks]
types:       sensor_msgs/PointCloud2 [1158d486dd51d683ce2f1be655c3c181]
topics:      /os_cloud_node/points   6294 msgs    : sensor_msgs/PointCloud2

Can anyone nudge me in the right direction how to solve this issue. I mean in terms of, for example. what will be correct chunksize or buffsize so my messages should not drop. Is there anyway, where I can check how many messages are dropped kind of statistical information?

Or there is a bottleneck, because the switch might be under stress [too many packets]? Any suggestions would be appreciated.

Regards, Affi

edit retag flag offensive close merge delete

1 Answer

Sort by ยป oldest newest most voted

answered 2021-04-16 20:22:54 -0500

peci1 gravatar image

How glad I am we're not the only ones seeing this bug! We've just spent about 20+ manhours debugging it. With the help of a network tap device, we pinpointed the issue to the timing of the packets sent by the lidar.

It seems the lidar sends data in bursts (8 UDP fragments, pause 8 UDP fragments, pause, 1 UDP fragment). These 17 fragments form one large fragmented UDP datagram. And that's what the ROS driver sends furter as the lidar packet. When you connect the lidar directly to a computer, you don't observe the packet loss as there is noone else on the cable and a Gbit link is good enough for the lidar. However, when the lidar traffic is mixed in a switch with traffic from other devices, it shows the bursts of data are just too fast and switches start throwing Ouster packets away as their buffers are half-full of the other traffic and can no longer contain all packets belonging to the Ouster bursts. We haven't found any parameter of the switches we tested (about 5 types) that would tell us the size of the buffers, but we assume the store-and-forward technology they use has a buffer for just a few packets. Ouster sends 10000 UDP packets/s in 1024x10 mode, and one of our switches specifies 17 Mpackets/s, which is way more. However, the bursting behavior goes beyond the limit of even this switch on short time intervals.

We plan to report this as an issue to Ouster, as it basically renders the sensor unusable when connected to a switch.

Until it is fixed (if!), there are some things you can do to make things a bit better.

  1. Switch to 512x10 mode and connect Ouster to switch via a 4-wire 100 Mbit Ethernet cable. This will force it into 100 Mbit mode, in which the bursts are much smaller. The bandwidth of 512x10 mode is 60 Mbps which is easy to handle on the 100 Mbit link. This option fully resolves the issue for us, but you have to work only with 512x10 mode.

  2. Connect the lidar via a dedicated network card (e.g. a USB-Ethernet adapter or a PCIe card). This fully resolves the issue (if you manage to find a reliable USB-Ethernet adapter, which is not easy).

  3. If your receiving machine is Linux, set the following in /etc/systctl.d/60-ouster-fix.conf:

net.ipv4.ipfrag_high_thresh = 26000000 net.ipv4.ipfrag_low_thresh = 13000000 net.ipv4.ipfrag_time = 5

You can play with the values. This basically doesn't fix the lost packets problem, but it makes it less serious for the receiving computer. As said earlier, Ouster sends 20k UDP frames fragmented to 17 MTU-sized frames (1500 B). As frames get lost on the switch, the receiving computer adds any received fragment to a buffer where it waits for its other buddies to form the large UDP datagram. However, with 20% of them missing, most datagrams are never completed (reassembled). So ... (more)

edit flag offensive delete link more


Jumbo frames might also help. It doesn't decrease data rate, but it will decrease packet rate. Especially for busy networks, that might help reduce fragmentation and lost packets.

gvdhoorn gravatar image gvdhoorn  ( 2021-04-17 08:54:54 -0500 )edit

Yeah, we'd like to use them. But it seems the lidar is limited to the standard MTU of 1500 B :(

peci1 gravatar image peci1  ( 2021-04-17 09:29:20 -0500 )edit

Your Answer

Please start posting anonymously - your entry will be published after you log in or create a new account.

Add Answer

Question Tools

1 follower


Asked: 2021-02-18 05:06:32 -0500

Seen: 51 times

Last updated: Apr 16