Messages getting dropped [ros melodic + industrial camera]
Hey everyone,
I am trying to perform a ground truth validation, having an Industrial camera and Ouster lidar (os1-64) where both are connected to Nvidia Agx via PoE switch. The bandwidth rate of the camera is 49 MByte/sec
and Lidar Mode is 1024*20
.
If both the camera and ouster run at the same time, my messages are getting dropped. Whereas, if I only run the ouster client I am getting all the messages. You can see the comparisons below:
I am trying to run with following command to record the messages:
rosbag record -b 1024 --chunksize=1024 --duration=1h /os_cloud_node/points __name:=recordLidarData &
This is the output of the rosbag file which was recorded without the camera presence on the network:
You can see the messages are 71991
and size of the recorded bag for 1 hour is almost 210.9 GB
:
path: _2018-01-28-17-35-57.bag
version: 2.0
duration: 59:59s (3599s)
start: Jan 28 2018 17:35:57.88 (1517157357.88)
end: Jan 28 2018 18:35:57.60 (1517160957.60)
size: 210.9 GB
messages: 71991
compression: none [71991/71991 chunks]
types: sensor_msgs/PointCloud2 [1158d486dd51d683ce2f1be655c3c181]
topics: /os_cloud_node/points 71991 msgs : sensor_msgs/PointCloud2
When simultaneously, both the camera and ouster clinet runs, I am getting the correct FPS for the camera but all the messages are getting dropped for ouster and ended up in very small bag file. You can see the output of the rosbag file:
Here you can see the messages I am getting for 1 hour only 6294
and the size is of rosbag is 18.04 GB
.
path: _2018-01-28-19-15-51.bag
version: 2.0
duration: 59:58s (3598s)
start: Jan 28 2018 19:15:52.19 (1517163352.19)
end: Jan 28 2018 20:15:50.64 (1517166950.64)
size: 18.4 GB
messages: 6294
compression: none [6294/6294 chunks]
types: sensor_msgs/PointCloud2 [1158d486dd51d683ce2f1be655c3c181]
topics: /os_cloud_node/points 6294 msgs : sensor_msgs/PointCloud2
Can anyone nudge me in the right direction how to solve this issue. I mean in terms of, for example. what will be correct chunksize or buffsize so my messages should not drop. Is there anyway, where I can check how many messages are dropped kind of statistical information?
Or there is a bottleneck, because the switch might be under stress [too many packets]? Any suggestions would be appreciated.
Regards, Affi
Asked by affi on 2021-02-18 06:06:32 UTC
Answers
How glad I am we're not the only ones seeing this bug! We've just spent about 20+ manhours debugging it. With the help of a network tap device, we pinpointed the issue to the timing of the packets sent by the lidar.
It seems the lidar sends data in bursts (8 UDP fragments, pause 8 UDP fragments, pause, 1 UDP fragment). These 17 fragments form one large fragmented UDP datagram. And that's what the ROS driver sends furter as the lidar packet. When you connect the lidar directly to a computer, you don't observe the packet loss as there is noone else on the cable and a Gbit link is good enough for the lidar. However, when the lidar traffic is mixed in a switch with traffic from other devices, it shows the bursts of data are just too fast and switches start throwing Ouster packets away as their buffers are half-full of the other traffic and can no longer contain all packets belonging to the Ouster bursts. We haven't found any parameter of the switches we tested (about 5 types) that would tell us the size of the buffers, but we assume the store-and-forward technology they use has a buffer for just a few packets. Ouster sends 10000 UDP packets/s in 1024x10 mode, and one of our switches specifies 17 Mpackets/s, which is way more. However, the bursting behavior goes beyond the limit of even this switch on short time intervals.
We plan to report this as an issue to Ouster, as it basically renders the sensor unusable when connected to a switch.
Until it is fixed (if!), there are some things you can do to make things a bit better.
Switch to 512x10 mode and connect Ouster to switch via a 4-wire 100 Mbit Ethernet cable. This will force it into 100 Mbit mode, in which the bursts are much smaller. The bandwidth of 512x10 mode is 60 Mbps which is easy to handle on the 100 Mbit link. This option fully resolves the issue for us, but you have to work only with 512x10 mode.
Connect the lidar via a dedicated network card (e.g. a USB-Ethernet adapter or a PCIe card). This fully resolves the issue (if you manage to find a reliable USB-Ethernet adapter, which is not easy).
If your receiving machine is Linux, set the following in /etc/systctl.d/60-ouster-fix.conf:
net.ipv4.ipfrag_high_thresh = 26000000
net.ipv4.ipfrag_low_thresh = 13000000
net.ipv4.ipfrag_time = 5
You can play with the values. This basically doesn't fix the lost packets problem, but it makes it less serious for the receiving computer. As said earlier, Ouster sends 20k UDP frames fragmented to 17 MTU-sized frames (1500 B). As frames get lost on the switch, the receiving computer adds any received fragment to a buffer where it waits for its other buddies to form the large UDP datagram. However, with 20% of them missing, most datagrams are never completed (reassembled). So their parts just sit in this buffer waiting. The default setting for Ubuntu is to kick them off after 30 seconds. You can see this in action - if you run the lidar and cameras at the same time, every 30 seconds you receive a few damaged scans in quick succession. That's exactly when the reassembly buffer kicks off most of the hopeless datagrams. The other two settings just control size of the reassembly buffer. The default is about 300 kB, we increase it to 26 MB.
With these 3 settings, we got much better performance, but the problem is still not fixed. Without them, traffic of 60 Mbps from other devices is enough for lidar packets to start disappearing, wherease with the settings, most lidar packets make it through even with 500 Mbps other traffic.
I'll update this answer if Ouster tells us some relevant information or if they fix the issue in a new firmware.
Asked by peci1 on 2021-04-16 20:22:54 UTC
Comments
Jumbo frames might also help. It doesn't decrease data rate, but it will decrease packet rate. Especially for busy networks, that might help reduce fragmentation and lost packets.
Asked by gvdhoorn on 2021-04-17 08:54:54 UTC
Yeah, we'd like to use them. But it seems the lidar is limited to the standard MTU of 1500 B :(
Asked by peci1 on 2021-04-17 09:29:20 UTC
Is there any follow up solution for that? We have been facing the same problem here.
By running only the LiDAR everything is fine, but when we start the cameras we notice the packet drops on the LiDAR. We tried setting the LiDAR mode to 512x10 but we still saw the same behavior. Have you received any answer from Ouster?
Asked by lucas.nunes on 2021-11-19 04:30:03 UTC
If 512x10 mode is okay for you, connect the lidar via a 100 Mbps link and it will get much better (or even fixed). This connection can be forced either by putting some 100 Mbps-only switch in the way, or just by connecting only 4 of the 8 ethernet cables from Ouster (you need to figure out which ones).
Asked by peci1 on 2021-11-19 12:17:12 UTC
Otherwise, Ouster told that they will try to address this issue in a new firmware released around Q1 2022.
Asked by peci1 on 2021-11-19 12:17:50 UTC
Last, if you can open an issue with Ouster support, it would be great. Until now, they believed we are the only ones with this problem and thus it is a non-problem.
Asked by peci1 on 2021-11-19 12:18:27 UTC
I don't know if you are aware but ouster released a new firmware with configurable UDP packet here: https://data.ouster.io/downloads/changelog/changelog-v2.2.x.pdf
Hopefully this will help.
Asked by lucas.nunes on 2022-01-18 05:43:47 UTC
Unfortunately, the new firmware does not help (yet). The configurability is only about supporting the new dual returns mode. There is nothing allowing configuration of how many columns are sent at once. But looking at the changes in the ros driver, it seems Ouster is preparing for that. So let's hope it will happen soon.
Asked by peci1 on 2022-01-18 06:25:12 UTC
In the meantime, we have found a switch that supports PTP time sync and has buffers large enough that it can handle muxing Ouster traffic with other devices (we got 120 Mbps Ouster + ca 820 Mbps other traffic without a single dropped Ouster packet). https://www.botblox.io/products/rugged-gigabit-switch
Asked by peci1 on 2022-01-18 06:26:53 UTC
Comments