Detection of subscriber queue overflow in ROS2?

asked 2023-04-20 03:43:18 -0500

Mango gravatar image

Is there some mechanism with which to detect when a subscriber's queue is full, or has dropped messages due to being full as new messages arrive?

The setup in question is ROS2 Humble on Ubuntu 22.04: including baremetal, WSL, and official docker images.

My understanding is that the combination of QoS settings HISTORY=KEEP_LAST and DEPTH='some_number' on a subscriber, will result in allocating a message queue of size 'some_number' for that subscriber. The subscriber's node's executor will then pull one message at the time from this queue, execute the configured callback with that message, thus making room for another message in the queue. (Assuming single threaded executor, mutually exclusive callback group, etc...)

In the event that messages are bursted, or arrive at a higher rate than the callback can handle, this queue will start filling up. For me, this constitutes a crucial event, namely that the callback is not fast enough, and we risk losing information as (the oldest) messages are dropped due to the queue being full. On a related note, it also means that the messages being handled can be quite dated, if the queue depth is large in relation to the incoming rate.

In ROS1, I have understood that a debug print is fired if a message is dropped due to the subscriber queue being full, AND that it is possible to check the number of unhandled messages in a subscribers queue during runtime. See related ROS1 question. I am hoping there is an equivalent mechanism for ROS2, and that I have simply missed where to configure it.

For the record, I have looked into the MESSAGE_LOST callback which you can pass as part of subscriber_options. My understanding is that this callback is fired when the middleware reports having failed to transmit a message, NOT when messages are correctly transmitted by the middleware, resulting in the oldest message in the queue being ejected in favor of the new incoming one.

Although it should be irrelevant for such a conceptual question, the transmission in question is via SHM on FastDDS, with all nodes on the same host, WITHOUT any loaned messages or zero copy stuff going on.

edit retag flag offensive close merge delete

Comments

My understanding is that this callback is fired when the middleware reports having failed to transmit a message, NOT when messages are correctly transmitted by the middleware

This seems to contradict the description of the message lost example?

You should see the talker output to the terminal for each message it publishes. The listener should report each message that it receives as well as any lost message events it detects.



For me, this constitutes a crucial event, namely that the callback is not fast enough, and we risk losing information as (the oldest) messages are dropped due to the queue being full.

What would be your preferred solution? To detect a full buffer, stop the publisher for a while untill the buffer is emptied and then continue?

jrtg gravatar image jrtg  ( 2023-04-21 04:24:36 -0500 )edit

Thanks for your response :)

To clarify, the callback in question regarding "My understanding is that this callback is fired ..." was the message_lost_callback, not the message_callback.

As to the statement in the example, I interpret this as follows: "The listener should report each message that it receives (the message callback is called) as well as any lost message events it detects. (the message lost callback is called)".

To that end, there is no contradiction here. However, the message callback is not executed immediately upon receiving a message from the middleware, it is run whenever the executor finds time for it, and has pulled a message from the subscriber queue. Should this queue fill up, and indeed overflow before the executor finds time, the oldest messages in the queue are dropped silently, before the executor has a chance to run the message callback with those messages. This eventuality is thus covered by neither callback.

Mango gravatar image Mango  ( 2023-04-21 09:43:39 -0500 )edit

Regarding a preferred solution, I think the current behaviour is fine, i.e. that should the queue overflow, the oldest messages are dropped. What I would like is for something to tell me when this has occurred.

This is not about stalling the system (services might be a better fit for this) but rather detecting that the system as a whole is poorly tuned, and that I should make an effort to either:

  1. Optimize the message callback, to hopefully have it handle a higher rate of messages.
  2. Slow down the rate of incoming messages to something which the callback can indeed handle.

Or indeed a little bit of both.

Mango gravatar image Mango  ( 2023-04-21 09:49:53 -0500 )edit