Ask Your Question
1

Tracking a message from publisher to subscriber

asked 2019-04-01 10:36:21 -0500

updated 2020-03-11 10:42:50 -0500

I'm trying to track messages accross nodes to eventually build a graph (well, many graphs) using LTTng and the tracetools package.

One of the things I need is to be able to connect the message that a publisher sends and the (same) message that a subscriber receives. For example, looking at this representation of pub/sub queues below, I'd like to link 0x561f5f3d1064 and 0x7f8120003230.

image description

Unfortunately, looking at roscomm and how a message is serialized, the message isn't unique since the content does not always have a timestamp/std_msgs/Header. For example, a std_msgs/String message with the same content will always be the same. Otherwise this would be too easy!

This is the solution I'm considering right now:

  • Right before the publisher sends the message over the network, we use message_start (from the buffer). This identifies the message on the publisher's side.
  • We take the very next net_dev_queue event that matches the pub/sub connection (hosts/ports of the TCP connection), and "link" the TCP sequence number (or skbaddr if we're on the same host) to the message_start above.
  • Similarly, for the subscriber, we use the message_start of the very next message that the subscriber receives after the corresponding netif_receive_skb event (with the same sequence number).
  • Thus the two message_start values are linked.

I'm not sure how reliable this method is.

Therefore I'd like to hear other ideas. Or maybe a confirmation that this might actually work and be reliable!

Update: here is what I ended up doing https://christophebedard.com/ros-trac...

edit retag flag offensive close merge delete

1 Answer

Sort by » oldest newest most voted
3

answered 2020-03-11 11:18:14 -0500

updated 2020-03-13 02:29:51 -0500

gvdhoorn gravatar image

Submitting an answer to my own question since I didn't get any answers.

I ended up doing pretty much what I described in my question. Full post here: https://christophebedard.com/ros-trac...

Here's a summary/excerpt:

In order to do what I've described above, similar to what I mentioned, some information is needed on:

  • connections between publishers and subscribers
  • subscriber/publisher queue states
  • network packet exchanges

We first need to know about connections between nodes. The ROS instrumentation includes a tracepoint for new connections (new_connection). It includes the address and port of the host and the destination, with an address:port pair corresponding to a specific publisher or subscription.

We also need to build a model of the publisher and subscriber queues. To achieve this, we can leverage the relevant tracepoints. These include a tracepoint for when a message is added to the queue (publisher_message_queued, subscription_message_queued), when it’s dropped from the queue (subscriber_link_message_dropped, subscription_message_dropped), and when it leaves the queue (either sent over the network to the subscriber (subscriber_link_message_write), or handed over to a callback (subscriber_callback_start)). We can therefore visualize the state of a queue over time!

Finally, we need information on network packet exchanges. Although this isn’t really necessary for this kind of analysis, it allows us to reliably link a message that gets published to a message that gets received by the subscriber. This is good when building a robust analysis, and it paves the way for a future critical path analysis based on this message flow analysis.

This requires us to trace both userspace (ROS) and kernel. Fortunately, we only have to enable 2 kernel events for this (net_dev_queue for packet queuing and netif_receive_skb for packet reception). It saves us a lot of disk space, since enabling many events can generate multiple gigabytes of trace data, even when tracing for only a few seconds! Also, as the rate of generated events increases, the overhead also increases. More resources have to be allocated to the buffers to properly process those events, otherwise they can get discarded or overwritten.

Result:

C:\fakepath\result_analysis_initial_zoom.png

Some links for actual code/further information:

edit flag offensive delete link more

Comments

1

As there is a good chance your website will go off-line in the future (all sites do at some point). It would be great if you could summarise what you did here in your answer. That would make this somewhat less of a link-only answer, and allow it to maintain its value even without your site being operational.

gvdhoorn gravatar image gvdhoorn  ( 2020-03-11 12:17:36 -0500 )edit
1

Good point! I've added a summary/excerpt.

christophebedard gravatar image christophebedard  ( 2020-03-12 15:30:12 -0500 )edit

n1. Thanks.

gvdhoorn gravatar image gvdhoorn  ( 2020-03-13 02:29:33 -0500 )edit

Your Answer

Please start posting anonymously - your entry will be published after you log in or create a new account.

Add Answer

Question Tools

3 followers

Stats

Asked: 2019-04-01 10:36:21 -0500

Seen: 575 times

Last updated: Mar 13 '20