ROS Resources: Documentation | Support | Discussion Forum | Index | Service Status | ros @ Robotics Stack Exchange
Ask Your Question
1

Why is spinOnce() right after publish() flaky?

asked 2022-02-05 17:47:06 -0600

peci1 gravatar image

I have a simple C++ test code which contains a subscriber, a publisher and nothing else. I do the following:

  1. Set up subscriber sub with queue size 1.
  2. Set up publisher pub with queue size 1.
  3. Wait until sub.getNumPublishers() is 1 to make sure the two guys talked (wait up to 1 second).
  4. pub.publish(msg)
  5. std::thread t(&ros::spinOnce);

What's weird is that the publish call is flaky, i.e. in about 10% of cases, the subscriber callback isn't fired (regardless of how long I wait).

I can't understand why that happens, because:

  • I'm running with sim_time set to true and nobody publishes the time, so time should not play a role.
  • There are no other nodes running on this rosmaster.
  • Both the subscriber and publisher are in the same process, so they should use intraprocess communication. Thus nothing like Nagle should play a role.

My understanding of what publish() does is the following:

  • Serialize the data
  • Find all subscriber links
  • Store the serialized data together with some other goodies in the subscriber link's callback queue.

There's nothing obviously asynchronous here. Am I missing something? Or is it possible that sub.getNumPublishers() shows 1 while the publisher and subscriber are not really yet connected? (but I wait 1 second for this to happen...).

The only other explanation I can think of is that something else gets to the subscriber's queue. But what would that be? It is definitely not "converted" into calling the subscriber callback I registered because that's what I'm checking and what fails.

MWE:

#include <ros/ros.h>
#include <std_msgs/Header.h>
#include <thread>
int main(int argc, char **argv)
{
  ros::init(argc, argv, "test_flaky_publish");
  ros::Time::setNow(ros::Time(0));
  bool called = false;
  ros::NodeHandle nh;
  auto cb = [&](const std_msgs::Header&){called = true;};
  ros::Subscriber sub = nh.subscribe<std_msgs::Header, const std_msgs::Header&>("test", 1, cb);
  ros::Publisher pub = nh.advertise<std_msgs::Header>("test", 1);
  while (sub.getNumPublishers() == 0 && ros::ok())
    ros::WallDuration(0, 1000).sleep();
  pub.publish(std_msgs::Header());
  std::thread t(&ros::spinOnce);
  ros::WallDuration(0.5).sleep();
  if (called)
    ROS_INFO("called");
  else
    ROS_ERROR("not called");
  if (t.joinable())
    t.join();
  return called ? 0 : 1;
}
edit retag flag offensive close merge delete

Comments

Both the subscriber and publisher are in the same process, so they should use intraprocess communication.

is that actually true in ROS 1? IIRC, that's something only nodelets do.

gvdhoorn gravatar image gvdhoorn  ( 2022-02-06 02:10:36 -0600 )edit

I think that's true only for no-serialization comms. But no-tcp inside one process should work everywhere, even outside nodelets.

peci1 gravatar image peci1  ( 2022-02-06 04:27:22 -0600 )edit

1 Answer

Sort by ยป oldest newest most voted
2

answered 2022-02-06 11:32:43 -0600

peci1 gravatar image

After reading http://wiki.ros.org/roscpp/Internals and a few source files, I think I got what is going on.

The behavior depends on whether no-copy publishing (via shared_ptr messages) is used or not. In no-copy intraprocess publishing, the published message is directly put on the subscriber's callback queue. In intraprocess serialized pubsub connection (when you publish just a reference or a copy of the message), the callback queue of the subscriber is not directly updated from the publisher. It instead goes into the publisher output queue and waits for PollManager to move the callback from the publisher output queue to the subscriber callback queue.

As referenced later in the source code analysis, the PollSet gets signaled right after a serialized message is put in the publisher output queue. The signaling is done via a pipe, which can be buffered in kernel. So I think this is what brings asynchronicity in the code.

Relevant source lines:

edit flag offensive delete link more

Comments

And yes, I verified that if you change the MWE to used ConstPtr messages on both the publisher and subscriber, the example is no longer flaky.

peci1 gravatar image peci1  ( 2022-02-06 11:36:20 -0600 )edit

So to deflake the code, either switch to using pointer messages, or you can also put the following code after the publish call: while (dynamic_cast<ros::CallbackQueue*>(nh.getCallbackQueue())->empty()) ros::WallDuration(0, 1000).sleep();. But that's not the best solution, as it will only work in this minimal example where we know nobody else is publishing and we also know the dynamic type of the callback queue.

peci1 gravatar image peci1  ( 2022-02-06 11:45:34 -0600 )edit

Question Tools

1 follower

Stats

Asked: 2022-02-05 17:47:06 -0600

Seen: 302 times

Last updated: Feb 06 '22