Action client not receiving result when server publishes too quickly

asked 2017-04-05 14:38:52 -0600

cpagravel gravatar image

updated 2017-04-06 10:28:43 -0600

Important Info

  • Client hangs randomly in a call to waitForResult() (no guarantee that it will happen)
  • Hanging is influenced by placement of calls to sleep()
  • Recording the topics indicated that the hanging happens when waitForResult() is called before the result is published.
  • I am using Ubuntu 14.04 and ROS Indigo.

I am debugging a problem where an action client will hang while waiting for a result from the action server. After a lot of experimenting, I've narrowed it down to some sort of synchronization issue. One, which is influenced by placing sleep statements in different locations.

In the node which sends the goal, I have the following code:

  action_client_.sendGoal(goal);
  // ros::Duration(0.5).sleep();  // Enabling this line increases the probability to hang. Though, it is random.
  ROS_INFO("Waiting for result");
  action_client_.waitForResult();
  ROS_INFO("Result received");

In the action server node, I have the following code:

  // ros::Duration(0.001).sleep();  // Enabling this line significantly decreases the likelyhood of hanging
  ROS_INFO("Sending result");
  action_server_.setSucceeded(result_);
  ROS_INFO("Result sent");

Possible Sources of Error

  • OS scheduling - The OS could be putting the client thread to sleep after (or inside) the call to sendGoal() and then the action server is able to finish before the client thread can call waitForResult()
  • Bad time stamp - The time stamp for the result message could be incorrectly set. It appears as though the time stamp is evaluated while determining if a result was received.
  • Resource deadlock - The backtrace from GDB shows that the client is waiting on a mutex. Perhaps a race condition has created a deadlock.

GDB Backtrace in Client when hanging

#0  pthread_cond_timedwait@@GLIBC_2.3.2 ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
#1  0x00007ffff71fc0e9 in bool boost::condition_variable_any::do_wait_until<boost::unique_lock<boost::mutex> >(boost::unique_lock<boost::mutex>&, timespec const&) ()
   from /home/avidbots/Dev/avidbots/ros_indigo/devel/lib/libavidbots_intra_sector_planner.so
#2  0x00007ffff71f4403 in bool boost::condition_variable_any::timed_wait<boost::unique_lock<boost::mutex> >(boost::unique_lock<boost::mutex>&, boost::posix_time::ptime const&) ()
   from /home/avidbots/Dev/avidbots/ros_indigo/devel/lib/libavidbots_intra_sector_planner.so
#3  0x00007ffff71e9090 in bool boost::condition_variable_any::timed_wait<boost::unique_lock<boost::mutex>, boost::date_time::subsecond_duration<boost::posix_time::time_duration, 1000l> >(boost::unique_lock<boost::mutex>&, boost::date_time::subsecond_duration<boost::posix_time::time_duration, 1000l> const&) ()
   from /home/avidbots/Dev/avidbots/ros_indigo/devel/lib/libavidbots_intra_sector_planner.so
#4  0x00007ffff7259ddb in actionlib::SimpleActionClient<avidbots_connection_planner::ConnectionPlannerAction_<std::allocator<void> > >::waitForResult(ros::Duration const&) ()
   from /home/avidbots/Dev/avidbots/ros_indigo/devel/lib/libavidbots_intra_sector_planner.so
---Type <return> to continue, or q <return> to quit---

Is this a bug in ROS, or am I doing something wrong?

edit retag flag offensive close merge delete

Comments

It could be a corner case that is not covered as most actions take some time. As an action is implemented on topics (which take some time be set up) it's not very suprising that there can be timing issues in extreme cases. Why is your action terminating so fast?

NEngelhard gravatar imageNEngelhard ( 2017-04-06 10:01:02 -0600 )edit

I am querying the global planner. The global planner, very quickly determines it cannot plan the goal accepted due to a violation (which it is expected to handle).

cpagravel gravatar imagecpagravel ( 2017-04-06 11:20:47 -0600 )edit