Can ROS2 services still be expected to be 'flakey'?

asked 2023-01-26 07:12:28 -0500

jbp gravatar image

Hello,

I have stumbled upon a problem, when trying to use services in ROS2 humble. I haven't tried other distros.

In short: My service-clients occasionally doesn't receive a service-response, when I have multiple clients calling the same service at the same time. It can also happen with 1 client however it occurs more rarely and also depends on which DDS/RTPS vendor I use.

I have scoured varies ROS2-related packages, issue lists and so forth. I found this issue, which seems to be a similar issue. From the history, then I can see that they have added 'flakey' services to the list of known issues for ROS2 foxy. Yet, the later released distributions' release notes doesn't contain this as a known issue. I can see that some of the related subissues are open.
Are anyone aware whether this behavior is still to be expected?

If yes, are you aware of any good workarounds? Currently, the only way I have found is to wait for a timeout, prune pending requests and try again. However, this is not a particular great pattern for my application.

Thanks in advance! Let me know if you need more details.

Recreate issue To recreate the issue, then compile the provided package cpp_pubsub and execute the following commands in 3 different terminals:

ros2 run cpp_pubsub service --ros-args -r __node:=service --log-level debug

ros2 run cpp_pubsub client --ros-args -r __node:=client1 --log-level debug

ros2 run cpp_pubsub client --ros-args -r __node:=client2 --log-level debug

My platform

Ubuntu 22 kernel 5.15.0-58-generic
I also use rmw_cyclonedds_cpp as the RMW_implementation

Alternatively you can see the package files here

client_member_function.cpp:

#include <chrono>
#include <functional>
#include <memory>
#include <string>
#include "rclcpp/rclcpp.hpp"
#include <cpp_pubsub/srv/node_call.hpp>
int main(int argc, char * argv[])
{
    rclcpp::init(argc, argv);
    std::shared_ptr<rclcpp::Node> node = rclcpp::Node::make_shared("client");
    auto client =node->create_client<cpp_pubsub::srv::NodeCall>("/service");
    client->wait_for_service();
    uint limit = 500000;
    for(size_t i = 0; i < limit ; i ++)
    {
    auto request = std::make_shared<cpp_pubsub::srv::NodeCall::Request>();
    request->node_name = node->get_name();
    client->wait_for_service();
    auto shared_future = client->async_send_request(request);
   if (rclcpp::spin_until_future_complete(node, shared_future, std::chrono::seconds(10)) == rclcpp::FutureReturnCode::SUCCESS)
    {
        std::stringstream str;
        str << "Received response " << i;
        RCLCPP_INFO(rclcpp::get_logger("rclcpp"), str.str().c_str());
    } else {
        RCLCPP_ERROR(rclcpp::get_logger("rclcpp"), "Failed to call service. Stopping execution now");
        return 1;
    }
   }
   rclcpp::shutdown();
   return 0;
 }

service_member_function.cpp

#include <functional>
#include <memory>
#include "rclcpp/rclcpp.hpp"
#include <cpp_pubsub/srv/node_call.hpp>
class MinimalService : public rclcpp::Node
{
public:
    MinimalService()
    : Node("Service")
    {
        service_ = create_service<cpp_pubsub::srv::NodeCall>("/service", std::bind(&MinimalService::callback,this, std::placeholders::_1,std::placeholders::_2));
    }

private:
    void callback(cpp_pubsub::srv::NodeCall::Request::ConstSharedPtr req ,  cpp_pubsub::srv::NodeCall::Response::SharedPtr)
    {
        RCLCPP_INFO(get_logger(), "Received request from %s",req->node_name.c_str());
    }
    rclcpp::Service<cpp_pubsub::srv::NodeCall>::SharedPtr service_;

};

int main(int argc, char * argv[])
{
  rclcpp::init(argc, argv);
  auto node = std::make_shared<MinimalService>();
  rclcpp::spin(node);
  rclcpp::shutdown();
  return 0;
}

CMakeLists.txt

cmake_minimum_required(VERSION 3.8)
project(cpp_pubsub)

if(CMAKE_COMPILER_IS_GNUCXX OR CMAKE_CXX_COMPILER_ID MATCHES "Clang")
  add_compile_options(-Wall -Wextra -Wpedantic)
endif()

# find dependencies
find_package(ament_cmake REQUIRED)
# uncomment ...
(more)
edit retag flag offensive close merge delete

Comments

I can confirm that your code is not working reliably on a default ROS2-humble installation with FastDDS or CyclonDDS (using docker). After a few seconds of running it, it starts to "stutter", sometimes one or both clients even crash with the error message: [ERROR] [1674831280.798867979] [rclcpp]: Failed to call service. Stopping execution now after receiving some 1000-messages successfully before.

I am also very interested in a workaround. Your code does not compile on foxy, but I would like to test that as well.

brean gravatar image brean  ( 2023-01-27 09:06:38 -0500 )edit

Thanks for trying it out, Brean.

By examining the logs I found that the client never receives the response as an event. The server reports that it was sent. However, the stuck client would still receive responses from the second node that's still running.

It's an odd behaiviour and I expect it to be a bug in the implementation. With the current interfaces for client/services I find it difficult to detect if it should occur unless I implement a timeout that waits 'long enough'. However this approach is not particular great in a application, where timing is critical.

I tried looking into accessing the rmw_request_id_t that follows the request, but I haven't found a way to access it on the client side(expect the sequence number). If I could get it, then it would be easy to wrap the service/client in a class that would ...(more)

jbp gravatar image jbp  ( 2023-01-27 09:25:18 -0500 )edit