Ask Your Question
0

[ROS2] Node Becomes Zombie, All Pubs/Subs "Dead" But Node Remains Active

asked 2021-05-12 15:34:24 -0500

Spectre gravatar image

updated 2021-05-12 15:38:20 -0500

I have a node which is responsible for communicating with an Arduino via USB Serial. It seems to have problems only when it is receiving subscription data. I have another node acting as a PID speed controller publishing PWM values, and the Arduino node has two subscribers for these values (one per motor).

Everything seems to start fine, but after reaching some unknown condition the Arduino node just simply stops responding. I cannot find any errors, warnings, etc. in the logs. I haven't observed any caught or uncaught exceptions. In fact, the node still catches keyboard interrupt exceptions and prints a corresponding "I was caught" info message to the ROS log.

Running ros2 node list shows that the node is still alive.

Running ros2 topic list -t shows that subscribers of other nodes are still listening to topics for messages that the Arduino node would normally publish.

Running ros2 doctor --report | less shows that all of the publishers and subscribers within the Arduino node are completely missing.

This is further corroborated by running ros2 topic echo <arduino_published_topic> <message_type> and seeing absolutely nothing whereas previously messages were being published quite frequently. Additionally, echoing topics from other concurrently-running nodes does produce a steady stream of messages.

None of this is running in a simulated manner. I have a robot which contains a RaspberryPi 4 and the Arduino, and I interact with them over wifi. Running all of the above commands on the RaspberryPi 4, as opposed to my main desktop computer, on my home network (i.e. congestion or network management can be ruled-out), shows no difference in behavior.

I'm not sure how much I should copy/paste from my node as it has grown quite long, but below are some relevant tidbits. Let me know if I should elaborate further on anything!

Also one final thing I will mention is that the PID Velocity controller producing the subscribed messages has a much lower publish frequency than any of the publishers and timers in this node (0.2 sec per publish vs 0.05 sec per encoder publish from serial_arbiter, for example).

# "Arduino node", called serial_arbiter.py
class SerialArbiter(Node):

    def __init__(self):
        super().__init__('serial_arbiter')
        # Omitting most parts where I declare params, retrieve param values, spit them to the logger

        # Declare Parameters and default values
        self.declare_parameter('serial_port_path', '/dev/ttyACM0')
        self.declare_parameter('serial_baud_rate', 115200)

        self.arduino = SerialComms(self.serial_port_path, self.serial_baud_rate)
        self.arduino.reset_encoders()

        # Publishers and timers are declared here, but these are the only two subscribers.
        self.subscriber_motor_right = self.create_subscription(Int16, 'set_motor_power_right', self.subscribed_motor_right_cb, 10)
        self.subscriber_motors_left = self.create_subscription(Int16, 'set_motor_power_left', self.subscribed_motor_left_cb, 10)

    # Omitting probably-irrelevant callback functions
    def subscribed_motor_right_cb(self, message):

        try:
            self.arduino.set_motor_power(self.arduino.RIGHT, int(message.data))
        except Exception as e:
            self.get_logger().error("Failed to update Motor 1 power: '{}'".format(e))
            pass

    def subscribed_motor_left_cb(self, message):

        try:
            self.arduino.set_motor_power(self.arduino.LEFT, int(message.data))
        except Exception as e:
            self.get_logger().error("Failed to update Motor 2 power: '{}'".format(e))
            pass ...
(more)
edit retag flag offensive close merge delete

1 Answer

Sort by ยป oldest newest most voted
1

answered 2021-05-16 22:18:10 -0500

Spectre gravatar image

It took a lot of trial and error but I traced the problem to my serial_send function.

The "dead" state of publishers and subscribers was intermittent. Occasionally they were missing, occasionally they were present.

The solution to my problem was to simply add two timeouts to the declaration of the Serial() object:

self.serial_port = serial.Serial(serial_port_path, serial_baud_rate, timeout=0.01, write_timeout=0.01)

The clue for the solution came from using the program minicom to communicate with my Arduino directly. Out of ideas, I tried connecting to the Arduino while the node was active. I rigged my Arduino code to produce something every time it received a message with a carriage-return. I noticed that spamming the Enter key, thereby spamming \r, I would occasionally see the other expected responses for sensor states, encoder states, motor efforts states, etc.

This could only occur if the ROS node was still sending messages. My best guess is that there was a deadlock condition created by my use of Serial.read_until(), assuming that a carriage-return would always be received. I did add a "fallback" of a 30-character limit hoping that this would account for unforeseen issues like this.

Based on the code I wrote I had no real reason to suspect otherwise, and I already knew that my Arduino did send a carriage-return for every response. Additionally that code had a maximum input buffer of 20 characters after which it would check its input, and regardless of its response it would always terminate that response with a carriage-return.

The real key was in adding the write_timeout=0.01 parameter to the constructor. The rest of my ROS node's python code was written to log any caught exceptions to the log, and sure enough "write timeout" exceptions were finally being caught.

Only very occasionally am I observing these errors; I don't want to imply that my code is terribly broken. There is some nuance in the way PySerial works which I might be able to fix with a sleep() in the serial_send function. Currently these timeout exceptions only occur when the serial_arbiter node receives an overwhelming amount of messages, and has a difficult time keeping-up with its callbacks. Now at least the node has some added resiliency.

edit flag offensive delete link more

Your Answer

Please start posting anonymously - your entry will be published after you log in or create a new account.

Add Answer

Question Tools

1 follower

Stats

Asked: 2021-05-12 15:34:24 -0500

Seen: 40 times

Last updated: May 16