How to debug this? Callbacks stop.
I have a very standard looking mobile base controller written in Python. It subscribes to cmd_vel messages and publishes odometry. Odometry is published from inside a loop that includes rate.sleep and the cmd_vel callback talks directly to the hardware. Both have lots of debug messages inside so I can know what they are doing. It runs about 10 to 20Hzon a Raspberry Pi 3B and uses about 6% of a processor core. It seems to run fine and the robot base moves and turns as expected.
Here is the problem: It only works for a while, perhaps 5 to 10 minutes of random driving around and then the hardware motor controller never gets another update and the robot just keeps driving with the last motor speeds until power cycled.
At first, I suspect the motor controller but no...
I see (using rostopic echo /cmd_vel) thatcmd_vel messages are being published correctly and no odometry is beibg published. It seems like the base controller is terminated. But no...
"rosnode ping" works as does rosnode info. I se that my controller node is running and subscribed to cmd_vel but the callback is not being called and the loop that runs the odometry publisher is not running either
The effect is as if the node were stuck inside spinOnce, not frozen because ping and info work, just stuck.
The odd thing is that is all works fine for a while, then stops. I start it all with one launch file and "control-C" stops everything
QUESTION: How do I debug this? I have many debug log messages and none are logged because I assume the callback and lop are not running.
The effect is exactly as if by base controler node were stuck in
As this is running over wireless: can you run a
rostopic echo
(and/or arosbag record
) of thecmd_vel
topic on the robot to see whether that also stops?Also:
independent of this problem: I would really recommend to implement a command time-out in your mobile base driver. That is something all mobile robots ..
.. should have and prevents communication issues from doing any real damage with robots repeating or continuing the last received command.
Yes, I have a timeout. If no cmd_vel messages are received I send a stop command to the hardware. But as I wrote, the while-rosok, dostuff() rate..sleep() loop does not run so time outs are not detected. (will write a watchdog node that monitors odometry and compares to cmd_vel. kills power)
It was not clear to me from your description that event processing / callback queues stop working. I had understood that it just appeared that no callbacks were being called / messages were not being received.
Do you have any other nodes on that system that exhibit the same problem? If you ..
Yes. I have monitored cmd_vel. the published cmd_vel is good and reacts to sensors. rosnode info shows the base controller is subscribed to cmd_vel. There is no difference between when it works and not that I can see with rostopic echo cmd_vel. rosnode ping works on all nodes.
.. disable parts of the base driver (ie: writing to/reading from hw fi), does this also occur?
How does your driver communnicate with the motor controller? Memory reads/writes? Serial port? Something else? If a serial port read hangs in a callback (for instance), that could explain this.
can you add how you reached this conclusion?
Yes, I agree it ONLY APPEARS as if event processing / callback queues stop . Only this one node has this problem. If this were C++ I'd suspect a pointer killed code inside a ROS library but this is Python.
My question is what to look at? I'm stumped. Nodes answer pings, cmd_val is published