ROS Communication segfault issue with python

asked 2018-05-04 22:30:15 -0500

ljk gravatar image

updated 2018-05-04 22:49:56 -0500

Hi ROS community,

Recently I encountered a very weird segfault with ros communication in python.

Here is the brief description about the problem of interest and the communication structure.

I am sending commands from local PC to control Fetch Robot via wifi connection to local router.

The local PC runs mainly motion planner and deep learning stuff and send high level command to the robot to execute.

The main local programs (experienced segfault) are written in python (2.7). Both local PC and the Fetch robot are running in Ubuntu 14.04, ROS indigo.

the weird thing about the segfault is that it did not occur consistently (with good luck, segfault does not appear for hours, with bad luck, it shows up a few times within half an hour). Also, the segfault does not occur in the same line of codes but mainly happens in two places: one dealing with a forward pass through a deep neural network which has nothing to do with ros and another dealing with calling moveit to plan a path for the manipulator.

To debug the segfault, I use python faulthandler and below are two instances of segfault happening in different lines of codes

Instance 1 when planning a path

Instance 2 when deep learning computation takes place

As you can see, even though they happened in different places, it all complains something related to tcpros and threading, but it tells me very little about what potential issues are (sorry I am not so familiar with network or communication stuff)

I am suspecting it could be due to the following 1. network connectivity issues 2. ros communication issues

Hence, I will much appreciate if anyone can help narrow down the issues from which I can debug further. Thanks a lot in advance!

edit retag flag offensive close merge delete

Comments

If python is SEGFAULTing, wouldn't it make more sense to run the python binary/binaries in gdb? All the Python stacktraces show things I would expect (when using a networking middleware): lots of subscriptions waiting on new data, callbacks being called, etc. That does not mean ..

gvdhoorn gravatar image gvdhoorn  ( 2018-05-05 05:28:54 -0500 )edit

.. the issue is in tcpros_*.py, just that it's being used a lot (note that it could of course be that there is a problem in those classes/files, but seeing the file mentioned a lot does not imply causation).

gvdhoorn gravatar image gvdhoorn  ( 2018-05-05 05:30:19 -0500 )edit

thanks for the comments @gvdhoorn! I did try to use debug with gdb by following this tutorial

but I didn't get useful information from gdb, hence i resorted to faulthandler.

ljk gravatar image ljk  ( 2018-05-05 21:33:51 -0500 )edit

I agree with you that observing tcpros being mentioned many times does not mean it is the cause of the problem. now I am more suspicious in my deep neural network with pytorch which potentially cause the problem.

ljk gravatar image ljk  ( 2018-05-05 21:38:13 -0500 )edit

SEGFAULTs are (nearly) impossible to debug without backtraces, so you'll have to get one in some way. gdb is the easiest.

Python typically (read: in my experience) doesn't SEGFAULT, unless you're doing some strange things interacting with native code.

gvdhoorn gravatar image gvdhoorn  ( 2018-05-06 05:29:03 -0500 )edit