Nodelet manager throws ros::serialization::StreamOverrunException
Hi, I haven't been able to find enough online material related to serialization overruns, so I'm asking here for the first time. Feel free to point out any improvements that I can make to the question in order to fit the forum's rules.
Setup
I'm working with ROS kinetic and pcl-1.9.1, running the ros master in a server and launching the nodes/nodelets in a docker container running on the same machine.
I'm working with 3 KinectOne cameras which run simultaneously for 3D object recognition. My first try was to join the three pointclouds and use a single processing pipeline. As an attempt to speed up the process, i have ported my nodes to nodelets to exploit the zero-copy transport as I work with 540x960 pointclouds. Furthermore, I decided to make three processing pipelines, one for each camera, as I need to maintain the organized structure of the pointclouds for recognition purposes. This multiplied my number of nodelets by three, and now when I run the program I get the following error
terminate called after throwing an instance of ros::serialization::StreamOverrunException
what(): Buffer Overrun
[manager-1] process has died [pid 30560, exit code -6, cmd /opt/ros/kinetic/lib/nodelet/nodelet manager __name:=manager __log:=/root/.ros/log/ebd68120-3f4e-11e9-8235-bc305b9d52e9/manager-1.log].
log file: /root/.ros/log/ebd68120-3f4e-11e9-8235-bc305b9d52e9/manager-1*.log
Problem hints
I've tried to run the program adding one step of the process at a time, and the problem arises when adding the following nodelet to the manager: SurfaceSegmentationNodelet. At this point I have 15 nodelets loaded under the same manager, and my goal is to end up with 21 nodelets under one manager. 2 nodelets are required in order to communicate with each KinectOne, and I have written the other 15, all of which have their own dynamic_reconfigure server, and all of them publish at an approximate rate of 6Hz. All publihsers/subscribers have a queue size of 5. I think that the problem might come from any of the following sources:
- I'm asking too much for a single nodelet manager
- I need to increase the queue_size (I've seen this approach in other answers but as I see it, a queue size of five for a publisher that runs at 6 Hz should be enough)
- Having that many reconfigure servers somewhat overloads the manager's communication
Any help will be appreciated, thank you in advance!
UPDATE
Running only 1 pipeline instead of 3 throws no exception. I guess then that the problem might be that I launch too many nodelets for one single manager. Can anyone confirm that?
UPDATE #2
I ran both the manager and the nodelets under gdb
and got the following output (not showing the memory map). Apparently it does come from one of my SurfaceSegmentationNodelets, but I have little to no clue of what is causing the problem.
[pcl::ExtractIndices::applyFilter] The indices size exceeds the size of the input.
[pcl::ExtractIndices::applyFilter] The indices ...
It would probably be good to know where exactly that exception is thrown. Have you tried running the manager process in
gdb
and then look at the backtrace to see what is going on exactly?@gvdhoorn It seems to come from the suspicious surface segmentation nodelet. However, if I run three managers (one for each camera pipeline), the error does not happen. Any clue?
Sometimes the error is this one, also related to pointers and memory [pcl::ExtractIndices::applyFilter] The indices size exceeds the size of the input. * Error in `/root/ws/devel/lib/nodelet/nodelet': munmap_chunk(): invalid pointer: 0x00007ffe18013050 * ======= Backtrace: =========
Looks like indexing into some array or vector is not done correctly.
As to the stacktrace: did you build things with
Debug
symbols enabled? I'm not seeing any line nrs.I believe I did enable debugging symbols, as I added
-g
to theCMAKE_CXX_FLAGS
. If it helps, after the memory map I get the following line:But I've read that this is just an issue of libraries compiled in other directories.
I'm not too worried about "other libraries". It's your own for which it would be convenient to have line nrs. Typically gdb shows those if it can.
Another (unrelated) question btw: why are you running this as
root
?it's likely that this is actually the real issue here. Accessing memory out-of-bounds is a recipe for
SEGFAULT
s. Are you doing any input scaling, or manually setting up arrays/lists/vectors?in CMake projects (which Catkin projects essentially are), this is more easily done by specifying a build type. You can set it like so:
-DCMAKE_BUILD_TYPE=RelWithDebInfo
.I would actually recommend ..
.. doing that and choosing
RelWithDebInfo
, especially with Pointcloud processing, as otherwise things will most likely be way too slow.I'm running it in a docker container and therefore as
root
. I'm just publishing/subscribing to pointclouds and pointindices from pcl, and that's why I use nodelets, due to the zero-copy transport. I think that the error comes fromPCL::EuclideanClusterExtraction
, should I build pcl with-g
?Seeing as PCL is being used by thousands of people, I would search for bug reports with similar problems as you have. If you can't find those, I'd suspect your own code first (personally I always suspect my own code first).
I see various frames in
SurfaceSegmentationNodelet
in your backtrace, .... which would seem to be your class. Building everything with debugging symbols enabled will certainly make debugging easier.
But again, don't add
-g
, just set the CMake build type. CMake will take care of the rest.Again off-topic, but: you could actually use the
USER
instruction to change the runtime context. That's I believe even a best-practice/recommend.Yep, thanks for the help! I'm currently debugging the way you said. I'll look for it in PCL forums for sure, though I suspected it had to do with nodelets, boost pointers and stuff (which of course is an issue of my code). I'll try to fix it and post if I find something relevant.
Are your messages all in sync (ie: have proper timestamps sufficiently close to be able to correlate msgs temporally)?
if so, instead of using the manual synchronisation that you do now, I'd look at
message_filters
.Another question: why all the copying, and what is the rationale behind the ..
.. flags system?
Well it's my first approach at writing nodelets, that might explain a bit. Each nodelet subscribes to 3 topics and requires that all 3 messages have arrived in order to process the data, hence the flag system. The Nodelet class has an infinite-loop thread that checks the flags and processes the...
data. I also read somewhere that callbacks should not be computationally heavy as they can block a process, and therefore I decided to copy the input messages for their later processing. Anyway, I thought that with nodelets I'm just copying shared pointers so it shouldn't be a big deal, or is it?
I mean there is not that much info around on how to properly structure nodelets, so after a lot of browsing I dug down into some libraries that use nodelets to copy their structure, but yeah the "flag" system is mine
Conceptually you're doing something like message_filters/ApproximateTimeSynchronizer. If your messages all have timestamps, I would perhaps take a look at that. It'd be more elegant.
But first debug your code to make it work.
And to answer your question, each node publishes all its messages at once (sequentially inside a function), so I would guess that yes, they are in sync. Thanks for the
message_filters
suggestion, I'm taking a look at it!Publishing "at the same time" != timestamps are in-sync. Please understand the difference. Publishing does not set/update a timestamp.
For an example of using
message_filters
innodelet
s, you could take a look at stereo_throttle.cpp inrtabmap_ros
. It clearly shows the synchronised callback.Or any of the nodelets of
image_pipeline
of course, such as stereo_image_proc/nodelets/point_cloud2.cpp.Alternative to the original
message_filters
: fkie/message_filters.Thanks a lot! Yep, I see my mistake with the timestamps. Thanks again!