ROS Resources: Documentation | Support | Discussion Forum | Index | Service Status | Q&A
Ask Your Question

why is my code slower online than offline?

asked 2014-03-03 15:06:19 -0500

brice rebsamen gravatar image

This is a rather intricate problem. I've spent quite some time on it already. First I'll describe the setup, then some observations, and some possible explanations. There is no precise question here, but I am hoping somebody will be able to shed some light on this issue.

I have some code to process velodyne data to detect and track obstacles. I can run it offline, for profiling and debugging purpose, or online.

Offline means that I get the data from a bag. I collect all the required data: the actual points, the necessary transforms, etc., and when all the data is ready I pass it to my obs_detect function. This does not use any ros messaging. A single process reads the data from the bag, and processes it.

Online, the data comes from the sensors (or bag files that I am playing back). I'm using multiple processes to preprocess the raw velodyne data and create the TF data. The obstacle detection node subscribes to the velodyne points in the main thread, cumulates them to for a whole spin, transforms it to my fixed frame (think /odom). The actual obstacle detection happens in a separate thread, that gets notified using a condition variable that a spin is available for processing. I'm using shared pointers to pass the data between the 2 threads. Here is a short synopsis of what the node does:

tf::TransformListener tf_listener;
ObstacleDetector obs_detector(&tf_listener);
pcl::PointCloud<VelodynePointType>::ConstPtr spinToProcess;

void callback(const pcl::PointCloud<VelodynePointType>::ConstPtr & spin)
  // cumulate the points to form a spin
  // transform the spin to the desired frame
  // store in spinToProcess
  // notify the main thread

void thread_func()
  while(ros::ok()) {
    // wait for the condition variable


    // publish the results

main() {
  boost::thread thread(thread_func);

  // subscribe to the velodyne data, etc.


NOTE: I also tried using a multithreaded spinner but it did not help.

Here are some experimental results and my interpretation:

1- Offline it takes 40ms. Online, the same section of code takes 55ms (about 40% more)!

I'm measuring wall time. Looking at the timing of all the operations that go into detecting the obstacles, I can see that all of them are a fraction slower online than offline. So the slowness is distributed over the whole computation. And those timings do not consider the deserialization of the message.

I also measured the CPU time for the processing thread with "getrusage(RUSAGE_THREAD, &usage)". Offline there is no much difference between wall time and cpu time. Online, cpu time is about 10ms less.

That seems to indicate that the processing thread is interrupted.

2- If I run the offline version and the online version concurrently, the offline code takes about 60ms and the online code 70ms (using a total of 6 cores out of 8).

This could be the result of competing access to some hardware resources, like SSE registers, cache, ... Which could be the reason why the processing thread is interrupted.

3- Last year (around ... (more)

edit retag flag offensive close merge delete

2 Answers

Sort by ยป oldest newest most voted

answered 2014-03-03 19:09:38 -0500

ahendrix gravatar image

It doesn't look like the version of ros_comm (roscpp, rosconsole, message_filters) that was released into Fuerte has changed since May of last year ( ), and even then, the most recent changes were mostly limited to python libraries and some small bits of the C++ service implementation.

I don't think the message generation has changed in a very long time.

That leaves TF, PCL and Eigen as the primary culprits here.

  • The version of TF in Hydro was reimplemented on top of TF2. It's a pretty major rewrite, so it's possible that TF is slower in Hydro. I don't believe this port made it back to Groovy or Fuerte, so I'm not sure it's causing your slowdown there. In theory, TF2 should be faster than TF.
  • It looks like PCL was last released into Fuerte 6 months ago, and from digging through the release repository history, it looks like SSE was disabled in PCL around May 28th.
  • If you've changed Ubuntu versions, it's possible that you're using a newer version of Eigen that is somehow slower. Given that Eigen is a stable, established library, this seems unlikely.

Given those options, the most obvious change here is the patch disabling SSE in PCL.

edit flag offensive delete link more

answered 2014-03-03 20:28:12 -0500

updated 2014-03-03 20:38:14 -0500

One issue that would be pretty easy to check/modify is reducing the tf poll time for waitForTransform calls that you probably are using (explicitely or implicitely). By default, the poll time is 0.01s/10ms (see here). This means that tf only checks if a transform is available every 10ms. In some usage scenarios, this can have significant impact on the amount of data that can be processed in the thread where the waitForTransform is used.

edit flag offensive delete link more

Your Answer

Please start posting anonymously - your entry will be published after you log in or create a new account.

Add Answer

Question Tools

1 follower


Asked: 2014-03-03 15:06:19 -0500

Seen: 469 times

Last updated: Mar 03 '14