why is my code slower online than offline?
This is a rather intricate problem. I've spent quite some time on it already. First I'll describe the setup, then some observations, and some possible explanations. There is no precise question here, but I am hoping somebody will be able to shed some light on this issue.
I have some code to process velodyne data to detect and track obstacles. I can run it offline, for profiling and debugging purpose, or online.
Offline means that I get the data from a bag. I collect all the required data: the actual points, the necessary transforms, etc., and when all the data is ready I pass it to my obs_detect function. This does not use any ros messaging. A single process reads the data from the bag, and processes it.
Online, the data comes from the sensors (or bag files that I am playing back). I'm using multiple processes to preprocess the raw velodyne data and create the TF data. The obstacle detection node subscribes to the velodyne points in the main thread, cumulates them to for a whole spin, transforms it to my fixed frame (think /odom). The actual obstacle detection happens in a separate thread, that gets notified using a condition variable that a spin is available for processing. I'm using shared pointers to pass the data between the 2 threads. Here is a short synopsis of what the node does:
tf::TransformListener tf_listener;
ObstacleDetector obs_detector(&tf_listener);
pcl::PointCloud<VelodynePointType>::ConstPtr spinToProcess;
void callback(const pcl::PointCloud<VelodynePointType>::ConstPtr & spin)
{
// cumulate the points to form a spin
// transform the spin to the desired frame
// store in spinToProcess
// notify the main thread
}
void thread_func()
{
while(ros::ok()) {
// wait for the condition variable
obs_detector.detect(spinToProcess);
// publish the results
}
}
main() {
boost::thread thread(thread_func);
// subscribe to the velodyne data, etc.
ros::spin();
}
NOTE: I also tried using a multithreaded spinner but it did not help.
Here are some experimental results and my interpretation:
1- Offline it takes 40ms. Online, the same section of code takes 55ms (about 40% more)!
I'm measuring wall time. Looking at the timing of all the operations that go into detecting the obstacles, I can see that all of them are a fraction slower online than offline. So the slowness is distributed over the whole computation. And those timings do not consider the deserialization of the message.
I also measured the CPU time for the processing thread with "getrusage(RUSAGE_THREAD, &usage)". Offline there is no much difference between wall time and cpu time. Online, cpu time is about 10ms less.
That seems to indicate that the processing thread is interrupted.
2- If I run the offline version and the online version concurrently, the offline code takes about 60ms and the online code 70ms (using a total of 6 cores out of 8).
This could be the result of competing access to some hardware resources, like SSE registers, cache, ... Which could be the reason why the processing thread is interrupted.
3- Last year (around ...