ROS Resources: Documentation | Support | Discussion Forum | Index | Service Status | ros @ Robotics Stack Exchange
Ask Your Question
6

Speeding up OpenCV on ARM?

asked 2011-06-15 11:02:29 -0600

this post is marked as community wiki

This post is a wiki. Anyone with karma >75 is welcome to improve it.

Hi all,

I'm working with OpenCV on a Gumstix, and it is incredibly slow. For instance, running the "GoodFeaturesToTrack" function in a node runs at 1.5Hz, and requires 85% processor utilization on a Gumstix Overo Tide. I used a standard download and compile of the ROS vision_opencv package to get these results. Any ideas on how to optimize? It seems that even a lowly 700MHz processor should be a bit faster than what I'm getting.

Thanks!


Update

I'm looking into Ash Charles' suggestion to add a flag to use the neon coprocessor. I'm not quite sure how to do this. I'm assuming that I make a change to my opencv2 makefile, but would appreciate some direction on that.


Update #2

I'm working on implementing RahulP's suggestion. So, in the vision_opencv package, I added a flag to CMakeLists.txt, which now looks like this:

cmake_minimum_required(VERSION 2.4.6)
include($ENV{ROS_ROOT}/core/rosbuild/rosbuild.cmake)

set(ROSPACK_MAKEDIST true)
set(EXTRA_C_FLAGS_RELEASE "${EXTRA_C_FLAGS_RELEASE} -O2 -mfpu=neon")

rosbuild_make_distribution(1.4.3)

Then, I ran rosmake vision_opencv which was successful. However, running the code showed no improvement whatsoever in execution speed (as measured by running rostopic hz /output).

Did I skip a step here? Or do I need to be looking into profiling? I tried using oprofile, but didn't really get anywhere productive. If you have any suggestions, they would be much appreciated.


Potential Update/Question

If I were to go to my OpenCV directory, and type in:

./configure -mfpu=neon

followed by running rosmake, would that have the desired effect?


Update

I've uploaded all of my code to Github. The main program that I'd like to optimize is in nodes/find_laser_filtered.py. Obviously, one easy thing to do is to get rid of filtering lines by angle, that was mostly for debugging purposes on the desktop.

If you get an opportunity to take a look, please suggest any ideas for improvement you may have. I'd really like to get this running at 15+ Hz, and I'm currently at about 2 Hz...

Thanks!

edit retag flag offensive close merge delete

Comments

What is meant by "didn't really get anywhere productive"? Do you mean that you couldn't find a bottle neck, or that you had trouble with oprofile itself? If it's the latter I'd suggest sysprof which is much easier to use.
Asomerville gravatar image Asomerville  ( 2011-10-24 09:21:42 -0600 )edit
I had trouble with oprofile itself. I'll take a look at sysprof tomorrow and post results.
Bradley Powers gravatar image Bradley Powers  ( 2011-10-24 09:23:20 -0600 )edit
I looked at the recommended gcc flags for VFP and Neon support. I believe "-march=armv7-a -mtune=cortex-a8 -mfpu=neon -O3" would be best. '-O3' should do some auto-vectorizing. Unfortunately, this will still be soft float (-mfloat-abi-softfp) as hard float is an ongoing challenge.
Ash Charles gravatar image Ash Charles  ( 2011-10-25 05:07:06 -0600 )edit
@Bradley, @Ash, the last "update" and Ash's comment should be put in an answer so that others trying to solve the same problem will know where to look without reading the entire comment history. I'd suggest cleaning up the question as well.
Asomerville gravatar image Asomerville  ( 2011-10-25 10:52:56 -0600 )edit

6 Answers

Sort by ยป oldest newest most voted
2

answered 2011-06-17 04:15:31 -0600

Ash Charles gravatar image

updated 2011-10-25 13:20:19 -0600

I agree this seems slow so two thoughts crossed my mind.

  1. Check what is actually using all the processor time with something like oprofile. The processor shouldn't be doing tons of math or swapping of buffers but knowing what this is a first step.
  2. Hard-float capabilities are provided by the Neon and vfp extensions. IIRC, CMake should pick up flags from the environment so try adding '-mfpu=neon' to CXXFLAGS prior to recompiling.

The DSP onboard provides powerful capabilities when used with gstreamer---this may also give some benefit.

edit flag offensive delete link more

Comments

Interesting, I'll look into this. Thank you!
Bradley Powers gravatar image Bradley Powers  ( 2011-06-17 05:35:33 -0600 )edit
See http://elinux.org/BeagleBoard/GSoC/2010_Projects/OpenCV for a GSoC project on using the DSP and OpenCV together.
Eric Perko gravatar image Eric Perko  ( 2011-06-17 05:44:03 -0600 )edit
1

answered 2011-11-14 16:53:38 -0600

this post is marked as community wiki

This post is a wiki. Anyone with karma >75 is welcome to improve it.

I have done lot of optimization work on OpenCV on ARM and not ARM. Main gains I got us just looking into OpenCV code and just optimizing it for my particular use. For example often there is a some sort of "if ( 8 bit color) or ( 16 bit color)" and I make millions of calls to this function. Of course I never use randomly different color depths, so I just can remove this "if". Doing this in as many places possible usually leads with huge boost from some point as compiler realize some way to optimize it after. Mostly with inlining, comptuting constants while compiling etc. All kind of magic with GCC flags usually provides result in range of 1% up to 5%, no more. Of course when -O2 at least in place. So just cut and paste parts of OpenCV you use into own code and then strip them down as much you can. Do not worry too much, usually those algorithms are just few hundred lines.

edit flag offensive delete link more

Comments

Interesting. So, at the moment, my code uses very little actual OpenCV code. cv_bridge, CvtColor, CreateMemStorage, Smooth, Canny, Houghlines2, and Line. I'm also using the Python bindings, and I'm not sure how bad that is in terms of slowdowns.
Bradley Powers gravatar image Bradley Powers  ( 2011-11-14 20:37:43 -0600 )edit
1

answered 2011-10-05 10:18:51 -0600

RahulP gravatar image

The easiest way to do this is to trick the build setup by doing the modifaction mentioned below in the file CMakeLists.txt.

# Other optimizations 
if(USE_O2)

    set(EXTRA_C_FLAGS_RELEASE "${EXTRA_C_FLAGS_RELEASE} -O2 -mfpu=neon")

endif()

Dont forget to turn ON the USE_O2 options while configuring your build. This modifcation only applies to earlier versions of OpenCV. The latest version of OpenCV (version 2.3.1) supports building OpenCV with NEON enabled directly via CMake

edit flag offensive delete link more

Comments

So, I added set(EXTRA_C_FLAGS_RELEASE "${EXTRA_C_FLAGS_RELEASE} -O2 -mfpu=neon") to my CMakeLists.txt, and then ran rosmake vision_opencv. I didn't use the if statement, as I'm compiling this specifically for the ARM. I'm not super familiar with this whole system, is that the right way to do it?
Bradley Powers gravatar image Bradley Powers  ( 2011-10-24 07:16:48 -0600 )edit
0

answered 2013-02-21 21:49:18 -0600

I represent a company called Uncanny Vision. Our first product UncannyCV is an OpenCV like image processing/vision library optimized for Cortex A series ARM. We use all kinds of techniques like Neon instruction, algorithm optimizations, cache optimization etc to get a good speed-up. The performance gains vary from algorithm to algorithm. Sometimes its as low as 2x and some others are as high 20x.

For good points to track, we get nearly 8x improvement over standard openCV on Cortex A8. On BeagleBoardXm with a single core Cortex A8 running at 1GHz, we can do a VGA frame in 45ms(ofcourse the performance is also a little image dependent), meaning 22 fps. On 720MHz(which is the case discussed above), this would translate to 62.5ms.

For non-commercial purposes(university development projects) we can provide the trial version of the library for free. You can come to our website and drop us an email.

edit flag offensive delete link more
0

answered 2011-11-15 15:49:47 -0600

this post is marked as community wiki

This post is a wiki. Anyone with karma >75 is welcome to improve it.

i will try to use DSP core - OpenCV DSP acceleration

edit flag offensive delete link more
0

answered 2011-06-17 06:44:25 -0600

I would suggest profiling with either Valgrind or Gprof to see what the primary bottle neck might be. Once you see what part is slowing execution the most, you might be able to find specific issues which might be easier to research re compile flags.

edit flag offensive delete link more

Question Tools

2 followers

Stats

Asked: 2011-06-15 11:02:29 -0600

Seen: 9,807 times

Last updated: Feb 21 '13