Autoware lidar_localizer not working with ndt_gpu on V100 gpu
This is a question regarding autoware localizer.
It works perfectly with default pcl ndt library i-e methodtype is 0 for ndtmatching node. For ndt_gpu library it works great on one of my computer with GTX-1070 gpu as well but on an another machine with Nvidia Tesla V100, it fails. Here is a detailed report:
I am using latest docker for Autoware docker from https://gitlab.com/autowarefoundation/autoware.ai/docker with commit id: 0506e18f66834c8557aee43a64a0a87c1e8635f0
To reproduce the issue, I follow instructions here. https://gitlab.com/autowarefoundation/autoware.ai/autoware/wikis/ROSBAG-Demo
in launch file: install/lidarlocalizer/share/lidarlocalizer/launch/ndtmatching.launch I set <arg name="methodtype" default="2" /> to enable pclanhgpu as soon as I start I get the following error:
Error: out of memory /home/autoware/Autoware/src/autoware/coreperception/ndtgpu/src/VoxelGrid.cu 181 terminate called after throwing an instance of 'boost::exceptiondetail::cloneimplboost::exception_detail::error_info_injector<boost::lock_error >' what(): boost: mutex lock failed in pthreadmutexlock: Invalid argument [ndtmatching-6] process has died [pid 33002, exit code -6, cmd /home/autoware/Autoware/install/lidarlocalizer/lib/lidarlocalizer/ndtmatching _name:=ndtmatching _log:=/home/autoware/.ros/log/d4d41132-d390-11e9-9d98-ac1f6b4112c2/ndtmatching-6.log]. log file: /home/autoware/.ros/log/d4d41132-d390-11e9-9d98-ac1f6b4112c2/ndt_matching-6*.log
My output for nvidia-smi is :
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 418.67 Driver Version: 418.67 CUDA Version: 10.1 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla V100-SXM2... On | 00000000:61:00.0 Off | 0 | | N/A 41C P0 58W / 300W | 3292MiB / 16130MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla V100-SXM2... On | 00000000:62:00.0 Off | 0 | | N/A 35C P0 40W / 300W | 11MiB / 16130MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 2 Tesla V100-SXM2... On | 00000000:89:00.0 Off | 0 | | N/A 35C P0 38W / 300W | 11MiB / 16130MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 3 Tesla V100-SXM2... On | 00000000:8A:00.0 Off | 0 | | N/A 36C P0 38W / 300W | 11MiB / 16130MiB | 0% Default | +-------------------------------+----------------------+----------------------+
I have 4 gpus with 16GB memory each, there is no way that it can run out of memory.
The strange thing is that on my GTX-1070 with 8GB of memory, this runs without any issue.
The contents of log file /home/autoware/.ros/log/d4d41132-d390-11e9-9d98-ac1f6b4112c2/ndt_matching-6.log
Log file: methodtype: 2 usegnss: 1 queuesize: 1 offset: linear getheight: 1 uselocaltransform: 0 useodom: 0 useimu: 0 imuupsidedown: 0 imutopic: /imuraw localizer: velodyne
(tfx,tfy,tfz,tfroll,tfpitch,tfyaw): (1.2, 0, 2, 0, 0, 0)
Update points_map.
--
Upon analyzing code and debugging different scenarios with different maps, this is what I have found out so far:
the CMakeLists.txt in ndt_gpu is setting architecture wrong:
if ("${CUDACAPABILITYVERSION}" MATCHES "^[1-9][0-9]+$") set(CUDAARCH "sm${CUDACAPABILITYVERSION}") else () set(CUDAARCH "sm52") endif ()
it should be sm70 for V100 but CUDACAPABILITYVERSION is empty on my system and sm52 is being set. I have set it to sm70 and compute52, but the problem persists.
In the code side, the error can happen at different points depending on maps, but it always either fails with running out of memory or goes in some infinite loop in buildParent() kernel
Not knowing cuda at all myself, my hunch that either i) cuda compile flags are being wrongly set for v100 architecture, or some other initialization is required in code or ii) as the cores are much greater in v100 some sort of dynamic code parallelization is causing problems
If anyone has had some problem with this, I would appreciate some help. I can also make a bug report on Autoware gitlab if that is what is needed.
Thank you
Asked by villie on 2019-09-10 01:36:46 UTC
Answers
@villie - Please see the answer https://answers.ros.org/question/329433/ndt_matching-gpu-version-died/ and the issue https://gitlab.com/autowarefoundation/autoware.ai/core_perception/issues/9. I know these are not very helpful at this time but it is where we are regarding this issue until a CUDA expert comes along and helps us resolve them.
Asked by Josh Whitley on 2019-09-11 16:51:30 UTC
Comments
I posted an answer below but I just want to check - can you please run the following:
And report the output?
Asked by Josh Whitley on 2019-09-11 16:52:36 UTC