Autoware lidar_localizer not working with ndt_gpu on V100 gpu
This is a question regarding autoware localizer.
It works perfectly with default pcl ndt library i-e method_type is 0 for ndt_matching node. For ndt_gpu library it works great on one of my computer with GTX-1070 gpu as well but on an another machine with Nvidia Tesla V100, it fails. Here is a detailed report:
I am using latest docker for Autoware docker from https://gitlab.com/autowarefoundation... with commit id: 0506e18f66834c8557aee43a64a0a87c1e8635f0
To reproduce the issue, I follow instructions here. https://gitlab.com/autowarefoundation...
in launch file: install/lidar_localizer/share/lidar_localizer/launch/ndt_matching.launch I set <arg name="method_type" default="2"/> to enable pcl_anh_gpu as soon as I start I get the following error:
Error: out of memory /home/autoware/Autoware/src/autoware/core_perception/ndt_gpu/src/VoxelGrid.cu 181 terminate called after throwing an instance of 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<boost::lock_error> >' what(): boost: mutex lock failed in pthread_mutex_lock: Invalid argument [ndt_matching-6] process has died [pid 33002, exit code -6, cmd /home/autoware/Autoware/install/lidar_localizer/lib/lidar_localizer/ndt_matching __name:=ndt_matching __log:=/home/autoware/.ros/log/d4d41132-d390-11e9-9d98-ac1f6b4112c2/ndt_matching-6.log]. log file: /home/autoware/.ros/log/d4d41132-d390-11e9-9d98-ac1f6b4112c2/ndt_matching-6*.log
My output for nvidia-smi is :
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 418.67 Driver Version: 418.67 CUDA Version: 10.1 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla V100-SXM2... On | 00000000:61:00.0 Off | 0 | | N/A 41C P0 58W / 300W | 3292MiB / 16130MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla V100-SXM2... On | 00000000:62:00.0 Off | 0 | | N/A 35C P0 40W / 300W | 11MiB / 16130MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 2 Tesla V100-SXM2... On | 00000000:89:00.0 Off | 0 | | N/A 35C P0 38W / 300W | 11MiB / 16130MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 3 Tesla V100-SXM2... On | 00000000:8A:00.0 Off | 0 | | N/A 36C P0 38W / 300W | 11MiB / 16130MiB | 0% Default | +-------------------------------+----------------------+----------------------+
I have 4 gpus with 16GB memory each, there is no way that it can run out of memory.
The strange thing is that on my GTX-1070 with 8GB of memory, this runs without any issue.
The contents of log file /home/autoware/.ros/log/d4d41132-d390-11e9-9d98-ac1f6b4112c2/ndt_matching-6.log
Log file: method_type: 2 use_gnss: 1 queue_size: 1 offset: linear get_height: 1 use_local_transform: 0 use_odom: 0 use_imu: 0 imu_upside_down: 0 imu_topic: /imu_raw localizer: velodyne
(tf_x,tf_y,tf_z,tf_roll,tf_pitch,tf_yaw): (1.2, 0, 2, 0, 0, 0)
Update points_map.
--
Upon analyzing code and debugging different scenarios with different maps, this is what I have found out so far:
the CMakeLists.txt in ndt_gpu is setting architecture wrong:
if ("${CUDA_CAPABILITY_VERSION}" MATCHES "^[1-9][0-9]+$") set(CUDA_ARCH "sm_${CUDA_CAPABILITY_VERSION}") else () set(CUDA_ARCH "sm_52") endif ()
it should be sm_70 for V100 but CUDA_CAPABILITY_VERSION is empty on my system and sm_52 is being set. I have set it to sm_70 and compute_52, but the problem persists.
In the code side, the error can happen at different points depending on maps, but it always either fails with running out of memory or goes in some infinite loop in buildParent() kernel
Not knowing cuda at ...
I posted an answer below but I just want to check - can you please run the following:
And report the output?