ROS Resources: Documentation | Support | Discussion Forum | Index | Service Status | ros @ Robotics Stack Exchange
Ask Your Question
0

nav2 troubles all over the place!! Please help (if you can!) :-)

asked 2023-06-27 16:56:27 -0600

Farbod gravatar image

Hello,

I am trying to use nav2 on a real robot, which is running Foxy. The robot hardware and sensors and ROS2 controllers start and run with no problems 95% of the time.

However, nav2 starts and runs 5% of the time, so that I can run SLAM and navigate the robot and map the environment. The rest of 95%, some random thing prevents nav2 from running.

I have included my launch file at the end. But, I am describing the problems first. Here are the problems that randomly arise when I am trying to start nav2. Note that at each trial, the problem is different.

I have the following topics that are published correctly:

prompt$ ros2 topic list
/cmd_vel
/diagnostics
/diffbot_base_controller/cmd_vel_unstamped
/dynamic_joint_states
/forward_position_controller/commands
/imu
/joint_states
/joy
/joy/set_feedback
/left/camera_info
/left/image_rect
/motor_control
/motor_status
/odom
/odometry/filtered
/parameter_events
/right/camera_info
/right/image_rect
/robot_description
/rosout
/scan
/set_pose
/stereo/camera_info
/stereo/converted_depth
/stereo/depth
/stereo/points
/tf
/tf_static

Problem type 1: slam_toolbox cannot register the Lidar.

When I bringup nav2 using my launch file, here is what I get:

prompt$ ros2 launch robot_user_robot_nav nav2_bringup_tb3_revised_launch.py slam:=True
[INFO] [launch]: All log files can be found below /home/robot_user/.ros/log/2023-06-27-15-16-54-669455-robot_user-Xavier-19352
[INFO] [launch]: Default logging verbosity is set to INFO
[INFO] [sync_slam_toolbox_node-1]: process started with pid [19354]
[INFO] [map_saver_server-2]: process started with pid [19356]
[INFO] [lifecycle_manager-3]: process started with pid [19358]
[INFO] [controller_server-4]: process started with pid [19360]
[INFO] [planner_server-5]: process started with pid [19362]
[INFO] [recoveries_server-6]: process started with pid [19364]
[INFO] [bt_navigator-7]: process started with pid [19366]
[INFO] [waypoint_follower-8]: process started with pid [19368]
[INFO] [lifecycle_manager-9]: process started with pid [19377]
[lifecycle_manager-3] [INFO] [1687897016.045531301] [lifecycle_manager_slam]: Creating
[lifecycle_manager-3] [INFO] [1687897016.079812558] [lifecycle_manager_slam]: Creating and initializing lifecycle service clients
[lifecycle_manager-3] [INFO] [1687897016.153660836] [lifecycle_manager_slam]: Starting managed nodes bringup...
[lifecycle_manager-3] [INFO] [1687897016.153758984] [lifecycle_manager_slam]: Configuring map_saver
[bt_navigator-7] [INFO] [1687897016.290348085] [bt_navigator]: 
[bt_navigator-7]    bt_navigator lifecycle node launched. 
[bt_navigator-7]    Waiting on external lifecycle transitions to activate
[bt_navigator-7]    See https://design.ros2.org/articles/node_lifecycle.html for more information.
[bt_navigator-7] [INFO] [1687897016.290661730] [bt_navigator]: Creating
[planner_server-5] [INFO] [1687897016.328064562] [planner_server]: 
[planner_server-5]  planner_server lifecycle node launched. 
[planner_server-5]  Waiting on external lifecycle transitions to activate
[planner_server-5]  See https://design.ros2.org/articles/node_lifecycle.html for more information.
[planner_server-5] [INFO] [1687897016.331484742] [planner_server]: Creating
[recoveries_server-6] [INFO] [1687897016.527387413] [recoveries_server]: 
[recoveries_server-6]   recoveries_server lifecycle node launched. 
[recoveries_server-6]   Waiting on external lifecycle transitions to activate
[recoveries_server-6]   See https://design.ros2.org/articles/node_lifecycle.html for more information.
[map_saver_server-2] [INFO] [1687897016.592736828] [map_saver]: 
[map_saver_server-2]    map_saver lifecycle node launched. 
[map_saver_server-2]    Waiting on external lifecycle transitions to activate
[map_saver_server-2]    See https://design.ros2.org/articles/node_lifecycle.html for more information.
[map_saver_server-2] [INFO] [1687897016.592960133] [map_saver]: Creating
[lifecycle_manager-9] [INFO] [1687897016.618960328] [lifecycle_manager_navigation]: Creating
[sync_slam_toolbox_node-1] [INFO] [1687897016.675506131] [slam_toolbox]: Node using stack size 160000000
[controller_server-4] [INFO] [1687897016.693737191] [controller_server]: 
[controller_server-4]   controller_server lifecycle node launched. 
[controller_server-4]   Waiting on external lifecycle transitions to activate
[controller_server-4]   See https://design.ros2.org/articles/node_lifecycle.html for more information.
[map_saver_server-2] [INFO] [1687897016.718958248] [map_saver]: Configuring
[controller_server-4] [INFO] [1687897016 ...
(more)
edit retag flag offensive close merge delete

Comments

Just a quick comment: Foxy is EOL. Nav2 on Foxy does not see any updates, and is in fact already quite far behind Humble and Rolling. If you can, I would recommend upgrading. Don't do it in place (so don't remove your Foxy setup), but get a second SSD / SD card / something so you can install 22.04 and Humble on it. Or if you can / are comfortable with it: use Docker.

If the same problems occur with newer versions, it would at least be possible to revert to your older system.

gvdhoorn gravatar image gvdhoorn  ( 2023-06-28 02:40:15 -0600 )edit

I will give that a try by using a second SSD card. I haven't used Docker before. Thanks.

Farbod gravatar image Farbod  ( 2023-06-28 11:44:30 -0600 )edit

2 Answers

Sort by ยป oldest newest most voted
0

answered 2023-06-28 11:41:44 -0600

Farbod gravatar image

Hi.

First, I have to say that the nav2 slam works fine on the first run (95% of times), when the robot is just booted. When I stop the nav2 after brining it up using the launch file using ctrl-C and run it again (after waiting some time until nodes disappear from the list "ros2 node list", I encounter these random problems that I described above.

Here is the result of echoing the transforms:

$ ros2 run tf2_ros tf2_echo odom scan_link
[INFO] [1687963541.009601577] [tf2_echo]: Waiting for transform odom ->  scan_link: Invalid frame ID "odom" passed to canTransform argument target_frame - frame does not exist
At time 1687963541.596396082
- Translation: [-0.020, 0.000, 1.239]
- Rotation: in Quaternion [0.004, 0.000, 1.000, -0.005]
At time 1687963542.595545666
- Translation: [-0.020, 0.000, 1.239]
- Rotation: in Quaternion [0.004, 0.000, 1.000, -0.005]
At time 1687963543.597899375
- Translation: [-0.020, 0.000, 1.239]
- Rotation: in Quaternion [0.004, 0.000, 1.000, -0.005]

Another transform:

$ ros2 run tf2_ros tf2_echo odom base_footprint 2.0
[INFO] [1687968194.440594135] [tf2_echo]: Waiting for transform odom ->  base_footprint: Invalid frame ID "odom" passed to canTransform argument target_frame - fram
e does not exist
At time 1687968194.881034063
- Translation: [-0.023, 0.000, 0.000]
- Rotation: in Quaternion [0.000, 0.000, 0.000, 1.000]
At time 1687968195.385934743
- Translation: [-0.023, 0.000, 0.000]
- Rotation: in Quaternion [0.000, 0.000, 0.000, 1.000]
At time 1687968195.881188582
- Translation: [-0.023, 0.000, 0.000]
- Rotation: in Quaternion [0.000, 0.000, 0.000, 1.000]
At time 1687968196.380709922
- Translation: [-0.023, 0.000, 0.000]
- Rotation: in Quaternion [0.000, 0.000, 0.000, 1.000]

Note that the above command publishes at the rate of 1 by default. Here is the actual rate with which the tf is published:

$ ros2 topic hz /tf
average rate: 56.453
        min: 0.003s max: 0.029s std dev: 0.00917s window: 58
average rate: 56.138
        min: 0.001s max: 0.029s std dev: 0.00901s window: 114
average rate: 56.053
        min: 0.001s max: 0.029s std dev: 0.00904s window: 171
average rate: 55.940
        min: 0.001s max: 0.029s std dev: 0.00903s window: 228
average rate: 55.870
        min: 0.001s max: 0.029s std dev: 0.00904s window: 285
average rate: 55.826
        min: 0.001s max: 0.029s std dev: 0.00902s window: 342
average rate: 55.825
        min: 0.001s max: 0.031s std dev: 0.00907s window: 398

And the scan:

$ ros2 topic hz /scan
average rate: 12.826
        min: 0.072s max: 0.081s std dev: 0.00351s window: 14
average rate: 12.830
        min: 0.072s max: 0.081s std dev: 0.00337s window: 27
average rate: 12.833
        min: 0.072s max: 0.082s std dev: 0.00324s window: 40
average rate: 12.849
        min: 0.072s max: 0.082s std dev: 0.00318s ...
(more)
edit flag offensive delete link more

Comments

At least now we know that there is no broken/orphaned frames in the tf tree. But what I don't know is if the tf is published at 0.9Hz (which 1.0s tolerance is not enough), or 1.1Hz. What is the lowest update rate shown in rqt_tf_tree? Next, we also should check the timestamp of the tf. This can also be seen from rqt_tf_tree or via rqt_monitor: what is the most recent transform (in rqt_tf_tree) or average delay (in tf_monitor) like, especially at the point where the random problems appear?

Guess Giant gravatar image Guess Giant  ( 2023-06-29 09:53:45 -0600 )edit

I tried to upload the PNG file of the transform tree, but apparently I don't have enough credit. :-) Here is what I see in the transform tree next to the branch connecting "odom" to "base_footprint":

Broadcaster: defualt_authority Average rate: 43.061 Buffer length: 0.325 Most recent transform: 1688051850.856241 Oldest transform: 1688051850.53112

The branch from "base_footprint" to "the base_link" shows:

Broadcaster: defualt_authority Average rate: 10000.0 Buffer length: 0.0 Most recent transform: 0.0 Oldest transform: 0.0

All branches that go from "base_link" to "scan_link" show:

Broadcaster: defualt_authority Average rate: 17.769 Buffer length: 0.619 Most recent transform: 1688051850.831185 Oldest transform: 1688051850.212133

After shutting down all nodes, and restarting, all branches that go from "base_link" to "scan_link" show:

Broadcaster: defualt_authority Average rate: 16.883 Buffer length: 0.951 Most recent transform: 1688053650.594053 Oldest transform: 1688053649.643528

Farbod gravatar image Farbod  ( 2023-06-29 10:52:23 -0600 )edit

May I know if base_link to scan_link is supposed to change? i.e. is the lidar moving? Nevertheless, what is the CPU usage like (see htop/ rqt_tf_top)? Could it be that there is bottleneck in calculating the tfs?

Guess Giant gravatar image Guess Giant  ( 2023-06-30 04:10:45 -0600 )edit

Thanks so much for the great ideas that help to get to the bottom of this.

In general, the base_link to scan_link transformation can change, i.e., the lidar can move. However, we are not moving the scan_link with respect to the base_link during navigation and/or mapping. I will check into the CPU usage and get back to you with the result in a couple of days, as I am away from the robot at this time.

Thanks again for all the great help.

Farbod gravatar image Farbod  ( 2023-07-03 09:14:55 -0600 )edit

I checked the CPU usage using htop. During the first couple of minutes after all nav2 nodes are run, none of them use more than 20%-25%. Oddly, after the system runs for a couple of minutes, the two nav2_controller and nav2_planner use 110%-120% CPU. The robot is using a Jetson Xavier, which has 8 CPU cores. Two of them are stuck at 100% at all times (again after a couple of minutes running). The node robot_state_publisher only uses 3.3% CPU, which is (if I am not mistaken) in charge of publishing the /tf and /tf_static topics.

Farbod gravatar image Farbod  ( 2023-07-05 12:17:23 -0600 )edit
0

answered 2023-06-28 02:27:37 -0600

Guess Giant gravatar image

All the problems seem to have one thing in common: tf. For disclosure I use ROS extensively but not ROS 2. I base my opinion based on what I've seen on ROS.

  1. slam_toolbox cannot register the Lidar. It is unclear if the tf tree is complete, in particular, at the launch of slam_toolbox. Try running ros2 run tf2_ros tf2_echo odom scan_link or ros2 run tqt_tf_tree rqt_tf_tree to view the entire tf tree and see if there are any breaks (requires install).

  2. Sometimes, problem 1 does not occur, but "Robot is out of bounds of the costmap!" This is quite straight forward. It just says that the robot is currently not in a free region in the costmap, in particular the obstacle_layer for /scan, and hence is unable to use raytracing to clear or mark obstacles. Moving the initial pose to an free area should fix this.

  3. nav2 can't get past controller server! Not sure, maybe controller waiting for initial tf as well?

  4. nav2 just sits there timing out waiting for transforms forever! Seems like the tf are being published at rates of 1Hz or more. If you want to use them at this rate be sure to increase the transform tolerances (timeout) in the respective nodes... if they are supposed to publish at higher update rates do check what could be throttling them. rqt_tf_tree might help with idenfying slow updating tf.

  5. nav2 drops the "scan_link" for unknown reason forever! Similar to 1. Should find out if there is any break in the tf tree, or maybe slow update rates like in 4. that is causing the timeouts to be reported.

edit flag offensive delete link more

Comments

Hello,

Thanks so much for the quick reply! I will try those for sure, and I think they will help.

Thanks again.

Farbod gravatar image Farbod  ( 2023-06-28 08:44:35 -0600 )edit

I was going to comment on your reply with the results of my test, but the text box was not large enough. Could you please see my reply that I posted as an answer to my own question? Thanks a lot.

Farbod gravatar image Farbod  ( 2023-06-28 11:43:28 -0600 )edit

Question Tools

2 followers

Stats

Asked: 2023-06-27 16:56:27 -0600

Seen: 1,367 times

Last updated: Jun 28 '23