Converting image pixel coordinates 2D(X, Y) to 3D world coordinate(Quaternion)

asked 2020-12-16 05:06:40 -0600

Ranjit Kathiriya
1624 ●18 ●148 ●46

updated 2021-02-02 10:22:52 -0600

Hi there,

I want to obtain quaternion from x,y,z(depth camera) position. I am new to ros, and I am using Intel real sense.

Things Done:

I have integrated yolov5 with ros, and I am obtaining the bounding boxes from that.

header: seq: 10 stamp: secs: 1608116230 nsecs: 742268562 frame_id: "detection" image_header: seq: 0
stamp: secs: 1608116230 nsecs: 742275953 frame_id: '' bounding_boxes: - probability: 0.378662109375 xmin: 626 ymin: 2 xmax: 640 ymax: 64 id: 0 Class: "teat"

I am getting scaled point cloud.

Think I want to do:

I want to find the same location into the cloud point based on yolov5.
I want the output like this video:
https://www.youtube.com/watch?v=iJWHR...
I want to obtain the detected object's quaternion so my robot arm can move to that particular location.

What approach I should follow?

edit retag flag offensive close merge delete

add a comment

2 Answers

Sort by » oldest newest most voted

answered 2020-12-16 14:20:40 -0600

RobertWilbrandt

176 ●2 ●10 ●6

I am not sure i completely get what your goal looks like, but from what i understand these points could be of interest to you:

If you want to get the full 3d position of the object, take a look at the align_depth parameter and the aligned_depth_to_color topics of the realsense driver. This allows you to easily read out the depth associated with pixels in your color image (for more information see this post). Beware that depth images might just contain some 0s where no depth information is available.
In order to get the transform to the object (or only the orientation quaternion you asked for) take a look at the image_geometry package. It contains classes that can use the information from your cameras camera_info topic to e.g. calculate the ray defined by a pixel inside your bounding box (see the projectPixelTo3dRay() function). Based on this you should be able to calculate any information you need to plan a robot movement.
Always useful when working with streamed images: Use the image_transport package to make handling image messages easier.

With these, you should be able to write a node that does this:

Keeps an image_geometry::PinholeCameraModel, and listens to the camera_info topic to update it.
Uses image_transport::Subscribers to listen to both the color and the aligned depth image from your camera and keeps the most current one.
Listens to your yolov5 output. For every bounding box, you can sample some pixels inside it and read their depth from the aligned depth image, use these to get some sort of "object depth" and then use the camera model to get the object position. In an easy case, you might just sample some points around the center of the bounding box, use the mean of their depth values (filtering out 0s) as depth and multiply the projected ray from your camera model by this to get the 3d vector from your camera to the object.
Does some fancy visualization, e.g. in your video you can see some custom visual markers to show the object estimation (the yellow cube). I usually recommend using rviz_visual_tools for these simple tasks.

I hope this gives an overview over the basic steps required. You might need some simple things like filtering to keep your estimate from jumping around too much (had that happen with realsense depth images before), but this should be apparent once you see your first estimates. If you want to actually move your robot to the detected position you might also need to use TF2 to get your estimate into a suitable base frame.

If i missed some points from your question or if you have remaining questions, feel free to ask.

edit flag offensive delete link

Comments

Thanks for giving me details it really is a lot for me and my main aim is to identify the detected objects and move my robot arm. I think following these steps will help me a lot.

How will I track the object based on the movement of an object. It is tracked automatically or I need to apply another technique for it?
Just correct me if I am wrong I have to take the center of the bounding box point that means X = (X + X_W)/2 Y = (Y + Y_H)/2 and pass this X and Y in the projectPixelTo3dRay() function so I will be able to calculate information for the robotic moment.
I do not need to find the pose of an object? It will automatically calculate it?
Correct me if I am wrong I do not need the Z location. It is calculating automatically by "object depth" is ...

(more)

Ranjit Kathiriya ( 2020-12-17 09:38:56 -0600 )edit

Hey,

I am not quite sure what you mean by "tracking". The node (in the form i described it) would constantly update its estimate of the object position. In most cases you would just publish that as a tf (using a tf2_rosTransformBroadcaster. This way, you could create another node to handle the the actual robot movement (getting its target using a tf2_ros::TransformListener from your first node) without having to care about how to get the position estimate.
That's correct, for the message you described (the one from yolov5) you would just use [(xmin+xmax)/2, (ymin+ymax)/2]. This might not get the best results for some objects (imagine something like a donut with a hole in the middle: Using the center point to calculate the depth will give you unusable results), but for many simple objects like the one in the video this should be fine.

RobertWilbrandt ( 2020-12-17 14:37:11 -0600 )edit

3./4. What would automatically calculate it? The projectPixelTo3dRay() function will give you a vector with z=1 (as described in the documentation. This vector only points into the correct direction, but you still need to scale it to get the actual position of the object. This is why you need to read the depth value from the aligned depth image. Given the direction vector v=(v_x,v_y,1) from the projectPixelToRay() function and a depth of d from the aligned depth image, d * v will give you the position of the object relative to the camera.

In general: This will only give you a position (as opposed to a pose, which typically also includes an orientation) relative to the camera. You typically use TF2 to convert this into a position relative to the robot base, which can then be used to control a robot.

RobertWilbrandt ( 2020-12-17 14:44:02 -0600 )edit

Thanks for the reply and for clarifying my points

Actually, my use-case for my project is to make a robotic milking system just like lely: https://www.youtube.com/watch?v=4ahez...

Until now I have just identified bounding boxes in ros and Now I will use your steps I will be able to achieve my use-case?

Ranjit Kathiriya ( 2020-12-18 06:22:55 -0600 )edit

Hey, that actually sounds like a really interesting use case. I think the general procedure should work, but just a few considerations:

Make sure you consider the minimum range of the realsense, i.e. test whether you actually get usable depth information out of it in the position you plan to mount it. Under a certain distance (depending on the specific model) you will just get zeroes.
If you plan to move the robot in a compliant way, it might make sense to just ignore the depth given by realsense for the control. This would mean: You still calculate the target point with the depth data, but ignore the z-coordinate of the resulting point for your actual arm movement. This might make your approach more robust, but obviously only if you have some way to measure the force on your end effectors.

Apart from that i don't really have ...(more)

RobertWilbrandt ( 2020-12-21 10:15:19 -0600 )edit

Hey,

Thank you so much for help will try this and give you update

thanks

Ranjit Kathiriya ( 2020-12-22 05:29:36 -0600 )edit

Hey Robert,

Correct me if I am wrong, from your answer 3 points: Always useful when working with streamed images: Use the image_transport package to make handling image messages easier.

A. Here I have to take 3 channels:

Depth channels- /camera/aligned_depth_to_color/image_raw
Color channels - /camera/color/image_raw
detected object channels: /darknet_ros/detection_image (Just a color channels with bounding boxes on the detected object)

Compress these three channels using Image_transport. Over here I can directly compress by using this command in the launch file? Example 1 in this link

rosrun image_transport republish raw in:=camera/image out:=camera/image_repub

Just like this in launch file:

 <node name="depth_republish" type="republish" pkg="image_transport" output="screen" args="raw in:=/camera/aligned_depth_to_color/image_raw out:=/camera/aligned_depth_to_color/comp_image_raw" />

 <node name="color_republish" type="republish" pkg="image_transport" output="screen" args="compressed in:=/camera/color/image_raw raw out:=/camera/color/comp_image_raw" />

Or I have to ...(more)

Ranjit Kathiriya ( 2021-01-04 04:31:38 -0600 )edit

@Ranjit the image_transport was strictly targeting your c++ code, you don't need any republisher node in this setup. You would use the provided image_transport::ImageTransport class in your custom node for receiving the camera images, take a look at these usage instructions or this tutorial. The big advantage of this is that you don't actually care about the specific image transport (such as compression strategy), as this is transparently handled for you. You would use one instance of this class for each image you want to receive.

As for your 3 channels listed: Are you sure you need to listen to your /darknet_ros/detection_image? For all the calculations a simple subscriber to the associated bounding_boxes topic should suffice (ignore this if you want to use it for debugging purposes, that might still make a lot of sense).

RobertWilbrandt ( 2021-01-06 09:45:41 -0600 )edit

see more comments

answered 2021-02-03 02:46:50 -0600

Ranjit Kathiriya
1624 ●18 ●148 ●46

updated 2021-03-15 15:59:07 -0600

The final answer for this question is here with compare of depth and point cloud technique.

A) Image_transport Side Step 1. Write a node in C++ that Subscribe to two channels 1. Depth channels (/camera/aligned_depth_to_color/image_raw) and 2. Color channels (/camera/color/image_raw).

Step 2: Then use image_transport::ImageTransport() class and publish the compressed version of both channels.

B) image_geometry Side

Step 1: Create a Node in Python and subscribe to both compressed (Color and Depth) and camera_info channels.

Step 2: Use camera_Info information to update image_geometry::PinholeCameraModel() at the time of initialization. Note: this will execute only one time.

Step 3: Calculating centroid of the bounding box [(xmin+xmax)/2, (ymin+ymax)/2]. Take care of the center of depth value pixel value, not 0. If it is, try to use another pixel that is near to that 0 pixels.

Step 4. Now pass X, Y of the center of the bounding box to this function projectPixelToRay((X, Y)). I will get a vector as an output whose Z value is

Step 5: A depth of d from the aligned depth channels(Image), d * v, will give you the object's position relative to the camera.

Note: Over here, try to use point cloud instead of depth channel. I have compared the results of both and finally can tell that point cloud gives the proper Z then aligned depth. Moreover, a depth channel gives null coordinates more than a point cloud.

C) tf2_ros side: I am not cleared about this flow, but I am trying my best to understand it.

Step 1: Continuously publish estimation pose for a detected object via TransformBroadcaster() function.

Step 2: This way, you could create another node to handle the actual robot movement (getting its target using a tf2_ros::TransformListener from your first node) without having to care about how to get the position estimate.

Thanks, @RobertWilbrandt, for helping in my use case.

My final output is given in the below image.

edit flag offensive delete link

Comments

Would you share your code with me? @Ranjit

Fawkess14 ( 2021-03-14 12:04:14 -0600 )edit

Hi @Ranjit thanks for your post. I am working on very similar project where I detect an objects and track tit using yolo +(some tracking algorithm). I am using the virtual Kinect camera in Gazebo as of now for testing purpose (later I will replace it with Intel realsense or any camera with more range). My approach is: 1-get the centre of the bounding box 2-get the point cloud of that pixel bu subscribing to "pointcluod2" topic 3-get XYZ from point cloud using method "point_cloud2.read_points" 4- use tf2 package to transform camera coordinate to inertial coordinate I tested code from internet doing up to step 3[(https://www.programmersought.com/a..., but not sure if this way is correct since your way is different. Why you did not used pointcloud directly?

imousa ( 2021-03-14 16:50:59 -0600 )edit

Hello @umousa,

First things, if your output is perfect, then this can be a way also. No way is wrong, and right it depends on the output.

3-get XYZ from the point cloud using the method "point_cloud2.read_points" 4- use tf2 package to transform camera coordinate to inertial coordinate I tested code from internet doing up to step 3[(https://www.programmersought.com/a...,

Can you explain these steps in more depth, if possible? And please attach the link properly.

Why did you not use point cloud directly?

Reason: For my object detection, I need to use an image stream, so that was the main reason I was using the color channel instead of the point cloud.

Ranjit Kathiriya ( 2021-03-16 11:22:48 -0600 )edit

sorry for the broken link It is [https://www.programmersought.com/arti... ] What I tried to do is matching the the pixel of interest (centre of bounding box) with correspondence point cloud I acquired from read_points method. This will give me directly a 3D point in camera frame of the centre of the object. The issue is I don't know how to associate the pixel to correct point cloud since the datatype of both is different. Second, the transformation from image frame to world frame (I am not sure if point cloud axis is aligned with world frame) The repo link you attached is great thank you, but I tried using some of the coed but I got this error: 'PointFromPixel' object has no attribute 'depth' sorry , I don't know how to attach my code

imousa ( 2021-03-16 14:28:00 -0600 )edit

Hello @imousa,

I don't know how to attach my code

Create a new question and attach your code and screenshot and explain in detail what you want as output and what you are getting. I will try my best to answer your question or any person who is more experienced than me will able to solve your answers.

Ranjit Kathiriya ( 2021-03-19 10:45:01 -0600 )edit

I @Ranjit and @imousa, this looks very interesting and actually I need to do something similar. (1) I want to use yolov3 or TensorFlow API to find the center of the detected object (2) get XYZ from these points can you PLEASE help me? Just in case you have any code for this

P.S - I don't need point clouds or depth

Anukriti ( 2021-10-21 09:13:21 -0600 )edit

Hello @Anukriti,

Have a look at this Yolact ROS 3D lightning talk. I think this will solve your problem. In the case of code, You can also have a look at depth_yolact_ros.

Ranjit Kathiriya ( 2021-10-21 09:21:01 -0600 )edit

I don't need point clouds or depth

You can do it in your 2D camera but your accuracy will be very low. For proper accuracy, you need 3D cameras.

(1) I want to use yolov3 or TensorFlow API to find the center of the detected object

You can't use Yolo3, TensorFlow libraries to find a particular centre point. You can get the bounding box of an object and then you have to manually find centre point by using your bounding box values for eg. [(xmin+xmax)/2, (ymin+ymax)/2]

(2) get XYZ from these points can you PLEASE help me?

What camera are you using?

Ranjit Kathiriya ( 2021-10-21 09:22:57 -0600 )edit

see more comments

Converting image pixel coordinates 2D(X, Y) to 3D world coordinate(Quaternion)

2 Answers

Comments

Comments

Question Tools

Stats

Related questions

Converting image pixel coordinates 2D(X, Y) to 3D world coordinate(Quaternion) edit

2 Answers

Comments

Comments

Question Tools

Stats

Related questions

Converting image pixel coordinates 2D(X, Y) to 3D world coordinate(Quaternion)