Revision history [back]

I am not sure i completely get what your goal looks like, but from what i understand these points could be of interest to you:

If you want to get the full 3d position of the object, take a look at the align_depth parameter and the aligned_depth_to_color topics of the realsense driver. This allows you to easily read out the depth associated with pixels in your color image (for more information see this post). Beware that depth images might just contain some 0s where no depth information is available.
In order to get the transform to the object (or only the orientation quaternion you asked for) take a look at the image_geometry package. It contains classes that can use the information from your cameras camera_info topic to e.g. calculate the ray defined by a pixel inside your bounding box (see the projectPixelTo3dRay() function). Based on this you should be able to calculate any information you need to plan a robot movement.
Always useful when working with streamed images: Use the image_transport package to make handling image messages easier.

With these, you should be able to write a node that does this:

Keeps an image_geometry::PinholeCameraModel, and listens to the camera_info topic to update it.
Uses image_transport::Subscribers to listen to both the color and the aligned depth image from your camera and keeps the most current one.
Listens to your yolov5 output. For every bounding box, you can sample some pixels inside it and read their depth from the aligned depth image, use these to get some sort of "object depth" and then use the camera model to get the object position. In an easy case, you might just sample some points around the center of the bounding box, use the mean of their depth values (filtering out 0s) as depth and multiply the projected ray from your camera model by this to get the 3d vector from your camera to the object.
Does some fancy visualization, e.g. in your video you can see some custom visual markers to show the object estimation (the yellow cube). I usually recommend using rviz_visual_tools for these simple tasks.

I hope this gives an overview over the basic steps required. You might need some simple things like filtering to keep your estimate from jumping around too much (had that happen with realsense depth images before), but this should be apparent once you see your first estimates. If you want to actually move your robot to the detected position you might also need to use TF2 to get your estimate into a suitable base frame.

If i missed some points from your question or if you have remaining questions, feel free to ask.