Are my findings about tod_detecting correct?
Step 1: Feature detection and extraction -- Features are detected and extracted from the ROI in the 2d gray-scale test image. There is one Features2d instance produced which contains keypoints and the corresponding generated feature descriptors. This stage does not use pose information.
Step 2: Keypoint matching -- For each object and view in the training base, and for each keypoint descriptor in the test image, this stage finds the k best matches between keypoint descriptors in the test image and the feature descriptors of the training images. This stage does not use pose information.
Step 3: Clustering matches -- For each training object, all best matching keypoints that have been found in the previous stage are clustered based on the distance between test image keypoints. This stage does not use pose information.
Step 4: Guess generation -- To each match we can find a corresponding 3d point recorded in Features3d instances. Guesses are made based on the decision whether we can find a good projection (i.e. many inliers) of those 3d points (coming from different views of the same object) to the camera plane, such that the projection error in respect to the 2d test image keypoints is minimized.
The pose estimation obtained in the training phase is absolutely crucial to bring these 3d points into the same coordinate system (see GuessGenerator.cpp, line 267 at time of writing).
(Using object_detection in SVN revision 50425).