Correspondence-Based Pose Estimation, by Sérgio Agostinho

Sérgio Agostinho will defend his doctoral thesis titled Correspondence-Based Pose Estimation, on january 27.
Abstract
While technologies like Autonomous Driving and Augmented Reality are yet to reach widespread adoption, it is undeniable that their influence is already felt today and there is a community of early adopters driving it, rooting for their success. These technologies are driven by complex, interdisciplinary systems that perceive, process and act on their surroundings to provide useful and satisfying experiences for their end-users. The topics addressed in this thesis are one of the many “cogs” driving these complex systems: it focuses on correspondence-based methods that tackle the long-standing perception problem of pose estimation. In Autonomous Driving, pose estimation provides the necessary information to enable routing the vehicle to the user’s final destination. Additionally, the data captured by the vehicle of its surroundings is also processed to build utility maps that can later be used to navigate it. When LiDARs are used, this usually entails registering together consecutive sweeps that will later be used to produce a map that is globally consistent. In the case of Augmented Reality, precise 3D pose estimation is essential for accurate visual superposition of digital information against the backdrop of the user’s field of view. In both cases, pose estimation plays a critical role in enabling these technologies to provide the ideal user experience.
This thesis is focused on correspondence-based 3D pose estimation problems applied to Computer Vision applications, particularly focused on keypoints or keylines i.e., points or lines of special interest,
coming from data sources such as 2D images and 3D point clouds. Despite being a heavily researched topic that has experienced increased interest over the past three decades, correspondence-based methods are still one of the most successful and reliable mechanisms for pose estimation; being able to gracefully coexist and integrate with the latest wave of data-driven techniques, while demonstrating very good generalization capabilities and wide applicability to many real world situations.
Correspondence-based pose problems follow a usual sequence of steps composed of: identifying keypoints or keylines of interest; establishing matches/correspondences between these; finding the optimal pose that minimizes a given geometric cost. In this thesis, we present novel research focused on the last two steps: leveraging both points and lines, and using approaches that combine both the rigour of the fundamental geometrical principles at play, with the huge potential and flexibility provided by data-driven techniques. It is within this context that we present the results for three separate problems:
absolute pose estimation between a 3D model and a camera, given 2D-3D correspondences of points and lines; point cloud registration, a problem which sets to find the best rigid transformation that aligns two point clouds; and last, keypoint matching for visual localization based entirely on geometry, without relying on visual descriptors.
The literature for point-based correspondence pose estimation methods is abundant. However, there are situations where extracting keypoints is simply not feasible and the ability to complement these visual cues with natural lines provides an additional degree of robustness. The limited number of methods that address this mixed modality of correspondences do not provide guarantees with respect to global optimality of the solutions returned. This was the context that led to the development of
CvxPnPL, a novel certifiable convex method to estimate 3D pose from mixed combinations of 2D-3D point and line correspondences, solving the Perspective-n-Points-and-Lines (PnPL) problem. We merge the contributions of each point and line into a unified Quadratically Constrained Quadratic Problem (QCQP) and then relax it into a Semidefinite Program (SDP) through Shor’s relaxation. In this way, we jointly handle mixed configurations of points and lines in a single computational framework. Furthermore, the proposed relaxation allows recovering a finite number of solutions under ambiguous configurations.
In such cases, the 3D pose candidates are found by further enforcing geometric constraints on the solution space and then retrieving such poses from the intersections of multiple quadrics. The choice of a
convex formulation makes the method insensitive to initialization and provides the theoretical framework for a posteriori validation of globally optimal solutions. As such, CvxPnPL is the first certifiable convex method for solving non-minimal PnPL problems. While we are competitive against other methods in the presence of point correspondences, we achieve state-of-the-art performance when only lines are available, promoting a reduction of 7.4% and 4.5% in median translation and rotation error.
In our second contribution, we tackle data-driven 3D point cloud registration on full end-to-end correspondence-based networks. Thi modality of approach aims to be a more computationally efficient alternative to the traditional pipelines that rely on RANSAC. Up until the addition of differentiable Singular Value Decomposition (SVD) to commonly used deep learning frameworks, most data-driven point cloud registration methods applied supervision at correspondence level, since there was no convenient mechanism to backpropagate end-to-end from pose error. With a differentiable SVD available, end-to-end supervision became easily accessible. However, supervision based exclusively on pose error is challenging for deeper neural networks. This second project alleviates this challenge by proposing a method to enforce stronger pose-based supervision. Concretely, given point correspondences, the standard Kabsch algorithm provides an optimal rotation estimate. However, given the initial rotation estimate supplied by Kabsch, we show we can improve point correspondence learning and consequently pose estimates by extending the original optimization problem. In particular, we linearize the governing constraints of the rotation matrix and solve the resulting linear system of equations. We then iteratively produce new solutions by updating the initial estimate. Our experiments show that, by plugging our differentiable layer to existing learning-based registration methods during training, we improve the correspondence matching that is meaningful to produce better poses. This yields up to a 7% decrease in rotation error for correspondence-based data-driven registration methods.
In our third and last contribution, we propose to go beyond the well-established approach to vision-based localization, that relies on visual descriptor matching between a query image and a 3D point cloud. While matching keypoints via visual descriptors makes localization highly accurate, it has significant storage demands, raises privacy concerns and requires updates to the descriptors in the long-term. To elegantly address these practical challenges for large-scale localization, we present GoMatch, an alternative to visual-based matching that solely relies on geometric information for matching image keypoints to maps, represented as sets of bearing vectors. Our bearing vector representation of 3D points, significantly relieves the cross-modal challenge in geometric-based matching that prevented prior work from tackling localization in realistic environments. With additional careful architecture design, GoMatch improves over prior geometric-based matching work with a reduction of (10.67m,95.7◦) and (1.43m, 34.7◦) in average median pose errors on Cambridge Landmarks and 7-Scenes, while requiring as little as 1.5/1.7% of storage capacity in comparison to the best visual-based matching methods. This confirms its potential and feasibility for real-world localization and opens the door to future efforts in advancing city-scale visual localization methods that do not require storing visual descriptors.
All these results show how correspondence-based problems are still highly relevant and widely used in modern day applications, with a formulation that is amenable to both heuristic and data-driven approaches, providing solutions for problems that require geometrically principled solutions and strong generalization capabilities. Therefore, it is our belief that the work presented in this thesis, successfully contributed to the advancement of one of the cornerstones of pose estimation in computer vision.