Bingran Chen, Baorun Li, Jian Yang, Yong Liu, and Guangyao Zhai
Abstract— High-level robotic manipulation tasks demand flexible 6-DoF grasp estimation to serve as a basic function. Previous approaches either directly generate grasps from point- cloud data, suffering from challenges with small objects and sensor noise, or infer 3D information from RGB images, which introduces expensive annotation requirements and dis- cretization issues. Recent methods mitigate some challenges by retaining a 2D representation to estimate grasp keypoints and applying Perspective-n-Point (PnP) algorithms to compute 6- DoF poses. However, these methods are limited by their non- differentiable nature and reliance solely on 2D supervision, which hinders the full exploitation of rich 3D information. In this work, we present KGN-Pro, a novel grasping network that preserves the efficiency and fine-grained object grasping of previous KGNs while integrating direct 3D optimization through probabilistic PnP layers. KGN-Pro encodes paired RGB-D images to generate Keypoint Map, and further outputs a 2D confidence map to weight keypoint contributions during re-projection error minimization. By modeling the weighted sum of squared re-projection errors probabilistically, the net- work effectively transmits 3D supervision to its 2D keypoint predictions, enabling end-to-end learning. Experiments on both simulated and real-world platforms demonstrate that KGN-Pro outperforms existing methods in terms of grasp cover rate and success rate. We will open-source the code on short notice.
Overview of the proposed KGN-Pro. It takes a pair of RGB-D images as the input and stacks a keypoint extractor to obtain a Keypoint Map, where the 2D keypoints can be calculated. Meanwhile, it uses a confidence extractor to obtain the confidence score of each 2D checkpoint, which is considered by 2D-3D correspondences X between 2D keypoints and the corner points on the gripper model. Then, A grasp pose distribution p(y|X) is estimated by the reprojection function on $X$. Finally, it samples grasps from the distribution and performs Nearest-Neighbor Matching with ground truth labels to obtain the corresponding pose supervision. The nearest supervision provides a target distribution t(y) to regularize p(y|X).