Human pose estimation aims to correctly detect and localize keypoints, i.e., human body joints or parts. It is one of the fundamental computer vision tasks that plays an important role in various downstream applications, such as motion capture, activity recognition, security monitoring, human-computer interactions, and virtual reality.
With the advent of the XR era, technologies such as virtual reality (VR), human-computer interaction (HCI), and augmented reality (AR) have gradually matured. As a core part of XR research, it is becoming more important to accurately estimate human pose. However, due to the freestyle motion of human bodies, for complex scenes with person-person occlusions, large variations of appearance and texture, and cluttered backgrounds, pose estimation remains very challenging.
Kan Zhehan, a graduate student from the Department of Electronic and Electrical Engineering (EEE) at the Southern University of Science and Technology (SUSTech), supervised by Chair Professor Zhihai He, has recently developed a self-constrained prediction-verification network to characterize and learn the structural correlation between keypoints during training.
Their paper, entitled “Self-Constrained Inference Optimization on Structural Groups for Human Pose Estimation,” was accepted by the European Conference on Computer Vision (ECCV 2022), a top AI conference with the proceedings published by Springer.
Kan and his co-authors observed that human poses exhibit strong group-wise structural correlation and spatial coupling between keypoints due to the biological constraints of different body parts. Based on this observation, the researchers proposed to partition the body’s keypoints into structural groups. Within each group, the keypoints are further partitioned into two subsets, proximal keypoints with high estimation accuracy and low-accuracy distal keypoints. This group-wise structural correlation can be explored to improve the accuracy and robustness of human pose estimation.
Figure 1. Illustration of the proposed idea of self-constrained inference optimization of structural groups for human pose estimation
One fundamental challenge in pose estimation, as well as in generic prediction tasks, is that there is no mechanism to verify if the obtained pose estimation or prediction results are accurate or not since the ground truth is not available. Prof. He’s team developed a self-constrained prediction-verification network to perform forward and backward predictions between these keypoint subsets.
In Figure 2, this prediction-verification network with a forward-backward prediction loop learns the internal structural correlation between the proximal and distal keypoints. The learning process is guided by the self-constraint loss. If the internal structural correlation is successfully learned, then the self-constraint loss generated by the forward and backward prediction loop should be small. This step is referred to as self-constrained learning.
Figure 2. The overall framework of the proposed network
Once successfully learned, the verification network serves as an accuracy verification module for the forward pose prediction. During the inference stage, it can be used to guide the local optimization of the pose estimation results of low-accuracy keypoints with the self-constrained loss on high-accuracy keypoints as the objective function. This provides an effective mechanism to iteratively refine the prediction result based on the specific statistics of the test sample. This step is referred to as self-constrained optimization. The such feedback-based adaptive prediction will result in better generalization capability on the test sample.
The comparison and ablation experiments are performed on the MS COCO and CrowdPose datasets, both of which contain very challenging scenes for pose estimation, such as multi-person poses of various body scales and occlusion patterns. The proposed method outperforms the current best by a large margin of up to 2.5% on theMS COCO dataset. On the CrowdPose dataset, it has improved the pose estimation accuracy by up to 1.5%, which is quite significant. Visualization results in Figure 3 also demonstrate that the proposed method can significantly improve the pose estimation results.
Figure 3. Three examples of refinement of predicted keypoints: the top row is the original estimation, and the bottom row is the refined version by the proposed method
The researchers partition the body keypoints into structural groups exploring the structural correlation within each group and develop a prediction-verification network to characterize the structural correlation between them based on a self-constraint loss. A self-constrained optimization method was introduced using the learned verification network as a performance assessment module to optimize the pose estimation of distal keypoints during the inference stage where the ground truth is unavailable. The extensive experimental results on benchmark MS COCO datasets demonstrated that the proposed SCIO method could significantly improve the pose estimation results, providing important reference values for follow-up research.
Zhehan Kan is the first author of this paper. Chair Professor Zhihai He is the corresponding author. Other authors of the paper include Shuoshuo Chen, a master’s student from the Department of EEE, and Zeng Li, Associate Professor of the Department of Statistics and Data Science at SUSTech.
This study was partially supported by the National Natural Science Foundation of China (NSFC).
Paper link: https://arxiv.org/abs/2207.02425
To read all stories about SUSTech science, subscribe to the monthly SUSTech Newsletter.
Proofread ByAdrian Cremin, Yingying XIA
Photo By