RELI11D: A Comprehensive Multimodal Human Motion Dataset and Method

CVPR 2024

Ming Yan1,2,3, Yan Zhang1,3, Shuqiang Cai1,3, Shuqi Fan1,3, Xincheng Lin1,3,

Yudi Dai1,3, Siqi Shen1,3*, Chenglu Wen1,3, Lan Xu4, Yuexin Ma4, Cheng Wang1,3,

*Corresponding author

1Fujian Key Laboratory of Sensing and Computing for Smart Cities, Xiamen University, China

2National Institute for Data Science in Health and Medicine, Xiamen University, China

3Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, China

4Shanghai Engineering Research Center of Intelligent Vision and Imaging, ShanghaiTech University, China


RELI11D is a high-quality dataset that provides four different modalities and records movement actions(first two rows). Our dataset's annotation pipeline can provide accurate global SMPL joints, poses as well as global human motion trajectories(last row).

Abstract

Comprehensive capturing of human motions requires both accurate captures of complex poses and precise localization of the human within scenes. Most of the HPE datasets and methods primarily rely on RGB, LiDAR, or IMU data. However, solely using these modalities or a combination of them may not be adequate for HPE, particularly for complex and fast movements. For holistic human motion understanding, we present RELI11D, a high-quality multimodal human motion dataset involves RGB camera, Event camera, LiDAR and IMU system. It records the motions of 10 actors performing 5 sports in 7 scenes, including 3.32 hours of synchronized LiDAR point clouds, IMU measurement data, RGB videos and Event steams. Through extensive experiments, we demonstrate that the RELI11D presents considerable challenges and opportunities as it contains many rapid and complex motions that require precise location. To address the challenge of integrating different modalities, we propose LEIR, a multimodal baseline that effectively utilizes LiDAR Point Cloud, Event stream, and RGB through our cross-attention fusion strategy. We show that LEIR exhibits promising results for rapid motions and daily motions and that utilizing the characteristics of multiple modalities can indeed improve HPE performance. Both the dataset and source code will be released publicly to the research community, fostering collaboration and enabling further exploration in this field.

Introduction

Fast and complex movements can be seen everywhere in reality. However, it is difficult to fully capture human body movements. It requires both accurately capturing complex postures and accurately positioning the human body in the scene.

Multimodal datasets can take advantage of single sensors and provide a better comprehensive understanding of human movements. RGB cameras capture appearance information. Event cameras can capture motions with high temporal resolution and dynamic range by measuring intensity change asynchronously. This fills the gaps between RGB camera frames. LiDAR is insensitive to light and can provide global geometry and trajectory information. The IMU System can achieve smooth local movements. Combining the two can make up for the shortcomings of the IMU in global coordinates.

So, the community needs a high-quality dataset and method to fill the gap in the 3D multimodal Rapid and Complex Human Motions.

Dataset Overview

For holistic human motion understanding, we present RELI11D. It records the motions of 10 actors performing 5 sports in 7 scenes, including 3.32 hours of synchronized four different modality and provides precise annotations.

Hardware System

We built a portable collection system that integrates different modal hardware devices to collect in different real-world scenarios. Our MoCap system contains 17 IMUs, which record important 3D human body joint points.

Data Annotation Pipeline

In the Data Annotation Pipeline section. First, we input the raw data collected by different hardware devices. Highly accurate reconstructed point cloud scenes are also included. In the Data Pre-processing stage, we first separate the human point cloud from the LiDAR scene and register it with the high-precision scene. Then, synchronize and calibrate each modality. We propose a Consolidated Optimization, which includes global Pose Loss geo, human joint Loss smoo, scene awareness Loss contact, and global trajectory Loss trans. Finally, we get Global Human Poses and Trajectories that are automatically and accurately annotated.

RELI11D GALLERY

Baseline LEIR

In order to effectively integrate the data of each modality and use it for global human pose estimation, we propose a Baseline, LEIR. First, input data from different modalities. Next, we use the feature extractor corresponding to different modalities to obtain features. The modal features enter the Temporal Unified Multimodal Model, and we propose the MMCA Unit to fuse the features. Finally, in the SMPL-Based Inverse Kinematics Solver, we design a variety of Loss to constrain data of different dimensions and finally predict the global human poses.

RELI11D Evaluation

In the qualitative experiment, we show the different stages of data set annotation, and it can be seen that our annotation has the results closest to real actions. We also manually annotate RELI11D. Table 3 shows the ablation experiments of different losses in the Consolidated Optimization Stage. The results show that both manual annotation and optimized Loss can improve the quality of the dataset.

Benchmark

In the Benchmarks experiment, we compare the human pose estimation method based on 2D video and the method based on global human pose estimation. It can be seen from the visualization experiments that existing methods cannot estimate fast-moving limb movements well, and almost all methods cannot estimate high-leg movements. The last experiment is the evaluation of Baseline LEIR. First, we experiment with different inputs of LEIR, and the model index of three-modal input was the best. This result confirms the importance of multi-modal methods in human pose estimation. In addition, we conduct cross-dataset validation, and our method also performs better on other datasets.

Global Trajrctory Experiment

Furthermore, we perform visualization experiments on the prediction results of global trajectories. Combine with the analysis of the previous quantitative experimental results, it is show that the multi-modal method can be of considerable help in global human pose estimation.

Citation


      @inproceedings{yan2024reli11d,
        title={RELI11D: A Comprehensive Multimodal Human Motion Dataset and Method},
        author={Yan, Ming and Zhang, Yan and Cai, Shuqiang and Fan, Shuqi and Lin, Xincheng and Dai, Yudi and Shen, Siqi and Wen, Chenglu and Xu, Lan and Ma, Yuexin and others},
        booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
        pages={2250--2262},
        year={2024}
      }