💃 DanceHMR: Hand-Aware Whole-Body
Human Mesh Recovery from Monocular Videos

arXiv 2026


Wenhao Shen1*  Ming Zhou1*†Â  Hengyuan Zhang1  Siyuan Bian1  Youjiang Xu1  Xi Lin1

1ByteDance Intelligent Creation
*Equal contribution   Corresponding author

TL;DR


(1) DanceHMR recovers temporally coherent SMPL-X body and hand motion from monocular videos.
(2) A hand-aware temporal architecture fuses body context with left- and right-hand observations through residual body-hand fusion, improving wrist consistency and detailed finger articulation.
(3) Close-up-aware augmentation makes the model robust to upper-body framing, hand truncation, motion blur, and hands moving in or out of the image.

Abstract


Monocular video human mesh recovery is essential for digital humans, avatar animation, and embodied simulation, where both temporal stability and expressive whole-body motion are required. Existing video HMR methods produce coherent body motion but often overlook detailed hand articulation, while image-based whole-body methods recover SMPL-X meshes independently per frame, often leading to jittery and inaccurate hand motion. We present DanceHMR, a temporally coherent whole-body HMR framework for challenging in-the-wild monocular videos. DanceHMR unifies body context and part-specific hand observations through residual body-hand fusion, enabling stable body motion and detailed hand recovery within a single temporal architecture. We further introduce close-up-aware augmentation to improve robustness under upper-body framing. Experiments on whole-body and body-only benchmarks demonstrate improved hand reconstruction, competitive body accuracy, and temporally stable, 2D-consistent SMPL-X motion in real-world videos.

Method


Framework overview. Given an input video, DanceHMR separately encodes body and hand observations, including vision features, 2D keypoints, bounding boxes, and camera cues. Left- and right-hand evidence is fused into a body-anchored representation as residual cues before temporal modeling, allowing the model to use local hand details when they are reliable while falling back to body context and temporal priors under occlusion, truncation, or blur. A temporal motion encoder and SMPL-X decoder then reconstruct coherent whole-body motion, including articulated hands. During training, close-up-aware augmentation simulates upper-body crops and hand truncation so the model handles livestream, vlog, speech, and dance-style compositions.

Training & Evaluation


Training: DanceHMR is trained with a two-stage curriculum on motion capture, synthetic, and in-the-wild video data, including BEDLAM, MOYO, ARCTIC, EgoBody, AMASS, WHAC-A-Mole, 3DPW, and Human3.6M. The second stage places greater emphasis on high-quality SMPL-X and hand-centric data to refine fine-grained hand articulation.

Whole-body results

Body-only results

On ARCTIC and UBody, DanceHMR improves hand reconstruction while remaining competitive on whole-body accuracy. On 3DPW and EMDB, it also maintains strong body-only video HMR performance, showing that the hand-aware design does not sacrifice body motion quality.

More Results




Citation


@article{shen2026dancehmr,
  title={DanceHMR: Hand-Aware Whole-Body Human Mesh Recovery from Monocular Videos},
  author={Shen, Wenhao and Zhou, Ming and Zhang, Hengyuan and Bian, Siyuan and Xu, Youjiang and Lin, Xi},
  journal={arXiv preprint},
  year={2026}
}