MolmoMotion: Language-Guided 3D Motion Forecasting Hits HuggingFace

HuggingFace

June 18, 2026

◷ 3 MIN

Original source

huggingface.co — read the full announcement →

The Announcement: HuggingFace Drops a Language-Conditioned Motion Forecasting Model

HuggingFace just released MolmoMotion, a model that takes a natural language instruction and predicts a sequence of 3D human or object motions. The team open-sourced the weights, a training script, and a Gradio demo. According to the paper, MolmoMotion achieves a 15% improvement in mean per-joint position error on the Human3.6M dataset compared to the previous state-of-the-art, MDM. It uses a transformer-based architecture with 120 million parameters and a cross-attention mechanism to fuse language embeddings with motion tokens. The researchers also introduced a new evaluation benchmark, MolmoBench, that includes 500 language-motion pairs with diverse actions like "walk cautiously over wet floor" and "grab the upper cabinet handle." The model outputs a 120-frame motion sequence at 25 fps, which is about 5 seconds of playback.

The Context: Why Language-Guided Motion Forecasting Is Suddenly Hot

Until recently, motion forecasting models were trained purely on motion capture data, ignoring the text descriptions that humans naturally use to communicate actions. Systems like MDM and MotionDiffuse could generate animations from action labels (e.g., "dance"), but they struggled with nuanced instructions like "wave your right hand while walking slowly." The problem is that language and motion exist in very different semantic spaces — a verb like "walk" can mean dozens of different gait patterns depending on context. The rise of large language models and vision-language models has given researchers tools to bridge that gap. But most of those approaches still required expensive fine-tuning or were locked behind API paywalls. HuggingFace's bet is that open-sourcing a dedicated motion-forecasting model will accelerate robotics research, where verbal instructions are the most natural interface for non-experts.

The Implications: Real-World Impact Beyond the Demo

If you're building a domestic robot that needs to understand "take the dirty plate to the kitchen," MolmoMotion's ability to condition on both the verb and the object is a big deal. The model can handle object-referring expressions — something earlier systems failed at. In internal tests, MolmoMotion correctly predicted the grasp trajectory for "pick up the blue mug" over 80% of the time, versus 55% for MDM. That said, the model runs at about 2 seconds per generation on an A100 GPU, which isn't real-time for interactive robotics. Still, for offline animation pipelines or skill learning in simulation, that's acceptable. The bigger win is that HuggingFace has released the full training recipe, including the dataset construction pipeline. That means anyone can retrain MolmoMotion on their own motion capture data or even synthetic data from physics simulators.

The Unknowns: Where MolmoMotion Falls Short and What to Watch

The biggest open question is generalization to unseen environments. MolmoMotion was trained on Human3.6M and a small proprietary dataset of daily activities. How well does it handle industrial settings, like a worker assembling a part? The paper doesn't say. Then there's the issue of ambiguity. If you say "move the chair," do you mean push it, pull it, or lift it? The model picks one interpretation, but it might not be the correct one. Also, the benchmark includes only single-person motions — multi-agent interaction forecasting is still an unsolved problem. Watch for follow-up work that adds spatial grounding (like bounding boxes from object detectors) or temporal reasoning over long horizons. Another red flag: the dataset might contain biases toward Western body language and common household motions. If you're deploying MolmoMotion in a warehouse with different cultural gestures, prepare for failure.

Frequently Asked Questions

What exactly is MolmoMotion?▾

MolmoMotion is a transformer-based model that generates a sequence of 3D joint positions from a natural language command. It was released by HuggingFace's research team as an open-source project. The model takes in a text prompt and outputs 120 frames (about 5 seconds) of motion at 25 frames per second.

Can I use MolmoMotion for real-time robotics?▾

Not yet. The current version takes about 2 seconds to generate a motion sequence on an NVIDIA A100 GPU, which is too slow for direct real-time control. However, the team is working on a distilled version that could run on edge devices at interactive rates. For simulation and offline planning, it's already usable.

What datasets was MolmoMotion trained on?▾

How does MolmoMotion compare to existing models like MDM?▾

On the Human3.6M benchmark, MolmoMotion achieves 15% lower mean per-joint position error than MDM, the previous best open-source model. More importantly, it can handle complex object-referring language that MDM cannot. However, MDM is still faster to generate motion (0.5 seconds vs. 2 seconds) and has a larger community of finetuned variants.

Is there a demo or code available?▾

Yes. HuggingFace has released a Gradio demo on their hub, along with the model weights, training scripts, and a Colab notebook. The full code is MIT licensed, so you can modify and deploy it freely. Check the repository for instructions on setting up your own environment.