
Learning diverse manipulation skills for real-world robots is severely bottlenecked by the reliance on costly and hard-to-scale teleoperated demonstrations. While human videos offer a scalable alternative, effectively transferring manipulation knowledge is fundamentally hindered by the significant morphological gap between human and robotic embodiments. To address this challenge and facilitate skill transfer from human to robot, we introduce Traj2Action,a novel framework that bridges this embodiment gap by using the 3D trajectory of the operational endpoint as a unified intermediate representation, and then transfers the manipulation knowledge embedded in this trajectory to the robot's actions. Our policy first learns to generate a coarse trajectory, which forms an high-level motion plan by leveraging both human and robot data. This plan then conditions the synthesis of precise, robot-specific actions (e.g., orientation and gripper state) within a co-denoising framework. Extensive real-world experiments on a Franka robot demonstrate that Traj2Action boosts the performance by up to 27% and 22.25% over baseline on short- and long-horizon real-world tasks, and achieves significant gains as human data scales in robot policy learning.
Task 1: pick up the water bottle
Task 1: pick up the water bottle
The robot arm is tasked with locating a water bottle placed on a tabletop, moving towards it, and grasping it successfully. This task evaluates the model's fundamental pick-and-place capabilities.
Task 2: pick up the tomato and put it in the yellow tray
Task 2: pick up the tomato and put it in the yellow tray
The workspace contains a tomato and two trays, one yellow and one blue. The robot must pick up the tomato and place it into the tray specified by a language command (e.g., "the yellow tray"). This task tests the policy's ability to ground language instructions to specific objects and goals.
Task 3: stack the rings on the pillar
Task 3: stack the rings on the pillar
The scene includes a pillar (composed of a yellow column and a blue base), a yellow ring, and a red ring. The robot needs to pick up both rings, one by one, and place them onto the pillar. This task assesses multi-step object manipulation and precision.
Task 4: stack the paper cups
Task 4: stack the paper cups
Three paper cups are placed on the table. The robot is required to stack them sequentially to form a single tower. This task evaluates the policy's ability to handle deformable objects and perform iterative, precise placement.
| Model Variants | SH (SR %) Short-Horizon (Success Rate) | LH (TP %) Long-Horizon (Task Progress) | Avg. Improvement | |||
|---|---|---|---|---|---|---|
| PWB pick up the water bottle | PTT pick up the tomato and put it in the tray | SRP stack the rings on the pillar | SPC stack the paper cups | SH (SR %) Short-Horizon (Success Rate) | LH (TP %) Long-Horizon (Task Progress) | |
| Baseline (π₀) | 48 | 50 | 23.75 | 37.75 | — | — |
| + Trajectory Expert | 58 | 60 | 33.50 | 54.25 | +10.00 | +13.13 |
| + Traj. Expert + Human Data | 76 | 76 | 44.75 | 61.25 | +27.00 | +22.25 |
Impact of Human Data Scale on Policy Performance.
Ablation study on the effect of different trajectory sampling frequencies (FPS) on model performance.
| Strategy | Robot Data (#/min) |
Human Data (#/min) |
Total Data Time (min) |
Performance SR(%) |
|---|---|---|---|---|
| Baseline | 408 / 202.29 | — | 202.29 | 50 |
| + Trajectory Expert + Robot Data-Only |
408 / 202.29 | 0 / 0 | 202.29 | 60 |
| + Trajectory Expert + Human & Robot Data |
270 / 133.70 | 120 / 14.25 | 147.95 | 58 |
| 240 / 28.61 | 162.31 | 60 | ||
| 635 / 75.39 | 209.09 | 62 |
path: static/videos/comparison/bottle_compare_video/
path: static/videos/comparison/fruits_compare_480/yellow/