From Human Hands to Robot Arms 
Manipulation Skills Transfer via Trajectory Alignment

Code

Abstract

Learning diverse manipulation skills for real-world robots is severely bottlenecked by the reliance on costly and hard-to-scale teleoperated demonstrations. While human videos offer a scalable alternative, effectively transferring manipulation knowledge is fundamentally hindered by the significant morphological gap between human and robotic embodiments. To address this challenge and facilitate skill transfer from human to robot, we introduce Traj2Action,a novel framework that bridges this embodiment gap by using the 3D trajectory of the operational endpoint as a unified intermediate representation, and then transfers the manipulation knowledge embedded in this trajectory to the robot's actions. Our policy first learns to generate a coarse trajectory, which forms an high-level motion plan by leveraging both human and robot data. This plan then conditions the synthesis of precise, robot-specific actions (e.g., orientation and gripper state) within a co-denoising framework. Extensive real-world experiments on a Franka robot demonstrate that Traj2Action boosts the performance by up to 27% and 22.25% over baseline on short- and long-horizon real-world tasks, and achieves significant gains as human data scales in robot policy learning.

Tasks Description

Task 1: pick up the water bottle

Task 1: pick up the water bottle

The robot arm is tasked with locating a water bottle placed on a tabletop, moving towards it, and grasping it successfully. This task evaluates the model's fundamental pick-and-place capabilities.

Task 2: pick up the tomato and put it in the yellow tray

Task 2: pick up the tomato and put it in the yellow tray

The workspace contains a tomato and two trays, one yellow and one blue. The robot must pick up the tomato and place it into the tray specified by a language command (e.g., "the yellow tray"). This task tests the policy's ability to ground language instructions to specific objects and goals.

Task 3: stack the rings on the pillar

Task 3: stack the rings on the pillar

The scene includes a pillar (composed of a yellow column and a blue base), a yellow ring, and a red ring. The robot needs to pick up both rings, one by one, and place them onto the pillar. This task assesses multi-step object manipulation and precision.

Task 4: stack the paper cups

Task 4: stack the paper cups

Three paper cups are placed on the table. The robot is required to stack them sequentially to form a single tower. This task evaluates the policy's ability to handle deformable objects and perform iterative, precise placement.

We designed four distinct tasks to assess the capabilities of our policy in terms of basic manipulation, understanding of instructions, handling of multi-step logic, and precise object placement. Hover over each task video to see a detailed description.

Results

Model Variants SH (SR %) Short-Horizon (Success Rate) LH (TP %) Long-Horizon (Task Progress) Avg. Improvement
PWB pick up the water bottle PTT pick up the tomato and put it in the tray SRP stack the rings on the pillar SPC stack the paper cups SH (SR %) Short-Horizon (Success Rate) LH (TP %) Long-Horizon (Task Progress)
Baseline (π₀) 48 50 23.75 37.75
+ Trajectory Expert 58 60 33.50 54.25 +10.00 +13.13
+ Traj. Expert + Human Data 76 76 44.75 61.25 +27.00 +22.25
Performance comparison.

Contribution of Trajectory Expert  Adding a trajectory expert boosts performance, especially on Long-Horizon tasks (e.g., SPC score: 37.75% → 54.25%, +16.50%). The expert generates a coarse spatio-temporal plan that simplifies low-level control. Without it, simply adding human data gives only minimal gains (52% vs. 50%), showing that the unified trajectory space is key to bridging the human–robot gap.

Contribution of Human Data  Integrating human demonstrations further improves results across tasks (e.g., +28% on PWB, +26% on PTT, +23.5% on SPC, +21% on SRP). Human data brings diverse motions that help the model handle difficult, long-horizon tasks, leading to more robust, generalizable plans.

Image 1

Impact of Human Data Scale on Policy Performance.

Image 2

Ablation study on the effect of different trajectory sampling frequencies (FPS) on model performance.

Impact of Human Data Scale  Performance improves steadily with more human demonstrations. For example, success on pick up the tomato and put it in the tray rises from 68% (no human data) to 76% with 460 demos. On the harder stack the paper cups task, 264 demos lift performance from 37.75% to 57.50%, reaching 61.25% with all 460 demos. This confirms that larger human datasets significantly enhance robot learning.

Performance comparison for the pick up the tomato and put it in the tray task under different data collection strategies.
Strategy Robot Data
(#/min)
Human Data
(#/min)
Total Data Time
(min)
Performance
SR(%)
Baseline 408 / 202.29 202.29 50
+ Trajectory Expert
+ Robot Data-Only
408 / 202.29 0 / 0 202.29 60
+ Trajectory Expert
+ Human & Robot Data
270 / 133.70 120 / 14.25 147.95 58
240 / 28.61 162.31 60
635 / 75.39 209.09 62

Human Data as a Substitute  Human data can replace expensive robot data while maintaining or improving results. With 240 human demos plus fewer robot demos, performance matches robot-only training (60%) but cuts collection time by 20%. With 635 human demos, results surpass the baseline (62%). Even 120 demos achieve 58%, close to robot-only performance. Human data is thus scalable, cost-effective, and efficient for training.

Impact of Trajectory Sampling Frequency  Aligning human and robot motion speeds is critical. Best results (76% success) come from sampling human trajectories at 30 FPS and robot ones at 10 FPS (a 3:1 ratio). This alignment prevents temporal mismatches and leads to more effective cross-embodiment learning.

Zero-Shot Generalization  When trained only on placing tomatoes in a yellow tray, the policy still achieved 12% success when asked to use a blue tray — showing it can generalize to unseen tasks instead of just memorizing training data.

Evaluation Videos

We evaluate our method on a suite of four manipulation tasks, each designed to test different aspects of robotic skill transfer from human demonstrations.


pick up the water bottle

pick up the water bottle
1/1

path: static/videos/comparison/bottle_compare_video/


pick up the tomato and put it in the yellow/blue tray

pick up the tomato and put it in the yellow tray
1/1

path: static/videos/comparison/fruits_compare_480/yellow/