Diffusion models have emerged as a promising choice for learning robot skills from demonstrations. However, diffusion models are neither robust to visual distribution shifts nor sample-efficient for policy learning. In this work, we present Factorized Diffusion Policies (FDP), a novel theoretical framework to learn action diffusion models without the need to jointly condition on all observational modalities such as proprioception and vision. Using our factored approach leads to 10% absolute performance improvement for ten RLBench and four Adroit tasks when compared to a standard diffusion policy which jointly conditions on all modalities. Moreover, FDP results in 25% higher absolute performance across five RLBench tasks with distribution shifts such as visual changes or distractors, where existing diffusion policies fail catastrophically. Our real-world experiments show that FDP is safe and relatively robust to deploy against visual distractors and appearance changes, maintaining strong performance even under significant visual disruptions and outperforming standard diffusion policies by over 40%.
In this work, we propose a novel theoretical framework Factorized Diffusion Policies (FDP) for learning action diffusion models that decouples observational modalities for prioritization. At its core, FDP learns a residual model using some input modalities that have been omitted while training a base model with prioritized inputs. The base and residual model outputs are then composed to obtain samples from the full conditional action distribution. In addition, we present an architecture that enables efficient learning of the residual model in the FDP framework. We demonstrate that prioritization of modalities may yield significant gains in sample efficiency and naturally improves policy robustness to distribution shifts in the residual observations.
Tasks
We evaluate Factorized Diffusion Policy (FDP) and the Diffusion Policy (DP) baseline across four real-world domains and report their task success rates. The domains are: Close Drawer as a simple task where the robot has to push the drawer; Put Block in Bowl that assesses the policy’s ability to perform precise pick-and-place actions; Pour in Bowl to evaluate the policy’s dexterity in operating near joint limits and Fold Towel to assess effectiveness in manipulating deformable objects.
Data Collection
We collect 50 demonstrations per domain on a Franka FR3 robot using a 6D space mouse, recording both proprioceptive and visual observations from two cameras—one mounted on the gripper and a static camera covering the workspace.
The trained policies are evaluated on four task variations in each domain: default: an in-distribution setup matching the conditions used during demonstration collection; color: the object’s color is altered to test robustness to visual appearance changes; distractor: novel, unseen objects such as vegetation props and soft toys are added to the scene to introduce clutter; and occlusion: visual input is intermittently blocked during policy rollout to simulate partial observability.
Close Drawer
DP Success
FDP (Ours) Success
FDP (Ours) Success
DP Success
FDP (Ours) Success
FDP (Ours) Success
DP Fail
FDP (Ours) Success
FDP (Ours) Success
DP Fail
FDP (Ours) Success
FDP (Ours) Success
Put Block in Bowl
DP Success
FDP (Ours) Success
FDP (Ours) Success
DP Fail
FDP (Ours) Success
FDP (Ours) Success
DP Fail
FDP (Ours) Success
FDP (Ours) Success
DP Fail
FDP (Ours) Success
FDP (Ours) Success
Pour in Bowl
DP Success
FDP (Ours) Success
FDP (Ours) Success
DP Fail
FDP (Ours) Success
FDP (Ours) Success
DP Fail
FDP (Ours) Success
FDP (Ours) Success
DP Fail
FDP (Ours) Success
FDP (Ours) Success
Fold Towel
DP Success
FDP (Ours) Success
FDP (Ours) Success
DP Fail
FDP (Ours) Success
FDP (Ours) Success
DP Fail
FDP (Ours) Success
FDP (Ours) Success
DP Fail
FDP (Ours) Success
FDP (Ours) Success
This section presents cases where Factorized Diffusion Policy (FDP) fails during execution. These examples highlight limitations under specific visual or task conditions.
Over fitted base model
Over fitted base model
Visual out of distribution
Visual out of distribution
Over fitted base model
Visual out of distribution