In recent years, the transformer architecture has become the de facto standard for machine learning algorithms applied to natural language processing and computer vision. Despite notable evidence of successful deployment of this architecture in the context of robot learning, we claim that vanilla transformers do not fully exploit the structure of the robot learning problem. Therefore, we propose Body Transformer (BoT), an architecture that leverages the robot embodiment by providing an inductive bias that guides the learning process. We represent the robot body as a graph of sensors and actuators, and rely on masked attention to pool information throughout the architecture. The resulting architecture outperforms the vanilla transformer, as well as the classical multilayer perceptron, in terms of task completion, scaling properties, and computational efficiency when representing either imitation or reinforcement learning policies.
Our experiments show how BoT benefits both imitation and reinforcement learning algorithms.
We evaluate BoT's reinforcement learning (RL) performance on four continuous control tasks trained in Isaac Gym.
Our results show that BoT-Mix consistently outperforms both the MLP and vanilla transformer baselines in terms of sample efficiency and asymptotic performance, highlighting the efficacy of integrating body-induced biases into the policy network architecture. Meanwhile, BoT-Hard performs better than the vanilla transformer on simpler tasks (A1-Walk and Humanoid-Mod), but shows relatively inferior results in hard-exploration tasks (Humanoid-Board and Humanoid-Hill).
Given that the masked attention bottlenecks information propagation from distant body parts, BoT-Hard's strong constraints on information communication may hinder efficient RL exploration: In Humanoid-Board and Humanoid-Hill, it may be useful for information about sudden changes in ground conditions to be transmitted from the toes to the fingertips in the upstream layers. For such tasks, BoT-Mix strikes a good balance between funneling information through the embodiment graph and enabling global pooling at intermediate layers to ensure efficient exploration. In contrast, in A1-Walk or Humanoid-Mod, the environment's state changes more regularly, thus the strong body-induced bias can effectively reduce the search space, enabling faster learning with BoT-Hard.
We devise a custom implementation of masked attention and compare it with the vanilla attention, showing how BoT-Hard is potentially 2x computationally more efficient than the vanilla transformer. The plots below show the time taken to process a sample as a function of the number of nodes in the graph, as well as the number of FLOPs required to do so.
@misc{sferrazza2024body,
title={Body Transformer: Leveraging Robot Embodiment for Policy Learning},
author={Carmelo Sferrazza and Dun-Ming Huang and Fangchen Liu and Jongmin Lee and Pieter Abbeel},
year={2024},
eprint={2408.06316},
archivePrefix={arXiv},
primaryClass={cs.RO}
}