Body Transformer: Leveraging
Robot Embodiment for Policy Learning

Carmelo Sferrazza, Dun-Ming Huang, Fangchen Liu, Jongmin Lee, Pieter Abbeel

University of California Berkeley

Abstract

In recent years, the transformer architecture has become the de facto standard for machine learning algorithms applied to natural language processing and computer vision. Despite notable evidence of successful deployment of this architecture in the context of robot learning, we claim that vanilla transformers do not fully exploit the structure of the robot learning problem. Therefore, we propose Body Transformer (BoT), an architecture that leverages the robot embodiment by providing an inductive bias that guides the learning process. We represent the robot body as a graph of sensors and actuators, and rely on masked attention to pool information throughout the architecture. The resulting architecture outperforms the vanilla transformer, as well as the classical multilayer perceptron, in terms of task completion, scaling properties, and computational efficiency when representing either imitation or reinforcement learning policies.

Body Transformer

Robot learning policies that employ the vanilla transformer architecture as a backbone typically neglect the useful information provided by the embodiment structure. In contrast, we leverage this structure to provide a stronger inductive bias to transformers, while retaining the representation power of the original architecture. Specifically, BoT is based on masked attention, where at each layer in the resulting architecture, a node can only attend to information from itself and its direct neighbors. As a result, information flows according to the graph structure, with the upstream layers reasoning according to local information and the downstream layers pooling more global information from the farther nodes.

We map the observation and action vectors to a graph of local observations and actions through linear tokenizers and detokenizers. Then, we propose two alternatives for the backbone of our architecture:

BoT-Hard, which masks every layer with a binary mask that reflects the structure of the graph. Concretely, this mask only allows each node to attend to itself and its direct neighbors, introducing considerable sparsity in the problem.
BoT-Mix, which interleaves layers with masked attention (constructed as in BoT-Hard) with layers with unmasked attention.

Results

Our experiments show how BoT benefits both imitation and reinforcement learning algorithms.

For imitation learning, we benchmark behavioral cloning (BC) using different models as backbones and train them on the MoCapAct dataset, where BoT-Hard consistently outperforms both the MLP and transformer baselines. Remarkably, the gap with these architectures further increases on the unseen validation clips, demonstrating the generalization capabilities provided by the embodiment-aware inductive bias. The table below indicates "max / mean" returns obtained during training.

Additionally, we show how BoT scales better when increasing the number of parameters, as shown in the figure below. The plot shows the performance of the models on the traning and validation sets as a function of the number of trainable parameters. BoT-Hard outperforms the transformer baseline, showing a more stable performance as the number of parameters increases.

In a dexterous manipulation scenario, BoT-Hard also outperforms baselines in a low-data imitation learning setting, being at least as demo-efficient as the MLP baseline, and consistently outperforming the transformer baseline. The figure below shows the performance as a function of the number of demonstrations.

We evaluate BoT's reinforcement learning (RL) performance on four continuous control tasks trained in Isaac Gym.

Our results show that BoT-Mix consistently outperforms both the MLP and vanilla transformer baselines in terms of sample efficiency and asymptotic performance, highlighting the efficacy of integrating body-induced biases into the policy network architecture. Meanwhile, BoT-Hard performs better than the vanilla transformer on simpler tasks (A1-Walk and Humanoid-Mod), but shows relatively inferior results in hard-exploration tasks (Humanoid-Board and Humanoid-Hill).

Given that the masked attention bottlenecks information propagation from distant body parts, BoT-Hard's strong constraints on information communication may hinder efficient RL exploration: In Humanoid-Board and Humanoid-Hill, it may be useful for information about sudden changes in ground conditions to be transmitted from the toes to the fingertips in the upstream layers. For such tasks, BoT-Mix strikes a good balance between funneling information through the embodiment graph and enabling global pooling at intermediate layers to ensure efficient exploration. In contrast, in A1-Walk or Humanoid-Mod, the environment's state changes more regularly, thus the strong body-induced bias can effectively reduce the search space, enabling faster learning with BoT-Hard.

Computational Analysis

We devise a custom implementation of masked attention and compare it with the vanilla attention, showing how BoT-Hard is potentially 2x computationally more efficient than the vanilla transformer. The plots below show the time taken to process a sample as a function of the number of nodes in the graph, as well as the number of FLOPs required to do so.

BibTeX

@misc{sferrazza2024body,
      title={Body Transformer: Leveraging Robot Embodiment for Policy Learning},
      author={Carmelo Sferrazza and Dun-Ming Huang and Fangchen Liu and Jongmin Lee and Pieter Abbeel},
      year={2024},
      eprint={2408.06316},
      archivePrefix={arXiv},
      primaryClass={cs.RO}
    }