Fourier Controller Networks for Real-Time Decision-Making in Embodied Learning

1Tsinghua University
ICML 2024

Abstract

Transformer has shown promise in reinforcement learning to model time-varying features for obtaining generalized low-level robot policies on diverse robotics datasets in embodied learning. However, it still suffers from the issues of low data efficiency and high inference latency.

In this paper, we propose to investigate the task from a new perspective of the frequency domain. We first observe that the energy density in the frequency domain of a robot's trajectory is mainly concentrated in the low-frequency part. Then, we present the Fourier Controller Network (FCNet), a new network that uses Short-Time Fourier Transform (STFT) to extract and encode time-varying features through frequency domain interpolation. In order to do real-time decision-making, we further adopt FFT and Sliding DFT methods in the model architecture to achieve parallel training and efficient recurrent inference.

Extensive results in both simulated (e.g., D4RL) and real-world environments (e.g., robot locomotion) demonstrate FCNet's substantial efficiency and effectiveness over existing methods such as Transformer, e.g., FCNet outperforms Transformer on multi-environmental robotics datasets of all types of sizes (from 1.9M to 120M).

Video

Crossing knee-deep snow

For scenes that have never been seen in the dataset (which is is collected in simulator), such as knee-deep snow, FCNet policy can show good locomotion performance, robustness and generalization.

Dashing in the snow

The FCNet policy slammed through the snow with ease. We find that the FCNet policy lasts longer and has less joint heating than the traditional RL policy, which may be due to the fact that FCNet filters out the high-frequency noise, making the output action softer.

We find that a limited amount of simulator data (e.g. 60M steps or less) can cover a very large amount of real-world terrain, so we expect to utilize a large amount of cheap simulator data in the future to lead to scaling laws for embodied AI, including robotic manipulation.

Methods

One common feature of the existing works on Transformer for decision making is that they mainly focus on modeling the time-varying features of robotic trajectories in time domain by drawing a direct analogy with that in modeling natural language sentences. We argue that this is insufficient in embodied learning, which has its unique features. Specifically, we take a close examination on embodied learning from the frequency domain and observe that the energy density distribution of a robot's state sequence is mainly concentrated in the low-frequency part, as shown in figure below. This is due to the inherent continuity and smoothness in natural physical phenomena and robot motors.

However, by directly modeling trajectories in the time domain, existing works on Transformer and its variants do not take into account this inductive bias for robotic control, resulting in low data efficiency and high inference latency. To address these issues, we propose a new architecture of Fourier Controller Network (FCNet) based on the key observation in the frequency domain. FCNet grounds the inductive bias in robotic control inspired by the Fourier transform. We conceptualize low-level continuous control as a sequential decision-making problem. Our neural model is adept at predicting subsequent actions by analyzing a historical window of state data, as depicted in the figure below. We concatenate action, reward, and state tokens together and then perform parallel training and real-time decision inference on this sequence.

Guided by the observation in the frequency domain and the inductive reasoning that differential dynamics are simplified in the frequency domain, FCNet introduces a causal spectral convolution (CSC) block. It employs the Short-Time Fourier Transform (STFT) and linear transform for efficient feature extraction in the frequency domain, distinct from Transformer and other prevalent architectures. As shown in figure below, we focus on the $m$ lowest modes, with $m$ strategically selected to be $\ll n$, where $n$ is the length of the state window. Consequently, the high-frequency part in the frequency domain is filtered, allowing us to focus solely on these $m$ lowest modes. The CSC makes efficient training and real-time inference possible, and has also been shown to have good performance in experiments.

Furthermore, to achieve efficient parallel training and inference, which necessitates causality in the model's sequential outputs (dependent only on previous inputs) and the rapid generation of each output token for real-time response, we introduce parallel training based on Fast Fourier transform (FFT), and recurrent inference based on sliding discrete Fourier transform (Sliding DFT) in FCNet.

This efficiency marks a significant speed advantage over traditional Transformer models, enabling handling the complexities of real-time continuous control in dynamic environments.

Some experimental results

Comprehensive analyses in both simulated (e.g., D4RL) and real-world environments (e.g., robot locomotion) demonstrate FCNet's substantial efficiency and effectiveness over existing methods such as Transformer, RetNet. FCNet outperforms Transformer on multi-environmental robotics datasets of all types of sizes (from 1.9M to 120M). The results show that FCNet significantly outperforms Transformer with limited data.

We test the inference latency of the Transformer (with KV cache) and FCNet under different hyperparameter settings related to model structure. The results show that the upward curve of the inference latency of FCNet is significantly slower than that of Transformer as the context length, number of layers, and hidden size are improved. This demonstrates the efficiency of the inference of FCNet.

BibTeX

@inproceedings{tanfourier,
      title={Fourier Controller Networks for Real-Time Decision-Making in Embodied Learning},
      author={Tan, Hengkai and Liu, Songming and Ma, Kai and Ying, Chengyang and Zhang, Xingxing and Su, Hang and Zhu, Jun},
      booktitle={Forty-first International Conference on Machine Learning}
    }