## Methods

One common feature of the existing works on Transformer for decision making is that they
mainly focus on modeling
the time-varying features of robotic trajectories in **time domain** by drawing a direct
analogy with that in modeling natural language sentences. We argue that this is insufficient in
embodied learning, which has its unique features.
Specifically, we take a close examination on embodied learning from the **frequency domain**
and observe that the energy density distribution of a robot's state sequence is mainly concentrated
in the low-frequency part, as shown in figure below. This is due to the
inherent continuity and smoothness in natural physical phenomena and robot motors.

However, by directly modeling trajectories in the time domain, existing works on Transformer and
its variants do not take into account this inductive bias for robotic control, resulting in
low data efficiency and high inference latency.
To address these issues, we propose a new architecture of Fourier Controller Network (FCNet)
based on the key observation in the frequency domain.
FCNet grounds the inductive bias in robotic control inspired by the Fourier transform.
We conceptualize low-level continuous control as a sequential decision-making problem.
Our neural model is adept at predicting subsequent actions by analyzing a historical window
of state data, as depicted in the figure below.
We concatenate action, reward, and state tokens together and then perform parallel training and
real-time decision inference on this sequence.

Guided by the observation in the frequency domain and the inductive reasoning that
differential dynamics are simplified in the frequency domain, FCNet introduces a causal
spectral convolution (CSC) block. It employs the Short-Time Fourier Transform (STFT) and
linear transform
for efficient feature extraction in the **frequency domain**, distinct from Transformer
and other prevalent architectures.
As shown in figure below, we focus on the $m$ lowest modes, with $m$ strategically selected to
be $\ll n$, where $n$ is the length of the state window.
Consequently, the high-frequency part in the frequency domain is filtered, allowing us to
focus solely on these $m$ lowest modes.
The CSC makes efficient training and real-time inference possible, and has also been
shown to have good performance in experiments.

Furthermore, to achieve efficient parallel training and inference,
which necessitates causality in the model's sequential outputs (dependent only on previous inputs)
and the rapid generation of each output token for real-time response,
we introduce parallel training based on Fast Fourier transform (FFT),
and recurrent inference based on sliding discrete Fourier transform (Sliding DFT) in FCNet.

This efficiency marks a significant speed advantage over traditional Transformer models,
enabling handling the complexities of real-time continuous control in dynamic environments.