Kavli Affiliate: Xiang Zhang
| First 5 Authors: Xiang Zhang, Muhammad Abdul-Mageed, Laks V. S. Lakshmanan, ,
| Summary:
The Transformer architecture excels in a variety of language modeling tasks,
outperforming traditional neural architectures such as RNN and LSTM. This is
partially due to its elimination of recurrent connections, which allows for
parallel training and a smoother flow of gradients. However, this move away
from recurrent structures places the Transformer model at the lower end of
Chomsky’s computational hierarchy, imposing limitations on its computational
abilities. Consequently, even advanced Transformer-based models face
considerable difficulties in tasks like counting, string reversal, and
multiplication. These tasks, though seemingly elementary, require a level of
computational complexity that exceeds the capabilities of the Transformer
architecture. Concurrently, the emergence of “Chain of Thought" (CoT)
prompting has enabled Transformer-based language models to tackle tasks that
were previously impossible or poorly executed. In this work, we thoroughly
investigate the influence of recurrent structures in neural models on their
reasoning abilities and computability, contrasting the role autoregression
plays in the neural models’ computational power. We then shed light on how the
CoT approach can mimic recurrent computation and act as a bridge between
autoregression and recurrence in the context of language models. It is this
approximated recurrence that notably improves the model’s performance and
computational capacity. Moreover, we revisit recent recurrent-based Transformer
model designs, focusing on their computational abilities through our proposed
concept of “recurrence-completeness" and identify key theoretical limitations
in models like Linear Transformer and RWKV. Through this, we aim to provide
insight into the neural model architectures and prompt better model design.
| Search Query: ArXiv Query: search_query=au:”Xiang Zhang”&id_list=&start=0&max_results=3