Transformer Creator Warning: AI Trapped in Original Framework, Jen-Hsun Huang Urges Seven Authors to Break the Deadlock

MarketWhisper

2025-11-17 07:11:13

In 2017, the paper “Attention is All You Need” was published, introducing the Transformer model based on self-attention mechanisms for the first time, breaking free from the constraints of traditional RNNs and CNNs, and effectively overcoming the long-distance dependency problem through parallel processing. At the 2024 GTC conference, Nvidia CEO Jen-Hsun Huang invited the seven authors of the Transformer to appear together.

The origin of Transformers lies in the efficiency dilemma of machine translation

(Source: NVIDIA)

Jen-Hsun Huang asked what problems were encountered at the beginning and what inspired the team to create the Transformer. Illia Polosukhin responded: “If you want to release a model that can truly read search results, for example, handling piles of documents, you need some models that can process this information quickly. The recurrent neural networks (RNN) at that time could not meet such needs.”

Jakob Uszkoreit added: “The speed at which we generate training data far exceeds our ability to train the most advanced architectures. In fact, we are using simpler architectures, such as feedforward networks with n-grams as input features. These architectures, at least in the context of Google's large-scale training data, often outperform those more complex and advanced models due to their faster training speeds.”

Noam Shazeer provided key insights: “It seems that this is a pressing issue that needs to be addressed. We started noticing these scaling laws around 2015, and you can see that as the model size increases, its level of intelligence also rises. And there’s a huge feeling of frustration because RNNs are just too cumbersome to work with. Then I happened to hear these guys discussing, hey, let’s replace it with convolution or attention mechanisms. I thought, great, let’s do that. I like to compare the Transformer to the leap from the steam engine to the internal combustion engine. We could have completed the industrial revolution with a steam engine, but it would have been very painful, while the internal combustion engine made everything better.”

Three Core Problems Solved by Transformer

Parallel Processing: Break free from the sequential processing limitations of RNNs and achieve true parallel computation.

Long-range dependency: Effectively captures the relationships between distant words through a self-attention mechanism.

Training Efficiency: Significantly improve model training speed, making large-scale pre-training possible.

These technological breakthroughs have made Transformers the cornerstone of modern AI. Large language models such as ChatGPT, BERT, and GPT-4 are all based on the Transformer architecture. However, seven years later, the creators believe it is time for a breakthrough.

Trapped in the Efficiency Dilemma of the Original Model

Aidan Gomez candidly stated: “I believe this world needs something better than Transformer, and I think all of us here hope it can be replaced by something that takes us to a new performance plateau.” Llion Jones added: “We are stuck on primitive models, even though technically, it may not be the most powerful thing we have right now. But everyone knows what kind of personal tools they want; you want to create better context windows, you want faster token generation capabilities. They are currently using too many computational resources. I think everyone has done a lot of wasted computation.”

Jakob Uszkoreit pointed out the core issue: “But I think this is mainly about how to allocate resources, rather than how much resources are consumed in total. For example, we do not want to spend too much money on an easy problem, or spend too little on a problem that is too difficult and ultimately not get a solution.”

Illia Polosukhin provided a vivid example: “This example is like 2+2, if you input it correctly into this model, it will use a trillion parameters. So I think adaptive computation is one of the things that needs to emerge next, we know how much computing resources should be spent on specific problems.” This criticism reveals the fundamental flaw of current AI models: a lack of adaptability, spending the same computing resources on simple and complex problems, leading to a huge waste.

Noam Shazeer analyzed from an economic perspective: “I think the current models are too economically feasible and still too small in scale. The computational cost for each operation is about $10 to $18. If you observe a model with 500 billion parameters and each token undergoes a trillion computations, it costs roughly one dollar per million tokens, which is 100 times cheaper than going out to buy a paperback and reading it.” This perspective is counterintuitive but profound: AI is currently too cheap, leading people to abuse rather than cherish computational resources.

Future Direction: Adaptive Computing and Reasoning Capability

Lukasz Kaiser revealed an important fact: “We did not succeed in our initial goal; our original intention for starting Transformer was to simulate the evolution of Tokens. It is not just a linear generation process, but a gradual evolution of text or code.” This admission shows that while Transformer has been successful, it has not fully realized the creators' vision.

Jakob Uszkoreit pointed out the next direction: “The next step is reasoning. We all recognize the importance of reasoning, but much of the work is still done manually by engineers. We want the model to generate the content we want, whether it’s video, text, or 3D information, and they should all be integrated together.” This suggests that future AI architectures will need stronger reasoning capabilities and multimodal integration.

Aidan Gomez added: “Can we achieve multi-tasking and multi-threading in parallel? If you really want to build such a model, help us design such a model, this is a very good way.” Lukasz Kaiser believes: “Inference actually comes from data, we need to make the data richer.” These discussions point to several key directions for AI architectures after Transformer: adaptive computation, enhanced inference, multimodal fusion, and more efficient data utilization.

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.