Quite suddenly, we revive our series The History of LLMs. But there is a good reason for that. Propped by Andrej Karpathy’s recent tweet and with the support of Hugging Face, we’d like to clarify where the famous “attention” is coming from. We are also working on turning The History of LLMs into a book, so it’s all well-timed. | There are a bunch of papers and researchers mentioned in this article. If you want to find all the papers in one place and follow their authors, check out this tl;dr that we created in this little colab with Hugging Face 👇🏼 | | TL;DR: The Story of Attention's Development by @karpathy | We linked to all the researchers that you can follow on HF and provided their papers – it’s all in one place now. | @huggingface |
| |
|
|
|
| Now to the (hi)story: You’ve probably heard of Attention is All You Need – the groundbreaking paper that introduced Transformers and shot to the top of the charts after ChatGPT ignited the Generative AI revolution. In his recent tweet, Andrej Karpathy said “It's always been a little surprising to me that the paper "Attention is All You Need" gets ~100X more err ... attention... than the paper that actually introduced Attention ~3 years earlier, by Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio: "Neural Machine Translation by Jointly Learning to Align and Translate" (2014).” | This awesome “attention” is sometimes also referred to as Bahdanau attention, named after the main author of the paper "Neural Machine Translation by Jointly Learning to Align and Translate." Karpathy asked Dzmitry Bahdanau to share his personal story about writing this paper and it seems that the person who actually came up with the term “attention” was Yoshua Bengio! (spoiler: we asked Yoshua Bengio to share his story as well – read it below.) | Eight years ago, Dzmitry joined Yoshua Bengio's lab as an intern after studying under Herbert Jaeger at Jacobs University. Initially skeptical about representing word sequences as vectors, he worked on a machine translation project with Kyunghyun Cho. Eager to secure a PhD, he focused on coding and debugging, eventually earning a PhD offer in 2014. | Once on the PhD track, Dzmitry explored ways to improve the encoder-decoder RNN model by addressing the bottleneck. His initial idea involved using two "cursors" to process source and target sequences, inspired by Alex Graves' RNN Transducer model. However, finding it impractical for machine translation and too complex to implement within the internship, he pivoted to a simpler synchronous cursor approach resembling hard-coded diagonal attention, which worked but lacked elegance. | “So one day I had this thought that it would be nice to enable the decoder RNN to learn to search where to put the cursor in the source sequence. This was sort of inspired by translation exercises that learning English in my middle school involved. Your gaze shifts back and forth between source and target sequence as you translate. I expressed the soft search as softmax and then weighted averaging of BiRNN states. It worked great from the very first try to my great excitement. I called the architecture RNNSearch, and we rushed to publish an ArXiV paper as we knew that Ilya and co at Google are somewhat ahead of us with their giant 8 GPU LSTM model (RNN Search still ran on 1 GPU). As it later turned out, the name was not great. The better name (attention) was only added by Yoshua to the conclusion in one of the final passes.” | | Dzmitry Bahdanau |
|
| We checked the paper, and actually “attention” as a term appears on the fourth page (not in the conclusion) and mentioned three times in the same passage. Even the author himself didn’t pay that much attention to “attention” then! Here is this passage: | “The probability αij , or its associated energy eij , reflects the importance of the annotation hj with respect to the previous hidden state si−1 in deciding the next state si and generating yi . Intuitively, this implements a mechanism of attention in the decoder. The decoder decides parts of the source sentence to pay attention to. By letting the decoder have an attention mechanism, we relieve the encoder from the burden of having to encode all information in the source sentence into a fixed length vector. With this new approach the information can be spread throughout the sequence of annotations, which can be selectively retrieved by the decoder accordingly.” | | Neural Machine Translation by Jointly Learning to Align and Translate |
|
| To clarify the situation, we asked Yoshua Bengio what happened and how he came up with this term: | “We were aware of previous work on neural networks exploring the idea of attending to different parts of an image when doing computer vision, such as Larochelle & Hinton 2010 (Hugo Larochelle did his PhD with me and then a postdoc with Hinton). There is, of course, lots of even earlier work in cognitive science and neuroscience. My own insight really became strong in the context of the machine translation task. Prior to our introduction of attention, we were using a recurrent network that read the whole input source language sequence and then generated the translated target language sequence. However, this is not at all how humans translate. Humans pay very particular attention to just one word or a few input words at a time, in their context, to decide on the next word (or few words) to generate to form the sequence of words in the translation. The traditional approach that dominated before our paper failed because it created a bottleneck through which all of the information about the input sequence had to be summarized, in a fixed-size vector, from which all the output words (the translated sequence) had to be generated. At the scale of translating a book, this would be like first reading the whole book in French, then, having the story in mind, generating the English translation of the book. We would for sure forget some details, because our mind can't easily hold all of them in memory at once. The solution was to allow looking back at any piece of detail in the French book while producing every additional word in the English translation. The attention mechanism selects which pieces of detail to consider at each step. To make the whole thing compatible with traditional training with backpropagation, we had to make the attention "soft", i.e., instead of picking a single location of focus, we have a total budget of attention that can be split across all the locations. After training, the network would learn to put most of the attention weight in one or two places. Because these attention weights are graded, we can still use gradient descent to learn how to compute them, and traditional neural network machinery worked like a charm. |
|
| In his reply to Karpathy, Bahdanau adds: “I did not have the foresight to think that attention can be used at a lower level, as the core operation in representation learning. But when I saw the Transformer paper, I immediately declared to labmates that RNNs are dead.” | It’s even more peculiar how previously unnoticed attention has become such an apple of discord. | Immediately after Karpathy’s post, Jürgen Schmidhuber – a heavyweight in AI with many incredible inventions but with a somewhat hurt ego – lashed out claiming that it was him who coined “attention” first. He claimed that “attention” first appeared in his 1992 paper Learning to Control Fast-Weight Memories: An Alternative to Dynamic Recurrent Networks | Unfortunately, attention never appears in this paper. | "Sebastian Raschka, in his turn, digging into 'what comes first,' recommended reading the above-mentioned paper to “those interested in historical tidbits and early approaches fundamentally similar to modern transformers.” He explains that in 1991, 23 years before Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio’s: Neural Machine Translation by Jointly Learning to Align and Translate and 25 years before the seminal Attention Is All You Need paper, Jürgen Schmidhuber introduced Fast Weight Programmers (FWP) as an alternative to recurrent neural networks. This approach involved a feedforward network that used gradient descent to adjust the "fast weights" of another neural network, enabling dynamic modifications during sequence processing. | Modern analogies link FWPs to today's Transformers. The key and value concepts in Transformers correspond to the FROM and TO in FWP, while the query mirrors the input processed by the fast weight matrix. This matrix, constructed from outer products of keys and values, allows end-to-end differentiable control over fast weight adjustments. The process closely resembles the functionality of linearized self-attention mechanisms, as seen in linear Transformers. | This connection was further elaborated in recent research, including the 2020 papers Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention and Rethinking Attention with Performers. The explicit equivalence between FWPs and linearized self-attention was formally established in the 2021 paper Linear Transformers Are Secretly Fast Weight Programmers, highlighting the foundational role of FWPs in modern Transformer architectures. | Jürgen Schmidhuber had a very similar ideas way before the famous papers but he didn’t come up with the term. And he didn’t make it work. | | Did he influence the researchers – it’s hard to tell. From the email exchange between Karpathy and Bahdanau, we see that Dzmitry was not influenced by him. | According to Dzmitry, the concept of "differentiable and data-dependent weighted average" emerged independently in Yoshua Bengio's lab, uninfluenced by Neural Turing Machines, Memory Networks, or earlier cognitive science papers. Its development was driven by Yoshua’s ambitious leadership, Kyunghyun Cho’s project management skills, and the author’s creativity and programming expertise. | Bahdanau adds though that he believes this idea was inevitable, as attention is a natural fit for flexible spatial connectivity in deep learning, only awaiting adequate GPU capabilities. | So our verdict is: | The "attention" mechanism, popularized by the 2014 paper by Bahdanau, Cho, and Bengio, introduced a method to focus on relevant parts of input sequences for neural machine translation. While Jürgen Schmidhuber's earlier Fast Weight Programmers share conceptual similarities, there’s no evidence they directly influenced the 2014 work or coined the term “attention.” Credit for naming and formalizing the mechanism belongs to Bahdanau, Cho, and Bengio, with Yoshua Bengio playing a key role in its naming. It was inspired by earlier work in cognitive science and neuroscience. | We linked to all the researchers that you can follow on Hugging Face and provided their papers – it’s all in one place now. | Who else should be credited? Leave a comment here -> | | Thank you for reading! 🤍🩶 Let us know if you’d like more articles of this sort: | Would you like to see more content like this? | |
|