
Session 14: An Overview on GPT & Transformer
Presenters
An Overview on GPT & Transformer
In our meeting on the "Overview of GPT & Transformer," we began by laying the foundation with the basics of neurons and how they interconnect in various ways—such as one-to-one, one-to-many, many-to-one, and many-to-many configurations. These relationships form the backbone of neural network architectures, enabling them to handle different types of input and output patterns. We moved on to explore feedforward networks, which are the simplest type of neural network where information flows in one direction, followed by an introduction to Recurrent Neural Networks (RNNs), which bring temporal dynamics into modeling sequences.
We then discussed key challenges in training deep RNNs, namely the exploding and vanishing gradient problems, and how these led to innovations like Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) networks, which use gating mechanisms to better capture long-term dependencies. The concept of bidirectional networks was introduced as a way to allow models to learn from both past and future context in sequences. We also touched on hybrid architectures like CNN+RNN or CNN+LSTM, which combine spatial feature extraction with sequential modeling, commonly used in tasks such as video analysis or speech recognition.
The latter part of the discussion focused on the evolution beyond RNNs with the introduction of the Transformer architecture, as described in the paper "Attention is All You Need". This model relies entirely on attention mechanisms, removing the need for recurrence and allowing for much more efficient parallelization during training. Finally, we examined how this architecture influenced later developments such as Vision Transformers (ViT), which adapt the Transformer model for image processing by treating image patches as tokens, bringing the power of attention-based modeling into the computer vision domain.