13.7. Attention and Transformer Models (RNN)

Aug 17, 2023

Sequence Length and Computation: The longer the sequence, the harder it becomes for RNNs, including LSTMs and GRUs, to process it efficiently and retain information from the initial parts of the sequence.
Limitation in Capturing Long-Term Dependencies: RNNs struggle with capturing long-term relationships in a sequence, especially when sequences are extremely long or when relevant information is separated by large gaps.

Core Idea: Instead of relying solely on the hidden state of RNNs, attention mechanisms allow the model to focus on specific parts of the input sequence when producing an output. It's a lot like how humans pay attention to specific parts of a sentence when comprehending or translating it.
Weights and Context Vectors: Understanding how attention scores are calculated and how they're used to derive a context vector, which is then used for predictions.

Moving Beyond RNNs: Transformers ditch the recurrent layers altogether. They handle sequences using attention mechanisms entirely, making them parallelizable and efficient.
Self-Attention: A mechanism that allows each item in a sequence to focus on different parts of the sequence. It calculates a weighted representation of the entire sequence for each item, which is then used in further layers.
Positional Encoding: Since transformers lack a notion of the order of sequences (unlike RNNs), positional encodings are added to ensure the model considers the position of items in a sequence.
Multi-Head Attention: Splitting the input into multiple parts and applying self-attention parallelly, capturing various aspects of the data.

BERT (Bidirectional Encoder Representations from Transformers): A model that reads text bidirectionally (considering both left and right context in all layers) and is pre-trained on a large corpus. It's fine-tuned for specific tasks.
GPT (Generative Pre-trained Transformer): A model primarily used for generating text. It's trained in a two-step process: unsupervised pre-training followed by supervised fine-tuning. GPT models have grown in size and capability with iterations like GPT-2, GPT-3, and GPT-4, showcasing impressive generative capabilities.