Attention is All You Need (2017)

Paper: Attention is All You Need (2017)

Objective is to improve language translation tasks
Introduce the Transformer model, has no recurrence or convolutions, relies solely on Self-Attention
Dimensions: (batch_size, seq_length, d_model)
- d_model = 512, each token is converted into a vector with this dimension
Attention
- Self-Attention or Scaled Dot Product Attention (Query, Key, Value)
  - W_q, W_k, W_v, learned weights of Query, Key and Value
  - Queries and Keys of dimension d_k, they have to be the same dimension
  - Values of dimension d_v
  - Q, K and V obtained by matrix multiplication of input, x and the Weight matrices
  - attention(Q, K, V) = softmax( Q · K^T / sqrt( d_k ) ) · V
- Multi-Head Attention
Transformer Architecture
- Consists of Encoder-Decoder
- Preprocessing (same for Encoder and Decoder)
  - Tokenization - splits the sentences into tokens
  - Input Embedding (learned) - converts each token into a vector of dimension, d
  - Positional Encoding - adds positional encoding to the input embedding vectors, d
    - Use sin() when the position is even, cos() when the position is odd
- Encoder (N = 6 identical layers), auto-encoding, maps an input sequence to a sequence of continuous representations, has two sublayers (all sublayers produce outputs of dimension d_model):
  - Residual Connection: x
  1. (Sublayer) Multi-Head Attention (h = 8 parallel attention heads), stacked Self-Attention
    - Queries, Keys and Values are linearly projected h times
    - Each head, d_model / h
    - Results of each head is concatenated and then projected
  - Add & Norm, Add Residual Connection followed by Layer Normalization: LayerNorm(x + Sublayer(x))
  1. (Sublayer) Position-wise Fully Connected Feed Forward Network
    - Linear() → ReLU() → Linear()
    - FFN(x) = max(0, xW1 + b1)W2 + b2 (simplified)
      - Inner layer has dimensionality, d_ff = 2048
  - Add & Norm
- Decoder (N = 6 identical layers), auto-regressive (generates an output sequence one element at a time and at each step, feeds back the output to the decoder), has three sublayers (all sublayers produce outputs of dimension d_model):
  - Residual Connection: x
  1. (Sublayer) Masked Multi-Head Attention (h = 8 parallel attention heads, nearly identical to the Encoder but includes masking), prevents position from attending to subsequent positions
  - Add & Norm
  1. (Sublayer) Encoder-Decoder Attention (or Cross Attention), performs Multi-Head Attention over the output of the encoder stack
    1. Receives K and V from Encoder
  - Add & Norm
  1. (Sublayer) Position-wise Fully Connected Feed Forward Network (identical to the Encoder)
  - Add & Norm
- Linear (learned) - target language vocab_size
- Softmax - converts the output of the decoder to predicted next-token probabilities
Advantages
- Performs better than previous models such as RNN, LSTM, GRU in language translation tasks
- Can be trained significantly faster than architectures based on recurrence or convolutions