Attention is All You Need (2017)

Paper: Attention is All You Need (2017)

  • Objective is to improve language translation tasks
  • Introduce the Transformer model, has no recurrence or convolutions, relies solely on Self-Attention
  • Dimensions: (batch_size, seq_length, d_model)
    • d_model = 512, each token is converted into a vector with this dimension
  • Attention
    • Self-Attention or Scaled Dot Product Attention (Query, Key, Value)
      • W_q, W_k, W_v, learned weights of Query, Key and Value
      • Queries and Keys of dimension d_k, they have to be the same dimension
      • Values of dimension d_v
      • Q, K and V obtained by matrix multiplication of input, x and the Weight matrices
      • attention(Q, K, V) = softmax( Q · K^T / sqrt( d_k ) ) · V
    • Multi-Head Attention
  • Transformer Architecture
    • Consists of Encoder-Decoder
    • Preprocessing (same for Encoder and Decoder)
      • Tokenization - splits the sentences into tokens
      • Input Embedding (learned) - converts each token into a vector of dimension, d
      • Positional Encoding - adds positional encoding to the input embedding vectors, d
        • Use sin() when the position is even, cos() when the position is odd
    • Encoder (N = 6 identical layers), auto-encoding, maps an input sequence to a sequence of continuous representations, has two sublayers (all sublayers produce outputs of dimension d_model):
      • Residual Connection: x
      1. (Sublayer) Multi-Head Attention (h = 8 parallel attention heads), stacked Self-Attention
        • Queries, Keys and Values are linearly projected h times
        • Each head, d_model / h
        • Results of each head is concatenated and then projected
      • Add & Norm, Add Residual Connection followed by Layer Normalization: LayerNorm(x + Sublayer(x))
      1. (Sublayer) Position-wise Fully Connected Feed Forward Network
        • Linear() → ReLU() → Linear()
        • FFN(x) = max(0, xW1 + b1)W2 + b2 (simplified)
          • Inner layer has dimensionality, d_ff = 2048
      • Add & Norm
    • Decoder (N = 6 identical layers), auto-regressive (generates an output sequence one element at a time and at each step, feeds back the output to the decoder), has three sublayers (all sublayers produce outputs of dimension d_model):
      • Residual Connection: x
      1. (Sublayer) Masked Multi-Head Attention (h = 8 parallel attention heads, nearly identical to the Encoder but includes masking), prevents position from attending to subsequent positions
      • Add & Norm
      1. (Sublayer) Encoder-Decoder Attention (or Cross Attention), performs Multi-Head Attention over the output of the encoder stack
        1. Receives K and V from Encoder
      • Add & Norm
      1. (Sublayer) Position-wise Fully Connected Feed Forward Network (identical to the Encoder)
      • Add & Norm
    • Linear (learned) - target language vocab_size
    • Softmax - converts the output of the decoder to predicted next-token probabilities
  • Advantages
    • Performs better than previous models such as RNN, LSTM, GRU in language translation tasks
    • Can be trained significantly faster than architectures based on recurrence or convolutions