Attention is All You Need (2017)
Paper: Attention is All You Need (2017)
- Objective is to improve language translation tasks
- Introduce the Transformer model, has no recurrence or convolutions, relies solely on Self-Attention
- Dimensions: (batch_size, seq_length, d_model)
- d_model = 512, each token is converted into a vector with this dimension
- Attention
- Self-Attention or Scaled Dot Product Attention (Query, Key, Value)
- W_q, W_k, W_v, learned weights of Query, Key and Value
- Queries and Keys of dimension d_k, they have to be the same dimension
- Values of dimension d_v
- Q, K and V obtained by matrix multiplication of input, x and the Weight matrices
- attention(Q, K, V) = softmax( Q · K^T / sqrt( d_k ) ) · V
- Multi-Head Attention
- Self-Attention or Scaled Dot Product Attention (Query, Key, Value)
- Transformer Architecture
- Consists of Encoder-Decoder
- Preprocessing (same for Encoder and Decoder)
- Tokenization - splits the sentences into tokens
- Input Embedding (learned) - converts each token into a vector of dimension, d
- Positional Encoding - adds positional encoding to the input embedding vectors, d
- Use sin() when the position is even, cos() when the position is odd
- Encoder (N = 6 identical layers), auto-encoding, maps an input sequence to a sequence of continuous representations, has two sublayers (all sublayers produce outputs of dimension d_model):
- Residual Connection: x
- (Sublayer) Multi-Head Attention (h = 8 parallel attention heads), stacked Self-Attention
- Queries, Keys and Values are linearly projected h times
- Each head, d_model / h
- Results of each head is concatenated and then projected
- Add & Norm, Add Residual Connection followed by Layer Normalization: LayerNorm(x + Sublayer(x))
- (Sublayer) Position-wise Fully Connected Feed Forward Network
- Linear() → ReLU() → Linear()
- FFN(x) = max(0, xW1 + b1)W2 + b2 (simplified)
- Inner layer has dimensionality, d_ff = 2048
- Add & Norm
- Decoder (N = 6 identical layers), auto-regressive (generates an output sequence one element at a time and at each step, feeds back the output to the decoder), has three sublayers (all sublayers produce outputs of dimension d_model):
- Residual Connection: x
- (Sublayer) Masked Multi-Head Attention (h = 8 parallel attention heads, nearly identical to the Encoder but includes masking), prevents position from attending to subsequent positions
- Add & Norm
- (Sublayer) Encoder-Decoder Attention (or Cross Attention), performs Multi-Head Attention over the output of the encoder stack
- Receives K and V from Encoder
- Add & Norm
- (Sublayer) Position-wise Fully Connected Feed Forward Network (identical to the Encoder)
- Add & Norm
- Linear (learned) - target language vocab_size
- Softmax - converts the output of the decoder to predicted next-token probabilities
- Advantages
- Performs better than previous models such as RNN, LSTM, GRU in language translation tasks
- Can be trained significantly faster than architectures based on recurrence or convolutions