Encoder-Decoder Models
The encoder-decoder architecture, also known as the sequence-to-sequence (Seq2Seq) model, is a
foundational deep learning framework used for tasks involving input and output sequences of variable
lengths. It originated in the context of machine translation but has since been widely applied across a
range of natural language processing (NLP) and sequence modeling tasks.
Architecture Overview
The encoder-decoder model consists of two main components:
Encoder
The encoder reads and compresses the input sequence into a fixed-dimensional context vector (also
called thought vector or latent representation). It processes the input one element at a time and
captures its semantic representation.
Formally, for an input sequence x = (x₁, x₂, ..., xₙ):
Each input token xᵢ is embedded using an embedding matrix E, resulting in an embedding vector
eᵢ.
The embeddings are passed through a recurrent or transformer-based network:
For RNN-based encoders:
hᵢ = RNN(hᵢ₋₁, eᵢ)
where hᵢ is the hidden state at timestep i.
For transformer-based encoders:
ini
H = TransformerEncoder([e₁, e₂, ..., eₙ])
Decoder
The decoder generates the output sequence step-by-step using the context vector from the encoder and
the previously generated outputs.
For an output sequence y = (y₁, y₂, ..., yₘ):
The decoder uses teacher forcing during training, where the ground truth yᵢ₋₁ is used to predict
yᵢ.
At inference, previously generated output is fed back in.
Each output token is generated as:
For RNN-based decoders:
1/6
sᵢ = RNN(sᵢ₋₁, yᵢ₋₁, context)
yᵢ = softmax(W·sᵢ + b)
For transformer-based decoders:
ini
Y = TransformerDecoder([y₁, y₂, ..., yᵢ₋₁], EncoderOutput)
Attention Mechanism
A major advancement was the introduction of attention (Bahdanau Attention, 2014) to address the
bottleneck of the fixed-size context vector. Attention allows the decoder to access all encoder hidden
states dynamically, learning to focus on relevant parts of the input sequence at each step.
Let h = (h₁, ..., hₙ) be encoder states, and sᵢ the decoder state:
Compute attention weights:
αᵢⱼ = softmax(score(sᵢ, hⱼ))
Compute context vector:
cᵢ = Σⱼ αᵢⱼ hⱼ
Use cᵢ in decoder step.
Common scoring functions:
Dot product: score(s, h) = sᵀh
General: score(s, h) = sᵀWₐh
Additive: score(s, h) = vᵀ tanh(W₁s + W₂h)
Variants of Encoder-Decoder Architectures
RNN-based Encoder-Decoder
LSTM and GRU variants are commonly used to capture long-term dependencies.
Example: Google’s 2014 Neural Machine Translation (NMT) used LSTM-based encoder-decoder with
attention.
CNN-based Encoder-Decoder
Convolutional networks can model sequences without recurrence.
Example: Facebook’s Convolutional Sequence to Sequence model (Gehring et al., 2017).
2/6
Transformer-based Encoder-Decoder
Introduced in “Attention is All You Need” (Vaswani et al., 2017).
Replaces recurrence with self-attention.
Encoder and decoder are stacks of self-attention and feedforward layers.
Enables parallelization and captures long-range dependencies efficiently.
Training Objective
Encoder-decoder models are typically trained using the cross-entropy loss:
ini
Loss = -Σₜ log P(yₜ | y₁, ..., yₜ₋₁, x)
During training, teacher forcing is used to provide the correct previous token.
Inference Strategies
During inference, models rely on decoding strategies:
Greedy decoding: selects the token with the highest probability at each step.
Beam search: keeps top-k most probable sequences at each step.
Sampling-based methods: introduces randomness, like top-k sampling or nucleus sampling.
Applications
Neural Machine Translation (NMT)
First large-scale success of encoder-decoder with attention.
Translates sentences from source to target language.
Examples: Google Translate, OpenNMT.
Text Summarization
Encodes long documents, decodes concise summaries.
Used in news aggregation and report generation.
Question Answering and Conversational AI
Encodes question and context, decodes answer.
Applied in models like T5, ChatGPT, and conversational bots.
Image Captioning
Encoder: CNN extracts image features.
3/6
Decoder: RNN or Transformer generates textual description.
Speech Recognition and Synthesis
Encoder-decoder converts spectrograms to text (ASR) or vice versa (TTS).
Example: DeepSpeech, Tacotron.
Code Generation
Models like CodeT5 and Codex use encoder-decoder to translate natural language to code.
Real-World Models
Google’s NMT System (2016)
RNN-based with attention.
Production-quality translation for many languages.
Facebook’s FairSeq
Supports RNN, CNN, and Transformer encoder-decoders.
T5 (Text-to-Text Transfer Transformer)
Treats every NLP task as a text-to-text problem.
Unified pretraining and finetuning.
BART (Bidirectional and Auto-Regressive Transformers)
Encoder: denoising autoencoder.
Decoder: autoregressive generation.
Used in summarization and dialogue generation.
MarianNMT
Efficient multilingual NMT system using transformer encoder-decoder.
Advancements and Research Trends
Pretraining + Finetuning
Pretrain on massive corpora, then finetune on task-specific data.
Reduces data requirements and improves generalization.
Multilingual and Multimodal Encoders
Encode inputs from different languages or modalities (e.g., text+image).
4/6
Improves generalization across tasks.
Retrieval-Augmented Encoder-Decoder
Combines retrieval with generation.
E.g., RAG (Retrieval-Augmented Generation) integrates vector databases with seq2seq.
Scaling Laws
Models scale with data, compute, and parameters.
GPT-3 and PaLM use transformer decoders, but encoder-decoder models still dominate many
supervised NLP tasks.
Efficiency Improvements
Sparse attention, quantization, and distillation reduce inference cost.
LightSeq, Hugging Face Transformers provide optimized runtimes.
Limitations
Exposure bias: discrepancy between training and inference due to teacher forcing.
Fixed decoding order: rigid left-to-right decoding can be suboptimal.
Data inefficiency: large datasets required for competitive performance.
Length constraints: difficult to scale to very long sequences without specialized architectures (e.g.,
LongT5).
Summary of Key Differences: RNN vs CNN vs Transformer
Encoder-Decoder
Feature RNN CNN Transformer
Parallelization Poor Moderate Excellent
Long-range deps Moderate (LSTM/GRU) Limited Strong (via self-attention)
Position info Implicit (order) Requires encoding Requires encoding
Speed Slow (sequential) Fast Fast
Accuracy (SOTA) Outdated Moderate Current state-of-the-art
Notable Datasets
WMT (for translation)
CNN/DailyMail (summarization)
SQuAD (question answering)
LibriSpeech (speech recognition)
MS-COCO (image captioning)
5/6
Libraries and Frameworks
Hugging Face Transformers
OpenNMT
Fairseq
Tensor2Tensor
AllenNLP
Summary
Encoder-decoder models are a cornerstone in modern AI for sequence generation tasks. Their evolution
from RNNs with attention to transformer-based architectures has led to significant advancements in
performance, scalability, and generalization. They remain a fundamental abstraction for many
supervised and generative NLP systems.
6/6