KEMBAR78
Encoder-Decoder Models | PDF | Computing | Cognitive Science
0% found this document useful (0 votes)
24 views6 pages

Encoder-Decoder Models

notes on encoder decoder model
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views6 pages

Encoder-Decoder Models

notes on encoder decoder model
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Encoder-Decoder Models

The encoder-decoder architecture, also known as the sequence-to-sequence (Seq2Seq) model, is a


foundational deep learning framework used for tasks involving input and output sequences of variable
lengths. It originated in the context of machine translation but has since been widely applied across a
range of natural language processing (NLP) and sequence modeling tasks.

Architecture Overview
The encoder-decoder model consists of two main components:

Encoder

The encoder reads and compresses the input sequence into a fixed-dimensional context vector (also
called thought vector or latent representation). It processes the input one element at a time and
captures its semantic representation.
Formally, for an input sequence x = (x₁, x₂, ..., xₙ):

Each input token xᵢ is embedded using an embedding matrix E, resulting in an embedding vector
eᵢ.

The embeddings are passed through a recurrent or transformer-based network:


For RNN-based encoders:

hᵢ = RNN(hᵢ₋₁, eᵢ)

where hᵢ is the hidden state at timestep i.


For transformer-based encoders:

ini

H = TransformerEncoder([e₁, e₂, ..., eₙ])

Decoder

The decoder generates the output sequence step-by-step using the context vector from the encoder and
the previously generated outputs.

For an output sequence y = (y₁, y₂, ..., yₘ):

The decoder uses teacher forcing during training, where the ground truth yᵢ₋₁ is used to predict
yᵢ.

At inference, previously generated output is fed back in.

Each output token is generated as:


For RNN-based decoders:

1/6
sᵢ = RNN(sᵢ₋₁, yᵢ₋₁, context)
yᵢ = softmax(W·sᵢ + b)

For transformer-based decoders:

ini

Y = TransformerDecoder([y₁, y₂, ..., yᵢ₋₁], EncoderOutput)

Attention Mechanism
A major advancement was the introduction of attention (Bahdanau Attention, 2014) to address the
bottleneck of the fixed-size context vector. Attention allows the decoder to access all encoder hidden
states dynamically, learning to focus on relevant parts of the input sequence at each step.

Let h = (h₁, ..., hₙ) be encoder states, and sᵢ the decoder state:

Compute attention weights:

αᵢⱼ = softmax(score(sᵢ, hⱼ))

Compute context vector:

cᵢ = Σⱼ αᵢⱼ hⱼ

Use cᵢ in decoder step.

Common scoring functions:

Dot product: score(s, h) = sᵀh

General: score(s, h) = sᵀWₐh

Additive: score(s, h) = vᵀ tanh(W₁s + W₂h)

Variants of Encoder-Decoder Architectures


RNN-based Encoder-Decoder

LSTM and GRU variants are commonly used to capture long-term dependencies.

Example: Google’s 2014 Neural Machine Translation (NMT) used LSTM-based encoder-decoder with
attention.

CNN-based Encoder-Decoder

Convolutional networks can model sequences without recurrence.

Example: Facebook’s Convolutional Sequence to Sequence model (Gehring et al., 2017).

2/6
Transformer-based Encoder-Decoder

Introduced in “Attention is All You Need” (Vaswani et al., 2017).

Replaces recurrence with self-attention.

Encoder and decoder are stacks of self-attention and feedforward layers.

Enables parallelization and captures long-range dependencies efficiently.

Training Objective
Encoder-decoder models are typically trained using the cross-entropy loss:

ini

Loss = -Σₜ log P(yₜ | y₁, ..., yₜ₋₁, x)

During training, teacher forcing is used to provide the correct previous token.

Inference Strategies
During inference, models rely on decoding strategies:

Greedy decoding: selects the token with the highest probability at each step.

Beam search: keeps top-k most probable sequences at each step.

Sampling-based methods: introduces randomness, like top-k sampling or nucleus sampling.

Applications
Neural Machine Translation (NMT)

First large-scale success of encoder-decoder with attention.

Translates sentences from source to target language.

Examples: Google Translate, OpenNMT.

Text Summarization

Encodes long documents, decodes concise summaries.

Used in news aggregation and report generation.

Question Answering and Conversational AI

Encodes question and context, decodes answer.


Applied in models like T5, ChatGPT, and conversational bots.

Image Captioning

Encoder: CNN extracts image features.

3/6
Decoder: RNN or Transformer generates textual description.

Speech Recognition and Synthesis

Encoder-decoder converts spectrograms to text (ASR) or vice versa (TTS).


Example: DeepSpeech, Tacotron.

Code Generation

Models like CodeT5 and Codex use encoder-decoder to translate natural language to code.

Real-World Models
Google’s NMT System (2016)

RNN-based with attention.


Production-quality translation for many languages.

Facebook’s FairSeq

Supports RNN, CNN, and Transformer encoder-decoders.

T5 (Text-to-Text Transfer Transformer)

Treats every NLP task as a text-to-text problem.

Unified pretraining and finetuning.

BART (Bidirectional and Auto-Regressive Transformers)

Encoder: denoising autoencoder.

Decoder: autoregressive generation.


Used in summarization and dialogue generation.

MarianNMT

Efficient multilingual NMT system using transformer encoder-decoder.

Advancements and Research Trends


Pretraining + Finetuning

Pretrain on massive corpora, then finetune on task-specific data.


Reduces data requirements and improves generalization.

Multilingual and Multimodal Encoders

Encode inputs from different languages or modalities (e.g., text+image).

4/6
Improves generalization across tasks.

Retrieval-Augmented Encoder-Decoder

Combines retrieval with generation.

E.g., RAG (Retrieval-Augmented Generation) integrates vector databases with seq2seq.

Scaling Laws

Models scale with data, compute, and parameters.


GPT-3 and PaLM use transformer decoders, but encoder-decoder models still dominate many
supervised NLP tasks.

Efficiency Improvements

Sparse attention, quantization, and distillation reduce inference cost.


LightSeq, Hugging Face Transformers provide optimized runtimes.

Limitations
Exposure bias: discrepancy between training and inference due to teacher forcing.

Fixed decoding order: rigid left-to-right decoding can be suboptimal.


Data inefficiency: large datasets required for competitive performance.

Length constraints: difficult to scale to very long sequences without specialized architectures (e.g.,
LongT5).

Summary of Key Differences: RNN vs CNN vs Transformer


Encoder-Decoder
Feature RNN CNN Transformer

Parallelization Poor Moderate Excellent


Long-range deps Moderate (LSTM/GRU) Limited Strong (via self-attention)

Position info Implicit (order) Requires encoding Requires encoding


Speed Slow (sequential) Fast Fast

Accuracy (SOTA) Outdated Moderate Current state-of-the-art

Notable Datasets
WMT (for translation)

CNN/DailyMail (summarization)
SQuAD (question answering)

LibriSpeech (speech recognition)


MS-COCO (image captioning)

5/6
Libraries and Frameworks
Hugging Face Transformers
OpenNMT

Fairseq
Tensor2Tensor

AllenNLP

Summary
Encoder-decoder models are a cornerstone in modern AI for sequence generation tasks. Their evolution
from RNNs with attention to transformer-based architectures has led to significant advancements in
performance, scalability, and generalization. They remain a fundamental abstraction for many
supervised and generative NLP systems.

6/6

You might also like