Demystifying LLMs
Devendra Singh Chaplot
Mistral AI
Feb 13, 2024
Mistral AI
Co-Founders
Arthur Mensch Timothée Lacroix Guillaume Lample
CEO CTO Chief Scientist
Former AI researcher at Former AI researcher at Former AI Researcher at
DeepMind, Polytechnique alum Meta, ENS alum Meta, Polytechnique alum
Releases
$500M+ funding, Of ces in Paris/London/SF Bay Area
fi
Mistral AI LLMs
Contents
• Stages of LLM Training:
• Pretraining
• Instruction-Tuning
• Learning from Human Preferences: DPO/RLHF
• Evaluation of LLMs
• Retrieval Augmented Generation (RAG)
• Recipe for RAG with code
Stages of LLM Training
1. Pretraining
2. Instruction-Tuning
3. Learning from Human Feedback
Stages of LLM Training
1. Pretraining
2. Instruction-Tuning
3. Learning from Human Feedback
Pretraining
Pretraining
We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same
architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e.
experts). For every token, at each layer, a router network selects two experts to process the current
state and combine their outputs. Even though each token only sees two experts, the selected experts
can be different at each timestep. As a result, each token has access to 47B parameters, but only uses
13B active parameters during inference. Mixtral was trained with a context size of 32k tokens and it
outperforms or matches Llama 2 70B and GPT-3.5 across all evaluated benchmarks. In particular,
Mixtral vastly outperforms Llama 2 70B on mathematics, code generation, and multilingual benchmarks.
We also provide a model
fi
Pretraining
We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same
architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e.
experts). For every token, at each layer, a router network selects two experts to process the current
state and combine their outputs. Even though each token only sees two experts, the selected experts
can be different at each timestep. As a result, each token has access to 47B parameters, but only uses
13B active parameters during inference. Mixtral was trained with a context size of 32k tokens and it
outperforms or matches Llama 2 70B and GPT-3.5 across all evaluated benchmarks. In particular,
Mixtral vastly outperforms Llama 2 70B on mathematics, code generation, and multilingual benchmarks.
We also provide a model
Large Language Model
O(1-100B) parameters
We
fi
Pretraining
We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same
architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e.
experts). For every token, at each layer, a router network selects two experts to process the current
state and combine their outputs. Even though each token only sees two experts, the selected experts
can be different at each timestep. As a result, each token has access to 47B parameters, but only uses
13B active parameters during inference. Mixtral was trained with a context size of 32k tokens and it
outperforms or matches Llama 2 70B and GPT-3.5 across all evaluated benchmarks. In particular,
Mixtral vastly outperforms Llama 2 70B on mathematics, code generation, and multilingual benchmarks.
We also provide a model
introduce
Large Language Model
O(1-100B) parameters
We
fi
Pretraining
We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same
architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e.
experts). For every token, at each layer, a router network selects two experts to process the current
state and combine their outputs. Even though each token only sees two experts, the selected experts
can be different at each timestep. As a result, each token has access to 47B parameters, but only uses
13B active parameters during inference. Mixtral was trained with a context size of 32k tokens and it
outperforms or matches Llama 2 70B and GPT-3.5 across all evaluated benchmarks. In particular,
Mixtral vastly outperforms Llama 2 70B on mathematics, code generation, and multilingual benchmarks.
We also provide a model
introduce Mixtral
Large Language Model
O(1-100B) parameters
We introduce
fi
Pretraining
We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same
architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e.
experts). For every token, at each layer, a router network selects two experts to process the current
state and combine their outputs. Even though each token only sees two experts, the selected experts
can be different at each timestep. As a result, each token has access to 47B parameters, but only uses
13B active parameters during inference. Mixtral was trained with a context size of 32k tokens and it
outperforms or matches Llama 2 70B and GPT-3.5 across all evaluated benchmarks. In particular,
Mixtral vastly outperforms Llama 2 70B on mathematics, code generation, and multilingual benchmarks.
We also provide a model
introduce Mixtral 8x7B , a Sparse Mixture
Large Language Model
O(1-100B) parameters
We introduce Mixtral 8x7B , a Sparse
fi
Pretraining
• Task: Next token prediction
Pretraining
• 1 token ~= 0.75 word
introduce Mixtral 8x7B
• Vocab size: O(10K) tokens
• Each token is represented by an integer
Large Language Model
(LLM)
We introduce Mixtral
Pretraining
• Why is it hard?
Pretraining
• Why is it hard?
• Time: Datasets are huge - O(1T) tokens
• Preprocessing, Cleaning, Deduplication
• More data might not lead to better model
Llama pretraining data mixture
Pretraining
• Why is it hard?
• Time: Datasets are huge - O(1T) tokens
• Preprocessing, Cleaning, Deduplication
• More data might not lead to better model
• Money: O(1-100B) parameters
• O(1-10K) GPUs for weeks or months Llama pretraining data mixture
• O($10-100M) per model
Pretraining
• Why is it hard?
• Time: Datasets are huge - O(1T) tokens
• Preprocessing, Cleaning, Deduplication
• More data might not lead to better model
• Money: O(1-100B) parameters
• O(1-10K) GPUs for weeks or months Llama pretraining data mixture
• O($10-100M) per model
• YOLO: decide model architecture, hyper
parameters, data mixture for the “big run”
• Best hyper-parameters for a smaller model
Llama Model sizes, architectures, and
might not be the best for a larger model optimization hyper-parameters.
Can we use the Pretrained model?
Can we use the Pretrained model?
Prompt:
Write a python function to nd whether the input number is prime.
fi
Can we use the Pretrained model?
Prompt:
Write a python function to nd whether the input number is prime.
Response:
Example:
Input: 17
Output: True
Input: 15
Output: False
Approach:
1. Let’s assume that the input number is n
2. Check if n is divisible by 1. If n is divisible by 1 then it is not a prime number.
3. Check if n is divisible by 2. If n is divisible by 2 then it is not a prime number.
fi
Can we use the Pretrained model?
Prompt:
def is_prime(x: int):
"""
takes as input an integer x. Returns True if x is prime and False otherwise
"""
Can we use the Pretrained model?
Prompt:
def is_prime(x: int):
"""
takes as input an integer x. Returns True if x is prime and False otherwise
"""
Response:
if x <= 1:
return False
for i in range(2, int(x ** 0.5) + 1):
if x % i == 0:
return False
return True
Can we use the Pretrained model?
Prompt:
def is_prime(x: int):
"""
takes as input an integer x. Returns True if x is prime and False otherwise
"""
Response:
if x <= 1:
return False
for i in range(2, int(x ** 0.5) + 1):
if x % i == 0:
return False
return True
Model knows the answer but it is not aligned with human preferences
Stages of LLM Training
1. Pretraining
2. Instruction-Tuning
3. Learning from Human Feedback
Instruction Tuning
Prompt: [INST] Write a python function to nd whether the input number is prime. [\INST]
def is_prime(x: int):
“""
takes as input an integer x. Returns True if x is prime and False otherwise
"""
if x <= 1:
Response: return False
for i in range(2, int(x ** 0.5) + 1):
if x % i == 0:
return False
return True
fi
Instruction Tuning
Prompt: [INST] Write a python function to nd whether the input number is prime. [\INST]
def is_prime(x: int):
“""
takes as input an integer x. Returns True if x is prime and False otherwise
"""
if x <= 1:
Response: return False
for i in range(2, int(x ** 0.5) + 1):
if x % i == 0:
return False
return True
Large Language Model
O(1-100B) parameters
[INST] Write … [\INST]
fi
Instruction Tuning
Prompt: [INST] Write a python function to nd whether the input number is prime. [\INST]
def is_prime(x: int):
“""
takes as input an integer x. Returns True if x is prime and False otherwise
"""
if x <= 1:
Response: return False
for i in range(2, int(x ** 0.5) + 1):
if x % i == 0:
return False
return True
def
Large Language Model
O(1-100B) parameters
[INST] Write … [\INST] def
fi
Instruction Tuning
Prompt: [INST] Write a python function to nd whether the input number is prime. [\INST]
def is_prime(x: int):
“""
takes as input an integer x. Returns True if x is prime and False otherwise
"""
if x <= 1:
Response: return False
for i in range(2, int(x ** 0.5) + 1):
if x % i == 0:
return False
return True
def is_prime
Large Language Model
O(1-100B) parameters
[INST] Write … [\INST] def
fi
Instruction Tuning
Prompt: [INST] Write a python function to nd whether the input number is prime. [\INST]
def is_prime(x: int):
“""
takes as input an integer x. Returns True if x is prime and False otherwise
"""
if x <= 1:
Response: return False
for i in range(2, int(x ** 0.5) + 1):
if x % i == 0:
return False
return True
def is_prime (x) :
Large Language Model
O(1-100B) parameters
[INST] Write … [\INST] def is_prime (x)
fi
Instruction Fine-tuning
• Dataset Instruction-tuning
• Paired: (Prompt, Response)
def is_prime
• O(10-100K instructions)
Large Language Model
(LLM)
[INST] … [\INST] Def
Instruction Fine-tuning
• Dataset Instruction-tuning
• Paired: (Prompt, Response)
def is_prime
• O(10-100K instructions)
• Task:
Large Language Model
• Next word prediction (Masked) (LLM)
[INST] … [\INST] Def
Instruction Fine-tuning
• Dataset Instruction-tuning
• Paired: (Prompt, Response)
def is_prime
• O(10-100K instructions)
• Task:
Large Language Model
• Next word prediction (Masked) (LLM)
• Compute:
• O(1-100) GPUs [INST] … [\INST] Def
• Few hrs/days
Stages of LLM Training
1. Pretraining
2. Instruction-Tuning
3. Learning from Human Feedback
Human Preferences
Human preferences are cheaper/easier than human annotation
Prompt: [INST] Write a python function to nd whether the input number is prime. [\INST]
def is_prime(x: int):
“""
takes as input an integer x. Returns True if x is prime and False otherwise
"""
if x <= 1:
Response 1: return False
for i in range(2, int(x ** 0.5) + 1):
if x % i == 0:
return False
return True
def is_prime(x: int):
“""
takes as input an integer x. Returns True if x is prime and False otherwise
"""
if x <= 1:
Response 2: return False
for i in range(2, x):
if x % i == 0:
return False
return True
fi
Human Preferences
Human preferences are cheaper/easier than human annotation
Prompt: [INST] Write a python function to nd whether the input number is prime. [\INST]
def is_prime(x: int):
“""
takes as input an integer x. Returns True if x is prime and False otherwise
"""
if x <= 1:
Response 1: return False
for i in range(2, int(x ** 0.5) + 1):
if x % i == 0:
return False
return True
def is_prime(x: int):
“""
takes as input an integer x. Returns True if x is prime and False otherwise
"""
if x <= 1:
Response 2: return False
for i in range(2, x):
if x % i == 0:
return False
return True
Response 1 > Response 2
fi
Reinforcement Learning
from Human Feedback (RLHF)
[Deep Reinforcement Learning from Human Preferences. Christiano et al. 2017]
Direct Preference Optimization (DPO)
[Deep Reinforcement Learning from Human Preferences. Christiano et al. 2017]
[Direct Preference Optimization: Your Language Model is Secretly a Reward Model. Rafailov et al. 2023]
Stages of LLM Training
Pretraining Instruction-Tuning Learning from Human Feedback
Dataset: Dataset: Dataset:
Raw text Paired: (Prompt, Response) Human Preference Data
Few trillions of tokens O(10-100K instructions) O(10-100K)
Task: Task: Task:
Next word prediction Next word prediction (Masked) RLHF/DPO
Compute: Compute: Compute:
O(1-10K) GPUs O(1-100) GPUs O(1-100) GPUs
Weeks/months of training Few hrs/days Few hrs/days
Evaluation of LLMs
Evaluation of pretrained models
Evaluation of pretrained models
0-shot:
def is_prime(x: int):
"""
takes as input an integer x.
Returns True if x is prime and False otherwise
"""
Evaluation of pretrained models
0-shot:
def is_prime(x: int):
"""
takes as input an integer x.
Returns True if x is prime and False otherwise
"""
3-shot:
## How old is Barack Obama in 2014?
Barack Obama is 57 years old in 2014.
## What is Barack Obama’s birthday?
Barack Obama was born on August 4, 1961.
## What is the name of Barack Obama’s wife?
Barack Obama’s wife is Michelle Obama.
## How tall is Barack Obama?
Evaluation of pretrained models
0-shot:
def is_prime(x: int):
"""
takes as input an integer x.
Returns True if x is prime and False otherwise
"""
3-shot:
## How old is Barack Obama in 2014?
Barack Obama is 57 years old in 2014.
## What is Barack Obama’s birthday?
Barack Obama was born on August 4, 1961.
## What is the name of Barack Obama’s wife?
Barack Obama’s wife is Michelle Obama.
## How tall is Barack Obama?
Evaluation of Instruction-tuned models
https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard
Evaluation of Instruction-tuned models
LMSYS Chatbot Arena Leaderboard
https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard
Evaluation of Instruction-tuned models
• Proxies for human evaluation:
• MT Bench:
• Ask GPT-4 to score responses
• 0.90 correlation with human
preferences
• Alpaca Eval:
• Compare win-rate against GPT-4 (v2)
• 0.84 correlation with human
preferences
Practical tips
• Proprietary vs Open-Source
• For proprietary models:
• Prompt Engineering: Few-shot prompting, Chain-of-thought
• Retrieval Augmented Generation (RAG)
Practical tips
• Proprietary vs Open-Source
• For proprietary models:
• Prompt Engineering: Few-shot prompting, Chain-of-thought
• Retrieval Augmented Generation (RAG)
• For open-source
• Everything above
• Task-speci c ne-tuning and DPO: Need data and a bit of compute
fi
fi
Practical tips
Open-source Proprietary
• Proprietary vs Open-Source
• For proprietary models:
• Prompt Engineering: Few-shot prompting, Chain-of-thought
• Retrieval Augmented Generation (RAG)
• For open-source
• Everything above
• Task-speci c ne-tuning and DPO: Need data and bit of compute
• Balance performance vs cost (training and inference)
• Proprietary models higher general-purpose performance
• Open-source models can beat proprietary models on speci c tasks
with ne-tuning
• Proprietary models typically have higher inference cost
Price
0.42€ 1.8€ 7.5€
(per M tokens)
fi
fi
fi
fi
Retrieval Augmented Generation (RAG)
Retrieval Augmented Generation (RAG)
When do we need Retrieval Augmented Generation (RAG)?
• LLM doesn’t know everything, sometimes require task-speci c knowledge
• Sometimes you want LLMs to answer queries based on some data source to reduce hallucinations
• Knowledge resource doesn’t t in the context window of the LLM
[Figure from https://lemaoliu.github.io/retrieval-generation-tutorial/]
fi
fi
Recipe for RAG
[Figure from https://gradient ow.substack.com/p/best-practices-in-retrieval-augmented]
fl
Basic RAG code
https://docs.mistral.ai/guides/basic-RAG/