Open Challenges For
LLM Applications
Chip Huyen (@chipro | bit.ly/chip-mlops-discord)
Jun ‘23
Agenda
1. Inconsistency
2. Hallucination
3. Compliance + privacy
4. Context length
5. Model drift
6. Forward & backward compatibility
7. LLM on the edge
8. LLM for non-English languages
9. Efficiency of chat as a universal interface
10. Data bottleneck
2
Challenge 1: Consistency
1. How to ensure user experience consistency?
2. How to ensure downstream apps can run without breaking?
3
Same input, different outputs
Small input changes can cause big output changes
● temperature=0 won’t fix it
● won’t pass the perturbation test
5
No output schema guarantee for downstream apps
6
Challenge 2:
Hallucination
“Half knowledge is worse
than ignorance.”
- Thomas Babington Macaulay
7
Poor performance on tasks that require factuality
BIRD-SQL Leaderboard
8
Why do LLMs hallucinate?
● [DeepMind] Models "lack the understanding of the cause and effect of their
actions"
● [OpenAI] Mismatch between LLM’s internal knowledge and labeler’s
internal knowledge caused by behavior cloning
9
Low quality data High quality data Human feedback
RLHF
Text Demonstration Comparison
Prompts
e.g. Internet data data data
Trained to give Optimized to generate
Optimized for Finetuned for responses that maximize
a scalar score for
text completion dialogue scores by reward model
(prompt, response)
Language Supervised Reinforcement
Classification
modeling finetuning Learning
Pretrained LLM SFT model Reward model Final model
Scale >1 trillion 10K - 100K 100K - 1M comparisons 10K - 100K
May ‘23 tokens (prompt, response) (prompt, winning_response, losing_response) prompts
Examples GPT-x, Gopher, Falcon, Dolly-v2, Falcon-Instruct InstructGPT, ChatGPT,
Bolded: open LLaMa, Pythia, Bloom, Claude, StableVicuna
sourced StableLM
See: RLHF: Reinforcement Learning from Human Feedback
Is hallucination a feature or a bug?
A feature for tasks that rely on creativity
A bug for tasks that rely on factuality
11
Challenge 3: Privacy
1. [Build] If you build a chatbot to let
your users to talk to your data, how
to ensure that chatbot doesn’t
accidentally reveal sensitive
information?
2. [Buy] If you send your user data to
APIs, are these APIs compliant?
Multi-step Jailbreaking Privacy Attacks on ChatGPT (Li et al., 2023)
12
Challenge 4: Context length
● A significant proportion of information seeking
questions have context-dependent answers
(e.g., roughly 16.5% of NQ-Open)
(SituatedQA, 2021)
● Use cases:
○ Document processing
○ Summarization
○ Narrative
○ Any task involving genes and proteins
○ etc.
COLT5 (2023) 13
Challenge 5: Data drift
● “Existing models, which are trained
on data collected in the past, fail to
generalize to answering questions
asked in the present, even when
provided with an updated evidence
corpus (a roughly 15 point drop in
accuracy).”
(SituatedQA, 2021)
Generative AI taught everyone
about data drift 14
Challenge 6: Forward & backward compatibility
● Same model, new data
● New model
How to make sure your prompts still work with newer models?
15
Challenge 7: LLM on the edge
● Healthcare devices
● Autonomous vehicles
● Drive-thru voice bots
● Your personal ChatGPT, trained on your own data, run on your Macbook
16
Challenge 7: LLM on the edge
1. On-device inference
2. Training
a. On-device training: bottlenecked by compute + memory + tech available
b. If trained on a server:
i. How to incorporate device’s data?
ii. How to send model’s updates to device?
17
Choose a model size
7B param model can run on a Macbook 5 - 13B
Cost param Perf
● bfloat16 = 14GB memory model
● int8 = 7GB memory
7B param model costs approx*:
● $100 to finetune
● $25,000 to train from scratch
Model size
Finetuned General
for specific models
tasks
* Highly dependent on how much data 18
Challenge 8: LLMs for
non-English languages
● Performance (Lai et al., 2023)
19
Challenge 8: LLMs for non-English languages
● Performance (Lai et al., 2023)
● Tokenization (Yennie Jun, 2023)
○ Latency
○ Cost
20
Challenge 9: Efficiency of chat as a universal interface
Poll: Which do you prefer?
1. Search interface
2. Chat interface
21
Challenge 9: Efficiency of chat as a universal interface
Chat is NOT efficient, but is very robust
22
Challenge 9: Efficiency of chat as a universal interface
How much you like an interface depends on how much you’ve been exposed to
that interface
● Ongoing discussion for the last decade, since the rise of superapp in Asia
23
Dan Grover (2015)
Challenge 10: Data bottleneck
● The rate of training dataset size growth is much faster than the rate of new
data being generated (Villalobos et al, 2022)
● Internet is being rapidly populated with AI-generated text
24
Data is essential to leverage AI
1. Consolidate existing data across departments and sources
2. Update your data terms of use (see StackOverflow and Reddit)
3. Put guardrails around data quality + governance
Reach out if you want Claypot to help you with your data story!
25
10 open challenges
1. Inconsistency
2. Hallucination
3. Compliance + privacy
4. Context length
5. Model drift
6. Forward & backward compatibility
7. LLM on the edge
8. LLM for non-English languages
9. Efficiency of chat as a universal interface
10. Data bottleneck
26
Thank you!
@chipro
linkedin.com/in/chiphuyen
bit.ly/chip-mlops-discord