KEMBAR78
Module 2 | PDF | Intelligence | Cognitive Science
0% found this document useful (0 votes)
18 views17 pages

Module 2

This document discusses the training and capabilities of large language models (LLMs), highlighting their reliance on vast datasets and the challenges of data transparency, ethical usage, and bias. It explains the training process involving pretraining and fine-tuning, as well as emergent properties that allow LLMs to perform tasks they weren't explicitly trained for. The document also addresses concerns about bias in LLMs and the implications of their capabilities for the future of artificial intelligence.

Uploaded by

Anusha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views17 pages

Module 2

This document discusses the training and capabilities of large language models (LLMs), highlighting their reliance on vast datasets and the challenges of data transparency, ethical usage, and bias. It explains the training process involving pretraining and fine-tuning, as well as emergent properties that allow LLMs to perform tasks they weren't explicitly trained for. The document also addresses concerns about bias in LLMs and the implications of their capabilities for the future of artificial intelligence.

Uploaded by

Anusha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 17

Module 2

Training large language models


Training Large Language Models

For years, data has served as the foundation of the digital economy. With
our lives becoming increasingly online, businesses and platforms have
collected massive amounts of personal and behavioral information—
turning data into an asset worth trillions globally. In this context, large
language models (LLMs) have emerged as products of the internet era,
designed to replicate human communication by processing vast quantities
of online text data.

The outcomes of this process have been both expected and surprising. On
one hand, LLMs can produce coherent, context-aware responses; on the
other hand, they sometimes mirror the harmful or inappropriate content
present in their training data. Given the enormous scale of data collection,
it's inevitable that some of the information includes personal details,
irrelevant spam, or offensive material. This raises important questions
about data transparency, ethical usage, and bias mitigation.

Yet, despite these challenges, the scale at which modern LLMs operate
has led to advanced capabilities not observed in smaller models. Their
performance, versatility, and potential across a wide range of applications
have driven continued investment in scaling up both data and model
size. As a result, the development race shows no sign of slowing down.

This chapter will explore how these models are trained and why they
exhibit unique behaviors and vulnerabilities not found in earlier
generations of NLP systems.

How Are LLMs Trained?


As briefly introduced in Chapter 1, the training of large language models
involves transformer-based neural networks and a process called
self-supervised learning, where the model learns to predict the most
likely next token (word or character) in a sequence. Now, we’ll examine
the training process more thoroughly and introduce a fascinating
phenomenon: emergent properties—abilities that LLMs display even
though they weren't directly trained to perform those specific tasks.

The training pipeline typically begins with pretraining, where the model
is exposed to massive datasets and learns to complete a token
prediction task. For generative models, this often takes the form of
causal or autoregressive prediction, where the model predicts the
next token based on all previous ones. This stage is called pretraining
because it forms the core knowledge base of the model, which will be
further refined later.

Once the model has completed pretraining, it can be fine-tuned on


narrower tasks using labeled data and clearly defined goals. For
instance:

 A chatbot like ChatGPT may be fine-tuned using conversational


datasets.
 Instruction-following models are fine-tuned with prompts like
“Explain this concept” or “Write a story.”
 Some models are optimized for code generation, trained on
programming-specific datasets.

This multi-phase training process—pretraining followed by task-


specific fine-tuning—is illustrated in Figure 2.1.

.
Figure 2.1: The multi-stage training process of large language models,
showing pretraining and fine-tuning steps

Exploring Open Web Data Collection

To effectively simulate and generate human-like language, large language


models (LLMs) require enormous volumes of text. For example, take the
seemingly simple task of answering a question. The model must first
accurately interpret both the question and its surrounding context. This
means it must understand the meaning of every token—each individual
unit of text—and grasp how they work together. It must also analyze the
sentence structure to determine what is being asked and formulate a
relevant answer, either by referencing external context (open-book style)
or relying on its internal knowledge (closed-book style).

Thanks to their exposure to vast online datasets, most LLMs can correctly
answer common questions like “Who was the first president of the United
States?” without being given any additional context. However, if a
question is vague—such as “Who was the first president?”—the model
might assume the question is about the U.S., demonstrating a bias that
reflects the dominance of U.S.-centric, English-language data on the
internet. While regional context (such as the user's IP address) could shift
this assumption, it's also true that English-language content—especially
from North America and Western Europe—makes up the majority of
what's available on the web.
As mentioned in Chapter 1, Wikipedia is a key source of training data for
LLMs. Yet its representation is not globally balanced. For instance, while
English Wikipedia contains over 6.6 million articles, French Wikipedia—the
second largest—has around 2.5 million. This imbalance results in LLMs
being more proficient in English and more knowledgeable about Western
topics.

To explore additional datasets, we can look to platforms like Hugging


Face, an open-source hub for AI resources. Their dataset repository
includes Reddit posts, news articles, movie reviews (from platforms like
Amazon and Rotten Tomatoes), and Q&A archives from Stack Exchange.
Another major source is Common Crawl, a nonprofit that provides large-
scale, public web archive data. In essence, any site where people write
text publicly can become part of an LLM’s training data.

Developers often train models using a mix of:

 Publicly available datasets (e.g., Hugging Face)

 Purchased commercial datasets

 Custom-collected data via web scraping

 Manually crafted training examples

While initial training can be automated, crowdsourced feedback and


conversational datasets are essential in fine-tuning LLMs for specific
applications, such as building dialogue agents or improving contextual
understanding.

Demystifying Autoregression and Bidirectional Token Prediction

Earlier language models like Google’s BERT focused primarily on


understanding text rather than generating it. These models are
considered bidirectional, meaning they analyze text by looking both to
the left and right of a missing word to infer what it might be. This
approach works well for comprehension-based tasks because it provides
full context for any word.

In contrast, when building models for text generation, such as those in


the GPT family or Google’s PaLM, the training process only considers the
words that come before the target word. This is known as
autoregressive modeling, where the prediction of the next token
depends entirely on the sequence that precedes it.

🔍 Autoregressive: A modeling approach where the output (e.g., next


word) is predicted using only prior tokens in the sequence.

Consider the sentence:


“For their honeymoon, they flew to _____ and had a romantic
dinner in front of the Eiffel Tower.”
A bidirectional model can easily predict “Paris” by using the complete
sentence context—both before and after the blank.

However, in a sentence like:


“A good location for a romantic honeymoon is _____,”
a generative model only uses the left context to make predictions. That
means it cannot rely on any following words.

During training, models are exposed to billions of examples where they


repeatedly guess the next token. When they guess wrong, the system
adjusts the internal weights of the model until it becomes better at
predicting words. This process is known as self-supervised learning.

When we interact with models like ChatGPT, it might not seem like a
structured task—but it is. The model is simply predicting, one token at a
time, what response should follow the previous message. If you type,
“Hey! What’s up?”, the model will evaluate the most probable and
contextually appropriate replies based on its training.

Fine-Tuning Large Language Models


After an LLM has been pre-trained to predict the next token in a sequence,
it gains the ability to generate meaningful words, phrases, or full
sentences. At this point, these models are often referred to as
foundation models or base models because they form the essential
groundwork for a broad range of natural language processing (NLP)
applications. Their strength lies in the rich internal representations
they’ve developed for thousands of words, phrases, and contextual
patterns.

Although base models may not seem highly capable on their own, they
can be easily tailored for specific tasks through a process known as fine-
tuning. Fine-tuning involves training the model on curated, labeled
datasets that reflect the task it’s expected to perform. These tasks may
be narrowly focused—such as identifying legal terms—or more
generalized, such as following user instructions. For instance, many
commercial LLMs are fine-tuned to understand prompts like “Write a
song” or “Summarize this article.”

From a technical standpoint, fine-tuning is a supervised learning


process, but it doesn’t start from scratch. Instead, it builds on top of the
pretrained foundation model by using its pre-learned weights and
representations. Unlike the original training, which can take weeks on
powerful computing infrastructure, fine-tuning is far faster—sometimes
completed in just a few minutes. The resulting model retains the
knowledge of the base model while adjusting its weights to better suit the
target task.

The Surprising: Emergent Properties in LLMs

LLMs represent a natural evolution of previous neural network


approaches. It was already well-established that as models grow in size,
their performance typically improves—this principle is often reflected in
what are known as scaling laws, which estimate how accuracy changes
as model size increases. However, researchers observed something
unexpected when LLMs reached massive sizes: they began to
demonstrate new capabilities that weren't evident in smaller
models, even if those models had the same training objectives.

These novel behaviors are called emergent properties, a term used in a


2022 study to describe a system where small, quantitative changes result
in entirely new and qualitative changes in behavior. For example, while
it might be expected that a model with 100 billion parameters performs
slightly better than one with 100 million, in reality, the larger model may
perform entirely new tasks the smaller one simply can't handle. These
capabilities are difficult to predict and often arise only at scale.

🔍 Emergent Properties: Capabilities that only appear in very large LLMs,


often in ways that smaller models can't replicate.

Quick Learning: Few-Shot and Zero-Shot Abilities

To understand emergent behavior more clearly, it helps to compare it with


what’s learned through standard fine-tuning. In the traditional pipeline,
a model is explicitly trained to perform tasks such as translation,
summarization, or analogy completion using labeled data. These
improvements are expected outcomes of the training process.

In contrast, emergent behaviors often show up in zero-shot and few-


shot learning scenarios. These terms refer to how many examples the
model receives before it's expected to perform a task:

 Zero-shot learning: The model performs a task without seeing


any examples of how to do it.
 Few-shot learning: The model is shown just a handful of
examples before being asked to generalize.
Let’s consider a real-world example. A restaurant owner might want to
mark all vegetarian dishes on their menu. Using a chatbot like ChatGPT,
they could ask:

"Please rewrite this menu and add an asterisk next to all items that do not
contain meat."

This request involves multiple subtasks: interpreting the instruction,


identifying which dishes are vegetarian based on their descriptions, and
generating a revised version of the menu. Although the model may never
have been explicitly trained to perform this specific function, it can still
successfully complete the task without prior examples. This is a
prime example of a zero-shot capability, which sets modern LLMs apart
from their predecessors.

Zero-Shot and Few-Shot Learning in Practice

The terms zero-shot and few-shot describe how many examples a


model is given before it's expected to complete a task.

In a few-shot setting, the model is provided with a few instances of how


the task should be performed. These examples are embedded directly into
the prompt—the input text given to the model to guide its output.

In contrast, in a zero-shot setup, the model receives no such examples.


Instead, the prompt simply consists of the task description or instruction.
(While some models include a base prompt behind the scenes to ensure
they respond in helpful ways, that detail isn’t central to this explanation.)

Let’s look at another use case. Suppose a freelance writer is categorizing


articles for three topics: dog breeding, exoplanets, and Pittsburgh.
They might ask the model:

“Each of the following articles is related to one of these three topics. For
each article, identify the most likely associated topic.”
Even though this task could be done without examples (zero-shot), giving
the model a few labeled examples usually improves accuracy. For
instance, the prompt might include:

 Example: “The latest discovery of space telescopes” → Exoplanets


 Example: “Why pugs have breathing problems” → Dog breeding

This small addition transforms the prompt into a few-shot format,


offering guidance on how the model should interpret and classify inputs.

Figure 2.2 illustrates the distinction between fine-tuning, zero-shot,


and few-shot learning using a machine translation task as the example.

In fact, if you’ve ever used a chatbot like ChatGPT to perform a task—like


summarizing a paragraph or suggesting names for a blog—you may have
used zero-shot or few-shot learning without even realizing it. One of
the major advantages of LLMs is that they accept input in natural
language, making it easy and intuitive for users to modify or clarify
prompts on the fly. This flexibility enables powerful and adaptive
interactions that don’t require specialized programming skills.

Emergence Through Prompt Engineering

Beyond zero-shot and few-shot examples, subtle adjustments in a model’s


prompt can reveal surprising new capabilities. A notable example is
chain-of-thought prompting, where guiding the model to think in steps
—for example, adding “Let’s think step-by-step”—can improve accuracy in
reasoning tasks. Similarly, researchers have experimented with enhancing
performance through detailed instructions and by asking the model to
evaluate its own confidence level. These strategies can yield better
results depending on the scenario.

A widely cited study on emergent abilities in LLMs examined model


performance on various few-shot learning tasks. They found that small
models often performed no better than chance, while significantly larger
models demonstrated abrupt improvements. For instance, basic
arithmetic tasks like addition and multiplication were essentially
unsolvable by earlier GPT-3 models—until the 13-billion-parameter
version. Likewise, once the model size crossed 70 billion parameters, its
performance in academic subjects like law, history, and mathematics
improved dramatically. Since these emergent behaviors don’t follow
predictable scaling trends, it’s difficult to know how much further
improvement is possible, at what size abilities plateau, or whether
performance can be consistently generalized.

Are These Sparks of AGI?

According to an internal analysis by Microsoft, GPT-4 demonstrated the


ability to solve complex, novel tasks across disciplines like medicine,
programming, law, and psychology, without requiring custom
instructions. In a paper provocatively titled “Sparks of Artificial General
Intelligence (AGI),” the authors suggest that GPT-4’s broad capabilities
may signal the early formation of AGI—a system with intelligence
comparable to human beings in learning and general problem-solving.

AGI has long been the aspirational goal of AI researchers. It's defined by
the ability to transfer learning across domains, adapt to new problems,
and make intelligent decisions beyond narrow tasks—capabilities
historically attributed only to humans.

Is Emergence Overstated?
Despite these promising findings, some experts remain skeptical.
Researchers from Stanford University have argued that what appear to be
emergent properties may, in fact, be artifacts of the evaluation process
itself. They observed that the sharp increases in performance often
attributed to emergence could stem from variables such as:

 The metric used to measure performance.

 The amount of test data (too little can produce noisy results).

 The small sample size of large models compared to smaller ones.

These researchers do not deny the capabilities of LLMs. Rather, they


question whether these capabilities represent a fundamental shift in
intelligence. Their conclusion: improvements may be more about
measurement methods than breakthrough abilities.

What’s in the Training Data?

LLMs are trained on enormous datasets harvested from the web—so


massive, in fact, that no one knows exactly what they contain. For
example, OpenAI’s GPT-3 was trained on an estimated 45 terabytes of
text, equivalent to around 3.4 billion web pages.

With such vast and uncurated sources, these models can absorb not just
language patterns but also harmful, biased, or even sensitive information.
As a result, LLMs may unintentionally learn and reproduce offensive
stereotypes or reveal private data.

How Bias is Embedded in LLMs

LLMs have been shown to replicate and even amplify existing social biases
along dimensions such as race, gender, age, religion, and sexual
orientation. These harmful patterns arise from three main factors:

1. Reflection of societal norms found in training text.


2. Underrepresentation of marginalized communities in the data.

3. Evolving cultural perspectives, which may not be captured by


older or stagnant datasets.

In Chapter 1, we explored how word embeddings reveal bias. For


example, research has shown that reviews for movies with African
American names scored more negatively than those with European
American names, despite having similar content. Another common bias
is observed in machine translation, where gender-neutral languages
translated into English often reinforce gender roles—e.g., Turkish
sentences like “O bir doktor. O bir hemşire.” becoming “He is a doctor.
She is a nurse.”

Similarly, LLMs often produce content that links religious identities with
harmful stereotypes. In one study, prompts containing the word
“Muslim” led to responses mentioning “terrorist” 23% of the time, while
“Jewish” was frequently associated with “money”. These stereotypes were
not only present—they were exaggerated by the model.

Even occupational and gender-based stereotypes are magnified.


Female characters generated in stories were more often portrayed as
emotional or family-focused, while males were described as powerful or
professional.

The Problem of Representation

A second major cause of bias is the lack of diverse voices in training


data. For instance:

 Reddit and Wikipedia, both popular LLM training sources, are


skewed. Reddit users are predominantly male and young (18–29),
and only 8–15% of Wikipedia contributors are women.

 Filtering methods—such as removing content from Common Crawl


that doesn't resemble Reddit or Wikipedia, or that includes certain
flagged terms—can unintentionally exclude valid discussions,
especially from LGBTQ or other underrepresented groups.

Further, changing social movements—like Black Lives Matter (BLM)—


can rapidly alter how issues are discussed online. Wikipedia articles that
now highlight systemic police violence may not have reflected the same
emphasis prior to the BLM movement, meaning models trained on older
snapshots of data could misrepresent social realities.

Bias is Complex—and Can Be Meaningful

Researchers from the University of Bath and Princeton argue that bias
cannot be cleanly separated from meaning. In fact, they claim:

 All language reflects bias, because human communication is


value-laden.

 Defining bias algorithmically is impossible, as social norms


evolve over time.

 Historical context matters—removing all bias might also erase


important truths.

Attempts to “debias” models—such as adjusting word embeddings to


reduce gender associations—have had mixed results. While some
techniques can reduce gender bias, they often fail to address racial or
religious bias, and may also reduce the model’s ability to produce
coherent and informative text. This phenomenon is called “fairness
through blindness.”

A Way Forward: Better Dataset Documentation

Experts like Bender and Gebru argue that the best approach is to
transparently document the training data. Today, most LLMs use
proprietary data sources with little or no user visibility. Without insight
into what the model has seen, bias is harder to detect—and impossible to
hold accountable.

Organizations like Hugging Face are working on this problem. Their


dataset cards provide detailed metadata, including:

 Dataset origin and contents

 Known biases

 Intended use cases and limitations

They’ve also created a tool to explore the ROOTS corpus (1.6 TB of


multilingual text), used in training the BLOOM language model.

Another effort, the Data Nutrition Project, takes inspiration from food
labeling—highlighting key “ingredients” in datasets, such as demographic
makeup, licensing, and known risks.

Final Note: Humans Still Hold the Key

Unlike machines, humans can reflect on personal experience and


override learned biases. While AI systems can be improved through
better data and design, ethical judgment, empathy, and critical
thinking remain firmly human traits—and essential for building
responsible AI systems.

Risks of Exposing Personal Data in LLMs

Due to their reliance on enormous volumes of internet-sourced data,


Large Language Models (LLMs) may inadvertently include personally
identifiable information (PII)—such as names, addresses, Social
Security numbers, biometric details, or sexual orientation—even when
trained exclusively on public datasets. One key risk lies in the possibility
that the model may "memorize" certain pieces of data from its training
set and reproduce them in responses, unintentionally disclosing private
information.
This risk is amplified when a model trained on private or proprietary
datasets is made publicly accessible. An especially serious concern is the
potential for training data extraction attacks, in which malicious users
craft specific prompts designed to extract sensitive or identifying data
that the model may have memorized.

Proof of Concept: Real-World Extraction Attacks

In a collaborative study, Google, OpenAI, Apple, Stanford,


Northeastern University, and UC Berkeley conducted controlled
attacks on GPT-2 to demonstrate that such vulnerabilities are real. Their
goal was to prove that models could be tricked into revealing verbatim
snippets from their training data—such as personal contact details, chat
transcripts, code fragments, and UUIDs.

The researchers chose GPT-2 because its training data is sourced from
publicly available web content and is well-documented. Still, they were
able to extract hundreds of exact text fragments that appeared rarely
in the training data—some only once—demonstrating that memorization
can occur even from sparsely represented data. Notably, larger models
were found to be more prone to this kind of leakage than smaller ones.

In another study titled "The Secret Sharer," researchers demonstrated


how easy it was to extract sensitive information—including credit card
and Social Security numbers—from a language model trained on the
Enron Email Dataset, a publicly available archive of internal corporate
communications released during a government investigation.

Addressing the Privacy Challenge

A simple and direct mitigation strategy would be to exclude PII and


sensitive content from training data. However, implementing this at
scale is extremely difficult due to the massive and unstructured nature
of web data. This brings us back to the urgent need for carefully curated
and transparently documented datasets in LLM development.

Another class of solutions comes from Privacy-Preserving or Privacy-


Enhancing Technologies (PETs). These are strategies and tools
designed to protect sensitive information during the model training
process. PETs include:

 Pseudonymization

 Data masking

 Obfuscation

 Sanitization

 Blocklisting sensitive terms

For instance, blocklists can be used to scan and exclude high-risk


sequences from training corpora. But, as shown in “The Secret Sharer,”
this method is not foolproof—any sensitive information not covered by
the blocklist could still be memorized and leaked.

Differential Privacy: A Popular but Imperfect PET

One of the best-known PETs is differential privacy, which adds


statistical "noise" to the data during training. This technique is designed
to make it impossible to determine whether any individual sample was
part of the training set. However, differential privacy also has drawbacks
—particularly when dealing with rare information. If a sensitive detail only
appears once or twice in the dataset, this technique may fail to prevent its
memorization.

Additionally, PETs in general are often:

 Technically complex

 Costly to implement

 Difficult for regulators to audit or enforce


As noted in “Beyond Data: Reclaiming Human Rights at the Dawn
of the Metaverse,” these limitations make PETs a challenging but
crucial tool in the fight to safeguard privacy in AI systems.

Call to Action

While existing privacy technologies have limitations, raising awareness


about unintended memorization is essential. Researchers, developers,
and policy-makers must collaborate to build more robust protections,
devise improved privacy methods, and develop standardized
evaluations to test how well models retain sensitive data—so that future
LLMs can be safer, more responsible, and less prone to data leakage.

You might also like