KEMBAR78
LLMs in Python Free Course by Inder P Singh | PDF | Python (Programming Language) | Command Line Interface
0% found this document useful (0 votes)
19 views28 pages

LLMs in Python Free Course by Inder P Singh

The document is a comprehensive guide on using Large Language Models (LLMs) in Python, covering setup, installation, and practical applications such as inference, API integration, and local deployment. It emphasizes Python's advantages in AI development, including its rich libraries and community support, while providing quizzes for knowledge assessment. Key topics include prompt engineering, fine-tuning, and best practices for efficient model usage.

Uploaded by

Bipul Mondal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views28 pages

LLMs in Python Free Course by Inder P Singh

The document is a comprehensive guide on using Large Language Models (LLMs) in Python, covering setup, installation, and practical applications such as inference, API integration, and local deployment. It emphasizes Python's advantages in AI development, including its rich libraries and community support, while providing quizzes for knowledge assessment. Key topics include prompt engineering, fine-tuning, and best practices for efficient model usage.

Uploaded by

Bipul Mondal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Introduction to

LLMs in Python
by Inder P Singh

Download for reference Like Share


YouTube Channel: Software and Testing Training (341 Tutorials, 82,000 Subscribers)
Blog: Fourth Industrial Revolution
Copyright © 2025 All Rights Reserved.
1 Introduction to LLMs in Python _____________________________________________________3
2 Setup & Installation _______________________________________________________________5
3 Basic Inference with Transformers __________________________________________________6
4 Calling OpenAI’s ChatGPT API in Python_____________________________________________9
5 Local Deployment with Hugging Face Models ______________________________________ 12
6 Prompt Engineering in Python ____________________________________________________ 14
7 Fine‑Tuning & Custom Training ___________________________________________________ 16
8 Advanced Techniques: Streaming, Batching & Callbacks ___________________________ 18
9 Efficiency & Optimization ________________________________________________________ 20
10 Integration & Deployment Workflows ____________________________________________ 22
11 Best Practices & Troubleshooting________________________________________________ 25
12 Introduction to LLMs in Python Quiz _____________________________________________ 27

Download for reference Like Share


YouTube Channel: Software and Testing Training (341 Tutorials, 82,000 Subscribers)
Blog: Fourth Industrial Revolution
Copyright © 2025 All Rights Reserved.
1 Introduction to LLMs in Python
Q: What does “Introduction to LLMs in Python” mean?
A: Introduction to LLMs in Python means the foundational knowledge and practical steps
necessary to utilize Large Language Models (LLMs) within the Python ecosystem. It covers
concepts such as loading pre‑trained models, tokenization, inference, and integration with
APIs or libraries. This course equips developers and technical users with the skills to
harness LLM capabilities (like text generation, summarization, and translation) directly
through Python code.
Note: If you are new to Python, you can view an introduction to Python in the tutorial here.
The full set of Python for beginners tutorials are available here.

Q: Why is Python the lingua franca (common language) for LLM development?
A: Python has extensive machine‑learning libraries (such as Transformers, PyTorch, and
TensorFlow), a vibrant ecosystem of wrapper utilities (like Hugging Face pipelines), and
API clients (for OpenAI’s GPT services). Its readable syntax and broad community support
enable fast experimentation and production deployment, making it the popular choice for
AI practitioners.

Q: What are typical use cases for LLMs in Python?


A: Developers and technical users can use LLMs in Python for tasks including automated
documentation generation, code completion, chatbots, data extraction from
unstructured text, and sentiment analysis. Python’s data‑processing capabilities allow
these models to integrate with web frameworks, data pipelines, and DevOps tools.
Example: A Python script using an LLM can parse customer reviews, summarize sentiment
trends, and output a JSON report for BI dashboards.

If you need video tutorials, please check out the Software and Testing Training playlists
here.

Download for reference Like Share


YouTube Channel: Software and Testing Training (341 Tutorials, 82,000 Subscribers)
Blog: Fourth Industrial Revolution
Copyright © 2025 All Rights Reserved.
Quiz:

1. Which language is most used for LLM integration due to its rich AI libraries?
A. Java
B. Python (Correct)
C. C++
D. Ruby
2. Loading a pre‑trained model and performing tokenization in Python typically involves
which library?
A. NumPy
B. Transformers (Correct)
C. Requests
D. Matplotlib
3. An LLM use case that transforms raw text into structured key‑value pairs exemplifies:
A. Code completion
B. Data extraction (Correct)
C. Model quantization
D. Image segmentation
4. The phrase “lingua franca” in this context refers to Python’s role as:
A. A spoken language for data scientists
B. A common programming language for LLM tasks (Correct)
C. A legacy scripting language
D. A proprietary AI framework
5. In a chatbot application, Python’s role is to:
A. Train the LLM from scratch
B. Serve as the interface for prompt sending and response handling (Correct)

Download for reference Like Share


YouTube Channel: Software and Testing Training (341 Tutorials, 82,000 Subscribers)
Blog: Fourth Industrial Revolution
Copyright © 2025 All Rights Reserved.
2 Setup & Installation
Q: How can you confirm that you have the correct Python 3.8+ environment for LLM
development?
A: You can verify your interpreter with the command python --version and, if
necessary, install a compatible distribution (such as Anaconda or the official Python
installer). To isolate dependencies, create a virtual environment using python -
m venv venv , then activate it (source venv/bin/activate on Linux/macOS or
venv\Scripts\activate on Windows). This makes sure that the package versions for
LLM libraries remain consistent across projects.

Q: How is the Transformers library installed and initialized in Python?


A: Within the activated environment, run pip install transformers. After installation,
you can load a model and tokenizer with code such as:

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("gpt2")

model = AutoModelForCausalLM.from_pretrained("gpt2")

This setup provides the core classes needed for tokenization and inference.

Q: What are the steps to install and configure PyTorch for GPU acceleration?
A: First, determine your CUDA version (e.g., nvidia-smi). Then install PyTorch with the
matching CUDA toolkit using the command from the official site, for example pip
install torch torchvision torchaudio --extra-index-url
https://download.pytorch.org/whl/cu117. Verify GPU availability in Python with:

import torch

torch.cuda.is_available() # Returns True if GPU is ready

Q: How can you add the OpenAI client library and authenticate API access?
A: Install via pip install openai. Set your API key as an environment variable—export
OPENAI_API_KEY="sk‑..." on Linux/macOS or set OPENAI_API_KEY="sk‑..." on
Windows, so your scripts can call:

import openai openai.api_key = os.getenv("OPENAI_API_KEY")

This avoids hard‑coding secrets, for security.


Download for reference Like Share
YouTube Channel: Software and Testing Training (341 Tutorials, 82,000 Subscribers)
Blog: Fourth Industrial Revolution
Copyright © 2025 All Rights Reserved.
Q: What techniques help manage dependencies and their versions across different
projects?
A: Use a requirements.txt file generated with pip freeze > requirements.txt to
lock exact versions. This prevents version conflicts when multiple LLM projects coexist.

Connect with Inder P Singh (6 years' experience in AI and ML) on LinkedIn. You can
message Inder if you need personalized training or want to collaborate on projects.

Quiz:

1. Which command creates a virtual environment in Python?


A. python -m venv venv (Correct)
B. pip install venv
C. conda activate venv
D. python setup.py venv
2. To install the Transformers library, you use:
A. pip install torch
B. pip install transformers (Correct)
C. pip install openai
D. pip install numpy
3. How do you verify GPU availability for PyTorch?
A. torch.has_gpu()
B. torch.cuda.is_available() (Correct)
C. nvidia.check_gpu()
D. torch.device("cpu")
4. Securely setting your OpenAI API key involves:
A. Hard‑coding it in your script
B. Storing it in config.json
C. Exporting it as an environment variable (Correct)
D. Passing it as a URL parameter
5. A requirements.txt file is generated with:
A. pip list > requirements.txt
B. pip freeze > requirements.txt (Correct)
C. pip install requirements.txt
D. pip dependency-list > requirements.txt

3 Basic Inference with Transformers


Download for reference Like Share
YouTube Channel: Software and Testing Training (341 Tutorials, 82,000 Subscribers)
Blog: Fourth Industrial Revolution
Copyright © 2025 All Rights Reserved.
Q: How can you initialize a pipeline for text generation using Hugging Face?
A: You import and call the pipeline constructor from the Transformers library,
specifying the task and model name. For example:

from transformers import pipeline

generator = pipeline("text-generation", model="gpt2")

This generator object handles both tokenization and decoding internally, allowing you to
pass prompts and parameters.

Q: How does tokenization work before generating text?


A: The pipeline’s tokenizer splits input strings into discrete tokens (meaning subword
units) from the model’s vocabulary. It then converts those tokens into numeric IDs. For
instance:

tokens = generator.tokenizer("Hello, Inder!", return_tensors="pt")

Here, tokens.input_ids contains a tensor of IDs that the model consumes.

Q: How do you generate a simple completion once the pipeline is set up?
A: Call the generator with your prompt and configuration options like max_length and
num_return_sequences:

result = generator("The future of AI is", max_length=30,


num_return_sequences=1)
print(result[0]["generated_text"])

This returns the prompt plus generated continuation up to 30 tokens.

Q: What parameters help control generation behavior in the pipeline?


A: Parameters such as temperature (for randomness), top_k or top_p (for sampling
diversity), and do_sample (to enable sampling) adjust creativity and variation:

generator("Explain quantum computing:", max_length=50,


temperature=0.7, top_p=0.9, do_sample=True)

# © Inder P Singh https://www.linkedin.com/in/inderpsingh

Quiz:

Download for reference Like Share


YouTube Channel: Software and Testing Training (341 Tutorials, 82,000 Subscribers)
Blog: Fourth Industrial Revolution
Copyright © 2025 All Rights Reserved.
1. Which function creates a text-generation pipeline?
A. AutoModel.from_pretrained
B. pipeline("text-generation", ...) (Correct)
C. generate_text(...)
D. TextGenerator()
2. In tokenization, what does the tokenizer output?
A. Raw strings
B. Numeric token IDs (Correct)
C. Model weights
D. Attention scores
3. The max_length parameter controls:
A. The number of pipelines created
B. The maximum tokens in the generated sequence (Correct)
C. The tokenizer vocabulary size
D. The number of threads used
4. To enable probabilistic sampling diversity, which parameter is used?
A. force_cpu
B. do_sample (Correct)
C. use_cache
D. return_tensors

Download for reference Like Share


YouTube Channel: Software and Testing Training (341 Tutorials, 82,000 Subscribers)
Blog: Fourth Industrial Revolution
Copyright © 2025 All Rights Reserved.
4 Calling OpenAI’s ChatGPT API in Python
Q: How can you authenticate when using OpenAI’s ChatGPT API in Python?
A: You can install the openai package and set your API key as an environment variable—
export OPENAI_API_KEY="sk-..." on Linux/macOS or set
OPENAI_API_KEY="sk-..." on Windows. In your script, you then invoke:

import os, openai


openai.api_key = os.getenv("OPENAI_API_KEY")

This keeps your secret key out of source code and loaded at runtime.

Q: What structure does the messages parameter use in ChatGPT calls?


A: The messages argument is a list of role‑tagged dictionaries defining the conversation.
Each entry has "role" set to "system", "user", or "assistant", and a "content"
string. For example:

messages = [

{"role": "system", "content": "You are a helpful assistant."},

{"role": "user", "content": "Summarize the benefits of LLMs in


Python."}

Q: How can you make a synchronous ChatGPT call using the SDK?
A: Use openai.ChatCompletion.create, passing the model name and messages. For
example:

response = openai.ChatCompletion.create(

model="gpt-4",

messages=messages,

temperature=0.3,

max_tokens=150

)
Download for reference Like Share
YouTube Channel: Software and Testing Training (341 Tutorials, 82,000 Subscribers)
Blog: Fourth Industrial Revolution
Copyright © 2025 All Rights Reserved.
print(response.choices[0].message.content)

This blocks execution until the full response is received and available in
response.choices.

Q: How can you handle streaming responses to display partial results in real time?
A: Enable the stream=True flag and iterate over the response generator. Each chunk
contains delta segments that you can print as they arrive. This approach provides a more
responsive experience, rendering output token by token:

stream = openai.ChatCompletion.create(

model="gpt-4",

messages=messages,

stream=True

for chunk in stream:

print(chunk.choices[0].delta.get("content", ""), end="",


flush=True)

Quiz:

1. Where should you store your OpenAI API key for secure access?
A. In your script as a constant
B. As an environment variable (Correct)
C. In a GitHub gist
D. In plain text logs
2. The messages list item with "role": "system" is used to:
A. Define user queries
B. Set high‑level instructions for the assistant (Correct)
C. Stream responses
D. Format JSON output
3. In a synchronous call, the generated text is retrieved from:
A. response.data
B. response.choices[0].message.content (Correct)
Download for reference Like Share
YouTube Channel: Software and Testing Training (341 Tutorials, 82,000 Subscribers)
Blog: Fourth Industrial Revolution
Copyright © 2025 All Rights Reserved.
C. response.text
D. response.streaming
4. To receive partial content as it’s generated, you must set:
A. stream=True (Correct)
B. do_sample=True
C. echo=True
D. max_tokens=1

Download for reference Like Share


YouTube Channel: Software and Testing Training (341 Tutorials, 82,000 Subscribers)
Blog: Fourth Industrial Revolution
Copyright © 2025 All Rights Reserved.
5 Local Deployment with Hugging Face Models
Q: How can you download model weights for local inference using Hugging Face?
A: You can call the from_pretrained method on both the model and tokenizer classes,
specifying the model identifier. For example:

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("gpt2")

model = AutoModelForCausalLM.from_pretrained("gpt2")

This fetches the weights and vocabulary files into your local cache, making them available
for offline use.

Q: How do you prepare the AutoTokenizer and AutoModelForCausalLM for inference?


A: After loading, you set the model to evaluation mode and move it to the appropriate
device. For example:

import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model.to(device)

model.eval()

Q: What steps enable running inference on CPU versus GPU?


A: The device assignment determines where tensors reside. On GPU:

input_ids = tokenizer("Hello, world!",


return_tensors="pt").input_ids.to(device)
outputs = model.generate(input_ids, max_length=50)

If device is CPU, the same code runs on the processor, though with higher latency. Always
move both model and tensors to device for consistent execution.

Follow Inder P Singh (6 years' experience in AI and ML) on LinkedIn to get the new AI and ML
documents for FREE.

Download for reference Like Share


YouTube Channel: Software and Testing Training (341 Tutorials, 82,000 Subscribers)
Blog: Fourth Industrial Revolution
Copyright © 2025 All Rights Reserved.
Quiz:

1. Which method fetches model weights and tokenizer files?


A. load_pretrained
B. from_pretrained (Correct)
C. download_model
D. init_pretrained
2. To switch the model to evaluation mode before inference, you call:
A. model.train()
B. model.eval() (Correct)
C. model.generate()
D. model.infer()
3. How do you determine whether to use GPU or CPU for inference?
A. By checking torch.device availability (Correct)
B. By reading a config file
C. By calling model.device()
D. By inspecting tokenizer attributes
4. To run inference on GPU, you must:
A. Only move the model to CUDA
B. Move both model and input tensors to CUDA (Correct)
C. Increase max_length
D. Enable do_train mode

Download for reference Like Share


YouTube Channel: Software and Testing Training (341 Tutorials, 82,000 Subscribers)
Blog: Fourth Industrial Revolution
Copyright © 2025 All Rights Reserved.
6 Prompt Engineering in Python
Q: How can you build a prompt template programmatically in Python?
A: You can define a Python format string with placeholders for dynamic values. This
approach centralizes prompt structure and allows easy substitution for different inputs.
For example:
template = "Translate the following text to French:\n\n\"{text}\""

def render_prompt(text):

return template.format(text=text)

prompt = render_prompt("Good morning!")

Q: What is prompt chaining and how is it implemented in Python?


A: Prompt chaining sequences multiple calls to the API, using each response to form the
next prompt. In Python:

first = client.chat(["Summarize this article: ..."])

second = client.chat([f"Based on that summary, list three key


takeaways:\n\n{first}"])

By passing first into the second call, you create a pipeline where each step builds on
prior output.

Q: How do you compare zero‑shot and few‑shot strategies in code?


A: For zero‑shot, send only the instruction:

resp0 = client.chat([{"role":"user","content":"Explain quantum


computing in simple terms."}])

For few‑shot, include inline examples:

examples = [

{"role":"user","content":"Q: What is 2+2?\nA: 4"},

{"role":"user","content":"Q: What is 10-3?\nA: 7"}

prompt = examples + [{"role":"user","content":"Q: What is 5×3?\nA:"}]

Download for reference Like Share


YouTube Channel: Software and Testing Training (341 Tutorials, 82,000 Subscribers)
Blog: Fourth Industrial Revolution
Copyright © 2025 All Rights Reserved.
resp1 = client.chat(prompt)

You can compare resp0 with resp1 to reveal improved accuracy and formatting
consistency with few‑shot.

Quiz:

1. Which Python construct allows dynamic insertion of variables into a prompt template?
A. Lambda functions
B. f‑strings or str.format (Correct)
C. List comprehensions
D. Decorators
2. Prompt chaining in Python involves:
A. Encrypting prompts before sending
B. Passing the previous API response as part of the next prompt (Correct)
C. Using only system messages
D. Parallelizing API calls
3. A zero‑shot prompt differs from a few‑shot prompt by:
A. Including multiple examples
B. Relying solely on the instruction without examples (Correct)
C. Using a higher temperature
D. Always returning JSON
4. In a few‑shot strategy, embedding two Q&A pairs before a new question helps to:
A. Decrease model temperature
B. Guide the model’s response format and improve accuracy (Correct)
C. Increase token usage efficiency
D. Force the model into eval mode

Download for reference Like Share


YouTube Channel: Software and Testing Training (341 Tutorials, 82,000 Subscribers)
Blog: Fourth Industrial Revolution
Copyright © 2025 All Rights Reserved.
7 Fine‑Tuning & Custom Training
Q: How do you prepare a dataset for fine‑tuning an LLM in Python?
A: You collect and clean target‑domain text, then format it into examples. This is typically
done with keys like {"prompt": "...", "completion": "..."}. You convert this list
into a datasets.Dataset object:

from datasets import Dataset data = [{"prompt": "Hello, how are you?",
"completion": "I am fine, thank you."}]
ds = Dataset.from_list(data)
ds = ds.train_test_split(test_size=0.1)

Q: How is the Hugging Face Trainer API used to fine‑tune a model?


A: You instantiate TrainingArguments to set parameters (such as
per_device_train_batch_size, num_train_epochs, and output directory) andthen
create a Trainer with your model, tokenizer, dataset, and a data collator:

from transformers import Trainer, TrainingArguments,


DataCollatorForLanguageModeling
args = TrainingArguments(output_dir="out",
per_device_train_batch_size=4, num_train_epochs=3)
collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)
trainer = Trainer(model=model, args=args, train_dataset=ds["train"],
data_collator=collator) trainer.train()

Q: What is the advantage of LoRA for custom training?


A: LoRA adds low‑rank adapter matrices to each transformer layer, training only these
small modules. This reduces GPU memory usage and training time, enabling fine‑tuning
with limited resources. Integration uses the peft library:

from peft import get_peft_model, LoraConfig


config = LoraConfig(r=8, lora_alpha=16)
peft_model = get_peft_model(model, config)
trainer.model = peft_model
trainer.train()

Q: How do you execute a Python script to run fine‑tuning from the command line?
A: You wrap your code in a script fine_tune.py and use standard Python invocation,
passing hyperparameters via flags or environment variables:
Download for reference Like Share
YouTube Channel: Software and Testing Training (341 Tutorials, 82,000 Subscribers)
Blog: Fourth Industrial Revolution
Copyright © 2025 All Rights Reserved.
python fine_tune.py --model_name gpt2 --train_file data.json --epochs
3

Quiz:

1. Which format is commonly used for fine‑tuning examples?


A. Plain text files
B. Prompt‑completion JSON objects (Correct)
C. XML documents
D. CSV without headers
2. The DataCollatorForLanguageModeling in a Trainer is responsible for:
A. Logging metrics
B. Preparing masked or causal batches (Correct)
C. Saving model checkpoints
D. Scheduling learning rate
3. The benefit of using LoRA over full fine‑tuning is:
A. Increased model size
B. Lower resource usage and faster training (Correct)
C. Eliminating need for tokenization
D. Automatic hyperparameter tuning
4. To pass hyperparameters to a Python fine‑tuning script, you typically use:
A. Hard‑coded constants
B. argparse flags or environment variables (Correct)
C. Direct modifications in library source
D. Comments in the script

Download for reference Like Share


YouTube Channel: Software and Testing Training (341 Tutorials, 82,000 Subscribers)
Blog: Fourth Industrial Revolution
Copyright © 2025 All Rights Reserved.
8 Advanced Techniques: Streaming, Batching &
Callbacks
Q: How can you implement streaming generation with Hugging Face’s Transformers in
Python?
A: You call the generate method with streaming=True and iterate over tokens as they
become available.

Q: How can you batch multiple inputs to improve inference throughput?


A: You pass a list of prompts to the pipeline and specify batch_size. Transformers
groups these into a single forward pass:

prompts = ["Hello world", "Goodbye world"]


results = generator(prompts, max_length=20, batch_size=2)
for res in results:
print(res["generated_text"])

Batching reduces per‑prompt overhead.

Q: What are callback hooks? How do they help with real‑time processing?
A: Callback hooks are functions you register to execute at key moments (such as before
generation, on each new token, or after completion). In Transformers you can use
LogitsProcessor or leverage custom callback frameworks:

def on_token(token_id, score):


print(tokenizer.decode([token_id]), end="")

stream = generator.stream("Compute 2+2:", callback=on_token)

This design pattern allows custom logging, filtering, or early stopping based on token
content.

Quiz:

1. Enabling streaming in Transformers generation allows you to:


A. Pre‑load all outputs before printing
B. Receive tokens as they are generated for real‑time display (Correct)
C. Batch multiple prompts silently
D. Remove the need for tokenization
Download for reference Like Share
YouTube Channel: Software and Testing Training (341 Tutorials, 82,000 Subscribers)
Blog: Fourth Industrial Revolution
Copyright © 2025 All Rights Reserved.
2. Batching inputs increases throughput by:
A. Sending each prompt one by one
B. Grouping multiple prompts into a single model call (Correct)
C. Reducing model size
D. Eliminating GPU memory usage
3. A callback hook in token streaming can be used to:
A. Modify the model architecture
B. Execute custom code upon each generated token, such as filtering or logging
(Correct)
C. Disable sampling
D. Increase max_length automatically
4. When using batching, you should set the batch_size parameter to:
A. The number of GPUs
B. The number of prompts you wish to process simultaneously (Correct)
C. The maximum token length
D. The callback frequency

Download for reference Like Share


YouTube Channel: Software and Testing Training (341 Tutorials, 82,000 Subscribers)
Blog: Fourth Industrial Revolution
Copyright © 2025 All Rights Reserved.
9 Efficiency & Optimization
Q: How do you apply quantization to an LLM in Python for reduced memory footprint?
A: Using the bitsandbytes library, you load a model in an 8‑bit precision mode:

from transformers import AutoModelForCausalLM, AutoTokenizer


model = AutoModelForCausalLM.from_pretrained("gpt2",
load_in_8bit=True, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("gpt2")

This reduces weight precision from 32‑bit to 8‑bit, shrinking model size and speeding
inference.

Q: What is distillation? How can you perform distillation in Python?


A: Distillation trains a smaller “student” model to mimic a larger “teacher.” You extract
soft labels from the teacher’s outputs, then fine‑tune the student using those probabilities
as targets. With the optimum library, a high‑level API might look like:

from optimum.intel importDistillationTrainer


trainer = DistillationTrainer(teacher=teacher_model,
student=student_model, ... )
trainer.train()

This should produce a compact student model that retains most of the teacher’s
capabilities.

Q: How does parameter‑efficient tuning via LoRA work within Python?


A: The peft library implements LoRA by inserting trainable low‑rank adapters into each
transformer layer while freezing the base model:

from peft import get_peft_model, LoraConfig


config = LoraConfig(r=4, lora_alpha=16)
peft_model = get_peft_model(model, config)

You then fine‑tune peft_model normally: only the adapter parameters update, requiring
fewer resources than full‑model training.

Quiz:

Download for reference Like Share


YouTube Channel: Software and Testing Training (341 Tutorials, 82,000 Subscribers)
Blog: Fourth Industrial Revolution
Copyright © 2025 All Rights Reserved.
1. Enabling load_in_8bit=True primarily achieves:
A. Increased generation diversity
B. Reduced model memory and faster inference (Correct)
C. Automatic fine‑tuning
D. Higher precision arithmetic
2. In distillation, the “student” model learns from:
A. Random noise
B. Hard labels only
C. Teacher model’s soft output probabilities (Correct)
D. Unrelated datasets
3. LoRA’s adapters are trained while the original model parameters remain:
A. Frozen (Correct)
B. Randomized
C. Quantized
D. Distilled
4. A key advantage of parameter‑efficient tuning is:
A. Full retraining of all weights
B. Minimal additional parameters and resource usage (Correct)
C. Elimination of tokenization
D. Automatic GPU selection

Download for reference Like Share


YouTube Channel: Software and Testing Training (341 Tutorials, 82,000 Subscribers)
Blog: Fourth Industrial Revolution
Copyright © 2025 All Rights Reserved.
10 Integration & Deployment Workflows
Q: How can you embed an LLM in a FastAPI web service?
A: You can define an endpoint that receives a prompt, calls the model, and returns the
output. For example:

This creates a REST API where clients send JSON with a prompt field and receive generated
text.

Q: How can you build an interactive Streamlit app for LLM inference?
A: Use Streamlit’s input/output widgets to capture user text and display responses. For
example:

This renders a web interface without manual HTML.

Q: How do you create a command‑line tool for LLM calls?


A: Use argparse to parse arguments and invoke the model accordingly:

Download for reference Like Share


YouTube Channel: Software and Testing Training (341 Tutorials, 82,000 Subscribers)
Blog: Fourth Industrial Revolution
Copyright © 2025 All Rights Reserved.
import argparse
from transformers import pipeline

parser = argparse.ArgumentParser()
parser.add_argument("--prompt", required=True)
args = parser.parse_args()

generator = pipeline("text-generation", model="gpt2")


print(generator(args.prompt, max_length=50)[0]["generated_text"])

Running python tool.py --prompt "Hello from Inder" prints the generated text in
the terminal.

Q: What are the steps to Dockerize an LLM microservice?


A: Write a Dockerfile specifying a base image, install dependencies, copy your
application, and expose the port:

FROM python:3.9-slim

WORKDIR /app

COPY requirements.txt .

RUN pip install -r requirements.txt

COPY . .

CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "80"]

Quiz:

1. In FastAPI, you return model outputs from an endpoint defined with:


A. @app.get
B. @app.post (Correct)
C. @app.delete
D. @app.update
2. Streamlit’s widget to capture multi‑line text input is:
A. st.text_input
B. st.text_area (Correct)

Download for reference Like Share


YouTube Channel: Software and Testing Training (341 Tutorials, 82,000 Subscribers)
Blog: Fourth Industrial Revolution
Copyright © 2025 All Rights Reserved.
C. st.button
D. st.slider
3. To parse CLI arguments in a Python tool, you typically use:
A. json module
B. argparse (Correct)
C. logging
D. threading
4. In a Dockerized LLM service, the CMD line often invokes:
A. python app.py
B. uvicorn with FastAPI application (Correct)
C. streamlit run
D. bash start.sh

Hope that you are finding this Introduction to LLMs in Python document useful! You can
message Inder P Singh (6 years' experience in AI and ML) on LinkedIn to get the new AI and
ML documents. You can follow Inder on Kaggle to get his public datasets and code.

Download for reference Like Share


YouTube Channel: Software and Testing Training (341 Tutorials, 82,000 Subscribers)
Blog: Fourth Industrial Revolution
Copyright © 2025 All Rights Reserved.
11 Best Practices & Troubleshooting
Q: How should you implement logging in a Python LLM application?
A: Use the logging module to record prompts, responses, and errors with appropriate log
levels. For example:

import logging

logging.basicConfig(level=logging.INFO)

logging.info(f"Prompt: {prompt}")

response = generator(prompt)

logging.info(f"Response: {response[0]['generated_text']}")

This creates a timestamped audit trail without clutter.

Q: What strategy handles errors gracefully during inference?


A: Wrap API or model calls in try/except blocks (learn about Python try except here) and
implement retries with exponential backoff. For instance:

Q: How can you manage token‑limit issues when prompts exceed the model’s context
window?
A: Implement a sliding‑window or truncation strategy: retain the most recent tokens and
drop the earliest when the combined length exceeds max_length. For example:

Download for reference Like Share


YouTube Channel: Software and Testing Training (341 Tutorials, 82,000 Subscribers)
Blog: Fourth Industrial Revolution
Copyright © 2025 All Rights Reserved.
Q: What are best practices for version control of prompts?
A: Store prompts as template files in your code repository, tag changes with commit
messages, and reference versions in your code. Use modular functions to load specific
prompt versions:

with open(f"prompts/prompt_v{version}.txt") as f:

template = f.read()

Quiz:

1. To record the prompts and responses that occurred during runtime, you should use:
A. print() statements
B. The logging module with INFO level (Correct)
C. File I/O only
D. Database inserts
2. Implementing retries with exponential backoff helps mitigate:
A. Token limit errors
B. Transient inference errors (Correct)
C. Version control conflicts
D. Docker build issues
3. A sliding‑window token strategy is used to:
A. Increase model batch size
B. Handle prompt lengths exceeding context limits (Correct)
C. Speed up tokenization
D. Compress model weights
4. Storing prompt templates in version control helps:
A. Faster inference
B. Reproducible prompt evolution and auditability (Correct)
C. Lower memory usage
D. Automatic API key rotation

Download for reference Like Share


YouTube Channel: Software and Testing Training (341 Tutorials, 82,000 Subscribers)
Blog: Fourth Industrial Revolution
Copyright © 2025 All Rights Reserved.
12 Introduction to LLMs in Python Quiz
1. Which command creates an isolated Python environment for LLM projects?
A. python -m venv venv (Correct) – This avoids dependencies’ conflict across
projects.
B. pip install venv
C. conda run venv
D. python setup.py venv
2. To generate text with Hugging Face’s pipeline, you must initialize it with:
A. pipeline("text-classification")
B. pipeline("text-generation", model="gpt2") (Correct) – This loads both the
model and tokenizer for generation.
C. AutoModel.from_pretrained("gpt2")
D. generate_text("gpt2")
3. The messages list passed to OpenAI’s ChatCompletion API requires roles such as:
A. "agent", "user", "system"
B. "system", "user", "assistant" (Correct) – These define instruction context, user
prompts, and model responses.
C. "prompt", "completion"
D. "input", "output"
4. When deploying a model locally with AutoModelForCausalLM, switching to GPU
involves:
A. Calling model.cuda() and moving input tensors with .to(device) (Correct)
B. Reinstalling the model with GPU flag
C. Only setting device="cuda" in the pipeline
D. Using model.enable_gpu()
5. A prompt template rendered via Python’s str.format is an example of:
A. Prompt chaining
B. Prompt templating (Correct) – It allows dynamic insertion of variables like {text}
into a fixed string structure.
C. Few‑shot prompting
D. Model quantization
6. In prompt chaining, the output of the first API call is used to:
A. Train the model
B. Serve as the next prompt’s input (Correct) – This builds complex, multi‑step
workflows.
Download for reference Like Share
YouTube Channel: Software and Testing Training (341 Tutorials, 82,000 Subscribers)
Blog: Fourth Industrial Revolution
Copyright © 2025 All Rights Reserved.
C. Increase token limits
D. Generate random seeds
7. Fine‑tuning with LoRA via the peft library trains only:
A. The full model parameters
B. Low‑rank adapter modules (Correct) – This is to reduce training cost and memory.
C. Tokenizer vocabulary
D. GPU kernels
8. Streaming generation in Transformers enables you to:
A. Precompute entire responses before viewing
B. Receive tokens one by one in real time (Correct) – Useful for responsive UIs.
C. Batch multiple prompts
D. Skip tokenization
9. Quantizing a model to 8‑bit precision achieves:
A. Higher accuracy than 32‑bit
B. Reduced memory footprint and faster inference (Correct)
C. Full fine‑tuning without GPU
D. Automatic prompt optimization
10. Embedding prompt templates and tracking versions in Git ensures:
A. Rapid inference
B. Reproducible prompt evolution and auditability (Correct)
C. Automatic error handling
D. Unlimited token limits
11. A Dockerized LLM microservice typically uses a CMD instruction to launch:
A. python main.py
B. uvicorn app:app --host 0.0.0.0 --port 80 (Correct) – This starts the FastAPI app for
inference.
C. streamlit run app.py
D. bash start.sh
12. Implementing retries with exponential backoff in inference code helps mitigate:
A. Model quantization errors
B. Transient runtime or API call failures (Correct) – It enables continuity so that
temporary issues don’t crash the application.
C. Tokenization speed issues
D. Version control conflicts

Download for reference Like Share


YouTube Channel: Software and Testing Training (341 Tutorials, 82,000 Subscribers)
Blog: Fourth Industrial Revolution
Copyright © 2025 All Rights Reserved.

You might also like