TEXT-TO-IMAGE-GENERATOR USING STABLE
DIFFUSION MODEL
Minor Project Report
Submitted in Partial Fulfillment of the Requirement for the Award of the Degree
of
BACHELOR OF TECHNOLOGY
IN
COMPUTER SCIENCE AND ENGINEERING
Submitted by
RITIK SHARMA(01-CSE-2022)
AWAIS AHMED(30-CSE-2022)
AMREEN SAYED(34-CSE-2022)
MOHD HABIB(40-CSE-2022)
Under the Guidance of
Mr. Amit Dogra
SCHOOL OF ENGINEERING & TECHNOLOGY
BABA GHULAM SHAH BADSHAH UNIVERSITY
RAJOURI (J & K) - 185234
July 2025
DECLARATION
I hereby declare that the project entitled “Text-To-Image-Generator Using Stable Diffusion
Model” submitted for the B. Tech. (CSE) degree is my original work completed under the
supervision of ”Mr. Amit Dogra”. To the best of my knowledge the project has not formed the basis for
the award of any other degree, diploma, fellowship or any other similar titles.
Place: Baba Ghulam Shah Badshah Signature of the Student
University Rajouri
Date:
CERTIFICATE
This is to certify that the project titled “Text-To-Image-Generator Using Stable Model” is the
bonafide work carried out by Ritik Sharma(01-CSE-2022), a student of B Tech (CSE) in School of
Engineering and Technology, Baba Ghulam Shah Badshah University, Rajouri, J&K, during the academic
year 2025, in partial fulfillment of the requirements for the award of the degree of Bachelor of
Technology (Computer Science and Engineering ) and to the best of my knowledge, the project has
not formed the basis for the award of any other degree.
Place:Baba Ghulam Shah University Rajouri
Date:
Signature of the Guide Signature of HoD
Signature of External Examiner
ACKNOWLEDGEMENT
I have taken efforts in this project. However, it would not have been possible without the kindsupport
and help of many individuals and organizations. I would like to extend my sincere thanksto all of them.I
am highly indebted to Mr. A m i t D o g r a (H.O.D, CSE Department) , Mr. Amit Dogra (Project Head
and Guide ) for their guidance and constant supervision as well as for providing necessary information
regarding the project & also for their support in completing the project.I would like to express my
gratitude towards my parents & member of Institute of Computer Science And Engineering for their
kind co-operation and encouragement which help me in completion of this project.I would like to
express my special gratitude and thanks to my team members who work hard to complete this project on
time with me.My thanks and appreciations also go to my colleague in developing the project and people
whohave willingly helped me out with their abilities.I have taken efforts in this project. However, it
would not have been possible without the kind support and help of many individuals . I would like to
extend my sincere thanks to all of them.
ABSTRACT
A Text-to-image generation using the Stable Diffusion model represents a significant
advancement in artificial intelligence, enabling the synthesis of high-quality, photorealistic
images from textual descriptions. Stable Diffusion, a latent diffusion model developed by
Stability AI and collaborators, leverages a computationally efficient approach by operating in the
latent space of pretrained autoencoders, rather than directly in pixel space. This innovation
reduces hardware requirements and accelerates both training and inference, making high-
fidelity image generation accessible on consumer GPUs. The proposed approach seeks to
further enhance Stable Diffusion by introducing adaptive prompt engineering, hybrid
conditioning (combining textual and visual inputs), latent space optimization, and lightweight
model variants suitable for edge computing. An ethical content moderation system is also
integrated to address biases and ensure responsible image generation. Extensive
experimentation demonstrates that Stable Diffusion achieves a strong balance between
creativity, realism, and prompt adherence, with quantitative evaluation using metrics like
Fréchet Inception Distance (FID) and CLIP Score, as well as qualitative human assessments.
The system's modular design and efficient workflow-encompassing text processing, latent
transformation, iterative denoising, and ethical filtering-make it a robust framework for diverse
creative and research applications..In summary, Stable Diffusion democratizes AI-driven
creativity by providing scalable, high-fidelity text-to-image generation, with ongoing research
focused on improving efficiency, controllability, and ethical safeguards
TABLE OF CONTENTS
Chapter Number Title Page Number
1 Introduction 1
1. Overview of the Project
2. Scope and Objective
2 Literature Survey 2-3
1. Introduction
2. Literature Survey
3 System Design 4-6
1. Natural Language Processing
2. Advantages of NLP
3. Disadvantages of NLP
4. Architecture Diagram
5. Hardware Requirement
6. Software Requirement
4 Aalysis 7-8
1. Python Library
2. Data
3. Software Description
3.1 Python
3.2 Google Colab Notebook For Python
5 Creation of Project 9-11
1. File Structure
2. Step-by-Step implementation
6 Implementation of code
12-16
6.1.Sample codes
7 Conclusion 17
8 References 18
CHAPTER 1. INTRODUCTION
1.1.OVERVIEW OF THE PROJECT
The Text-to-Image Generator Using Stable Diffusion project focuses on developing an AI-
driven system that can create high-quality, realistic images from natural language descriptions
provided by users. Utilizing the Stable Diffusion model-a cutting-edge latent diffusion
architecture-the project streamlines the image generation process by operating in a
compressed latent space, making it both computationally efficient and accessible on standard
consumer hardware. The workflow involves encoding user text prompts with a powerful
language-image model (CLIP), which guides the iterative denoising process of random noise in
the latent space to synthesize images that closely match the given descriptions. The generated
images are then decoded into high-resolution visuals using a variational autoencoder. This
project not only highlights the model’s ability to produce a wide variety of images, from simple
objects to complex scenes.
1.2. SCOPE AND OBJECTIVE
SCOPE
To create an efficient and accessible AI system that generates high-quality images from text
prompts using the Stable Diffusion model.
OBJECTIVE:
Enable users to visualize ideas and concepts through natural language input.
Support diverse image generation tasks, such as text-to-image, inpainting, and image editing.
Provide a user-friendly, scalable, and open-source platform for creative and educational
applications.
CHAPTER 2. LITERATURE SURVEY
1. INTRODUCTION
Text-to-image generation is a rapidly advancing field in artificial intelligence that focuses on creating
images from natural language descriptions. This technology has the potential to revolutionize creative
industries, education, and digital content creation by enabling users to visualize their ideas and
concepts directly from text. The evolution of this field has seen a transition from early generative
models such as GANs and VAEs to more advanced diffusion models, which have significantly
improved the quality, diversity, and semantic accuracy of generated images. Among these, the Stable
Diffusion model stands out for its efficiency, scalability, and ability to generate high-fidelity images on
consumer hardware.
2. Literature Survey .
2.1 Early Generative Models.
The initial approaches to text-to-image generation were based on Generative Adversarial Networks
(GANs) and Variational Autoencoders (VAEs). GANs, such as those introduced by Goodfellow et al.
(2014), use a generator and discriminator in a competitive framework to produce realistic images.
Models like StackGAN and AttnGAN extended GANs for text-to-image tasks, but often struggled with
training instability and limited prompt fidelity. VAEs offered a probabilistic approach, generating diverse
samples but typically at the cost of lower image clarity.
2.2 Diffusion Models.
Diffusion models have recently emerged as a powerful alternative for generative tasks. These models,
including Denoising Diffusion Probabilistic Models (DDPMs), generate images by gradually denoising
random noise, resulting in stable training and high-quality outputs. Notable systems like DALL-E 2
(OpenAI), Imagen (Google), and GLIDE (OpenAI) have demonstrated the effectiveness of diffusion
models in producing photorealistic images from text prompts.
2.3 Stable Diffusion.
Stable Diffusion, introduced by Stability AI in 2022, represents a breakthrough in text-to-image
synthesis. Unlike previous models that operate directly in pixel space, Stable Diffusion utilizes a latent
diffusion approach, working in a compressed latent space to achieve computational efficiency without
sacrificing image quality. The model employs a CLIP-based text encoder to translate prompts into
embeddings that guide the diffusion process, and a variational autoencoder (VAE) to decode the final
latent representation into a high-resolution image. This architecture allows Stable Diffusion to run on
standard GPUs, making advanced generative AI accessible to a wider audience.
2.4 Recent Advancements.
Recent research has focused on improving prompt adherence, image resolution, and the accurate
rendering of complex concepts. Stable Diffusion 3, for example, incorporates a Multimodal Diffusion
Transformer (MMDiT) and Rectified Flow (RF) to enhance efficiency and prompt alignment.
Techniques such as Classifier-Free Guidance (CFG) and large-scale training on datasets like
LAION-5B have further improved the model’s versatility and output quality.
2.5 Applications and Challenges.
Text-to-image diffusion models are now widely used in creative design, advertising, gaming, and
education. However, challenges remain, including generating fine details in complex scenes,
handling ambiguous or abstract prompts, and ensuring ethical use by filtering inappropriate content.
Ongoing research aims to address these issues through improved architectures, better training data,
and integrated content moderation systems.
Current research is focused on:
•Improving model efficiency and reducing computational requirements.
•Enhancing multimodal capabilities (e.g., integrating audio or video).
•Increasing the model’s ability to interpret complex and nuanced text prompts.
•Expanding real-time image generation and interactivity.
References:
•Goodfellow, I. et al. (2014). Generative Adversarial Networks.
•Ramesh, A. et al. (2022). Hierarchical Text-Conditional Image Generation with CLIP Latents
(DALL-E 2).
•Rombach, R. et al. (2022). High-Resolution Image Synthesis with Latent Diffusion Models (Stable
Diffusion).
•Saharia, C. et al. (2022). Photorealistic Text-to-Image Diffusion Models with Deep Language
Understanding (Imagen).
This format matches your project file structure and provides a comprehensive, detailed literature
survey for your text-to-image generator project using Stable Diffusion.
Chapter 3. System Design
3.1. Natural Language Processing (NLP)
Natural Language Processing (NLP) is a field of artificial intelligence that focuses on the interaction
between computers and human language. In this project, NLP is used to interpret and encode user-
provided text prompts, which describe the images to be generated. The system utilizes advanced
NLP models, such as CLIP (Contrastive Language-Image Pretraining), to convert textual
descriptions into high-dimensional vector representations. These vectors serve as the conditioning
input for the image generation process, ensuring that the output image aligns closely with the user’s
intent.
3.2. Advantages of Natural Language Processing.
•Semantic Understanding: NLP models can interpret complex and nuanced text, allowing
users to provide detailed and creative prompts.
•Flexibility: Users can describe a wide variety of scenes, objects, and styles in natural
language, making the system highly versatile.
•User-Friendly: No technical expertise is required; users simply type what they want to see.
•Context Awareness: Advanced NLP models can understand context, relationships, and
attributes within the prompt, resulting in more accurate image generation.
3.3. Disadvantages of NLP.
•Ambiguity: Natural language can be ambiguous or vague, sometimes leading to unexpected
or inaccurate image outputs.
•Bias: NLP models may reflect biases present in their training data, which can influence the
generated images.
•Complexity Limitations: Extremely complex or abstract prompts may not always be fully
understood or visualized by the model.
•Language Limitations: Performance may vary across different languages or dialects,
depending on the training data.
3.4. Architecture Diagram.
The architecture of the text-to-image generator using Stable
Diffusion consists of the following main components:
1.User Interface: Accepts natural language prompts from the
user.
2.Text Encoder (NLP Module): Uses a model like CLIP to
convert text prompts into embeddings.
3.Latent Diffusion Model:
1. Noise Initialization: Starts with random noise in the
latent space.
2. Denoising Network (UNet): Iteratively refines the noise,
conditioned on the text embedding.
4.Variational Autoencoder (VAE) Decoder: Converts the final
latent representation into a high-resolution image.
5.Output Module: Displays the generated image to the user.
3.5. Hardware Requirement.
•Processor: Intel i5 or higher / AMD equivalent
•RAM: Minimum 8 GB (16 GB recommended for faster processing)
•GPU: NVIDIA GPU with at least 4 GB VRAM (e.g., GTX 1650 or
better) for efficient image generation; CPU-only operation is possible
but slower
•Storage: Minimum 10 GB free disk space for model weights and
dependencies
3.6. Software Requirements.
•Operating System: Windows 10/11, Linux, or macOS
•Programming Language: Python 3.8 or above
•Libraries/Frameworks:
• PyTorch or TensorFlow (for deep learning)
• Hugging Face Transformers (for NLP models)
• diffusers (for Stable Diffusion implementation)
• OpenCV, PIL (for image processing)
• Flask or Streamlit (for web interface, if applicable)
•Other Tools: CUDA drivers (for GPU acceleration), Git (for version
control), and Jupyter Notebook (for experimentation)
Conclusion-
In this chapter ,we have discussed introduction to NLP, software requirement,and
hardware requirement of the chatbot project and in next chapter ,we discuss about
various libraries used for creation of project in Python.
CHAPTER 4. ANALYSIS
1. PYTHON LIBRARY
A Python library is a collection of related modules. It contains bundles of code that
can be used repeatedly in different programs. It makes Python Programming simpler
and convenient for the programmer. For implementing a text-to-image generator
using the Stable Diffusion model, several Python libraries are essential:
•diffusers: The primary library for accessing and running Stable Diffusion
models, offering easy-to-use pipelines for text-to-image generation.
•transformers: Provides pre-trained models and tokenizers, especially useful for
handling text encoding with CLIP or similar models.
•torch (PyTorch): The core deep learning framework used for model inference,
tensor operations, and GPU acceleration.
•Pillow (PIL): For image processing tasks such as saving, displaying, or
converting generated images.
•requests: Useful for API calls, especially if interacting with cloud-based or
hosted Stable Diffusion services.
•IPython.display: For displaying images within Jupyter notebooks or interactive
Python environments.
•os, json, time: Standard Python libraries for file handling, configuration, and
time management during batch processing or automation.
2. Data
The data component in this project refers primarily to:
•Text Prompts: User-provided natural language descriptions that guide the
image generation process.
•Model Weights: Pre-trained Stable Diffusion model files, typically several
gigabytes in size, which contain the learned parameters for generating images.
•Generated Images: Output files (usually in PNG or JPEG format) created by
the model from the given text prompts.
•Optional Datasets: For advanced use cases (like fine-tuning or evaluation),
large datasets such as LAION-5B may be used, containing millions of image-text
pairs to enhance model performance or benchmarking.
3. Software Description
3.1 Python
Python is the primary programming language for this project due to its robust
ecosystem for AI development and ease of integration with machine learning
frameworks. Python’s flexibility allows for rapid prototyping, and its extensive
libraries support everything from data processing to deep learning model
deployment.
•Why Python?
• Industry-standard for machine learning and AI research.
• Extensive support for GPU acceleration and distributed computing.
• Active community and frequent updates for AI libraries.
Other Software Components:
•Operating System: Windows 10/11, Linux, or macOS, all compatible with Python
and required libraries
•CUDA Toolkit: For GPU acceleration if using an NVIDIA GPU.
•Jupyter Notebook: For interactive development and visualization.
•Web Frameworks (optional): Flask or Streamlit can be used to build user
interfaces for prompt input and image display
3.2 Google Colab Notebook For Python
Google Colab is an online platform that provides free access to powerful GPU
resources and a collaborative environment for running Python code, making it ideal
for implementing and experimenting with deep learning models such as Stable
Diffusion For your Text-to-Image Generator project,etc. Google Colab Notebook is
an essential tool for your Text-to-Image Generator project using Stable Diffusion. It
streamlines development, testing, and demonstration by providing a cloud-based,
GPU-enabled Python environment that supports all necessary libraries and
collaborative features.
CHAPTER 5. HOW TO CREATE A Project
5.1. FILES STRUCTURE
This project demonstrates how to generate high-quality images from text prompts
using the Stable Diffusion model and Python. The workflow leverages Hugging
Face’s Diffusers library and runs efficiently in a Google Colab or Jupyter
Notebook environment.
Its Structure Involves:
• Text_to_image_generator.ipynb # Main Jupyter Notebook (this file)
• Generated_images/ # Folder for saving generated images
• README.md # Project documentation (optional)
5.2. STEP-BY-STEP IMPLEMENTATION
Step 1: Install Required Libraries
!pip install diffusers transformers accelerate
!pip install torch torchvision torchaudio
!pip install pillow
Step 2: Import Libraries
from diffusers import StableDiffusionPipeline
import torch
from PIL import Image
import os
Step 3: Authenticate with Hugging Face (if required)
from huggingface_hub import notebook_login
notebook_login()
Follow the link, login, and paste your User Access Token if prompted .
Step 4: Load the Stable Diffusion Model
pipe = StableDiffusionPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5",
torch_dtype=torch.float16
).to("cuda")
Step 5: Generate an Image from a Text Prompt
prompt = "A futuristic cityscape at sunset, highly detailed, digital art"
image = pipe(prompt, num_inference_steps=50).images[0]
# Display the image
image.show()
Step 6: Save the Generated Image
output_dir = "generated_images"
os.makedirs(output_dir, exist_ok=True)
image_path = os.path.join(output_dir, "cityscape.png")
image.save(image_path)
print(f"Image saved at {image_path}")
Step 7: Experiment with More Prompts
prompts = [
"A serene mountain landscape in the style of a watercolor painting",
"A cyberpunk robot playing chess, digital art",
"A cute cat wearing a wizard hat, cartoon style"
for i, prompt in enumerate(prompts):
img = pipe(prompt, num_inference_steps=40).images[0]
img.save(os.path.join(output_dir, f"sample_{i+1}.png"))
display(img)
CHAPTER 6. IMPLEMENTION OF CODE
6.1. SAMPLE CODES.
#TextToImageGenerator.ipynb
# Install necessary libraries
%pip install --upgrade diffusers
transformers
from pathlib import Path
import tqdm
import torch
import pandas as pd
import numpy as np
from diffusers import
StableDiffusionPipeline
from transformers import pipeline,
set_seed
from torch.utils.data import Dataset
import matplotlib.pyplot as plt
import matplotlib.pyplot as plt
import cv2
class CFG:
device = "cuda"
seed = 42
generator =
torch.Generator(device).manual_seed(seed)
image_gen_steps = 35
image_gen_model_id =
"stabilityai/stable-diffusion-2"
image_gen_size = (400,400)
image_gen_guidance_scale = 9
prompt_gen_model_id = "gpt2"
prompt_dataset_size = 6
prompt_max_length = 12
# Load prompts from CSV
df = pd.read_csv('prompts.csv')
df['prompt'] = df['prompt'].str.replace(r'^"|"$',
'', regex=True) # Remove surrounding quotes if
present
# Sample prompts according to config
sampled_prompts = df['prompt'].sample(
n=CFG.prompt_dataset_size,
random_state=CFG.seed
).tolist()
# Optionally truncate prompts to max length (in
tokens or words)
def truncate_prompt(prompt, max_length):
words = prompt.split()
if len(words) > max_length:
return ' '.join(words[:max_length])
return prompt
truncated_prompts = [truncate_prompt(p,
CFG.prompt_max_length) for p in sampled_prompts]
# Create a PyTorch Dataset
class PromptDataset(Dataset):
def __init__(self, prompts):
self.prompts = prompts
def __len__(self):
return len(self.prompts)
def __getitem__(self, idx):
return self.prompts[idx]
# Instantiate the dataset
prompt_dataset = PromptDataset(truncated_prompts)
# Example usage:
for i, prompt in enumerate(prompt_dataset):
print(f"Prompt {i+1}: {prompt}")
# === Generate and Save Images ===
os.makedirs("generated_images", exist_ok=True)
for idx, prompt in enumerate(truncated_prompts):
print(f"Generating image for prompt {idx+1}:
{prompt!r}")
with torch.autocast(CFG.device):
image = image_gen_model(
prompt,
num_inference_steps=CFG.image_gen_steps,
guidance_scale=CFG.image_gen_guidance_scal
e,
height=CFG.image_gen_size[1],
width=CFG.image_gen_size[0],
generator=CFG.generator
).images[0]
image.save(f"generated_images/image_{idx+1}.png")
print("All images generated and saved in
'generated_images/' folder.")
with open("generated_images/prompts.txt", "w") as f:
for idx, prompt in enumerate(prompt_dataset):
f.write(f"image_{idx+1}.png: {prompt}\n")
# === Load Stable Diffusion Pipeline ===
image_gen_model =
StableDiffusionPipeline.from_pretrained(
CFG.image_gen_model_id,
torch_dtype=torch.float16,
revision="fp16",
use_auth_token="your_hugging_face_auth_token" #
<-- Replace with your token
)
image_gen_model = image_gen_model.to(CFG.device)
def generate_image(prompt, model):
with torch.autocast(CFG.device):
image = model(
prompt,
num_inference_steps=CFG.image_gen_steps,
generator=CFG.generator,
guidance_scale=CFG.image_gen_guidance_scale
).images[0]
image = image.resize(CFG.image_gen_size)
return image
#Sample Prompts.
generate_image("A photorealistic image of a futuristic
city with flying cars and towering skyscrapers, with a
vivid sunset in the background", image_gen_model)
generate_image("A mystical enchanted forest with glowing
plants and magical creatures, bathed in soft dawn light",
image_gen_model)
6.1.0 OUTPUT OF TextToImageGenerator.ipynb
SAMPLE OUTPUTS
Fig 6.1.1 TextToImageGenerator.ipynb Outputs.
CONCLUSION
The Text-to-Image Generator project using the Stable Diffusion model
demonstrates a significant advancement in generative AI, enabling the creation
of high-quality, photorealistic images from natural language prompts with
remarkable efficiency and flexibility. By leveraging a latent diffusion process,
Stable Diffusion operates in a compressed latent space, which dramatically
reduces computational requirements without sacrificing image detail or semantic
alignment. The model’s architecture-featuring a variational autoencoder for
encoding and decoding, a U-Net noise predictor for progressive denoising, and
powerful text conditioning via models like CLIP-ensures that generated images
closely match the user’s descriptions while preserving intricate visual feature.
Overall, the project highlights Stable Diffusion’s ability to democratize AI-powered
creativity, offering an accessible, scalable, and versatile tool for transforming
ideas into compelling visual representations
REFERENCES
1.CompVis/stable-diffusion (GitHub):
The official repository for Stable Diffusion, providing the model code, architecture
details, and usage instructions.
2.“Visual Explanation for Text-to-image Stable Diffusion” (arXiv, 2023):
This paper introduces Diffusion Explainer, an interactive tool that helps
users understand how Stable Diffusion transforms text prompts into images,
offering insights into the model’s structure and operations.
3.“TEXT-IMAGE GENERATION-STABLE DIFFUSION” (IJSREM
Journal):
A project-focused article that explores the Stable Diffusion framework for
text-to-image synthesis, discussing its workflow and performance
4.“What’s in a text-to-image prompt? The potential of stable diffusion in
art” (ScienceDirect, 2024):
This study analyzes how prompt structure affects image generation quality and
explores the creative potential of Stable Diffusion in artistic applications.
5.Stable Diffusion 3: Research Paper (Stability AI, 2024):
The latest research paper from Stability AI detailing the advancements,
architecture, and performance of Stable Diffusion 3, including comparisons with
other state-of-the-art models.
6.Official Stable Diffusion by Stability AI (GitHub):
https://github.com/Stability-AI/stablediffusion
7.Generative Models by Stability AI (including Stable Video Diffusion):
https://github.com/Stability-AI/generative-models