1.
Problem Statement
The objective of this project is to develop a highly accurate and efficient audio and video
transcription system that converts spoken content from various recordings into written text. The
system should support a wide range of audio and video file formats, handling different accents,
dialects, and languages.The transcription should reflect spoken words accurately, including
proper names, technical terms, and context-specific vocabulary, with appropriate punctuation
and capitalization to enhance readability. Key functionalities must also include noise reduction to
manage background sounds and maintain transcription accuracy, as well as real-time
transcription capabilities for live recordings alongside batch processing for pre-recorded
files.The output should be exportable in formats like TXT.
2. Data Gathering
To enhance the dataset for our audio-to-text transcription and summarization project, we
collected a diverse set of audio data from various YouTube videos.
3. Understanding APIs:
1. Researching multiple APIs used to convert audio to text data.
2. I have used AssemblyAI, which provides a powerful API for speech-to-text transcription. It's
designed to transcribe audio data into text with high accuracy and speed
4. Creating an AssemblyAI Key
I have create Assembly key for transmitting audio to text
4. Searching Methods for Transcribing Audio to Text
1. I found the Huggingface library.
2. The Huggingface library has three models that are useful for converting audio to text.
3. I implemented the wav2vec2 and Whisper models for transcribing audio to text.
5. Displaying Video
Libraries for Displaying Video and Audio
1. I have used the moviepy library for reading video files.
2. I have used the pygame library for handling video and audio playback.
6. Transcribing Audio to Text
1. I have used AssemblyAI for transcribing audio to text.
2. I used the librosa library to load audio files.
7. Perform EDA
A. Perform Visualization
1. Visualize the Waveform
2. Visualize the Spectrogram
Compute and display the spectrogram
3. Extract and Visualize MFCCs
4. Analyze Silence and Noise
5. Extract and Analyze Other Features
a. Perform Zero-crossing: The zero-crossing rate is the rate at which the signal changes
sign,indicating noise levels.
b. Perform Spectral Centroid: The spectral centroid indicates where the center of mass of
thespectrum is located, giving a sense of brightness.
c. Perform Statistical Analysis: Perform statistical analysis to understand the distribution
andcharacteristics of your audio data.
B. Post-Processing the Transcribed Text
1. Cleaning the Transcriptions
a. Remove unwanted characters (I used the re library).
2. Normalizing the Text
3. Handling Silence and Noise
8. Transcribing Audio to Text
1. I researched the Huggingface library and used the wav2vec2 and Whisper models for
transcribing audio to text.
2. For wav2vec2, I used the IPython, SciPy, and NumPy libraries.
3. After that, I used the Whisper model for transcribing audio into text.
Screenshot of wav2vec2:
Screenshot of whisper:
9. Learning Gemini/ Gemini Pro
Gemini:
1. I learned about Gemini and received some resources from Ritesh.
2. I created an API key in Gemini.
3. Using this API key, I will transmit audio into text.
Gemini-Pro:
1. I learned about Gemini Pro.
2. I also learned how to transmit video into text, and I transmitted a video into summarized text
using Gemini.
10. Creating a Streamlit Using AssemblyAI/Wav2Vec2/Whisper Model
1. I have created a Streamlit application for transmitting audio to text using AssemblyAI and
saving that text into a PDF.
2. I had a discussion with Ritesh Mahale, who guided me about the Streamlit UI. Finally, I
completed the Streamlit UI.
3. I have created a Streamlit application using the Whisper model for transcribing audio to text,
and the result was really good. Therefore, I finalized the Whisper model
AssemblyAI:
Whisper:
Finalise
1. I have used the Whisper model for transcribing audio to text.
2. I have used Gemini pro for text summarization, and I have finally completed the
summarization.
Model Deployment
Using Whisper Model:
1. I Have deployed Model on streamlit
link: text-summarization-jkeca3vdz9ewoingfdskaq.streamlit.app
2. I have use whisper model for transmitting audio into Text and have used
Gemini for Text summarization
Using AssemblyAI:
1. I Have deployed Model on streamlit
link : audiototext-gf2ksybgvmuhdzsvbbxmgx.streamlit.app
2. I have use AssemblyAi for transmitting audio into Text