Data Analysis and Visualization Lab (CS-352 )
Namal University Mianwali
Faculty of Computer Science
Lab 04 – Example Dataset
In this exercise, there is a .txt file of the famous speech given by Theresa May. The aim of this exercise is to plot a
bar chart of the most "frequent" words in the speech.
The resultant graph will look similar to the one given below:
Read the file speech.txt using Python and extract each word in the file. (Remember this list will and
Task 1
should contain duplicates)
Solution # Import libraries
import numpy as np
import pandas as pd
import string
import matplotlib.pyplot as plt
%matplotlib inline
# Read the speech.txt file and create a word list
# file_name = "speech.txt"
word_list = []
with open('speech.txt', 'r', encoding='utf-8') as file:
text = file.read()
words = text.lower().split()
# Remove punctuation characters and numbers from the word list
for word in words:
# Remove leading/trailing punctuation characters
word = word.strip(string.punctuation).replace('–', '')
# Add the cleaned word to the word list
if word and not word.isdigit():
word_list.append(word)
print(word_list[:20]) #printing some values
OUTPUT
['thank', 'you', 'today', 'i', 'want', 'to', 'talk', 'about', 'the', 'united', 'kingdom', 'our', 'place', 'in', 'the',
'world', 'and', 'our', 'membership', 'of']
Read stopwords.csv using pandas and extract all the stopwords in the file that belongs to language
Task 2
English.
Solution # Read 'stopwords.csv' file as pandas dataframe and print its head()
data = pd.read_csv('stopwords.csv')
data.head()
OUTPUT
# Selecting stopwords that belong to English language
english_stopwords = set(data[data['Language'] == 'English']['Words'].tolist())
print(english_stopwords)
OUTPUT
{'was', 're', 'himself', 'after', 'ourselves', 'whom', 'out', 'against', 'has', 'aren', 't', 'we', 'have', 'own',
'it', 'as', 'do', "should've", "aren't", "hadn't", 'where', 'll', 'its', 'their', 'again', 'below', 's', 've', 'the',
'too', 'wouldn', 'more', 'y', 'can', 'are', 'itself', 'needn', 'she', 'having', 'now', 'or', 'at', 'nor', 'about',
"don't", 'very', 'were', 'mustn', "mustn't", 'his', "you're", "wasn't", 'once', 'which', 'doing', 'don',
"mightn't", 'but', "shouldn't", 'ain', 'most', 'then', "isn't", 'with', 'both', 'them', 'weren', 'be', 'had',
'will', 'hasn', 'isn', 'that', "couldn't", 'there', 'only', 'into', 'here', 'does', 'doesn', 'herself', 'me', 'by',
'is', "shan't", 'if', "it's", 'from', 'while', 'theirs', 'wasn', 'of', 'because', 'just', 'm', 'than', 'in', 'haven',
'yourself', 'him', "she's", 'my', 'our', 'did', 'off', 'each', "won't", 'shouldn', 'above', 'other',
'yourselves', 'how', 'they', 'hers', 'ours', 'hadn', 'am', 'her', 'shan', 'such', 'couldn', "haven't", 'those',
'when', 'been', 'for', 'same', "you've", 'between', "didn't", 'why', 'and', 'ma', 'on', 'should', 'under',
"you'll", 'over', 'some', 'further', "needn't", "doesn't", 'myself', 'until', "weren't", 'before', 'any', 'he',
'o', 'didn', 'being', 'not', 'few', 'to', 'up', 'won', 'an', 'down', 'd', "that'll", 'through', "wouldn't", 'you',
'during', 'your', 'what', 'mightn', 'i', 'all', 'yours', 'no', 'so', 'a', 'these', 'themselves', "you'd", 'who',
"hasn't", 'this'}
Task 3 Remove all the words extracted from the speech file if those words are present in the stopwords.
word_list = [word for word in word_list if word not in english_stopwords]
Solution
OUTPUT
['thank', 'today', 'want', 'talk', 'united', 'kingdom', 'place', 'world', 'membership', 'european',
'union', 'start', 'want', 'make', 'clear', 'see', 'rally', 'attack', 'even', 'criticism']
Task 4 Sort the pruned list of words from the file in order of the highest frequency of occurrence.
# Sort the dictionary based on highest frequncy of occurence
sorted_word_freq = {k: v for k, v in sorted(frequency_dict.items(), key=lambda item: item[1],
reverse=True)}
Solution
OUTPUT
Task 5 Plot a bar chart that represents the frequency of the top 15 words of the speech.
# Separate the frequency_list into 2 lists for easy plotting
keys = [key for key in sorted_word_freq]
values = [sorted_word_freq[key] for key in sorted_word_freq]
OUTPUT
# Plotting the words on the x-axis and frequency along the y-axis
# Plot the bar chart
plt.figure(figsize=(10, 5))
plt.bar(keys[:15], values[:15])
plt.title("Most Frequent Words in Theresa May's Speech")
plt.xlabel("Words")
plt.ylabel("Frequency")
plt.xticks(rotation=90)
plt.show()
OUTPUT
Solution
THE END