LABORATORYMANUA
L
WEBANDSOCIALMEDIAANALYTICSLAB
For
B. Tech IV Year I Semester
(COMPUTERSCIENCEANDENGINEERIN
G)
(DATASCIENCE)
(R18Regulations)
DEPARTMENTOFCOMPUTERSCIENCEANDENG
INEERING
(DATASCIENCE)
Sreyas Institute of Engineering and
Technology
Prepared by B. Venkata Varma
An UGC Autonomous Institution
Prepared by B. Venkata Varma
PROGRAMOUTCOMES(POs)
1. Engineeringknowledge: Apply the knowledge of mathematics, science, engineering
fundamentals, and an engineering specialization to the solution of complex engineering
problems.
2. Problemanalysis:Identify, formulate, review research literature, and analyze complex
engineering problems reaching substantiated conclusions using first principles of mathematics,
natural sciences, and engineering sciences.
3. Design/developmentofsolutions:Design solutions for complex engineering problems and
design system components or processes that meet the specified needs with appropriate
consideration for the public health and safety, and the cultural, societal, and environmental
considerations.
4. Conductinvestigationsofcomplexproblems: Useresearch-basedknowledgeandresearch
methods including design of experiments, analysis and interpretation of data, and synthesis of
the information to provide valid conclusions.
5. Moderntoolusage:Create,select,andapplyappropriatetechniques,resources,andmodern
engineering and IT tools including prediction and modeling to complex engineering activities
with an understanding of the limitations.
6. Theengineerandsociety: Applyreasoninginformedbythecontextualknowledgetoassess
societal, health, safety, legal and cultural issues and the consequent responsibilities relevant to
the professional engineering practice.
7. Environmentandsustainability: Understand the impact of the professional engineering
solutions in societal and environmental contexts, and demonstrate the knowledge of, and need
for sustainable development.
8. Ethics:Apply ethical principles and commit to professional ethics and responsibilities and
norms of the engineering practice.
9. Individualandteamwork: Function effectively as an individual, and as a member or leader
in diverse teams, and in multidisciplinary settings.
10. Communication:Communicate effectively on complex engineering activities with the
engineering community and with society at large, such as, being able to comprehend and write
effective reports and design documentation, make effective presentations, and give and receive
clear instructions.
11. Projectmanagementandfinance: Demonstrate knowledge and understanding of the
engineering and management principles and apply these to one’s own work, as a member and
leader in a team, to manage projects and in multidisciplinary environments.
12. Life-longlearning: Recognize the need for,and have the preparation and ability to engage
in independent and life-long learning in the broadest context of technological change.
Prepared by B. Venkata Varma
COURSE STRUCTURE
(REGULATION: R18)
For The Fourth Year Under Graduate
Programme
Bachelor of
Technology(B.Tech)
With effect from the
AcademicYear2024-25
DEPARTMENTOFCOMPUTERSCIENCEANDENG
INEERING
(DATASCIENCE)
Prepared by B. Venkata Varma
R18B.Tech.CSE(DS)Syllabus
JNTUHYDERABAD
DATAMININGLAB
IV Year B.Tech. CSE(DS) I -Sem LTPC
0021
CourseObjectives:Exposuretovariouswebandsocialmediaanalytictechniques.
CourseOutcomes:
1. Knowledgeondecisionsupportsystems.
2. Applynaturallanguageprocessingconceptsontextanalytics.
3.Understand sentiment analysis.
4.Knowledgeonsearchengineoptimizationandwebanalytics.
List of Experiments
1. PreprocessingtextdocumentusingNLTKofPython
a.Stopword elimination
b.Stemming
c.Lemmatization
d.POStagging
e.Lexicalanalysis
2. Sentiment analysis on customer review on products
3. Web analytics
a. Webusagedata(web server log data, click stream analysis)
b.Hyperlink data
4. Search engine optimization-implement spamdexing
5. Use Google analytics tools to implement the following
Prepared by B. Venkata Varma
a. ConversionStatistics
b.Visitor Profiles
6. Use Google analytics tools to implement the Traffic Sources.
Resources:
1. Stanford core NLP package
2.GOOGLE.COM/ANALYTICS
TEXT BOOKS:
1. Ramesh Sharda, Dursun Delen, Efraim Turban, BUSINESS INTELLIGENCE
ANDANALYTICS:SYSTEMSFORDECISIONSUPPORT,PearsonEducation.
REFERENCEBOOKS:
1. RajivSabherwal,IrmaBecerra-Fernandez,”BusinessIntelligence–
Practice,Technologies andManagement”,JohnWiley2011.
2. LarissT.Moss,ShakuAtre,“BusinessIntelligenceRoadmap”,Addison-Wesley It
Service.
3. YuliVasiliev,“OracleBusinessIntelligence:TheCondensedGuidetoAnalysis and
Reporting”,SPD Shroff, 2012
Prepared by B. Venkata Varma
CO-POMAPPING
PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12
CO1 3 3 3 3 3 - 2 3 3 - - 3
CO2 3 3 3 2 2 2 - 3 3 - - 3
CO3 3 3 3 3 3 - - 3 3 - - 3
CO4 3 3 3 3 3 - - 3 3 - - 3
CO5 3 3 3 - 3 - - 3 3 - - 3
AVG 3 3 3 3 3 2 2 3 3 2 2 3
CO-PSOMAPPING:
PSO1 PSO2
CO1 - 2
CO2 - 1
CO3 - 1
CO4 - 2
CO5 - 1
AVG 0 2
Prepared by B. Venkata Varma
1. Preprocessing text document using NLTK of Python
a. Stop word elimination
Stop word elimination is a process used in natural language processing (NLP) to
remove common words, often called stop words, that are not essential to the
meaning of a sentence. These are typically high-frequency words like "and," "the,"
"is," and "of" that do not contribute much to understanding the content or topic of a
text.
Steps in Stop Word Elimination:
1. Tokenization: Split the text into individual words or tokens.
2. Stop Word List: Have a predefined list of stop words (e.g., provided by NLP
libraries or custom lists).
3. Filtering: Remove words from the text that are in the stop word list.
4. Reconstruction: Reassemble the text or tokens without the stop words.
Example:
Original sentence: "The cat is on the mat."
Stop words (from a predefined list): "the," "is," "on."
NLTK library maintains a list of around 179 stopwords (shown below) that can be
used to filtering stopwords from the text. You may also add or remove stopwords
from the default list.
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
print(stopwords.words('english'))
Out:-
‘i’, ‘me’, ‘my’, ‘myself’, ‘we’, ‘our’, ‘ours’, ‘ourselves’, ‘you’, “you’re”, “you’ve”, “you’ll”,
“you’d”, ‘your’, ‘yours’, ‘yourself’, ‘yourselves’, ‘he’, ‘him’, ‘his’, ‘himself’, ‘she’, “she’s”,
‘her’, ‘hers’, ‘herself’, ‘it’, “it’s”, ‘its’, ‘itself’, ‘they’, ‘them’, ‘their’, ‘theirs’, ‘themselves’,
‘what’, ‘which’, ‘who’, ‘whom’, ‘this’, ‘that’, “that’ll”, ‘these’, ‘those’, ‘am’, ‘is’, ‘are’,
‘was’, ‘were’, ‘be’, ‘been’, ‘being’, ‘have’, ‘has’, ‘had’, ‘having’, ‘do’, ‘does’, ‘did’, ‘doing’,
‘a’, ‘an’, ‘the’, ‘and’, ‘but’, ‘if’, ‘or’, ‘because’, ‘as’, ‘until’, ‘while’, ‘of’, ‘at’, ‘by’, ‘for’,
‘with’, ‘about’, ‘against’, ‘between’, ‘into’, ‘through’, ‘during’, ‘before’, ‘after’, ‘above’,
Prepared by B. Venkata Varma
‘below’, ‘to’, ‘from’, ‘up’, ‘down’, ‘in’, ‘out’, ‘on’, ‘off’, ‘over’, ‘under’, ‘again’, ‘further’,
‘then’, ‘once’, ‘here’, ‘there’, ‘when’, ‘where’, ‘why’, ‘how’, ‘all’, ‘any’, ‘both’, ‘each’,
‘few’, ‘more’, ‘most’, ‘other’, ‘some’, ‘such’, ‘no’, ‘nor’, ‘not’, ‘only’, ‘own’, ‘same’, ‘so’,
‘than’, ‘too’, ‘very’, ‘s’, ‘t’, ‘can’, ‘will’, ‘just’, ‘don’, “don’t”, ‘should’, “should’ve”, ‘now’,
‘d’, ‘ll’, ‘m’, ‘o’, ‘re’, ‘ve’, ‘y’, ‘ain’, ‘aren’, “aren’t”, ‘couldn’, “couldn’t”, ‘didn’, “didn’t”,
‘doesn’, “doesn’t”, ‘hadn’, “hadn’t”, ‘hasn’, “hasn’t”, ‘haven’, “haven’t”, ‘isn’, “isn’t”,
‘ma’, ‘mightn’, “mightn’t”, ‘mustn’, “mustn’t”, ‘needn’, “needn’t”, ‘shan’, “shan’t”,
‘shouldn’, “shouldn’t”, ‘wasn’, “wasn’t”, ‘weren’, “weren’t”, ‘won’, “won’t”, ‘wouldn’,
“wouldn’t”
import nltk
nltk.download('stopwords')
def stopword_elimination(text):
stopwords = nltk.corpus.stopwords.words('english')
filtered_words = [word for word in text.split() if word.lower() not in stopwords]
return filtered_words
if __name__ == '__main__':
text = "This is a sample text with stopwords."
filtered_words = stopword_elimination(text)
print(filtered_words)
Output
['sample','text','with']
b)Stemming
Stemming also reduces the words to their root forms but unlike lemmatization, the
stem itself may not a valid word in the Language.
NLTK has many stemming functions with different algorithms, we will use
PorterStemmer over here.You will like to either perform stemming or lemmatization
and not both. We will however perform stemming on our data just to explain to you.
We have defined a custom function stemming() that returns the text by converting
the words to stem, we finally apply it to Twitter dataframe.
import nltk
from nltk.stem import PorterStemmer
def stemming(text):
stemmer = PorterStemmer()
stemmed_words = []
for word in text.split():
Prepared by B. Venkata Varma
stemmed_words.append(stemmer.stem(word))
return stemmed_words
if __name__ == '__main__':
text = "This is a sample text with stemming."
stemmed_words = stemming(text)
print(stemmed_words)
Output
['thi', 'is', 'a', 'sampl', 'text', 'with', 'stemming.']
In [1]:
from nltk.stem import PorterStemmer
def stemming(text):
porter = PorterStemmer()
result = []
for word in text:
result.append(porter.stem(word))
return result
# Test
text = ['Connects', 'Connecting', 'Connections', 'Connected', 'Connection', 'Connectings',
'Connect']
stemmed_words = stemming(text)
print(stemmed_words)
[Out]:
['connect', 'connect', 'connect', 'connect', 'connect', 'connect',
C) Lemmatization
Lemmatization is converting the word to its base form or lemma by removing
affixes from the inflected words. It helps to create better features for machine
learning and NLP models hence it is an important preprocessing step.
There are many Lematizers available in NLTK that employ different algorithms. In
our example, we have used WordNet Lemmatizer module of NLTK for
lemmatization.
We have created a custom function lemmatization() that first does POS tagging
and then lemmatizes the text. Finally, this function is applied to our Twitter
dataframe.
import nltk
Prepared by B. Venkata Varma
nltk.download('wordnet')
nltk.download('omw-1.4')
from nltk.stem import WordNetLemmatizer
def lemmatization(text):
lemmatizer = WordNetLemmatizer()
lemmatized_words = []
for word in text.split():
lemmatized_words.append(lemmatizer.lemmatize(word))
return lemmatized_words
if __name__ == '__main__':
text = "This is a sample text with lemmatization."
lemmatized_words = lemmatization(text)
print(lemmatized_words)
Output
['this','sample','text','lemmatization']
D) POStagging
If we scrape our data from a different website, removing HTML tags becomes
an essential step as part of our preprocessing.
We can use Python regular expression function to find all the unwanted tags. Here in
this example, we have defined a custom function remove_tag() which cleans the
HTML tags from the text by using regular expression. And finally, we apply this
function to our Twitter dataframe.
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
def pos_tagging(text):
tokens = nltk.word_tokenize(text)
tagged_tokens = nltk.pos_tag(tokens)
return tagged_tokens
if __name__ == '__main__':
text = "This is a sample text with lexical analysis."
tagged_tokens = pos_tagging (text)
print(tagged_tokens)
Output
[('This','DT'),('is','VBZ'),('a','DT'),('sample','NN'),('text','NN'),('with','IN'), ('POS',
'NN'), ('tagging', 'VBG')]
E) Lexical analysis
Prepared by B. Venkata Varma
Lexical analysis is the process of converting a sequence of characters in a source code file
into a sequence of tokens that can be more easily processed by a compiler or interpreter. It
is often the first phase of the compilation process and is followed by syntax analysis and
semantic analysis.
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
def lexical_analysis(text):
tokens = nltk.word_tokenize(text)
tagged_tokens = nltk.pos_tag(tokens)
return tagged_tokens
if __name__ == '__main__':
text = "This is a sample text with lexical analysis."
tagged_tokens = lexical_analysis(text)
print(tagged_tokens)
Output
[('This','DT'),('is','VBZ'),('a','DT'),('sample','NN'),('text','NN'),('with','IN'),
('lexical','JJ'),('analysis','NN')]
2. Sentiment analysis on customer review on products
import nltk
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer
def sentiment_analysis(text):
analyzer = SentimentIntensityAnalyzer()
sentiment = analyzer.polarity_scores(text)
return sentiment
if __name__ == '__main__':
text = "This is a sample text with positive sentiment."
sentiment = sentiment_analysis(text)
print(sentiment)
Output
{ 'neg': 0.0, 'neu': 0.625, 'pos': 0.375, 'compound': 0.5574}
3. Web analytics
Prepared by B. Venkata Varma
a. Web usage data(web server log data, click stream analysis)
import pandas as pd
def web_usage_analysis(log_file):
# Read web log data from CSV file
try:
log_data = pd.read_csv(log_file)
except Exception as e:
print(f"Error reading file: {e}")
return
# Check if necessary columns exist
required_columns = ['user_id', 'session_id', 'timestamp']
if not all(col in log_data.columns for col in required_columns):
print("Missing required columns in the log data.")
return
# Group by user and session to count requests per user
user_requests = log_data.groupby('user_id')['session_id'].count()
# Display the results
print("Web Requests per User:")
print(user_requests)
# Example usage with a log file
log_file = '/content/web_log.csv' # Example file path
web_usage_analysis(log_file)
Steps to Create a CSV File:
1. Using a Text Editor or Spreadsheet Application:
Option 1: Create Manually (Text Editor)
1. Open a text editor (e.g., Notepad, VSCode, Sublime Text).
2. Enter the following example data:
3.Save the file as web_log.csv (make sure to set the file type to All Files or the extension
to .csv).
user_id session_id timestamp
101 1 9/30/2023 10:15
102 2 9/30/2023 10:20
101 3 9/30/2023 11:00
103 4 9/30/2023 11:30
104 5 9/30/2023 12:15
105 6 9/30/2023 13:00
101 7 10/1/2023 9:00
102 8 10/1/2023 9:30
104 9 10/1/2023 10:00
105 10 10/1/2023 10:45
Output
Prepared by B. Venkata Varma
Web Requests per User:
user_id
101 3
102 2
103 1
104 2
105 2
a.Name: session_id, dtype: int64
b. Hyperlink
data
import requests
import bs4
def hyperlink_analysis(url):
# Send a request to the URL and parse the HTML
response = requests.get(url)
soup = bs4.BeautifulSoup(response.content, 'html.parser')
# Find all hyperlinks in the page
links = soup.find_all('a')
# Analyze the links
link_counts = {}
for link in links:
anchor_text = link.text
url = link.get('href', '') # Get href, handle if it doesn't exist
if url not in link_counts:
link_counts[url] = 0
link_counts[url] += 1
# Print the results
for url, count in link_counts.items():
print(f'{url}: {count}')
# Entry point of the script
if __name__ == '__main__':
url = 'https://www.google.com/'
hyperlink_analysis(url)
Output
pip install requests
pip install bs4
Prepared by B. Venkata Varma
python hyperlink_analysis.py
/search: 5
/maps: 1
/shopping: 1
/about: 1
https://policies.google.com/privacy: 1
/intl/en/policies/terms/: 1
4. Search engine optimization-implement spamdexing
import nltk
nltk.download('stopwords')
def spamdexing(text):
# Load English stopwords from NLTK
stopwords = nltk.corpus.stopwords.words('english')
# Define the keywords to be added
keywords = ['keyword1', 'keyword2', 'keyword3']
# Filter the text by removing stopwords
filtered_text = [word for word in text.split() if word.lower() not in stopwords]
# Append keywords to the filtered text (each keyword repeated 10 times)
for keyword in keywords:
filtered_text.append(keyword * 10)
return filtered_text
# Entry point of the script
if __name__ == '__main__':
text = "This is a sample text with stopwords."
filtered_text = spamdexing(text)
print(filtered_text)
Output
['This','is','a','sample','text','with','stopwords.','keyword1','keyword1','keyword1',
'keyword1','keyword1', 'keyword1', 'keyword1', 'keyword1', 'keyword2', 'keyword2',
'keyword2', 'keyword2','keyword2', 'keyword2', 'keyword2', 'keyword2', 'keyword3',
'keyword3', 'keyword3', 'keyword3','keyword3', 'keyword3', 'keyword3', 'keyword3']
Prepared by B. Venkata Varma
5. Use Google analytics tools to implement the following
a. Conversion Statistics
import requests
def get_conversion_data(conversion_id):
url = 'https://analytics.google.com/analytics/v3/data/ga'
params = {
'ids': f'ga:{conversion_id}',
'start-date': '2023-01-01',
'end-date': '2023-08-01',
'metrics': 'ga:conversions',
'dimensions': 'ga:date',
'samplingLevel': '1'
}
response = requests.get(url, params=params)
return response.json()
if __name__ == '__main__':
conversion_id = '1234567890'
conversion_data = get_conversion_data(conversion_id)
print(conversion_data)
Output
The output of the program will depend on the data in the data file. However, the
output might include the following information:
•The conversion rate
•The number of conversions
•The number of visitors.
b. Visitor Profiles
To create Visitor Profiles in Python, you need to analyze visitor data from sources like an
API, database, or CSV files. Visitor profiles typically include attributes such as
demographics, preferences, behavior patterns, and interaction history. Here's how you can
structure your approach:
Prepared by B. Venkata Varma
import pandas as pd
import matplotlib.pyplot as plt
# Sample visitor data
visitor_data = [
{"visitor_id": 1, "age": 25, "gender": "Female", "location": "New York", "visits": 10,
"purchases": 2},
{"visitor_id": 2, "age": 30, "gender": "Male", "location": "Los Angeles", "visits": 5,
"purchases": 1},
{"visitor_id": 3, "age": 22, "gender": "Female", "location": "Chicago", "visits": 15,
"purchases": 5},
]
# Convert to DataFrame
df = pd.DataFrame(visitor_data)
# Add derived metrics
df['conversion_rate'] = (df['purchases'] / df['visits']) * 100 # Conversion rate in %
# Display basic statistics
print("Summary Statistics:")
print(df.describe())
# Group by gender
gender_summary = df.groupby('gender').agg({'visits': 'mean', 'purchases': 'mean',
'conversion_rate': 'mean'})
print("\nGender-Based Summary:")
print(gender_summary)
# Plot profiles
plt.figure(figsize=(10, 6))
df.groupby('location')['visits'].sum().plot(kind='bar', color='skyblue')
plt.title('Visits by Location')
plt.xlabel('Location')
plt.ylabel('Total Visits')
plt.show()
Prepared by B. Venkata Varma
Output
Summary Statistics:
visitor_id age visits purchases conversion_rate
count 3.0 3.000000 3.000000 3.000000 3.000000
mean 2.0 29.000000 17.333333 2.666667 14.230504
std 1.0 6.557439 4.932883 2.081666 7.536384
min 1.0 22.000000 14.000000 1.000000 6.666667
25% 1.5 26.000000 14.500000 1.500000 10.476190
50% 2.0 30.000000 15.000000 2.000000 14.285714
75% 2.5 32.500000 19.000000 3.500000 18.012422
max 3.0 35.000000 23.000000 5.000000 21.739130
Gender-Based Summary:
visits purchases conversion_rate
gender
FeMale 15.0 1.0 6.666667
Female 23.0 5.0 21.739130
male 14.0 2.0 14.285714
Prepared by B. Venkata Varma
6. Use Google analytics tools to implement the Traffic Sources.
import requests
import json
def get_traffic_sources(profile_id, access_token):
url = 'https://analytics.googleapis.com/analytics/v3/data/ga'
params = {
'ids': f'ga:{profile_id}', # Use f-strings to interpolate the profile_id
'start-date': '2023-01-01',
'end-date': '2023-08-01',
'metrics': 'ga:sessions',
'dimensions': 'ga:source,ga:medium',
'samplingLevel': 'HIGHER_PRECISION'
}
headers = {
'Authorization': f'Bearer {access_token}', # Provide the access token for
authentication
'Accept': 'application/json'
}
response = requests.get(url, params=params, headers=headers)
if response.status_code == 200:
return response.json()
else:
print(f"Error: {response.status_code} - {response.text}")
return None
def save_data_to_json(data, filename='traffic_sources.json'):
# Write the response data to a JSON file
with open(filename, 'w') as json_file:
json.dump(data, json_file, indent=4)
print(f"Data saved to {filename}")
if __name__ == '__main__':
profile_id = '1234567890' # Replace with your actual profile ID
access_token = 'YOUR_ACCESS_TOKEN' # Replace with a valid access token
# Fetch the traffic sources
traffic_sources = get_traffic_sources(profile_id, access_token)
if traffic_sources:
save_data_to_json(traffic_sources)
Prepared by B. Venkata Varma
Output
Invalid Access Token:-
Error: 401 - {
"error": {
"code": 401,
"message": "Request is missing required authentication credential.",
"errors": [
{
"message": "Request is missing required authentication credential.",
"domain": "global",
"reason": "required"
}
]
}
}
Invalid Profile ID:-
Error: 400 - {
"error": {
"code": 400,
"message": "Invalid value 'ga:123456'. Values must match the pattern 'ga:[0-9]+'.",
"errors": [
{
"message": "Invalid value 'ga:123456'. Values must match the pattern 'ga:[0-9]+'.",
"domain": "global",
"reason": "invalid"
}
Prepared by B. Venkata Varma
]
}
}
Prepared by B. Venkata Varma