SOCIAL MEDIA MINING LAB
EXP 1
To implement the task of collecting and visualizing complex social networks from Twitter and
Wikipedia using NodeXL and Python, you can follow these steps:
### 1. Collect Data:
#### From Twitter:
- Use the Tweepy library to access the Twitter API and fetch relevant data such as tweets, user
information, and relationships.
- Collect tweets based on keywords, hashtags, or user handles to build your social network.
#### From Wikipedia:
- Utilize web scraping libraries like BeautifulSoup or Scrapy to extract information from
Wikipedia pages.
- Gather data such as page links, categories, and content related to your topic of interest.
### 2. Preprocess Data:
- Clean and preprocess the collected data to remove noise, handle missing values, and format it
appropriately for analysis.
- Extract relevant features such as user mentions, hashtags, URLs, and user interactions from
Twitter data.
- Extract relevant information such as page titles, links, and categories from Wikipedia data.
### 3. Create Social Network Graphs:
- Use the NodeXL library in Python to create social network graphs.
- For Twitter data, nodes can represent users, and edges can represent interactions such as
retweets, mentions, or follows.
- For Wikipedia data, nodes can represent Wikipedia pages, and edges can represent links between
pages or shared categories.
### 4. Visualize Networks:
- Once you have created the social network graphs, use NodeXL's built-in visualization features to
visualize the networks.
- Customize the visualization settings to highlight important nodes, edges, or clusters within the
networks.
- Experiment with different layout algorithms to find the most suitable layout for your data.
### Example Code:
```python
import tweepy
import networkx as nx
import matplotlib.pyplot as plt
import wikipedia
# Twitter API credentials
consumer_key = 'your_consumer_key'
consumer_secret = 'your_consumer_secret'
access_token = 'your_access_token'
access_token_secret = 'your_access_token_secret'
# Authenticate with Twitter API
auth = tweepy.OAuth1UserHandler(consumer_key, consumer_secret, access_token,
access_token_secret)
api = tweepy.API(auth)
# Fetch tweets based on a keyword
tweets = api.search(q='data science', count=100)
# Create an empty graph
twitter_graph = nx.Graph()
# Add nodes for users
for tweet in tweets:
user_id = tweet.user.id
user_name = tweet.user.screen_name
twitter_graph.add_node(user_id, label=user_name)
# Add edges for retweets and mentions
for tweet in tweets:
user_id = tweet.user.id
for mentioned_user in tweet.entities['user_mentions']:
mentioned_user_id = mentioned_user['id']
twitter_graph.add_edge(user_id, mentioned_user_id)
# Visualize the Twitter network
nx.draw(twitter_graph, with_labels=True)
plt.show()
# Fetch Wikipedia page links
page = wikipedia.page("Data science")
links = page.links
# Create an empty graph
wiki_graph = nx.Graph()
# Add nodes for Wikipedia pages
for link in links:
wiki_graph.add_node(link)
# Add edges for page links
for link in links:
linked_pages = wikipedia.page(link).links
for linked_page in linked_pages:
if linked_page in links:
wiki_graph.add_edge(link, linked_page)
# Visualize the Wikipedia network
nx.draw(wiki_graph, with_labels=True)
plt.show()
```
Make sure to replace `'your_consumer_key'`, `'your_consumer_secret'`, `'your_access_token'`, and
`'your_access_token_secret'` with your actual Twitter API credentials. Also, ensure you have
installed the required libraries (`tweepy`, `networkx`, `matplotlib`, `wikipedia`) using pip.
EXP 2
To compute various vertex and network metrics for social graphs using NodeXL and Python, you
can utilize the NetworkX library, which provides implementations for these metrics. Below is an
example code demonstrating how to compute each of the specified metrics:
```python
import networkx as nx
import matplotlib.pyplot as plt
# Load your social graph data into a NetworkX graph object
# Example:
# G = nx.read_edgelist('your_graph_file.txt')
# Assuming you have loaded your graph, let's compute the metrics:
# (i) Degree Centrality
degree_centrality = nx.degree_centrality(G)
# (ii) Eigenvector Centrality
eigenvector_centrality = nx.eigenvector_centrality(G)
# (iii) Betweenness Centrality
betweenness_centrality = nx.betweenness_centrality(G)
# (iv) PageRank
pagerank = nx.pagerank(G)
# (v) Closeness Centrality
closeness_centrality = nx.closeness_centrality(G)
# (vi) Group Centrality (Average degree)
group_centrality = nx.k_core(G)
# (vii) Clustering Coefficient
clustering_coefficient = nx.clustering(G)
# Print or visualize the computed metrics
print("Degree Centrality:", degree_centrality)
print("Eigenvector Centrality:", eigenvector_centrality)
print("Betweenness Centrality:", betweenness_centrality)
print("PageRank:", pagerank)
print("Closeness Centrality:", closeness_centrality)
print("Group Centrality:", group_centrality)
print("Clustering Coefficient:", clustering_coefficient)
# Example: Visualize the network with node sizes representing degree centrality
# You can customize the visualization as per your preference
node_sizes = [degree_centrality[node] * 1000 for node in G.nodes()]
nx.draw(G, with_labels=True, node_size=node_sizes)
plt.show()
```
Make sure to replace `'your_graph_file.txt'` with the path to your social graph file if you're loading
it from a file. Also, ensure you have installed the required libraries (`networkx`, `matplotlib`) using
pip.
This code snippet computes the specified metrics for the given social graph and prints them out.
You can also visualize the network with customized node sizes based on degree centrality as shown
in the example. Adjustments and customizations can be made according to your specific
requirements.
EXP 3
To visualize social graphs reflecting various metrics using NodeXL and Python, you can use
NetworkX for graph manipulation and Matplotlib for visualization. Below is a sample code
demonstrating how to visualize a social graph while reflecting the computed metrics:
```python
import networkx as nx
import matplotlib.pyplot as plt
# Load your social graph data into a NetworkX graph object
# Example:
# G = nx.read_edgelist('your_graph_file.txt')
# Assuming you have loaded your graph, let's compute some metrics
# For demonstration purposes, let's compute Degree Centrality and PageRank
# (i) Degree Centrality
degree_centrality = nx.degree_centrality(G)
# (iv) PageRank
pagerank = nx.pagerank(G)
# Create a new figure for plotting
plt.figure(figsize=(10, 8))
# Draw the social graph
pos = nx.spring_layout(G) # Define a layout for the graph
nx.draw(G, pos, with_labels=True, node_size=300, edge_color='gray')
# Draw nodes colored by degree centrality
nx.draw_networkx_nodes(G, pos, node_color=list(degree_centrality.values()), cmap=plt.cm.Blues,
node_size=300)
# Draw edges
nx.draw_networkx_edges(G, pos, alpha=0.5)
# Add color bar for degree centrality
plt.colorbar(label='Degree Centrality')
# Add labels for nodes with PageRank
nx.draw_networkx_labels(G, pos, labels={node: f"{node}\nPR: {pagerank[node]:.2f}" for node in
G.nodes()}, font_size=8)
# Add title
plt.title("Social Network Graph with Degree Centrality and PageRank")
# Show plot
plt.axis('off') # Turn off axis
plt.show()
```
In this code:
- We compute Degree Centrality and PageRank for the given social graph.
- The graph is visualized using a spring layout.
- Nodes are colored based on their degree centrality, and node size remains constant.
- Edges are drawn with a semi-transparent gray color.
- Node labels are added, displaying both the node ID and its corresponding PageRank value.
- A color bar is added to indicate the degree centrality of nodes.
You can customize the visualization further according to your specific requirements or include
additional metrics to reflect in the visualization.
EXP 4
Detecting bridges in a social graph helps identify edges whose removal would disconnect the graph.
You can accomplish this using NetworkX in Python. Below is how you can implement it:
```python
import networkx as nx
# Load your social graph data into a NetworkX graph object
# Example:
# G = nx.read_edgelist('your_graph_file.txt')
# Assuming you have loaded your graph, let's detect bridges
# Detect bridges in the graph
bridges = list(nx.bridges(G))
# Print the bridges
if bridges:
print("Bridges found in the graph:")
for bridge in bridges:
print(bridge)
else:
print("No bridges found in the graph.")
```
In this code:
- We utilize NetworkX's `nx.bridges(G)` function to detect bridges in the graph `G`.
- Bridges are edges whose removal would disconnect the graph.
- The function returns a list of tuples, where each tuple represents a bridge edge `(u, v)`.
You can adapt this code to your specific graph data and further analyze or visualize the detected
bridges as required.
EXP 5
Detecting communities and influencers in a social graph can provide insights into the structure and
key players within the network. One approach to identifying communities is through clique
identification, where cliques represent densely connected subgraphs. Below is a basic
implementation of brute-force clique identification on Enron email data using NetworkX in Python:
```python
import networkx as nx
# Load Enron email data into a NetworkX graph object
# Example:
# G = nx.read_edgelist('enron_emails.txt')
# Assuming you have loaded your graph, let's identify cliques (communities)
# Brute-force approach to find all maximal cliques in the graph
cliques = list(nx.find_cliques(G))
# Print the identified cliques
if cliques:
print("Cliques found in the graph:")
for clique in cliques:
print(clique)
else:
print("No cliques found in the graph.")
```
In this code:
- We utilize NetworkX's `nx.find_cliques(G)` function to identify all maximal cliques in the Enron
email graph `G`.
- Maximal cliques are complete subgraphs where every node is connected to every other node in the
subgraph.
- The function returns a list of lists, where each inner list represents a maximal clique.
This approach is a brute-force method and may not be efficient for large graphs. You can explore
more advanced community detection algorithms such as Louvain or Girvan-Newman for more
scalable solutions. Additionally, you can analyze the identified cliques further to identify influencers
within each community based on their centrality metrics or other criteria.
EXP 6
Implementing the Girvan-Newman algorithm for community detection involves iteratively
removing edges from the graph based on edge betweenness centrality until the graph is divided into
separate communities. Below is how you can implement Girvan-Newman algorithm using
NetworkX in Python:
```python
import networkx as nx
import matplotlib.pyplot as plt
# Load your social graph data into a NetworkX graph object
# Example:
# G = nx.read_edgelist('your_graph_file.txt')
# Assuming you have loaded your graph, let's apply Girvan-Newman algorithm
# Function to find communities using Girvan-Newman algorithm
def girvan_newman(G):
communities = list(nx.community.girvan_newman(G))
return communities
# Detect communities using Girvan-Newman algorithm
communities = girvan_newman(G)
# Visualize the communities
plt.figure(figsize=(12, 8))
pos = nx.spring_layout(G)
nx.draw(G, pos, node_color='lightblue', with_labels=True)
# Draw communities with different colors
for i, community in enumerate(communities):
nx.draw_networkx_nodes(G, pos, nodelist=community, node_color=plt.cm.jet(i /
len(communities)))
plt.title("Graph with Communities Identified by Girvan-Newman Algorithm")
plt.show()
```
In this code:
- We define a function `girvan_newman()` to apply the Girvan-Newman algorithm to the graph.
- We use NetworkX's built-in implementation of the Girvan-Newman algorithm.
- The algorithm returns a generator of tuples, where each tuple represents a partition of the graph
into communities.
- We visualize the graph with nodes colored based on their communities using matplotlib.
You can customize this code according to your specific graph data and further analyze or visualize
the detected communities as required.
EXP 7
Performing classification with network information involves leveraging features derived from the
network structure to classify nodes. One approach is the Weighted Vote Relational Neighbor (WV-
RN) classifier, which uses the relational information of neighboring nodes to make predictions.
Below is a basic implementation of WV-RN classifier for Twitter data using NodeXL and Python:
```python
import networkx as nx
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
# Load your Twitter data into a NetworkX graph object
# Example:
# G = nx.read_edgelist('twitter_data.txt')
# Assuming you have loaded your graph and labeled nodes for classification
# Function to extract features from the graph for classification
def extract_features(G, node):
# Degree of the node
degree = G.degree(node)
# Average neighbor degree of the node
avg_neighbor_degree = sum(G.degree(neighbor) for neighbor in G.neighbors(node)) / (degree +
1)
return [degree, avg_neighbor_degree]
# Create feature matrix and target labels
X = []
y = []
for node in G.nodes():
features = extract_features(G, node)
X.append(features)
# Assuming each node has a label
label = 1 if node in positive_nodes else 0
y.append(label)
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train the classifier
classifier = KNeighborsClassifier(n_neighbors=5, weights='distance')
classifier.fit(X_train, y_train)
# Predict the labels for test data
y_pred = classifier.predict(X_test)
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
```
In this code:
- We define a function `extract_features()` to extract features from the graph for classification. In
this example, we extract the node's degree and average neighbor degree.
- We iterate over all nodes in the graph and extract features along with their labels for classification.
- We split the data into training and test sets.
- We train a K-nearest neighbors classifier using the extracted features.
- We evaluate the classifier's performance by predicting labels for the test data and calculating
accuracy.
You can customize this code according to your specific Twitter data and classification requirements.
Additionally, you can explore more sophisticated classifiers and feature extraction techniques for
better classification performance.
EXP 8
Performing sentiment analysis on an IMDb dataset involves analyzing the sentiment (positive or
negative) associated with movie reviews. Below is a basic implementation of sentiment analysis on
an IMDb dataset using Python with the NLTK library:
```python
import pandas as pd
from nltk.sentiment.vader import SentimentIntensityAnalyzer
# Load your IMDb dataset (CSV file) into a pandas DataFrame
# Example:
# imdb_data = pd.read_csv('imdb_dataset.csv')
# Assuming you have loaded your IMDb dataset, let's perform sentiment analysis
# Initialize the SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()
# Function to assign sentiment polarity scores to text
def get_sentiment(text):
# Calculate sentiment polarity scores using VADER
scores = sid.polarity_scores(text)
# Classify sentiment based on compound score
if scores['compound'] >= 0.05:
return 'Positive'
elif scores['compound'] <= -0.05:
return 'Negative'
else:
return 'Neutral'
# Apply sentiment analysis to the IMDb dataset
imdb_data['Sentiment'] = imdb_data['Review'].apply(get_sentiment)
# Print the sentiment analysis results
print("Sentiment Analysis Results:")
print(imdb_data['Sentiment'].value_counts())
```
In this code:
- We use the NLTK library's VADER (Valence Aware Dictionary and sEntiment Reasoner)
sentiment analysis tool for sentiment analysis.
- We load the IMDb dataset into a pandas DataFrame.
- We define a function `get_sentiment()` to calculate sentiment polarity scores using VADER and
classify the sentiment as positive, negative, or neutral based on the compound score.
- We apply sentiment analysis to the 'Review' column of the IMDb dataset and add a new column
'Sentiment' to store the sentiment labels.
- Finally, we print the sentiment analysis results, showing the counts of positive, negative, and
neutral sentiments in the dataset.
You need to ensure you have the NLTK library installed (`pip install nltk`) and have downloaded
the VADER lexicon (`nltk.download('vader_lexicon')`). Additionally, replace `'imdb_dataset.csv'`
with the path to your IMDb dataset CSV file.
EXP 9
To apply the k-means clustering algorithm on an IMDb dataset using Python, you can use libraries
such as scikit-learn. Below is a basic implementation:
```python
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
# Load your IMDb dataset (CSV file) into a pandas DataFrame
# Example:
# imdb_data = pd.read_csv('imdb_dataset.csv')
# Assuming you have loaded your IMDb dataset, let's apply k-means clustering
# Extract text data for clustering (e.g., movie reviews)
text_data = imdb_data['Review']
# Initialize TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, min_df=2, stop_words='english')
# Fit and transform the text data to TF-IDF features
tfidf_features = tfidf_vectorizer.fit_transform(text_data)
# Apply k-means clustering
k = 3 # Number of clusters
kmeans = KMeans(n_clusters=k)
kmeans.fit(tfidf_features)
# Get cluster labels for each data point
cluster_labels = kmeans.labels_
# Add cluster labels to the DataFrame
imdb_data['Cluster'] = cluster_labels
# Print the count of movies in each cluster
print("Number of movies in each cluster:")
print(imdb_data['Cluster'].value_counts())
```
In this code:
- We use the scikit-learn library to perform k-means clustering.
- We extract text data (e.g., movie reviews) from the IMDb dataset.
- We initialize a TF-IDF vectorizer to convert text data into numerical features.
- We fit and transform the text data into TF-IDF features.
- We apply k-means clustering with a specified number of clusters (k).
- We add cluster labels to the IMDb dataset.
- Finally, we print the count of movies in each cluster.
You need to ensure you have scikit-learn and pandas installed (`pip install scikit-learn pandas`).
Additionally, replace `'imdb_dataset.csv'` with the path to your IMDb dataset CSV file. Adjust the
parameters of the TF-IDF vectorizer and the number of clusters (k) according to your specific
dataset and requirements.
EXP 10
To apply user-based collaborative filtering on Amazon review data using Python, you can use
libraries such as Surprise. Below is a basic implementation:
```python
from surprise import Dataset, Reader, KNNBasic
from surprise.model_selection import train_test_split
from surprise.accuracy import rmse
# Load your Amazon review data into a Surprise dataset object
# Example:
# reader = Reader(line_format='user item rating', sep=',')
# amazon_data = Dataset.load_from_file('amazon_reviews.csv', reader)
# Assuming you have loaded your Amazon review data, let's apply user-based collaborative filtering
# Split the data into train and test sets
trainset, testset = train_test_split(amazon_data, test_size=0.2)
# Define the user-based collaborative filtering model (KNN)
sim_options = {'name': 'cosine', 'user_based': True} # Use cosine similarity
knn_model = KNNBasic(sim_options=sim_options)
# Train the model on the training set
knn_model.fit(trainset)
# Make predictions on the test set
predictions = knn_model.test(testset)
# Compute RMSE (Root Mean Squared Error) to evaluate the model performance
accuracy = rmse(predictions)
print("RMSE:", accuracy)
```
In this code:
- We use the Surprise library, which provides collaborative filtering algorithms and evaluation
metrics.
- We load the Amazon review data into a Surprise dataset object.
- We split the data into training and test sets.
- We define the user-based collaborative filtering model using the KNNBasic algorithm with cosine
similarity.
- We train the model on the training set.
- We make predictions on the test set.
- We compute RMSE to evaluate the model's performance.
You need to ensure you have Surprise installed (`pip install scikit-surprise`). Additionally, replace
`'amazon_reviews.csv'` with the path to your Amazon review data CSV file. Adjust the parameters
of the model and evaluation metrics according to your specific dataset and requirements.
EXP 11
To apply item-based collaborative filtering on Amazon review data using Python, you can still use
the Surprise library. However, you'll need to set `user_based` parameter to `False` to perform item-
based collaborative filtering. Below is how you can implement it:
```python
from surprise import Dataset, Reader, KNNBasic
from surprise.model_selection import train_test_split
from surprise.accuracy import rmse
# Load your Amazon review data into a Surprise dataset object
# Example:
# reader = Reader(line_format='user item rating', sep=',')
# amazon_data = Dataset.load_from_file('amazon_reviews.csv', reader)
# Assuming you have loaded your Amazon review data, let's apply item-based collaborative
filtering
# Split the data into train and test sets
trainset, testset = train_test_split(amazon_data, test_size=0.2)
# Define the item-based collaborative filtering model (KNN)
sim_options = {'name': 'cosine', 'user_based': False} # Use cosine similarity
knn_model = KNNBasic(sim_options=sim_options)
# Train the model on the training set
knn_model.fit(trainset)
# Make predictions on the test set
predictions = knn_model.test(testset)
# Compute RMSE (Root Mean Squared Error) to evaluate the model performance
accuracy = rmse(predictions)
print("RMSE:", accuracy)
```
In this code:
- We still use the Surprise library for collaborative filtering.
- We load the Amazon review data into a Surprise dataset object.
- We split the data into training and test sets.
- We define the item-based collaborative filtering model using the KNNBasic algorithm with cosine
similarity and `user_based` parameter set to `False`.
- We train the model on the training set.
- We make predictions on the test set.
- We compute RMSE to evaluate the model's performance.
You need to ensure you have Surprise installed (`pip install scikit-surprise`). Additionally, replace
`'amazon_reviews.csv'` with the path to your Amazon review data CSV file. Adjust the parameters
of the model and evaluation metrics according to your specific dataset and requirements.
EXP 12
Predicting individual behavior of users in social media can involve various techniques depending on
the specific behavior you're interested in. One common approach is to use machine learning
algorithms to predict user actions or preferences based on historical data and user features. Below is
a basic example of how you can predict user behavior in social media using Python with scikit-
learn:
```python
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
# Load your social media data into a pandas DataFrame
# Example:
# social_media_data = pd.read_csv('social_media_data.csv')
# Assuming you have loaded your social media data, let's predict user behavior
# Define features and target variable
X = social_media_data.drop(columns=['target_column']) # Features
y = social_media_data['target_column'] # Target variable
# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Define the machine learning model (e.g., RandomForestClassifier)
model = RandomForestClassifier(n_estimators=100, random_state=42)
# Train the model on the training set
model.fit(X_train, y_train)
# Make predictions on the test set
y_pred = model.predict(X_test)
# Evaluate the model's performance
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
```
In this code:
- We load social media data into a pandas DataFrame. This data should contain features related to
user behavior and a target column representing the behavior you want to predict.
- We define features (X) and the target variable (y).
- We split the data into training and test sets.
- We define a machine learning model (e.g., RandomForestClassifier) to predict user behavior.
- We train the model on the training set.
- We make predictions on the test set.
- We evaluate the model's performance using accuracy score.
You need to ensure you have pandas and scikit-learn installed (`pip install pandas scikit-learn`).
Additionally, replace `'social_media_data.csv'` with the path to your social media data CSV file and
`'target_column'` with the name of the target column representing the behavior you want to predict.
Adjust the machine learning model and parameters according to your specific dataset and prediction
task.