KEMBAR78
Facebook Graph Link Prediction | PDF | Receiver Operating Characteristic | Statistical Data Types
0% found this document useful (0 votes)
269 views14 pages

Facebook Graph Link Prediction

This document discusses using machine learning models to predict links between nodes in a social network graph using Facebook data. It loads and explores the dataset, trains and tunes a random forest classifier on the data, and evaluates its performance on test data. Key steps include feature engineering on the graph data, parameter tuning of the random forest model, and obtaining an F1 score of 0.924 on held-out test data to evaluate the model's predictive performance.

Uploaded by

Praneeth Aluru
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
269 views14 pages

Facebook Graph Link Prediction

This document discusses using machine learning models to predict links between nodes in a social network graph using Facebook data. It loads and explores the dataset, trains and tunes a random forest classifier on the data, and evaluates its performance on test data. Key steps include feature engineering on the graph data, parameter tuning of the random forest model, and obtaining an F1 score of 0.924 on held-out test data to evaluate the model's predictive performance.

Uploaded by

Praneeth Aluru
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

In [0]:

In [8]:

from google.colab import drive


drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount


("/content/drive", force_remount=True).

Social network Graph Link Prediction - Facebook


Challenge
In [0]:

#Importing Libraries
# please do go through this python notebook:
import warnings
warnings.filterwarnings("ignore")

import csv
import pandas as pd #pandas to create small dataframes
import datetime #Convert to unix time
import time #Convert to unix time
# if numpy is not installed already : pip3 install numpy
import numpy as np #Do aritmetic operations on arrays
# matplotlib: used to plot graphs
import matplotlib
import matplotlib.pylab as plt
import seaborn as sns#Plots
from matplotlib import rcParams#Size of plots
from sklearn.cluster import MiniBatchKMeans, KMeans#Clustering
import math
import pickle
import os
# to install xgboost: pip3 install xgboost
import xgboost as xgb

import warnings
import networkx as nx
import pdb
import pickle
from pandas import HDFStore,DataFrame
from pandas import read_hdf
from scipy.sparse.linalg import svds, eigs
import gc
from tqdm import tqdm
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score

In [0]:
#reading
from pandas import read_hdf
df_final_train = read_hdf('/content/drive/My Drive/FBR/storage_sample_stage4.h5', 'train_
df',mode='r')
df_final_test = read_hdf('/content/drive/My Drive/FBR/storage_sample_stage4.h5', 'test_df
',mode='r')

In [11]:
df_final_train.columns

Out[11]:
Index(['source_node', 'destination_node', 'indicator_link',
'jaccard_followers', 'jaccard_followees', 'cosine_followers',
'cosine_followees', 'num_followers_s', 'num_followers_d',
'num_followees_s', 'num_followees_d', 'inter_followers',
'inter_followees', 'adar_index', 'follows_back', 'same_comp',
'shortest_path', 'weight_in', 'weight_out', 'weight_f1', 'weight_f2',
'weight_f3', 'weight_f4', 'page_rank_s', 'page_rank_d', 'katz_s',
'katz_d', 'hubs_s', 'hubs_d', 'authorities_s', 'authorities_d',
'svd_u_s_1', 'svd_u_s_2', 'svd_u_s_3', 'svd_u_s_4', 'svd_u_s_5',
'svd_u_s_6', 'svd_u_d_1', 'svd_u_d_2', 'svd_u_d_3', 'svd_u_d_4',
'svd_u_d_5', 'svd_u_d_6', 'svd_v_s_1', 'svd_v_s_2', 'svd_v_s_3',
'svd_v_s_4', 'svd_v_s_5', 'svd_v_s_6', 'svd_v_d_1', 'svd_v_d_2',
'svd_v_d_3', 'svd_v_d_4', 'svd_v_d_5', 'svd_v_d_6'],
dtype='object')

In [0]:
y_train = df_final_train.indicator_link
y_test = df_final_test.indicator_link

In [0]:

df_final_train.drop(['source_node', 'destination_node','indicator_link'],axis=1,inplace=T
rue)
df_final_test.drop(['source_node', 'destination_node','indicator_link'],axis=1,inplace=Tr
ue )

In [14]:
df_final_train.head()
Out[14]:

jaccard_followers jaccard_followees cosine_followers cosine_followees num_followers_s num_followers_d num_followees_s nu

0 0 0.0 0.000000 0.000000 11 6 15

1 0 0.0 0.000000 0.000000 28 9 31

2 0 0.0 0.064282 0.000000 2 11 1

3 0 0.0 0.000000 0.000000 6 2 10

4 0 0.2 0.250000 0.353553 4 2 4

In [15]:
estimators = [5, 10, 50, 100, 200, 500, 1000]
train_scores = []
test_scores = []
for i in estimators:
clf = RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=5, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=52, min_samples_split=120,
min_weight_fraction_leaf=0.0, n_estimators=i, n_jobs=-1,random_state=25,verb
ose=0,warm_start=False)
clf.fit(df_final_train,y_train)
train_sc = f1_score(y_train,clf.predict(df_final_train))
test_sc = f1_score(y_test,clf.predict(df_final_test))
test_scores.append(test_sc)
train_scores.append(train_sc)
print('Estimators = ',i,'Train Score',train_sc,'test Score',test_sc)
plt.plot(estimators,train_scores,label='Train Score')
plt.plot(estimators,test_scores,label='Test Score')
plt.xlabel('Estimators')
plt.ylabel('Score')
plt.title('Estimators vs score at depth of 5')

Estimators = 5 Train Score 0.9291707927242403 test Score 0.9119792103277726


Estimators = 10 Train Score 0.9299304333249517 test Score 0.882815376285869
Estimators = 50 Train Score 0.9276742813790778 test Score 0.9031955087839135
Estimators = 100 Train Score 0.9278897806989825 test Score 0.9185525124728965
Estimators = 200 Train Score 0.9246630827012297 test Score 0.9153597107177395
Estimators = 500 Train Score 0.9261065068817725 test Score 0.9165105756077029
Estimators = 1000 Train Score 0.9257950530035336 test Score 0.9166526563550774

Out[15]:
Text(0.5, 1.0, 'Estimators vs score at depth of 5')

In [16]:
depths = [2, 3, 4, 5, 6, 7, 8, 9, 10]
train_scores = []
test_scores = []
for i in depths:
clf = RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=i, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=52, min_samples_split=120,
min_weight_fraction_leaf=0.0, n_estimators=115, n_jobs=-1,random_state=25,ve
rbose=0,warm_start=False)
clf.fit(df_final_train,y_train)
train_sc = f1_score(y_train,clf.predict(df_final_train))
test_sc = f1_score(y_test,clf.predict(df_final_test))
test_scores.append(test_sc)
train_scores.append(train_sc)
print('depth = ',i,'Train Score',train_sc,'test Score',test_sc)
plt.plot(depths,train_scores,label='Train Score')
plt.plot(depths,test_scores,label='Test Score')
plt.xlabel('Depth')
plt.ylabel('Score')
plt.title('Depth vs score at depth of 5 at estimators = 115')
plt.show()

depth = 2 Train Score 0.8531705370336414 test Score 0.8324998912901682


depth = 3 Train Score 0.8769022440030951 test Score 0.8561332003642198
depth = 4 Train Score 0.9243469767148754 test Score 0.9154128632882174
depth = 5 Train Score 0.9274724348915825 test Score 0.9178823281759816
depth = 6 Train Score 0.928798185941043 test Score 0.9191936232128548
depth = 7 Train Score 0.9349451148551599 test Score 0.9196031762748278
depth = 8 Train Score 0.9520937054558225 test Score 0.920708561788275
depth = 9 Train Score 0.957140210387632 test Score 0.9202702702702702
depth = 10 Train Score 0.9600188628954546 test Score 0.9211429534016972
In [0]:

In [18]:
from sklearn.metrics import f1_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint as sp_randint
from scipy.stats import uniform

param_dist = {"n_estimators":sp_randint(105,125),
"max_depth": sp_randint(10,15),
"min_samples_split": sp_randint(110,190),
"min_samples_leaf": sp_randint(25,65)}

clf = RandomForestClassifier(random_state=25,n_jobs=-1)

rf_random = RandomizedSearchCV(clf, param_distributions=param_dist,


n_iter=5,cv=10,scoring='f1',random_state=25,return_t
rain_score=True)

rf_random.fit(df_final_train,y_train)
print('mean test scores',rf_random.cv_results_['mean_test_score'])
print('mean train scores',rf_random.cv_results_['mean_train_score'])

mean test scores [0.96361736 0.96290864 0.9608348 0.9631529 0.96471432]


mean train scores [0.96458678 0.96355943 0.96149067 0.96404851 0.96614618]

In [19]:
print(rf_random.best_estimator_)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,


criterion='gini', max_depth=14, max_features='auto',
max_leaf_nodes=None, max_samples=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=28, min_samples_split=111,
min_weight_fraction_leaf=0.0, n_estimators=121,
n_jobs=-1, oob_score=False, random_state=25, verbose=0,
warm_start=False)

In [0]:
clf = RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=14, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=28, min_samples_split=111,
min_weight_fraction_leaf=0.0, n_estimators=121, n_jobs=-1,
oob_score=False, random_state=25, verbose=0, warm_start=False)

In [0]:
clf.fit(df_final_train,y_train)
y_train_pred = clf.predict(df_final_train)
y_test_pred = clf.predict(df_final_test)

In [22]:
from sklearn.metrics import f1_score
print('Train f1 score',f1_score(y_train,y_train_pred))
print('Test f1 score',f1_score(y_test,y_test_pred))

Train f1 score 0.9660346501763172


Test f1 score 0.9240969941904522

In [0]:
from sklearn.metrics import confusion_matrix
def plot_confusion_matrix(test_y, predict_y):
C = confusion_matrix(test_y, predict_y)

A =(((C.T)/(C.sum(axis=1))).T)

B =(C/C.sum(axis=0))
plt.figure(figsize=(20,4))

labels = [0,1]
# representing A in heatmap format
cmap=sns.light_palette("blue")
plt.subplot(1, 3, 1)
sns.heatmap(C, annot=True, cmap=cmap, fmt=".3f", xticklabels=labels, yticklabels=lab
els)
plt.xlabel('Predicted Class')
plt.ylabel('Original Class')
plt.title("Confusion matrix")

plt.subplot(1, 3, 2)
sns.heatmap(B, annot=True, cmap=cmap, fmt=".3f", xticklabels=labels, yticklabels=lab
els)
plt.xlabel('Predicted Class')
plt.ylabel('Original Class')
plt.title("Precision matrix")

plt.subplot(1, 3, 3)
# representing B in heatmap format
sns.heatmap(A, annot=True, cmap=cmap, fmt=".3f", xticklabels=labels, yticklabels=lab
els)
plt.xlabel('Predicted Class')
plt.ylabel('Original Class')
plt.title("Recall matrix")

plt.show()

In [24]:
print('Train confusion_matrix')
plot_confusion_matrix(y_train,y_train_pred)
print('Test confusion_matrix')
plot_confusion_matrix(y_test,y_test_pred)

Train confusion_matrix
Test confusion_matrix

In [25]:
from sklearn.metrics import roc_curve, auc
fpr,tpr,ths = roc_curve(y_test,y_test_pred)
auc_sc = auc(fpr, tpr)
plt.plot(fpr, tpr, color='navy',label='ROC curve (area = %0.2f)' % auc_sc)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic with test data')
plt.legend()
plt.show()

In [26]:
features = df_final_train.columns
importances = clf.feature_importances_
indices = (np.argsort(importances))[-25:]
plt.figure(figsize=(10,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='r', align='center')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
Assignments:
1. Add another feature called Preferential Attachment with followers and followees data of vertex. you can
check about Preferential Attachment in below link http://be.amazd.com/link-prediction/
2. Add feature called svd_dot. you can calculate svd_dot as Dot product between sourse node svd and
destination node svd features. you can read about this in below pdf https://storage.googleapis.com/kaggle-
forum-message-attachments/2594/supervised_link_prediction.pdf
3. Tune hyperparameters for XG boost with all these features and check the error metric.

In [0]:

In [0]:

FEATURE ENGINEERING

Preferential Attachment
In [0]:

In [0]:
prefer_attach1=[]
trf=np.array(df_final_train['num_followers_s'])
tef=np.array(df_final_train['num_followers_d'])
for i in range(len(trf)):
prefer_attach1.append(trf[i]*tef[i])
df_final_train['prefer_att']=prefer_attach1

In [0]:
prefer_attach2=[]
trf=np.array(df_final_test['num_followers_s'])
tef=np.array(df_final_test['num_followers_d'])
for i in range(len(trf)):
prefer_attach2.append(trf[i]*tef[i])
df_final_test['prefer_att']=prefer_attach2

In [29]:
num_fs=np.array(df_final_train['num_followees_s'])
num_fd=np.array(df_final_train['num_followees_d'])
preferential_followees=[]
for i in range(len(num_fs)):
preferential_followees.append(num_fd[i]*num_fs[i])
df_final_train['prefer_Attach_followees']= preferential_followees
df_final_train.head()
Out[29]:

jaccard_followers jaccard_followees cosine_followers cosine_followees num_followers_s num_followers_d num_followees_s nu

0 0 0.0 0.000000 0.000000 11 6 15

1 0 0.0 0.000000 0.000000 28 9 31

2 0 0.0 0.064282 0.000000 2 11 1

3 0 0.0 0.000000 0.000000 6 2 10

4 0 0.2 0.250000 0.353553 4 2 4

In [30]:
num_fs=np.array(df_final_test['num_followees_s'])
num_fd=np.array(df_final_test['num_followees_d'])
preferential_followees=[]
for i in range(len(num_fs)):
preferential_followees.append(num_fd[i]*num_fs[i])
df_final_test['prefer_Attach_followees']= preferential_followees
df_final_test.head()
Out[30]:

jaccard_followers jaccard_followees cosine_followers cosine_followees num_followers_s num_followers_d num_followees_s nu

0 0 0.000000 0.029161 0.000000 6 14 6

1 0 0.000000 0.000000 0.000000 5 1 5

2 0 0.133333 0.235702 0.308607 18 1 14

3 0 0.000000 0.066667 0.000000 9 5 1

4 0 0.000000 0.005263 0.000000 4 95 123


In [0]:

In [0]:

In [32]:
s1 = df_final_train[['svd_u_s_1','svd_u_s_2','svd_u_s_3','svd_u_s_4','svd_u_s_5']].value
s
d1 = df_final_train[['svd_u_d_1','svd_u_d_2','svd_u_d_3','svd_u_d_4','svd_u_d_5']].value
s
svd_u_dot_train = []
for i in range(df_final_train.shape[0]):
res = np.dot(s1[i],d1[i])
svd_u_dot_train.append(res)

s1 = df_final_test[['svd_u_s_1','svd_u_s_2','svd_u_s_3','svd_u_s_4','svd_u_s_5']].values
d1 = df_final_test[['svd_u_d_1','svd_u_d_2','svd_u_d_3','svd_u_d_4','svd_u_d_5']].values
svd_u_dot_test = []
for i in range(df_final_test.shape[0]):
res = np.dot(s1[i],d1[i])
svd_u_dot_test.append(res)

print("svd_dot_train ",len(svd_u_dot_train))
print("svd_dot_test ",len(svd_u_dot_test))

svd_dot_train 100002
svd_dot_test 50002

In [0]:
print(svd_u_dot_train)

In [36]:
s2 = df_final_train[['svd_v_s_1','svd_v_s_2','svd_v_s_3','svd_v_s_4','svd_v_s_5']].value
s
d2 = df_final_train[['svd_v_d_1','svd_v_d_2','svd_v_d_3','svd_v_d_4','svd_v_d_5']].value
s
svd_v_dot_train = []
for i in range(df_final_train.shape[0]):
res = np.dot(s2[i],d2[i])
svd_v_dot_train.append(res)

ss = df_final_test[['svd_v_s_1','svd_v_s_2','svd_v_s_3','svd_v_s_4','svd_v_s_5']].values
dd = df_final_test[['svd_v_s_1','svd_v_s_2','svd_v_s_3','svd_v_s_4','svd_v_s_5']].values
svd_v_dot_test = []
for i in range(df_final_test.shape[0]):
res = np.dot(s2[i],d2[i])
svd_v_dot_test.append(res)

print("svd_dot_train ",len(svd_v_dot_train))
print("svd_dot_test ",len(svd_v_dot_test))

svd_dot_train 100002
svd_dot_test 50002

In [0]:
train = pd.DataFrame({'svd_u_dot_train':svd_u_dot_train,'svd_v_dot_train':svd_v_dot_trai
n})
test = pd.DataFrame({'svd_u_dot_test':svd_u_dot_test,'svd_v_dot_test':svd_v_dot_test})

In [38]:
from scipy.sparse import hstack
X_train = hstack((df_final_train,train))
X_test = hstack((df_final_test,test))

print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(100002, 56) (100002,)


(50002, 56) (50002,)

In [0]:

In [0]:

In [0]:

USING LGBM INSTEAD OF XGBOOST


In [0]:
from lightgbm import LGBMClassifier

In [53]:
base_learners = [5, 10, 50, 100, 200, 500, 1000]
depths= [2, 3, 4, 5, 6, 7, 8, 9, 10]
param_grid={'n_estimators': base_learners, 'max_depth':depths}
clf = LGBMClassifier(booster='gbtree',n_jobs=2)
model=RandomizedSearchCV(clf,param_grid,scoring='roc_auc',n_jobs=-1,cv=3)
model.fit(X_train,y_train)
print("optimal n_estimators",model.best_estimator_.n_estimators)
print("optimal max_depth",model.best_estimator_.max_depth)

optimal_depth=model.best_estimator_.max_depth
optimal_nest=model.best_estimator_.n_estimators

optimal n_estimators 500


optimal max_depth 8

In [0]:

In [54]:
from sklearn.metrics import f1_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint as sp_randint
from scipy.stats import uniform

param_dist = {"n_estimators":sp_randint(105,125),
"max_depth": sp_randint(10,15),
"min_samples_split": sp_randint(110,190),
"min_samples_leaf": sp_randint(25,65)}

clf = LGBMClassifier(random_state=25,n_jobs=-1)

rf_random = RandomizedSearchCV(clf, param_distributions=param_dist,


n_iter=5,cv=10,scoring='f1',random_state=25,return_t
rain_score=True)

rf_random.fit(X_train,y_train)
print('mean test scores',rf_random.cv_results_['mean_test_score'])
print('mean train scores',rf_random.cv_results_['mean_train_score'])

mean test scores [0.98172216 0.98130676 0.98087508 0.98147359 0.98174375]


mean train scores [0.98829868 0.98724842 0.98687499 0.98743107 0.98867191]

In [55]:

print(rf_random.best_estimator_)

LGBMClassifier(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,


importance_type='split', learning_rate=0.1, max_depth=14,
min_child_samples=20, min_child_weight=0.001,
min_samples_leaf=28, min_samples_split=111, min_split_gain=0.0,
n_estimators=121, n_jobs=-1, num_leaves=31, objective=None,
random_state=25, reg_alpha=0.0, reg_lambda=0.0, silent=True,
subsample=1.0, subsample_for_bin=200000, subsample_freq=0)

In [0]:

clf =LGBMClassifier(bootstrap=True, class_weight=None, criterion='gini',


max_depth=optimal_depth, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=28, min_samples_split=111,
min_weight_fraction_leaf=0.0, n_estimators=optimal_nest, n_jobs=-1,
oob_score=False, random_state=25, verbose=0, warm_start=False)

In [0]:

clf.fit(X_train,y_train)
y_train_pred = clf.predict(X_train)
y_test_pred = clf.predict(X_test)

In [58]:
from sklearn.metrics import f1_score
print('Train f1 score',f1_score(y_train,y_train_pred))
print('Test f1 score',f1_score(y_test,y_test_pred))

Train f1 score 0.9999195332931


Test f1 score 0.9227895920879494

In [0]:

from sklearn.metrics import confusion_matrix


def plot_confusion_matrix(test_y, predict_y):
C = confusion_matrix(test_y, predict_y)

A =(((C.T)/(C.sum(axis=1))).T)

B =(C/C.sum(axis=0))
plt.figure(figsize=(20,4))

labels = [0,1]
# representing A in heatmap format
cmap=sns.light_palette("blue")
plt.subplot(1, 3, 1)
sns.heatmap(C, annot=True, cmap=cmap, fmt=".3f", xticklabels=labels, yticklabels=lab
els)
plt.xlabel('Predicted Class')
plt.ylabel('Original Class')
plt.title("Confusion matrix")

plt.subplot(1, 3, 2)
sns.heatmap(B, annot=True, cmap=cmap, fmt=".3f", xticklabels=labels, yticklabels=lab
els)
plt.xlabel('Predicted Class')
plt.ylabel('Original Class')
plt.title("Precision matrix")

plt.subplot(1, 3, 3)
# representing B in heatmap format
sns.heatmap(A, annot=True, cmap=cmap, fmt=".3f", xticklabels=labels, yticklabels=lab
els)
plt.xlabel('Predicted Class')
plt.ylabel('Original Class')
plt.title("Recall matrix")

plt.show()

In [60]:

print('Train confusion_matrix')
plot_confusion_matrix(y_train,y_train_pred)
print('Test confusion_matrix')
plot_confusion_matrix(y_test,y_test_pred)

Train confusion_matrix

Test confusion_matrix

In [61]:
from sklearn.metrics import roc_curve, auc
fpr,tpr,ths = roc_curve(y_test,y_test_pred)
auc_sc = auc(fpr, tpr)
plt.plot(fpr, tpr, color='navy',label='ROC curve (area = %0.2f)' % auc_sc)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic with test data')
plt.legend()
plt.show()
In [0]:

features=df_final_train.columns
feat=features.append(train.columns)

In [111]:
print(feat[55])

svd_v_dot_train

In [106]:
print(feat[54])

svd_u_dot_train

In [109]:
importances = clf.feature_importances_
indices = (np.argsort(importances))[-25:]
print(indices)
plt.figure(figsize=(10,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='r', align='center')
plt.yticks(range(len(indices)), [feat[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()

[54 34 41 38 36 44 55 6 20 27 28 22 23 24 46 50 32 47 49 31 29 30 48 21
13]
In [0]:

CONCLUSION
In [112]:

from prettytable import PrettyTable


x=PrettyTable()

x.field_names=(['Model','Train f1 score', 'Test f1 score','ROC','Most Important Feature'


])

x.add_row(['Random Forest',0.966,0.924,0.93,'follows back' ])

print(x)

+---------------+----------------+---------------+------+------------------------+
| Model | Train f1 score | Test f1 score | ROC | Most Important Feature |
+---------------+----------------+---------------+------+------------------------+
| Random Forest | 0.966 | 0.924 | 0.93 | follows back |
+---------------+----------------+---------------+------+------------------------+

After Feature Engineering


In [114]:
y=PrettyTable()

y.field_names=(['Model','Train f1 score', 'Test f1 score','ROC','Important Feature'])

y.add_row(['XGB (used LGBM)',0.99,0.922,0.93,'shortest path' ])

print(y)

+-----------------+----------------+---------------+------+-------------------+
| Model | Train f1 score | Test f1 score | ROC | Important Feature |
+-----------------+----------------+---------------+------+-------------------+
| XGB (used LGBM) | 0.99 | 0.922 | 0.93 | shortest path |
+-----------------+----------------+---------------+------+-------------------+

1)After EDA of the data, where we have generated edges for missing edges and balanced them in prior and later
we have split the data to train and test.
2) We have added many features like no of followers, no of followees, jacard followers, cosine followers, hits
ratio, etc., after EDA and applied to Random Forest Model and found the metrics as above. We have found
follows back is the highest.
3) After Feature Engineering by adding SVD Dot and Preferential Attachment, we have found the metrics like f1
scores and AUC. We can see shortest path has the highest feature importance. Page_rank_d is also a very
important feature having very near score to shortest path score for friend recommendation.
4) Finally, XGB with feature engineering gave good classification of features for freind recommendation
compared to Random Forest.

In [0]:

You might also like