100% found this document useful (1 vote)

144 views11 pages

Random Forest: Implementaciones de Scikit-Learn Sobre QSAR

The document discusses implementations of random forest and ID3 decision tree algorithms from Scikit-Learn on a QSAR oral toxicity dataset. It loads and prepares the data, trains random forest and ID3 models on train data and tests them on test data. It reports the accuracy, classification report and ROC AUC scores. It also performs cross validation on the two models and hyperparameter tuning of random forest using randomized search.

Uploaded by

Richard Jimenez

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

144 views11 pages

Random Forest: Implementaciones de Scikit-Learn Sobre QSAR

Uploaded by

Richard Jimenez

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

13/4/2020 Scikit-Learn

Implementaciones de Scikit-learn sobre QSAR

Random Forest
In [2]:

import pandas as pd
import numpy as np
dataset= pd.read_csv("qsar_oral_toxicity.csv", sep=';', prefix='x', header=None)
dataset.head()

Out[2]:

x0 x1 x2 x3 x4 x5 x6 x7 x8 x9 ... x1015 x1016 x1017 x1018 x1019 x1020

0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0

1 0 0 1 0 0 0 0 0 0 0 ... 0 0 0 0 0 0

2 0 0 0 0 0 0 0 0 0 0 ... 0 0 1 0 0 0

3 0 0 0 0 0 0 0 1 0 0 ... 0 0 0 0 0 0

4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0

5 rows × 1025 columns

In [3]:

from sklearn import preprocessing, model_selection

enc = preprocessing.OrdinalEncoder()
enc.fit(dataset[['x1024']])
for i, cat in enumerate(enc.categories_[0]): print("{} -> {}".format(cat, i))
dataset['output'] = enc.transform(dataset[['x1024']])
dataset.head()

negative -> 0
positive -> 1

Out[3]:

x0 x1 x2 x3 x4 x5 x6 x7 x8 x9 ... x1016 x1017 x1018 x1019 x1020 x1021

0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0

1 0 0 1 0 0 0 0 0 0 0 ... 0 0 0 0 0 0

2 0 0 0 0 0 0 0 0 0 0 ... 0 1 0 0 0 0

3 0 0 0 0 0 0 0 1 0 0 ... 0 0 0 0 0 0

4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0

5 rows × 1026 columns

localhost:8888/nbconvert/html/Entrega2/ Scikit-Learn.ipynb?download=false 1/11

13/4/2020 Scikit-Learn

In [4]:

train, test = model_selection.train_test_split(dataset, test_size=0.2, random_state=42)

train.x1024.value_counts()

Out[4]:

negative 6609
positive 584
Name: x1024, dtype: int64

In [5]:

from sklearn.ensemble import RandomForestClassifier

X_train = train.iloc[:, 0:1024].values
Y_train = train.output
clf = RandomForestClassifier(n_estimators=100, max_features="sqrt", max_depth=None, min
_samples_split=2)
clf = clf.fit(X_train, Y_train)

In [6]:

X_test = test.iloc[:, 0:1024].values

Y_test = test.output
test_pred=clf.predict(X_test)

In [7]:

from sklearn import metrics

print("\nAcierto:", metrics.accuracy_score(test.output, test_pred))
print(metrics.classification_report(test.output, test_pred))

Acierto: 0.9394107837687604
precision recall f1-score support

0.0 0.95 0.99 0.97 1642

1.0 0.79 0.42 0.55 157

accuracy 0.94 1799

macro avg 0.87 0.70 0.76 1799
weighted avg 0.93 0.94 0.93 1799

In [8]:

from sklearn.metrics import roc_auc_score

roc_value = roc_auc_score(test.output, test_pred)
print(roc_value)

0.7047099622178948

ID3

localhost:8888/nbconvert/html/Entrega2/ Scikit-Learn.ipynb?download=false 2/11

13/4/2020 Scikit-Learn

In [9]:

from sklearn import tree

clf1 = tree.DecisionTreeClassifier()
clf1 = clf1.fit(X_train, Y_train)
test_pred1=clf1.predict(X_test)

In [10]:

print("\nAcierto:", metrics.accuracy_score(test.output, test_pred1))

print(metrics.classification_report(test.output, test_pred1))

Acierto: 0.9049471928849361
precision recall f1-score support

0.0 0.95 0.94 0.95 1642

1.0 0.46 0.52 0.49 157

accuracy 0.90 1799

macro avg 0.71 0.73 0.72 1799
weighted avg 0.91 0.90 0.91 1799

In [11]:

roc_value1 = roc_auc_score(test.output, test_pred1)

print(roc_value1)

0.731913853697138

Cross Validation
In [12]:

seed = 1
scoring = 'accuracy'

In [13]:

models = []
models.append(('CART', tree.DecisionTreeClassifier()))
models.append(('RF', RandomForestClassifier()))
results = []
names = []
for name, model in models:
kfold = model_selection.KFold(n_splits=10, random_state=None)
cv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfold,
scoring=scoring)
results.append(cv_results)
names.append(name)
msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
print(msg)

CART: 0.910466 (0.010126)

RF: 0.939664 (0.010440)

localhost:8888/nbconvert/html/Entrega2/ Scikit-Learn.ipynb?download=false 3/11

13/4/2020 Scikit-Learn

In [18]:

from sklearn.model_selection import RandomizedSearchCV

n_estimators = [int(x) for x in np.linspace(start = 100, stop = 1000, num = 10)]
max_features = ['auto', 'sqrt']
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
min_samples_split = [2, 5, 10]
min_samples_leaf = [1, 2, 4]
bootstrap = [True, False]

random_grid = {'n_estimators': n_estimators,

'max_features': max_features,
'max_depth': max_depth,
'min_samples_split': min_samples_split,
'min_samples_leaf': min_samples_leaf,
'bootstrap': bootstrap}
print (random_grid)

{'n_estimators': [100, 200, 300, 400, 500, 600, 700, 800, 900, 1000], 'max
_features': ['auto', 'sqrt'], 'max_depth': [10, 20, 30, 40, 50, 60, 70, 8
0, 90, 100, 110, None], 'min_samples_split': [2, 5, 10], 'min_samples_lea
f': [1, 2, 4], 'bootstrap': [True, False]}

localhost:8888/nbconvert/html/Entrega2/ Scikit-Learn.ipynb?download=false 4/11

13/4/2020 Scikit-Learn

In [15]:

from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor()
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_ite
r = 10, cv = 5, verbose=2, random_state=42)
rf_random.fit(X_train, Y_train)
rf_random.best_params_

localhost:8888/nbconvert/html/Entrega2/ Scikit-Learn.ipynb?download=false 5/11

13/4/2020 Scikit-Learn

Fitting 5 folds for each of 10 candidates, totalling 50 fits

[CV] n_estimators=100, min_samples_split=10, min_samples_leaf=2, max_featu
res=sqrt, max_depth=50, bootstrap=True

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent wo

rkers.

[CV] n_estimators=100, min_samples_split=10, min_samples_leaf=2, max_feat

ures=sqrt, max_depth=50, bootstrap=True, total= 12.9s
[CV] n_estimators=100, min_samples_split=10, min_samples_leaf=2, max_featu
res=sqrt, max_depth=50, bootstrap=True

[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 12.8s remaining:

0.0s

localhost:8888/nbconvert/html/Entrega2/ Scikit-Learn.ipynb?download=false 6/11

13/4/2020 Scikit-Learn

[CV] n_estimators=100, min_samples_split=10, min_samples_leaf=2, max_feat

ures=sqrt, max_depth=50, bootstrap=True, total= 13.1s
[CV] n_estimators=100, min_samples_split=10, min_samples_leaf=2, max_featu
res=sqrt, max_depth=50, bootstrap=True
[CV] n_estimators=100, min_samples_split=10, min_samples_leaf=2, max_feat
ures=sqrt, max_depth=50, bootstrap=True, total= 13.6s
[CV] n_estimators=100, min_samples_split=10, min_samples_leaf=2, max_featu
res=sqrt, max_depth=50, bootstrap=True
[CV] n_estimators=100, min_samples_split=10, min_samples_leaf=2, max_feat
ures=sqrt, max_depth=50, bootstrap=True, total= 13.8s
[CV] n_estimators=100, min_samples_split=10, min_samples_leaf=2, max_featu
res=sqrt, max_depth=50, bootstrap=True
[CV] n_estimators=100, min_samples_split=10, min_samples_leaf=2, max_feat
ures=sqrt, max_depth=50, bootstrap=True, total= 13.9s
[CV] n_estimators=300, min_samples_split=10, min_samples_leaf=4, max_featu
res=sqrt, max_depth=90, bootstrap=False
[CV] n_estimators=300, min_samples_split=10, min_samples_leaf=4, max_feat
ures=sqrt, max_depth=90, bootstrap=False, total= 1.0min
[CV] n_estimators=300, min_samples_split=10, min_samples_leaf=4, max_featu
res=sqrt, max_depth=90, bootstrap=False
[CV] n_estimators=300, min_samples_split=10, min_samples_leaf=4, max_feat
ures=sqrt, max_depth=90, bootstrap=False, total= 1.1min
[CV] n_estimators=300, min_samples_split=10, min_samples_leaf=4, max_featu
res=sqrt, max_depth=90, bootstrap=False
[CV] n_estimators=300, min_samples_split=10, min_samples_leaf=4, max_feat
ures=sqrt, max_depth=90, bootstrap=False, total= 1.1min
[CV] n_estimators=300, min_samples_split=10, min_samples_leaf=4, max_featu
res=sqrt, max_depth=90, bootstrap=False
[CV] n_estimators=300, min_samples_split=10, min_samples_leaf=4, max_feat
ures=sqrt, max_depth=90, bootstrap=False, total= 1.1min
[CV] n_estimators=300, min_samples_split=10, min_samples_leaf=4, max_featu
res=sqrt, max_depth=90, bootstrap=False
[CV] n_estimators=300, min_samples_split=10, min_samples_leaf=4, max_feat
ures=sqrt, max_depth=90, bootstrap=False, total= 1.1min
[CV] n_estimators=300, min_samples_split=2, min_samples_leaf=2, max_featur
es=auto, max_depth=60, bootstrap=False
[CV] n_estimators=300, min_samples_split=2, min_samples_leaf=2, max_featu
res=auto, max_depth=60, bootstrap=False, total=49.4min
[CV] n_estimators=300, min_samples_split=2, min_samples_leaf=2, max_featur
es=auto, max_depth=60, bootstrap=False
[CV] n_estimators=300, min_samples_split=2, min_samples_leaf=2, max_featu
res=auto, max_depth=60, bootstrap=False, total=30.2min
[CV] n_estimators=300, min_samples_split=2, min_samples_leaf=2, max_featur
es=auto, max_depth=60, bootstrap=False
[CV] n_estimators=300, min_samples_split=2, min_samples_leaf=2, max_featu
res=auto, max_depth=60, bootstrap=False, total=18.2min
[CV] n_estimators=300, min_samples_split=2, min_samples_leaf=2, max_featur
es=auto, max_depth=60, bootstrap=False
[CV] n_estimators=300, min_samples_split=2, min_samples_leaf=2, max_featu
res=auto, max_depth=60, bootstrap=False, total= 8.3min
[CV] n_estimators=300, min_samples_split=2, min_samples_leaf=2, max_featur
es=auto, max_depth=60, bootstrap=False
[CV] n_estimators=300, min_samples_split=2, min_samples_leaf=2, max_featu
res=auto, max_depth=60, bootstrap=False, total= 9.5min
[CV] n_estimators=700, min_samples_split=5, min_samples_leaf=1, max_featur
es=sqrt, max_depth=30, bootstrap=True
[CV] n_estimators=700, min_samples_split=5, min_samples_leaf=1, max_featu
res=sqrt, max_depth=30, bootstrap=True, total= 27.8s
[CV] n_estimators=700, min_samples_split=5, min_samples_leaf=1, max_featur
es=sqrt, max_depth=30, bootstrap=True
[CV] n_estimators=700, min_samples_split=5, min_samples_leaf=1, max_featu
localhost:8888/nbconvert/html/Entrega2/ Scikit-Learn.ipynb?download=false 7/11
13/4/2020 Scikit-Learn

res=sqrt, max_depth=30, bootstrap=True, total= 28.6s

[CV] n_estimators=700, min_samples_split=5, min_samples_leaf=1, max_featur
es=sqrt, max_depth=30, bootstrap=True
[CV] n_estimators=700, min_samples_split=5, min_samples_leaf=1, max_featu
res=sqrt, max_depth=30, bootstrap=True, total= 21.6s
[CV] n_estimators=700, min_samples_split=5, min_samples_leaf=1, max_featur
es=sqrt, max_depth=30, bootstrap=True
[CV] n_estimators=700, min_samples_split=5, min_samples_leaf=1, max_featu
res=sqrt, max_depth=30, bootstrap=True, total= 21.4s
[CV] n_estimators=700, min_samples_split=5, min_samples_leaf=1, max_featur
es=sqrt, max_depth=30, bootstrap=True
[CV] n_estimators=700, min_samples_split=5, min_samples_leaf=1, max_featu
res=sqrt, max_depth=30, bootstrap=True, total= 21.0s
[CV] n_estimators=500, min_samples_split=10, min_samples_leaf=1, max_featu
res=auto, max_depth=80, bootstrap=False
[CV] n_estimators=500, min_samples_split=10, min_samples_leaf=1, max_feat
ures=auto, max_depth=80, bootstrap=False, total=12.0min
[CV] n_estimators=500, min_samples_split=10, min_samples_leaf=1, max_featu
res=auto, max_depth=80, bootstrap=False
[CV] n_estimators=500, min_samples_split=10, min_samples_leaf=1, max_feat
ures=auto, max_depth=80, bootstrap=False, total=15.1min
[CV] n_estimators=500, min_samples_split=10, min_samples_leaf=1, max_featu
res=auto, max_depth=80, bootstrap=False
[CV] n_estimators=500, min_samples_split=10, min_samples_leaf=1, max_feat
ures=auto, max_depth=80, bootstrap=False, total=15.4min
[CV] n_estimators=500, min_samples_split=10, min_samples_leaf=1, max_featu
res=auto, max_depth=80, bootstrap=False
[CV] n_estimators=500, min_samples_split=10, min_samples_leaf=1, max_feat
ures=auto, max_depth=80, bootstrap=False, total=13.1min
[CV] n_estimators=500, min_samples_split=10, min_samples_leaf=1, max_featu
res=auto, max_depth=80, bootstrap=False
[CV] n_estimators=500, min_samples_split=10, min_samples_leaf=1, max_feat
ures=auto, max_depth=80, bootstrap=False, total=16.0min
[CV] n_estimators=200, min_samples_split=10, min_samples_leaf=1, max_featu
res=sqrt, max_depth=60, bootstrap=False
[CV] n_estimators=200, min_samples_split=10, min_samples_leaf=1, max_feat
ures=sqrt, max_depth=60, bootstrap=False, total= 9.8s
[CV] n_estimators=200, min_samples_split=10, min_samples_leaf=1, max_featu
res=sqrt, max_depth=60, bootstrap=False
[CV] n_estimators=200, min_samples_split=10, min_samples_leaf=1, max_feat
ures=sqrt, max_depth=60, bootstrap=False, total= 10.1s
[CV] n_estimators=200, min_samples_split=10, min_samples_leaf=1, max_featu
res=sqrt, max_depth=60, bootstrap=False
[CV] n_estimators=200, min_samples_split=10, min_samples_leaf=1, max_feat
ures=sqrt, max_depth=60, bootstrap=False, total= 9.8s
[CV] n_estimators=200, min_samples_split=10, min_samples_leaf=1, max_featu
res=sqrt, max_depth=60, bootstrap=False
[CV] n_estimators=200, min_samples_split=10, min_samples_leaf=1, max_feat
ures=sqrt, max_depth=60, bootstrap=False, total= 10.2s
[CV] n_estimators=200, min_samples_split=10, min_samples_leaf=1, max_featu
res=sqrt, max_depth=60, bootstrap=False
[CV] n_estimators=200, min_samples_split=10, min_samples_leaf=1, max_feat
ures=sqrt, max_depth=60, bootstrap=False, total= 10.1s
[CV] n_estimators=1000, min_samples_split=2, min_samples_leaf=2, max_featu
res=auto, max_depth=50, bootstrap=False
[CV] n_estimators=1000, min_samples_split=2, min_samples_leaf=2, max_feat
ures=auto, max_depth=50, bootstrap=False, total=25.0min
[CV] n_estimators=1000, min_samples_split=2, min_samples_leaf=2, max_featu
res=auto, max_depth=50, bootstrap=False
[CV] n_estimators=1000, min_samples_split=2, min_samples_leaf=2, max_feat
ures=auto, max_depth=50, bootstrap=False, total=28.5min
localhost:8888/nbconvert/html/Entrega2/ Scikit-Learn.ipynb?download=false 8/11
13/4/2020 Scikit-Learn

[CV] n_estimators=1000, min_samples_split=2, min_samples_leaf=2, max_featu

res=auto, max_depth=50, bootstrap=False
[CV] n_estimators=1000, min_samples_split=2, min_samples_leaf=2, max_feat
ures=auto, max_depth=50, bootstrap=False, total=34.1min
[CV] n_estimators=1000, min_samples_split=2, min_samples_leaf=2, max_featu
res=auto, max_depth=50, bootstrap=False
[CV] n_estimators=1000, min_samples_split=2, min_samples_leaf=2, max_feat
ures=auto, max_depth=50, bootstrap=False, total=30.8min
[CV] n_estimators=1000, min_samples_split=2, min_samples_leaf=2, max_featu
res=auto, max_depth=50, bootstrap=False
[CV] n_estimators=1000, min_samples_split=2, min_samples_leaf=2, max_feat
ures=auto, max_depth=50, bootstrap=False, total=31.1min
[CV] n_estimators=100, min_samples_split=5, min_samples_leaf=2, max_featur
es=sqrt, max_depth=10, bootstrap=True
[CV] n_estimators=100, min_samples_split=5, min_samples_leaf=2, max_featu
res=sqrt, max_depth=10, bootstrap=True, total= 2.2s
[CV] n_estimators=100, min_samples_split=5, min_samples_leaf=2, max_featur
es=sqrt, max_depth=10, bootstrap=True
[CV] n_estimators=100, min_samples_split=5, min_samples_leaf=2, max_featu
res=sqrt, max_depth=10, bootstrap=True, total= 2.2s
[CV] n_estimators=100, min_samples_split=5, min_samples_leaf=2, max_featur
es=sqrt, max_depth=10, bootstrap=True
[CV] n_estimators=100, min_samples_split=5, min_samples_leaf=2, max_featu
res=sqrt, max_depth=10, bootstrap=True, total= 2.1s
[CV] n_estimators=100, min_samples_split=5, min_samples_leaf=2, max_featur
es=sqrt, max_depth=10, bootstrap=True
[CV] n_estimators=100, min_samples_split=5, min_samples_leaf=2, max_featu
res=sqrt, max_depth=10, bootstrap=True, total= 2.1s
[CV] n_estimators=100, min_samples_split=5, min_samples_leaf=2, max_featur
es=sqrt, max_depth=10, bootstrap=True
[CV] n_estimators=100, min_samples_split=5, min_samples_leaf=2, max_featu
res=sqrt, max_depth=10, bootstrap=True, total= 2.1s
[CV] n_estimators=600, min_samples_split=2, min_samples_leaf=4, max_featur
es=auto, max_depth=100, bootstrap=True
[CV] n_estimators=600, min_samples_split=2, min_samples_leaf=4, max_featu
res=auto, max_depth=100, bootstrap=True, total=11.2min
[CV] n_estimators=600, min_samples_split=2, min_samples_leaf=4, max_featur
es=auto, max_depth=100, bootstrap=True
[CV] n_estimators=600, min_samples_split=2, min_samples_leaf=4, max_featu
res=auto, max_depth=100, bootstrap=True, total= 9.5min
[CV] n_estimators=600, min_samples_split=2, min_samples_leaf=4, max_featur
es=auto, max_depth=100, bootstrap=True
[CV] n_estimators=600, min_samples_split=2, min_samples_leaf=4, max_featu
res=auto, max_depth=100, bootstrap=True, total= 9.6min
[CV] n_estimators=600, min_samples_split=2, min_samples_leaf=4, max_featur
es=auto, max_depth=100, bootstrap=True
[CV] n_estimators=600, min_samples_split=2, min_samples_leaf=4, max_featu
res=auto, max_depth=100, bootstrap=True, total= 9.8min
[CV] n_estimators=600, min_samples_split=2, min_samples_leaf=4, max_featur
es=auto, max_depth=100, bootstrap=True
[CV] n_estimators=600, min_samples_split=2, min_samples_leaf=4, max_featu
res=auto, max_depth=100, bootstrap=True, total= 9.0min
[CV] n_estimators=1000, min_samples_split=5, min_samples_leaf=2, max_featu
res=auto, max_depth=50, bootstrap=True
[CV] n_estimators=1000, min_samples_split=5, min_samples_leaf=2, max_feat
ures=auto, max_depth=50, bootstrap=True, total=15.8min
[CV] n_estimators=1000, min_samples_split=5, min_samples_leaf=2, max_featu
res=auto, max_depth=50, bootstrap=True
[CV] n_estimators=1000, min_samples_split=5, min_samples_leaf=2, max_feat
ures=auto, max_depth=50, bootstrap=True, total=16.6min
[CV] n_estimators=1000, min_samples_split=5, min_samples_leaf=2, max_featu
localhost:8888/nbconvert/html/Entrega2/ Scikit-Learn.ipynb?download=false 9/11
13/4/2020 Scikit-Learn

res=auto, max_depth=50, bootstrap=True

[CV] n_estimators=1000, min_samples_split=5, min_samples_leaf=2, max_feat
ures=auto, max_depth=50, bootstrap=True, total=16.4min
[CV] n_estimators=1000, min_samples_split=5, min_samples_leaf=2, max_featu
res=auto, max_depth=50, bootstrap=True
[CV] n_estimators=1000, min_samples_split=5, min_samples_leaf=2, max_feat
ures=auto, max_depth=50, bootstrap=True, total=16.7min
[CV] n_estimators=1000, min_samples_split=5, min_samples_leaf=2, max_featu
res=auto, max_depth=50, bootstrap=True
[CV] n_estimators=1000, min_samples_split=5, min_samples_leaf=2, max_feat
ures=auto, max_depth=50, bootstrap=True, total=15.5min

[Parallel(n_jobs=1)]: Done 50 out of 50 | elapsed: 476.0min finished

Out[15]:

{'n_estimators': 200,
'min_samples_split': 10,
'min_samples_leaf': 1,
'max_features': 'sqrt',
'max_depth': 60,
'bootstrap': False}

In [16]:

rf_random.best_params_

Out[16]:

{'n_estimators': 200,
'min_samples_split': 10,
'min_samples_leaf': 1,
'max_features': 'sqrt',
'max_depth': 60,
'bootstrap': False}

comparamos
In [22]:

clf2 = RandomForestClassifier(n_estimators=200, max_features="sqrt", max_depth=60, min_

samples_split=10, min_samples_leaf=1, bootstrap= False)
clf2 = clf.fit(X_train, Y_train)

localhost:8888/nbconvert/html/Entrega2/ Scikit-Learn.ipynb?download=false 10/11

13/4/2020 Scikit-Learn

In [23]:

test_pred2=clf2.predict(X_test)
print("\nAcierto:", metrics.accuracy_score(test.output, test_pred2))
print(metrics.classification_report(test.output, test_pred2))

Acierto: 0.9382990550305725
precision recall f1-score support

0.0 0.95 0.99 0.97 1642

1.0 0.78 0.41 0.54 157

accuracy 0.94 1799

macro avg 0.86 0.70 0.75 1799
weighted avg 0.93 0.94 0.93 1799

In [ ]:

localhost:8888/nbconvert/html/Entrega2/ Scikit-Learn.ipynb?download=false 11/11

Heart Disease Prediction Guide
100% (1)
Heart Disease Prediction Guide
73 pages
A) What Is Motivation Behind Ensemble Methods? Give Your Answer in Probabilistic Terms
100% (1)
A) What Is Motivation Behind Ensemble Methods? Give Your Answer in Probabilistic Terms
6 pages
Intro to Machine Learning Basics
100% (1)
Intro to Machine Learning Basics
52 pages
Charmi Shah 20bcp299 Lab2
100% (1)
Charmi Shah 20bcp299 Lab2
7 pages
Outliers, Hypothesis and Natural Language Processing
100% (1)
Outliers, Hypothesis and Natural Language Processing
7 pages
SAT and GPA Regression Analysis
100% (1)
SAT and GPA Regression Analysis
1 page
Yolo v8
No ratings yet
Yolo v8
8 pages
Credit Card Fraud Detection Using Machine Learning
100% (1)
Credit Card Fraud Detection Using Machine Learning
82 pages
Linear - Regression
100% (1)
Linear - Regression
39 pages
Importing Libraries: Import As Import As Import As From Import As From Import From Import Import
100% (1)
Importing Libraries: Import As Import As Import As From Import As From Import From Import Import
11 pages
ML Guide: Boston House Price Prediction
100% (1)
ML Guide: Boston House Price Prediction
15 pages
Outlines: Statements of Problems Objectives Bagging Random Forest Boosting Adaboost
100% (1)
Outlines: Statements of Problems Objectives Bagging Random Forest Boosting Adaboost
14 pages
Machine Learning Methods To Weather Forecasting To Predict Apparent Temperature A Review
100% (1)
Machine Learning Methods To Weather Forecasting To Predict Apparent Temperature A Review
6 pages
Variosalgoritmos - Jupyter Notebook
100% (1)
Variosalgoritmos - Jupyter Notebook
9 pages
Csi 5155 ML Project Report
100% (1)
Csi 5155 ML Project Report
24 pages
Assignment10 4
100% (1)
Assignment10 4
3 pages
Lab 3. Linear Regression 230223
100% (1)
Lab 3. Linear Regression 230223
7 pages
Figure Style and Scale: Darkgrid Whitegrid Dark White Ticks Darkgrid
No ratings yet
Figure Style and Scale: Darkgrid Whitegrid Dark White Ticks Darkgrid
15 pages
0.1 Stock Data
100% (1)
0.1 Stock Data
4 pages
Regression Anallysis Hands0n 1
100% (1)
Regression Anallysis Hands0n 1
3 pages
K Means Clustering
100% (1)
K Means Clustering
10 pages
Classification Problems
100% (1)
Classification Problems
25 pages
Assignment No - 6-1
100% (1)
Assignment No - 6-1
3 pages
Patient Data Management System
100% (1)
Patient Data Management System
27 pages
Classification and Regression Trees
100% (1)
Classification and Regression Trees
60 pages
Thinkcspy 3
100% (1)
Thinkcspy 3
415 pages
Linear Regression: What Is Regression Analysis?
100% (1)
Linear Regression: What Is Regression Analysis?
21 pages
Loading The Dataset: 'Churn - Modelling - CSV'
No ratings yet
Loading The Dataset: 'Churn - Modelling - CSV'
6 pages
Lab7.ipynb - Colaboratory
100% (1)
Lab7.ipynb - Colaboratory
5 pages
Lecture Notes - Random Forests PDF
100% (1)
Lecture Notes - Random Forests PDF
4 pages
HW1
100% (1)
HW1
8 pages
Oil Export Indonesia
100% (1)
Oil Export Indonesia
12 pages
Top Python Cheat Sheets for Learners
100% (1)
Top Python Cheat Sheets for Learners
17 pages
Student Booklet For Sep 2015 v6
100% (1)
Student Booklet For Sep 2015 v6
50 pages
PR01
100% (1)
PR01
41 pages
Logistic Regression
100% (1)
Logistic Regression
29 pages
Decision Trees: at Some Point of Time You Have To Take A Decision Sitting On A Tree
100% (1)
Decision Trees: at Some Point of Time You Have To Take A Decision Sitting On A Tree
19 pages
IRIS BPNN - Ipynb - Colaboratory
100% (1)
IRIS BPNN - Ipynb - Colaboratory
4 pages
Book
100% (1)
Book
480 pages
CS550 Regression Aug12
100% (1)
CS550 Regression Aug12
63 pages
Boosting Algorithms in Machine Learning
100% (1)
Boosting Algorithms in Machine Learning
41 pages
Vinee
100% (1)
Vinee
28 pages
ML Lect1
100% (1)
ML Lect1
51 pages
Linear Regression Models Guide
100% (1)
Linear Regression Models Guide
61 pages
NYC Taxi Fare Data Cleaning
100% (1)
NYC Taxi Fare Data Cleaning
8 pages
01-Introduction Machine Learning
100% (1)
01-Introduction Machine Learning
48 pages
ECG Image Classification with ML
100% (1)
ECG Image Classification with ML
16 pages
05 Logistic - Regression
No ratings yet
05 Logistic - Regression
7 pages
02 ML Supervised Learning
No ratings yet
02 ML Supervised Learning
32 pages
Unit-5 Decision Trees and Ensemble Learning
100% (1)
Unit-5 Decision Trees and Ensemble Learning
162 pages
Weka Tutorial
No ratings yet
Weka Tutorial
2 pages
K-NN (Nearest Neighbor)
100% (1)
K-NN (Nearest Neighbor)
17 pages
To Pattern Recognition: CSE555, Fall 2021 Chapter 1, DHS
100% (1)
To Pattern Recognition: CSE555, Fall 2021 Chapter 1, DHS
39 pages
SVM Guide for Data Science Enthusiasts
100% (1)
SVM Guide for Data Science Enthusiasts
28 pages
Program Overview: #Datascience - Data Science in Iot
100% (1)
Program Overview: #Datascience - Data Science in Iot
9 pages
ML Lab6.Ipynb - Colaboratory
100% (1)
ML Lab6.Ipynb - Colaboratory
5 pages
Feature Selection in Python ML
No ratings yet
Feature Selection in Python ML
7 pages
Stats & ML Model Comparisons
100% (1)
Stats & ML Model Comparisons
72 pages
Bayes Classification for Fish Sorting
No ratings yet
Bayes Classification for Fish Sorting
86 pages
Random Forest Classifier on Banking Dataset
No ratings yet
Random Forest Classifier on Banking Dataset
7 pages
Sustainability 16 09753 v2
No ratings yet
Sustainability 16 09753 v2
18 pages
RFID Prox Credentials
No ratings yet
RFID Prox Credentials
1 page
ZKBio Security How To Find Password 2023
50% (2)
ZKBio Security How To Find Password 2023
3 pages
Tcp/Ip Application Note: LPWA Module Series
No ratings yet
Tcp/Ip Application Note: LPWA Module Series
53 pages
Cramer-Rao Bound via Char. Func.
No ratings yet
Cramer-Rao Bound via Char. Func.
15 pages
Visualizing Network Data
No ratings yet
Visualizing Network Data
13 pages
Re P Ub L I C o F I R e L And: EDI CT OF Government
No ratings yet
Re P Ub L I C o F I R e L And: EDI CT OF Government
402 pages
Designing A Wi-Fi Deployment Using Ekahau Site Survey Pro PDF
100% (1)
Designing A Wi-Fi Deployment Using Ekahau Site Survey Pro PDF
10 pages
Very High-Density 802.11ac NETWORKS: Engineering and Configuration Guide
No ratings yet
Very High-Density 802.11ac NETWORKS: Engineering and Configuration Guide
110 pages
Ep Eia Seacom
100% (1)
Ep Eia Seacom
278 pages
Delhi Public School Panvel Syllabus For Half Yearly Assessment Academic Year 2022-23
No ratings yet
Delhi Public School Panvel Syllabus For Half Yearly Assessment Academic Year 2022-23
1 page
Usp 38 Alt Akk
No ratings yet
Usp 38 Alt Akk
3 pages
Evangelion Classified Information
No ratings yet
Evangelion Classified Information
22 pages
The Secret To Solving Inferential Questions - GP Ka Funda
0% (1)
The Secret To Solving Inferential Questions - GP Ka Funda
42 pages
Results Focus
100% (1)
Results Focus
10 pages
Soni Jaivik M 17012071049 Concrete Technology (2ci406) 4: NAME:-Enrolment No: - Subject: - Semester: - Batch: - Guide By
No ratings yet
Soni Jaivik M 17012071049 Concrete Technology (2ci406) 4: NAME:-Enrolment No: - Subject: - Semester: - Batch: - Guide By
13 pages
Senior High School Students' Level of Intelligence and Social Media Use On Academic Performance
No ratings yet
Senior High School Students' Level of Intelligence and Social Media Use On Academic Performance
4 pages
Evolution of OB
No ratings yet
Evolution of OB
16 pages
Resume Saikrishna
No ratings yet
Resume Saikrishna
3 pages
Assignment 1
No ratings yet
Assignment 1
6 pages
WhatisSociology PDF
No ratings yet
WhatisSociology PDF
5 pages
Linear Regression and Corelation (1236)
No ratings yet
Linear Regression and Corelation (1236)
50 pages
Knowing More Than One Language: The Psycholinguistics of Bilingualism
No ratings yet
Knowing More Than One Language: The Psycholinguistics of Bilingualism
47 pages
Minutes of Meeting Different Projects
No ratings yet
Minutes of Meeting Different Projects
14 pages
Group Dynamics and Cohesion
No ratings yet
Group Dynamics and Cohesion
5 pages
IB Economics Teacher Resource 3ed Paper 2 Markscheme
No ratings yet
IB Economics Teacher Resource 3ed Paper 2 Markscheme
6 pages
Military Fiber-Optic Navigation Systems
No ratings yet
Military Fiber-Optic Navigation Systems
11 pages
Krishi Prayoga Pariwara
No ratings yet
Krishi Prayoga Pariwara
13 pages
ECON1005 Notes Unit 6
No ratings yet
ECON1005 Notes Unit 6
42 pages
RSW 2
No ratings yet
RSW 2
10 pages
Semantic Framework for Linguists
100% (1)
Semantic Framework for Linguists
257 pages
Transformational Leadership Behavioe Inventory (TLI) by Podsakoff Et Al. (PG 117)
No ratings yet
Transformational Leadership Behavioe Inventory (TLI) by Podsakoff Et Al. (PG 117)
131 pages
The Concept of Public Space and Its Democratic Manifestations PDF
No ratings yet
The Concept of Public Space and Its Democratic Manifestations PDF
24 pages
Report Part 1
No ratings yet
Report Part 1
7 pages
Asme Viii d1 Ma Appendix 3
No ratings yet
Asme Viii d1 Ma Appendix 3
3 pages
ECE 408 HOMEWORK #3 (CH. 13) Solutions ©dr. James S. Kang: F Rad F
No ratings yet
ECE 408 HOMEWORK #3 (CH. 13) Solutions ©dr. James S. Kang: F Rad F
49 pages
Impossibility of Distributed Consensus With One Faulty Process
No ratings yet
Impossibility of Distributed Consensus With One Faulty Process
7 pages
Understanding Rubrics: A Guide
No ratings yet
Understanding Rubrics: A Guide
10 pages
A. Sharath Chandra Reddy: Objective
No ratings yet
A. Sharath Chandra Reddy: Objective
2 pages

Random Forest: Implementaciones de Scikit-Learn Sobre QSAR

Uploaded by

Random Forest: Implementaciones de Scikit-Learn Sobre QSAR

Uploaded by

13/4/2020 Scikit-Learn

Implementaciones de Scikit-learn sobre QSAR

x0 x1 x2 x3 x4 x5 x6 x7 x8 x9 ... x1015 x1016 x1017 x1018 x1019 x1020

5 rows × 1025 columns

from sklearn import preprocessing, model_selection

x0 x1 x2 x3 x4 x5 x6 x7 x8 x9 ... x1016 x1017 x1018 x1019 x1020 x1021

5 rows × 1026 columns

localhost:8888/nbconvert/html/Entrega2/ Scikit-Learn.ipynb?download=false 1/11

train, test = model_selection.train_test_split(dataset, test_size=0.2, random_state=42)

from sklearn.ensemble import RandomForestClassifier

X_test = test.iloc[:, 0:1024].values

from sklearn import metrics

0.0 0.95 0.99 0.97 1642

accuracy 0.94 1799

from sklearn.metrics import roc_auc_score

localhost:8888/nbconvert/html/Entrega2/ Scikit-Learn.ipynb?download=false 2/11

from sklearn import tree

print("\nAcierto:", metrics.accuracy_score(test.output, test_pred1))

0.0 0.95 0.94 0.95 1642

accuracy 0.90 1799

roc_value1 = roc_auc_score(test.output, test_pred1)

CART: 0.910466 (0.010126)

localhost:8888/nbconvert/html/Entrega2/ Scikit-Learn.ipynb?download=false 3/11

from sklearn.model_selection import RandomizedSearchCV

random_grid = {'n_estimators': n_estimators,

localhost:8888/nbconvert/html/Entrega2/ Scikit-Learn.ipynb?download=false 4/11

from sklearn.ensemble import RandomForestRegressor

localhost:8888/nbconvert/html/Entrega2/ Scikit-Learn.ipynb?download=false 5/11

Fitting 5 folds for each of 10 candidates, totalling 50 fits

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent wo

[CV] n_estimators=100, min_samples_split=10, min_samples_leaf=2, max_feat

[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 12.8s remaining:

localhost:8888/nbconvert/html/Entrega2/ Scikit-Learn.ipynb?download=false 6/11

[CV] n_estimators=100, min_samples_split=10, min_samples_leaf=2, max_feat

res=sqrt, max_depth=30, bootstrap=True, total= 28.6s

[CV] n_estimators=1000, min_samples_split=2, min_samples_leaf=2, max_featu

res=auto, max_depth=50, bootstrap=True

[Parallel(n_jobs=1)]: Done 50 out of 50 | elapsed: 476.0min finished

clf2 = RandomForestClassifier(n_estimators=200, max_features="sqrt", max_depth=60, min_

localhost:8888/nbconvert/html/Entrega2/ Scikit-Learn.ipynb?download=false 10/11

0.0 0.95 0.99 0.97 1642

accuracy 0.94 1799

localhost:8888/nbconvert/html/Entrega2/ Scikit-Learn.ipynb?download=false 11/11

You might also like