KEMBAR78
? H2O AutoML - Automatic Machine Learning | PDF | Cross Validation (Statistics) | Machine Learning
0% found this document useful (0 votes)
42 views17 pages

? H2O AutoML - Automatic Machine Learning

H2O AutoML is an automated machine learning tool designed to simplify the model building process for both beginners and experts by training, tuning, and ranking multiple models with minimal coding. It supports various algorithms and provides a leaderboard for model performance evaluation, making machine learning accessible to non-experts. The platform is built on H2O.ai, an open-source machine learning framework that is fast, scalable, and supports multiple programming languages.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views17 pages

? H2O AutoML - Automatic Machine Learning

H2O AutoML is an automated machine learning tool designed to simplify the model building process for both beginners and experts by training, tuning, and ranking multiple models with minimal coding. It supports various algorithms and provides a leaderboard for model performance evaluation, making machine learning accessible to non-experts. The platform is built on H2O.ai, an open-source machine learning framework that is fast, scalable, and supports multiple programming languages.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

🚀 H2O AutoML: Automatic Machine Learning

🔍 What is H2O AutoML?


H2O AutoML is a tool that automates the entire machine learning workflow, making it easy for beginners and
experts alike to build high-performing models with minimal coding.

●​ Goal: Democratize machine learning — make it accessible to non-experts.​

●​ Problem Solved: Tuning complex models (like deep learning) is difficult for non-experts.​

●​ Solution: H2O AutoML automatically trains, tunes, and ranks multiple models.​

🛠️ How H2O AutoML Works


✅ You Provide:
●​ A dataset​

●​ A response column (target variable)​

●​ (Optional) Time or model count limit​

✅ H2O AutoML Will:


●​ Train multiple models​

●​ Automatically tune them​

●​ Use ensemble techniques (e.g., model stacking)​

●​ Present a leaderboard ranking models by performance​

🔹 Step 1: What is H2O.ai?


●​ H2O.ai is an open-source machine learning platform.​

●​ Supports multiple languages: Python, R, Java.​

●​ Built for big data, making it ideal for enterprise applications.​

●​ Known for being fast, scalable, and easy-to-use, especially with tools like H2O AutoML.​
🔹 Step 2: What is H2O AutoML?
H2O AutoML automates the process of training and tuning machine learning models.

💡 What It Does:
●​ Handles preprocessing and feature engineering​

●​ Tests multiple algorithms:​

○​ GLM (Generalized Linear Model)​

○​ GBM (Gradient Boosting Machine)​

○​ XGBoost​

○​ Deep Learning​

○​ Stacked Ensembles​

●​ Returns the best models via a leaderboard​

🔹 Step 3: H2O Setup (Python)


Install H2O:

pip install h2o

Initialize H2O in Python:

import h2o
from h2o.automl import H2OAutoML

h2o.init()

🔹 Step 4: Load Dataset


Import data from URL or local system:

data =
h2o.import_file("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris
.csv")
🔹 Step 5: Split the Dataset
train, test = data.split_frame(ratios=[.8])

🔹 Step 6: Run AutoML


Define the target and feature columns:

x = data.columns[:-1] # features
y = "species" # target column

aml = H2OAutoML(max_models=10, seed=1)


aml.train(x=x, y=y, training_frame=train)

🔹 Step 7: View Leaderboard


lb = aml.leaderboard
lb.head()

🔹 Step 8: Predict on Test Data


preds = aml.leader.predict(test)
print(preds)

🔹 Step 9: Shut Down H2O Cluster


h2o.shutdown(prompt=False)
🔹 Training the Model with H2O AutoML (Detailed)
✅ Python Code:
from h2o.automl import H2OAutoML

aml = H2OAutoML(max_models=10, seed=1, max_runtime_secs=300)


aml.train(x=x, y=y, training_frame=train)

✅ Parameter Breakdown:
Parameter Description

x List of feature columns

y Target column name

training_frame H2OFrame containing training data

max_models=10 Train up to 10 models

max_runtime_secs=300 Limit total run time to 5 minutes

seed=1 Ensures reproducibility

⚙️ Internal Workflow:
1.​ Data Preprocessing:​

○​ Handles missing values​

○​ Encodes categoricals​

○​ Scales data (if needed)​

2.​ Model Training:​

○​ GLM, DRF, GBM, XGBoost, Deep Learning, Stacked Ensembles​

3.​ Hyperparameter Tuning:​

○​ Uses random grid search + cross-validation​

4.​ Leaderboard Creation:​

○​ Models ranked by default metric (AUC, RMSE, etc.)​


🧩 Parameters Overview
🔹 Required Parameters:
Parameter Description Required

y Target column name ✅


training_frame Training dataset ✅
(H2OFrame)

🔹 Required Stopping Parameters (Specify at least one):


Parameter Description Type

max_runtime_secs Max time allowed for AutoML Integer


(seconds)

max_models Max number of models to train Integer

🔹 Optional Parameters:
🔸 Data Parameters
Parameter Description

x List of predictors

validation_frame For early stopping (if nfolds=0)

leaderboard_frame For scoring and ranking models

blending_frame Enables Holdout Stacking

fold_column Custom cross-validation fold assignments

weights_column Row-level observation weights

🔸 Miscellaneous Parameters
Parameter What It Does Why It Matters

nfolds Cross-validation (default = 5) Better model evaluation

seed Reproducible results Same output every run

sort_metric Metric used for ranking Optimize for what


leaderboard models matters (AUC, RMSE,
etc.)

include_algos / exclude_algos Specify or skip certain Customize or speed up


algorithms AutoML
project_name Name for AutoML run Organize experiments

max_runtime_secs_per_model Limit time per model Speeds up training

keep_cross_validation_predictio Needed for Stacked Enables ensemble


ns Ensembles creation

export_checkpoints_dir Save all trained models to a Use models later


directory

verbosity Logging level ("info", "debug", Debugging or clean


etc.) output

stopping_rounds, Early stopping settings Prevent overfitting, save


stopping_tolerance, time
stopping_metric

✅ Validation Options (nfolds = 0)


🔹 What Happens When You Set nfolds = 0?
●​ No cross-validation scores available​

●​ You must supply:​

○​ validation_frame for early stopping​

○​ leaderboard_frame for ranking​

🔹 If Not Provided:
●​ H2O will split training data:​

○​ 10% → validation​

○​ 10% → leaderboard​

Example: With 100,000 rows, H2O will use:

●​ 80,000 for training​

●​ 10,000 for validation​

●​ 10,000 for leaderboard​


✅ XGBoost Memory Requirements
🔹 Why It Matters
●​ XGBoost runs outside Java (in C++)​

●​ Needs its own memory — separate from H2O​

🔸 Recommendation
Use only 2/3 of total RAM for H2O:

Total RAM Allocate to H2O

60 GB 40 GB

12 GB 8 GB

30 GB 20 GB

h2o.init(max_mem_size = "40G")

✅ scikit-learn Compatibility (h2o.sklearn)


🔹 What Is It?
●​ scikit-learn wrappers for H2O AutoML:​

○​ H2OAutoMLClassifier​

○​ H2OAutoMLRegressor​

🔸 Benefits
●​ Use H2O models in sklearn pipelines​

●​ Accepts pandas, numpy, or H2OFrames

✅ Example:
from h2o.sklearn import H2OAutoMLClassifier

model = H2OAutoMLClassifier(max_runtime_secs=60)
model.fit(X_train, y_train)
preds = model.predict(X_test)
●​ Also supports score(), get_params(), set_params()​

python
CopyEdit
from h2o.sklearn import H2OAutoMLClassifier
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
("automl", H2OAutoMLClassifier(max_models=10))
])
pipeline.fit(X_train, y_train)

✅ Summary
Feature Purpose

h2o.init() Starts H2O engine

H2OAutoML() Trains multiple models automatically

aml.train() Starts model training

aml.leaderboard Shows model rankings and metrics

aml.leader Best-performing model

aml.predict() Predicts using leader model

h2o.explain() Explains predictions and model behaviour

🎯 Final Summary Table


Feature What It Does Why It Matters

Validation Options Enables leaderboard when Leaderboard and early stopping still
nfolds=0 work

XGBoost Memory Keeps RAM free for XGBoost Avoids crashes, improves stability

sklearn Use H2O in scikit-learn Integrate easily into Python


Compatibility pipelines workflows
🔍 1. Explainability in H2O AutoML
✅ What is it?
Explainability refers to the ability to understand why and how a machine learning model makes predictions.

🎯 Why is it important?
●​ Helps in building trust in the model.​

●​ Useful for debugging and model validation.​

●​ Crucial in regulated industries (like finance or healthcare).​

💡 In H2O:
H2O provides a function called h2o.explain() that automatically:

●​ Shows feature importance​

●​ Creates Partial Dependence Plots (PDPs)​

●​ Generates SHAP value plots​

●​ Displays model performance charts​

📌 Use Case:
import h2o
h2o.explain(aml.leader, test) # Explains the best model on test data

⚙️ 2. Training a Model Using H2O AutoML


📥 Step 1: Import Libraries and Start H2O
import h2o
from h2o.automl import H2OAutoML
h2o.init() # Starts H2O server locally

This sets up the H2O cluster to run AutoML tasks.

📁 Step 2: Import Dataset


train = h2o.import_file("https://...higgs_train_10k.csv")
test = h2o.import_file("https://...higgs_test_5k.csv")

Downloads a binary classification dataset for training and testing.

🧠 Step 3: Define Features and Target Variable


x = train.columns
y = "response"
x.remove(y)

●​ x: All input features (independent variables)​

●​ y: Target variable (dependent/output variable)​

H2O can auto-detect this too if x is not passed.

🔄 Step 4: Convert Target to Factor (for Classification)


train[y] = train[y].asfactor()
test[y] = test[y].asfactor()

This tells H2O that you're doing classification, not regression.

🚀 Step 5: Run AutoML


python
CopyEdit
aml = H2OAutoML(max_models=20, seed=1)
aml.train(x=x, y=y, training_frame=train)

●​ max_models=20: Builds up to 20 different models​

●​ seed=1: Makes results reproducible​

●​ AutoML tries various algorithms like GBM, XGBoost, DRF, Deep Learning, etc., and also builds Stacked
Ensembles.​
🏆 3. Leaderboard: Compare Model Performance
After training, use this:

python
CopyEdit
lb = aml.leaderboard
lb.head(rows=lb.nrows) # View all models

model_id auc loglos rmse mse


s

StackedEnsemble_AllModels_... 0.789801 0.5511 0.4321 0.1867

XGBoost_1_... 0.784651 0.5575 0.4349 0.1891

●​ ​
AUC: Measures classification performance (higher is better).​

●​ Logloss: Penalizes incorrect predictions (lower is better).​

●​ RMSE/MSE: Errors in prediction.​

The top model is stored in:

aml.leader

📈 4. Prediction Using the Leader Model


Once trained, you can make predictions:

# Option 1: Use the AutoML object (uses the leader model)


preds = aml.predict(test)

# Option 2: Use the leader model directly


preds = aml.leader.predict(test)

●​ Predictions are returned in the same order as your test dataset.​

●​ Output includes predicted class and probabilities.​


🧠 5. How AutoML Works Under the Hood
●​ It automatically trains multiple models using different algorithms.​

●​ Uses cross-validation or validation/leaderboard frames to evaluate models.​

●​ Selects the best model (leader) based on default metric (e.g., AUC for classification).​

●​ Builds stacked ensembles to combine top-performing models.​

🛡️ 6. Memory Best Practices for XGBoost in H2O


●​ H2O uses Java (JVM) and runs inside a Java process.​

●​ XGBoost runs outside the JVM (in native C++).​

●​ You must reserve memory for XGBoost:​

h2o.init(max_mem_size = "40G") # Reserve remaining RAM for XGBoost

●​ If total system RAM = 60 GB, allow only 40 GB to H2O, so that 20 GB is free for XGBoost.​

✅ AutoML Output
📊 Leaderboard
H2O AutoML automatically generates a leaderboard, which is a table ranking all the models trained during the
AutoML run. It includes the cross-validated performance of each model (default: 5-fold CV).

🔄 Customizing Evaluation with nfolds and leaderboard_frame


●​ nfolds parameter: Defines the number of folds for cross-validation (default is 5).​

●​ leaderboard_frame parameter: You can specify a custom dataset to evaluate all models instead of
using cross-validation. The leaderboard will then show metrics based on that dataset.​
📌 Ranking Logic
Models in the leaderboard are sorted using a default metric, depending on the problem type:

Problem Type Default Metric

Binary Classification AUC (Area Under Curve)

Multiclass Classification Mean Per-Class Error

Regression RMSE (Root Mean Squared Error)

Additional metrics like logloss, MSE, etc., are also provided.

⚙️ Adding More Details to the Leaderboard


To better assess model performance and complexity, you can use the extra_columns argument in the
h2o.automl.get_leaderboard() function:

Allowed options for extra_columns:

Option Description

training_time_ms Training time (milliseconds) excluding CV models

predict_time_per_r Avg. time to make predictions for one row


ow_ms

ALL Adds both training time and prediction time per row

✅ Example: Get leaderboard with full details


python
CopyEdit
lb = h2o.automl.get_leaderboard(aml, extra_columns="ALL")
lb

This displays a detailed leaderboard for all models trained.

📈 Creating Leaderboard on a New Dataset


You can also create a leaderboard that scores all trained models on a different dataset using:

python
CopyEdit
from h2o.automl import make_leaderboard
new_lb = make_leaderboard(aml, test, extra_columns="ALL")
new_lb

This helps compare models on real-world data or unseen test sets.


🔍 Examine Models
You can retrieve and analyze specific models from AutoML by their model ID, or get the best model of a
specific type or metric using convenient built-in methods.

✅ Get the Leader Model (Best Overall)


python
CopyEdit
# Get the best overall model
m = aml.leader

# or use the helper function


m = aml.get_best_model()

🎯 Get Best Model by Custom Metric


python
CopyEdit
# Get best model ranked by logloss instead of default
m = aml.get_best_model(criterion="logloss")

🧠 Get Best Model of a Specific Algorithm


python
CopyEdit
# Get best XGBoost model based on default metric
xgb = aml.get_best_model(algorithm="xgboost")

# Or by a specific metric like logloss


xgb = aml.get_best_model(algorithm="xgboost", criterion="logloss")

🔑 Get a Model by Its ID


If you know the model ID (from the leaderboard), you can retrieve it directly:

python
CopyEdit
m = h2o.get_model("StackedEnsemble_BestOfFamily_AutoML_20191213_174603")
🔍 Inspect Model Parameters
You can analyze model configuration (like number of trees, learning rate, etc.):

python
CopyEdit
# View all parameter names
xgb.params.keys()

# View specific parameter value


xgb.params['ntrees']

✅ AutoML Log
During the execution of H2O AutoML, various backend events and metadata are recorded. These logs are
accessible through specific properties of the AutoML object when using Python or R clients.

These logs are especially useful for debugging, performance analysis, and understanding training workflow
details.

📁 1. event_log: Training Events Summary


●​ The event_log is an H2OFrame that captures key backend events during model training.​

●​ It includes records like when a model starts training, finishes, gets scored, and other internal AutoML
operations.​

●​ This is helpful to trace the AutoML process, check if there were any errors or warnings, and monitor
progress.​

✅ Example: Retrieve the event log


python
CopyEdit
log = aml.event_log
log.head() # View first few event entries

🕒 2. training_info: AutoML Timing Information


●​ The training_info property is a dictionary that provides detailed statistics about the AutoML run.​

●​ It includes high-level information such as:​

○​ Total training time​


○​ Time spent on individual steps (e.g., data preprocessing, model training, leaderboard creation)​

●​ This helps in performance analysis and optimization.​

✅ Example: Retrieve training info


python
CopyEdit
info = aml.training_info
print(info)

Note: If you're specifically interested in the training and prediction times for each model, it’s often easier to
use the extended leaderboard with extra_columns="ALL" like this:

python
CopyEdit
lb = h2o.automl.get_leaderboard(aml, extra_columns="ALL")
lb

These logging tools provide transparency and traceability, which are key for:

●​ Performance debugging​

●​ Model reproducibility​

●​ Compliance and auditing in ML workflows​

🌐 Web UI via H2O Wave


In addition to Python and R APIs, H2O AutoML provides an interactive web interface through H2O.ai’s Wave
platform. H2O Wave is an open-source, Python-based platform that allows you to create custom web
applications for data science tasks.

🖥️ Features of the H2O AutoML Wave UI


●​ Run AutoML: You can easily upload your data, start the AutoML process, and get results directly from the
web interface.​

●​ Interactive Visualizations: The app provides several visualizations powered by the H2O Model
Explainability suite to help you analyze the results and better understand the models’ predictions.​

●​ The app also allows you to explore model performance and outcomes interactively.​

🏠 Running the Wave UI Locally


●​ You can run Wave locally on your machine or in the cloud.​

●​ To set it up locally:​

1.​ Install Wave: Follow the instructions provided here.​

2.​ Start the App: Follow the H2O AutoML Wave README instructions to run the AutoML app.​

Once running, you will see a simple UI to upload data, run AutoML, and view your model’s performance with
interactive graphs and visualizations.

🧪 Experimental Features: Preprocessing


Starting from H2O 3.32.0.1, AutoML now includes a minimal preprocessing option for Target Encoding of high
cardinality categorical variables. This feature is experimental and currently supports:

●​ Target Encoding: Applied to categorical variables with high cardinality (many unique values).​

●​ Supported Algorithms: Tree-based algorithms such as XGBoost, H2O GBM, and Random Forest can
benefit from this preprocessing method.​

●​ The preprocessing option is activated by setting the parameter preprocessing =


["target_encoding"], and it automatically tunes a Target Encoder model to improve the
performance of tree-based models.

You might also like