🚀 H2O AutoML: Automatic Machine Learning
🔍 What is H2O AutoML?
H2O AutoML is a tool that automates the entire machine learning workflow, making it easy for beginners and
experts alike to build high-performing models with minimal coding.
● Goal: Democratize machine learning — make it accessible to non-experts.
● Problem Solved: Tuning complex models (like deep learning) is difficult for non-experts.
● Solution: H2O AutoML automatically trains, tunes, and ranks multiple models.
🛠️ How H2O AutoML Works
✅ You Provide:
● A dataset
● A response column (target variable)
● (Optional) Time or model count limit
✅ H2O AutoML Will:
● Train multiple models
● Automatically tune them
● Use ensemble techniques (e.g., model stacking)
● Present a leaderboard ranking models by performance
🔹 Step 1: What is H2O.ai?
● H2O.ai is an open-source machine learning platform.
● Supports multiple languages: Python, R, Java.
● Built for big data, making it ideal for enterprise applications.
● Known for being fast, scalable, and easy-to-use, especially with tools like H2O AutoML.
🔹 Step 2: What is H2O AutoML?
H2O AutoML automates the process of training and tuning machine learning models.
💡 What It Does:
● Handles preprocessing and feature engineering
● Tests multiple algorithms:
○ GLM (Generalized Linear Model)
○ GBM (Gradient Boosting Machine)
○ XGBoost
○ Deep Learning
○ Stacked Ensembles
● Returns the best models via a leaderboard
🔹 Step 3: H2O Setup (Python)
Install H2O:
pip install h2o
Initialize H2O in Python:
import h2o
from h2o.automl import H2OAutoML
h2o.init()
🔹 Step 4: Load Dataset
Import data from URL or local system:
data =
h2o.import_file("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris
.csv")
🔹 Step 5: Split the Dataset
train, test = data.split_frame(ratios=[.8])
🔹 Step 6: Run AutoML
Define the target and feature columns:
x = data.columns[:-1] # features
y = "species" # target column
aml = H2OAutoML(max_models=10, seed=1)
aml.train(x=x, y=y, training_frame=train)
🔹 Step 7: View Leaderboard
lb = aml.leaderboard
lb.head()
🔹 Step 8: Predict on Test Data
preds = aml.leader.predict(test)
print(preds)
🔹 Step 9: Shut Down H2O Cluster
h2o.shutdown(prompt=False)
🔹 Training the Model with H2O AutoML (Detailed)
✅ Python Code:
from h2o.automl import H2OAutoML
aml = H2OAutoML(max_models=10, seed=1, max_runtime_secs=300)
aml.train(x=x, y=y, training_frame=train)
✅ Parameter Breakdown:
Parameter Description
x List of feature columns
y Target column name
training_frame H2OFrame containing training data
max_models=10 Train up to 10 models
max_runtime_secs=300 Limit total run time to 5 minutes
seed=1 Ensures reproducibility
⚙️ Internal Workflow:
1. Data Preprocessing:
○ Handles missing values
○ Encodes categoricals
○ Scales data (if needed)
2. Model Training:
○ GLM, DRF, GBM, XGBoost, Deep Learning, Stacked Ensembles
3. Hyperparameter Tuning:
○ Uses random grid search + cross-validation
4. Leaderboard Creation:
○ Models ranked by default metric (AUC, RMSE, etc.)
🧩 Parameters Overview
🔹 Required Parameters:
Parameter Description Required
y Target column name ✅
training_frame Training dataset ✅
(H2OFrame)
🔹 Required Stopping Parameters (Specify at least one):
Parameter Description Type
max_runtime_secs Max time allowed for AutoML Integer
(seconds)
max_models Max number of models to train Integer
🔹 Optional Parameters:
🔸 Data Parameters
Parameter Description
x List of predictors
validation_frame For early stopping (if nfolds=0)
leaderboard_frame For scoring and ranking models
blending_frame Enables Holdout Stacking
fold_column Custom cross-validation fold assignments
weights_column Row-level observation weights
🔸 Miscellaneous Parameters
Parameter What It Does Why It Matters
nfolds Cross-validation (default = 5) Better model evaluation
seed Reproducible results Same output every run
sort_metric Metric used for ranking Optimize for what
leaderboard models matters (AUC, RMSE,
etc.)
include_algos / exclude_algos Specify or skip certain Customize or speed up
algorithms AutoML
project_name Name for AutoML run Organize experiments
max_runtime_secs_per_model Limit time per model Speeds up training
keep_cross_validation_predictio Needed for Stacked Enables ensemble
ns Ensembles creation
export_checkpoints_dir Save all trained models to a Use models later
directory
verbosity Logging level ("info", "debug", Debugging or clean
etc.) output
stopping_rounds, Early stopping settings Prevent overfitting, save
stopping_tolerance, time
stopping_metric
✅ Validation Options (nfolds = 0)
🔹 What Happens When You Set nfolds = 0?
● No cross-validation scores available
● You must supply:
○ validation_frame for early stopping
○ leaderboard_frame for ranking
🔹 If Not Provided:
● H2O will split training data:
○ 10% → validation
○ 10% → leaderboard
Example: With 100,000 rows, H2O will use:
● 80,000 for training
● 10,000 for validation
● 10,000 for leaderboard
✅ XGBoost Memory Requirements
🔹 Why It Matters
● XGBoost runs outside Java (in C++)
● Needs its own memory — separate from H2O
🔸 Recommendation
Use only 2/3 of total RAM for H2O:
Total RAM Allocate to H2O
60 GB 40 GB
12 GB 8 GB
30 GB 20 GB
h2o.init(max_mem_size = "40G")
✅ scikit-learn Compatibility (h2o.sklearn)
🔹 What Is It?
● scikit-learn wrappers for H2O AutoML:
○ H2OAutoMLClassifier
○ H2OAutoMLRegressor
🔸 Benefits
● Use H2O models in sklearn pipelines
● Accepts pandas, numpy, or H2OFrames
✅ Example:
from h2o.sklearn import H2OAutoMLClassifier
model = H2OAutoMLClassifier(max_runtime_secs=60)
model.fit(X_train, y_train)
preds = model.predict(X_test)
● Also supports score(), get_params(), set_params()
python
CopyEdit
from h2o.sklearn import H2OAutoMLClassifier
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
("automl", H2OAutoMLClassifier(max_models=10))
])
pipeline.fit(X_train, y_train)
✅ Summary
Feature Purpose
h2o.init() Starts H2O engine
H2OAutoML() Trains multiple models automatically
aml.train() Starts model training
aml.leaderboard Shows model rankings and metrics
aml.leader Best-performing model
aml.predict() Predicts using leader model
h2o.explain() Explains predictions and model behaviour
🎯 Final Summary Table
Feature What It Does Why It Matters
Validation Options Enables leaderboard when Leaderboard and early stopping still
nfolds=0 work
XGBoost Memory Keeps RAM free for XGBoost Avoids crashes, improves stability
sklearn Use H2O in scikit-learn Integrate easily into Python
Compatibility pipelines workflows
🔍 1. Explainability in H2O AutoML
✅ What is it?
Explainability refers to the ability to understand why and how a machine learning model makes predictions.
🎯 Why is it important?
● Helps in building trust in the model.
● Useful for debugging and model validation.
● Crucial in regulated industries (like finance or healthcare).
💡 In H2O:
H2O provides a function called h2o.explain() that automatically:
● Shows feature importance
● Creates Partial Dependence Plots (PDPs)
● Generates SHAP value plots
● Displays model performance charts
📌 Use Case:
import h2o
h2o.explain(aml.leader, test) # Explains the best model on test data
⚙️ 2. Training a Model Using H2O AutoML
📥 Step 1: Import Libraries and Start H2O
import h2o
from h2o.automl import H2OAutoML
h2o.init() # Starts H2O server locally
This sets up the H2O cluster to run AutoML tasks.
📁 Step 2: Import Dataset
train = h2o.import_file("https://...higgs_train_10k.csv")
test = h2o.import_file("https://...higgs_test_5k.csv")
Downloads a binary classification dataset for training and testing.
🧠 Step 3: Define Features and Target Variable
x = train.columns
y = "response"
x.remove(y)
● x: All input features (independent variables)
● y: Target variable (dependent/output variable)
H2O can auto-detect this too if x is not passed.
🔄 Step 4: Convert Target to Factor (for Classification)
train[y] = train[y].asfactor()
test[y] = test[y].asfactor()
This tells H2O that you're doing classification, not regression.
🚀 Step 5: Run AutoML
python
CopyEdit
aml = H2OAutoML(max_models=20, seed=1)
aml.train(x=x, y=y, training_frame=train)
● max_models=20: Builds up to 20 different models
● seed=1: Makes results reproducible
● AutoML tries various algorithms like GBM, XGBoost, DRF, Deep Learning, etc., and also builds Stacked
Ensembles.
🏆 3. Leaderboard: Compare Model Performance
After training, use this:
python
CopyEdit
lb = aml.leaderboard
lb.head(rows=lb.nrows) # View all models
model_id auc loglos rmse mse
s
StackedEnsemble_AllModels_... 0.789801 0.5511 0.4321 0.1867
XGBoost_1_... 0.784651 0.5575 0.4349 0.1891
●
AUC: Measures classification performance (higher is better).
● Logloss: Penalizes incorrect predictions (lower is better).
● RMSE/MSE: Errors in prediction.
The top model is stored in:
aml.leader
📈 4. Prediction Using the Leader Model
Once trained, you can make predictions:
# Option 1: Use the AutoML object (uses the leader model)
preds = aml.predict(test)
# Option 2: Use the leader model directly
preds = aml.leader.predict(test)
● Predictions are returned in the same order as your test dataset.
● Output includes predicted class and probabilities.
🧠 5. How AutoML Works Under the Hood
● It automatically trains multiple models using different algorithms.
● Uses cross-validation or validation/leaderboard frames to evaluate models.
● Selects the best model (leader) based on default metric (e.g., AUC for classification).
● Builds stacked ensembles to combine top-performing models.
🛡️ 6. Memory Best Practices for XGBoost in H2O
● H2O uses Java (JVM) and runs inside a Java process.
● XGBoost runs outside the JVM (in native C++).
● You must reserve memory for XGBoost:
h2o.init(max_mem_size = "40G") # Reserve remaining RAM for XGBoost
● If total system RAM = 60 GB, allow only 40 GB to H2O, so that 20 GB is free for XGBoost.
✅ AutoML Output
📊 Leaderboard
H2O AutoML automatically generates a leaderboard, which is a table ranking all the models trained during the
AutoML run. It includes the cross-validated performance of each model (default: 5-fold CV).
🔄 Customizing Evaluation with nfolds and leaderboard_frame
● nfolds parameter: Defines the number of folds for cross-validation (default is 5).
● leaderboard_frame parameter: You can specify a custom dataset to evaluate all models instead of
using cross-validation. The leaderboard will then show metrics based on that dataset.
📌 Ranking Logic
Models in the leaderboard are sorted using a default metric, depending on the problem type:
Problem Type Default Metric
Binary Classification AUC (Area Under Curve)
Multiclass Classification Mean Per-Class Error
Regression RMSE (Root Mean Squared Error)
Additional metrics like logloss, MSE, etc., are also provided.
⚙️ Adding More Details to the Leaderboard
To better assess model performance and complexity, you can use the extra_columns argument in the
h2o.automl.get_leaderboard() function:
Allowed options for extra_columns:
Option Description
training_time_ms Training time (milliseconds) excluding CV models
predict_time_per_r Avg. time to make predictions for one row
ow_ms
ALL Adds both training time and prediction time per row
✅ Example: Get leaderboard with full details
python
CopyEdit
lb = h2o.automl.get_leaderboard(aml, extra_columns="ALL")
lb
This displays a detailed leaderboard for all models trained.
📈 Creating Leaderboard on a New Dataset
You can also create a leaderboard that scores all trained models on a different dataset using:
python
CopyEdit
from h2o.automl import make_leaderboard
new_lb = make_leaderboard(aml, test, extra_columns="ALL")
new_lb
This helps compare models on real-world data or unseen test sets.
🔍 Examine Models
You can retrieve and analyze specific models from AutoML by their model ID, or get the best model of a
specific type or metric using convenient built-in methods.
✅ Get the Leader Model (Best Overall)
python
CopyEdit
# Get the best overall model
m = aml.leader
# or use the helper function
m = aml.get_best_model()
🎯 Get Best Model by Custom Metric
python
CopyEdit
# Get best model ranked by logloss instead of default
m = aml.get_best_model(criterion="logloss")
🧠 Get Best Model of a Specific Algorithm
python
CopyEdit
# Get best XGBoost model based on default metric
xgb = aml.get_best_model(algorithm="xgboost")
# Or by a specific metric like logloss
xgb = aml.get_best_model(algorithm="xgboost", criterion="logloss")
🔑 Get a Model by Its ID
If you know the model ID (from the leaderboard), you can retrieve it directly:
python
CopyEdit
m = h2o.get_model("StackedEnsemble_BestOfFamily_AutoML_20191213_174603")
🔍 Inspect Model Parameters
You can analyze model configuration (like number of trees, learning rate, etc.):
python
CopyEdit
# View all parameter names
xgb.params.keys()
# View specific parameter value
xgb.params['ntrees']
✅ AutoML Log
During the execution of H2O AutoML, various backend events and metadata are recorded. These logs are
accessible through specific properties of the AutoML object when using Python or R clients.
These logs are especially useful for debugging, performance analysis, and understanding training workflow
details.
📁 1. event_log: Training Events Summary
● The event_log is an H2OFrame that captures key backend events during model training.
● It includes records like when a model starts training, finishes, gets scored, and other internal AutoML
operations.
● This is helpful to trace the AutoML process, check if there were any errors or warnings, and monitor
progress.
✅ Example: Retrieve the event log
python
CopyEdit
log = aml.event_log
log.head() # View first few event entries
🕒 2. training_info: AutoML Timing Information
● The training_info property is a dictionary that provides detailed statistics about the AutoML run.
● It includes high-level information such as:
○ Total training time
○ Time spent on individual steps (e.g., data preprocessing, model training, leaderboard creation)
● This helps in performance analysis and optimization.
✅ Example: Retrieve training info
python
CopyEdit
info = aml.training_info
print(info)
Note: If you're specifically interested in the training and prediction times for each model, it’s often easier to
use the extended leaderboard with extra_columns="ALL" like this:
python
CopyEdit
lb = h2o.automl.get_leaderboard(aml, extra_columns="ALL")
lb
These logging tools provide transparency and traceability, which are key for:
● Performance debugging
● Model reproducibility
● Compliance and auditing in ML workflows
🌐 Web UI via H2O Wave
In addition to Python and R APIs, H2O AutoML provides an interactive web interface through H2O.ai’s Wave
platform. H2O Wave is an open-source, Python-based platform that allows you to create custom web
applications for data science tasks.
🖥️ Features of the H2O AutoML Wave UI
● Run AutoML: You can easily upload your data, start the AutoML process, and get results directly from the
web interface.
● Interactive Visualizations: The app provides several visualizations powered by the H2O Model
Explainability suite to help you analyze the results and better understand the models’ predictions.
● The app also allows you to explore model performance and outcomes interactively.
🏠 Running the Wave UI Locally
● You can run Wave locally on your machine or in the cloud.
● To set it up locally:
1. Install Wave: Follow the instructions provided here.
2. Start the App: Follow the H2O AutoML Wave README instructions to run the AutoML app.
Once running, you will see a simple UI to upload data, run AutoML, and view your model’s performance with
interactive graphs and visualizations.
🧪 Experimental Features: Preprocessing
Starting from H2O 3.32.0.1, AutoML now includes a minimal preprocessing option for Target Encoding of high
cardinality categorical variables. This feature is experimental and currently supports:
● Target Encoding: Applied to categorical variables with high cardinality (many unique values).
● Supported Algorithms: Tree-based algorithms such as XGBoost, H2O GBM, and Random Forest can
benefit from this preprocessing method.
● The preprocessing option is activated by setting the parameter preprocessing =
["target_encoding"], and it automatically tunes a Target Encoder model to improve the
performance of tree-based models.