Guide
Guide
ML.NET Guide
Overview
What is ML.NET & how does it work?
Tutorials
Overview
Analyze sentiment (binary classification)
Categorize support issues (multiclass classification)
Predict prices (regression)
Categorize iris flowers (k-means clustering)
Recommend movies (matrix factorization)
Image Classification (transfer learning)
Product Sales Analysis (anomaly detection)
Detect objects in images (object detection)
How-to guides
Load data
Prepare data
Train, evaluate, and explain the model
Train and evaluate a model
Train a model using cross-validation
Inspect intermediate pipeline data values
Determine model feature importance with PFI
Use the trained model
Save and load a model
Use a model to make predictions
Re-train a model
Deploy a model to Azure Functions
Deploy a model to a web API
Infer.NET
Probabilistic programming with Infer.NET
Reference
API Reference
Preview API Reference
Resources
Glossary
Tasks
Algorithms
Data transforms
Metrics
Improve model accuracy
Automated ML preview
Overview
Model Builder
What is Model Builder?
Install Model Builder
Predict prices with Model Builder
ML.NET CLI
Install the CLI
Automated machine learning
Auto-generate a binary classifier
CLI reference
CLI telemetry
Automated ML API
Use the automated ML API
API reference
What is ML.NET and how does it work?
7/19/2019 • 9 minutes to read • Edit Online
ML.NET gives you the ability to add machine learning to .NET applications, in either online or offline scenarios.
With this capability, you can make automatic predictions using the data available to your application without
having to be connected to a network. This article explains the basics of machine learning in ML.NET.
Examples of the type of predictions that you can make with ML.NET include:
Regression/Predict continuous values Predict the price of houses based on size and location
class Program
{
public class HouseData
{
public float Size { get; set; }
public float Price { get; set; }
}
// 3. Train model
var model = pipeline.Fit(trainingData);
// 4. Make a prediction
var size = new HouseData() { Size = 2.5F };
var price = mlContext.Model.CreatePredictionEngine<HouseData, Prediction>(model).Predict(size);
Code workflow
The following diagram represents the application code structure, as well as the iterative process of model
development:
Collect and load training data into an IDataView object
Specify a pipeline of operations to extract features and apply a machine learning algorithm
Train a model by calling Fit() on the pipeline
Evaluate the model and iterate to improve
Save the model into binary format, for use in an application
Load the model back into an ITransformer object
Make predictions by calling CreatePredictionEngine.Predict()
Let's dig a little deeper into those concepts.
Both the house price model and the text classification model are linear models. Depending on the nature of your
data and the problem you are solving, you can also use decision tree models, generalized additive models, and
others. You can find out more about the models in Tasks.
Data preparation
In most cases, the data that you have available isn't suitable to be used directly to train a machine learning model.
The raw data needs to be prepared, or pre-processed before it can be used to find the parameters of your model.
Your data may need to be converted from string values to a numerical representation. You might have redundant
information in your input data. You may need to reduce or expand the dimensions of your input data. Your data
might need to be normalized or scaled.
The ML.NET tutorials teach you about different data processing pipelines for text, image, numerical, and time-
series data used for specific machine learning tasks.
How to prepare your data shows you how to applied data preparation more generally.
You can find an appendix of all of the available transformations in the resources section.
Model evaluation
Once you have trained your model, how do you know how well it will make future predictions? With ML.NET, you
can evaluate your model against some new test data.
Each type of machine learning task has metrics used to evaluate the accuracy and precision of the model against
the test data set.
For our house price example, we used the Regression task. To evaluate the model, add the following code to the
original sample.
HouseData[] testHouseData =
{
new HouseData() { Size = 1.1F, Price = 0.98F },
new HouseData() { Size = 1.9F, Price = 2.1F },
new HouseData() { Size = 2.8F, Price = 2.9F },
new HouseData() { Size = 3.4F, Price = 3.6F }
};
Console.WriteLine($"R^2: {metrics.RSquared:0.##}");
Console.WriteLine($"RMS error: {metrics.RootMeanSquaredError:0.##}");
// R^2: 0.96
// RMS error: 0.19
The evaluation metrics tell you that the error is low -ish, and that correlation between the predicted output and the
test output is high. That was easy! In real examples, it takes more tuning to achieve good model metrics.
ML.NET architecture
In this section, we go through the architectural patterns of ML.NET. If you are an experienced .NET developer, some
of these patterns will be familiar to you, and some will be less familiar. Hold tight, while we dive in!
An ML.NET application starts with an MLContext object. This singleton object contains catalogs. A catalog is a
factory for data loading and saving, transforms, trainers, and model operation components. Each catalog object has
methods to create the different types of components:
Clustering ClusteringCatalog
Forecasting ForecastingCatalog
Ranking RankingCatalog
Regression RegressionCatalog
You can navigate to the creation methods in each of the above categories. Using Visual Studio, the catalogs show
up via IntelliSense.
In the snippet, Concatenate and Sdca are both methods in the catalog. They each create an IEstimator object that
is appended to the pipeline.
At this point, the objects are created only. No execution has happened.
Train the model
Once the objects in the pipeline have been created, data can be used to train the model.
Calling Fit() uses the input training data to estimate the parameters of the model. This is known as training the
model. Remember, the linear regression model above had two model parameters: bias and weight. After the
Fit() call, the values of the parameters are known. Most models will have many more parameters than this.
You can learn more about model training in How to train your model
The resulting model object implements the ITransformer interface. That is, the model transforms input data into
predictions.
The CreatePredictionEngine() method takes an input class and an output class. The field names and/or code
attributes determine the names of the data columns used during model training and prediction. You can read about
How to make a single prediction in the How -to section.
Data models and schema
At the core of an ML.NET machine learning pipeline are DataView objects.
Each transformation in the pipeline has an input schema (data names, types, and sizes that the transform expects to
see on its input); and an output schema (data names, types, and sizes that the transform produces after the
transformation).
If the output schema from one transform in the pipeline doesn't match the input schema of the next transform,
ML.NET will throw an exception.
A data view object has columns and rows. Each column has a name and a type and a length. For example: the input
columns in the house price example are Size and Price. They are both type and they are scalar quantities rather
than vector ones.
All ML.NET algorithms look for an input column that is a vector. By default this vector column is called Features.
This is why we concatenated the Size column into a new column called Features in our house price example.
All algorithms also create new columns after they have performed a prediction. The fixed names of these new
columns depend on the type of machine learning algorithm. For the regression task, one of the new columns is
called Score. This is why we attributed our price data with this name.
public class Prediction
{
[ColumnName("Score")]
public float Price { get; set; }
}
You can find out more about output columns of different machine learning tasks in the Machine Learning Tasks
guide.
An important property of DataView objects is that they are evaluated lazily. Data views are only loaded and
operated on during model training and evaluation, and data prediction. While you are writing and testing your
ML.NET application, you can use the Visual Studio debugger to take a peek at any data view object by calling the
Preview method.
You can watch the debug variable in the debugger and examine its contents. Do not use the Preview method in
production code, as it significantly degrades performance.
Model Deployment
In real-life applications, your model training and evaluation code will be separate from your prediction. In fact,
these two activities are often performed by separate teams. Your model development team can save the model for
use in the prediction application.
mlContext.Model.Save(model, trainingData.Schema,"model.zip");
Where to now?
You can learn how to build applications using different machine learning tasks with more realistic data sets in the
tutorials.
Or you can learn about specific topics in more depth in the How To Guides.
And if you're super keen, you can dive straight into the API Reference documentation!
ML.NET tutorials
7/31/2019 • 2 minutes to read • Edit Online
The following tutorials enable you to understand how to use ML.NET to build custom machine learning solutions
and integrate them into your .NET applications:
Sentiment analysis: demonstrates how to apply a binary classification task using ML.NET.
GitHub issue classification: demonstrates how to apply a multiclass classification task using ML.NET.
Price predictor: demonstrates how to apply a regression task using ML.NET.
Iris clustering: demonstrates how to apply a clustering task using ML.NET.
Recommendation: generate movie recommendations based on previous user ratings
Image classification: demonstrates how to retrain an existing TensorFlow model to create a custom image
classifier using ML.NET.
Anomaly detection: demonstrates how to build an anomaly detection application for product sales data
analysis.
Detect objects in images: demonstrates how to detect objects in images using a pre-trained ONNX model.
Next Steps
For more examples that use ML.NET, check out the dotnet/machinelearning-samples GitHub repository.
Tutorial: Analyze sentiment of website comments with
binary classification in ML.NET
7/16/2019 • 12 minutes to read • Edit Online
This tutorial shows you how to create a .NET Core console application that classifies sentiment from website
comments and takes the appropriate action. The binary sentiment classifier uses C# in Visual Studio 2017.
In this tutorial, you learn how to:
Create a console application
Prepare data
Load the data
Build and train the model
Evaluate the model
Use the model to make a prediction
See the results
You can find the source code for this tutorial at the dotnet/samples repository.
Prerequisites
Visual Studio 2017 15.6 or later with the ".NET Core cross-platform development" workload installed
UCI Sentiment Labeled Sentences dataset (ZIP file)
1. Download UCI Sentiment Labeled Sentences dataset ZIP file, and unzip.
2. Copy the yelp_labelled.txt file into the Data directory you created.
3. In Solution Explorer, right-click the yelp_labeled.txt file and select Properties. Under Advanced, change
the value of Copy to Output Directory to Copy if newer.
Create classes and define paths
1. Add the following additional using statements to the top of the Program.cs file:
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using Microsoft.ML;
using Microsoft.ML.Data;
using static Microsoft.ML.DataOperationsCatalog;
using Microsoft.ML.Trainers;
using Microsoft.ML.Transforms.Text;
2. Create two global fields to hold the recently downloaded dataset file path and the saved model file path:
_dataPath has the path to the dataset used to train the model.
_modelPath has the path where the trained model is saved.
3. Add the following code to the line right above the Main method to specify the paths:
4. Next, create classes for your input data and predictions. Add a new class to your project:
In Solution Explorer, right-click the project, and then select Add > New Item.
In the Add New Item dialog box, select Class and change the Name field to SentimentData.cs.
Then, select the Add button.
5. The SentimentData.cs file opens in the code editor. Add the following using statement to the top of
SentimentData.cs:
using Microsoft.ML.Data;
6. Remove the existing class definition and add the following code, which has two classes SentimentData and
SentimentPrediction , to the SentimentData.cs file:
[LoadColumn(1), ColumnName("Label")]
public bool Sentiment;
}
[ColumnName("PredictedLabel")]
public bool Prediction { get; set; }
SentimentPrediction is the prediction class used after model training. It inherits from SentimentData so that the
input SentimentText can be displayed along with the output prediction. The Prediction boolean is the value that
the model predicts when supplied with new input SentimentText .
The output class SentimentPrediction contains two other properties calculated by the model: Score - the raw
score calculated by the model, and Probability - the score calibrated to the likelihood of the text having positive
sentiment.
For this tutorial, the most important property is Prediction .
2. Add the following as the next line of code in the Main() method:
3. Create the LoadData() method, just after the Main() method, using the following code:
public static TrainTestData LoadData(MLContext mlContext)
{
The LoadFromTextFile() defines the data schema and reads in the file. It takes in the data path variables and
returns an IDataView .
Split the dataset for model training and testing
When preparing a model, you use part of the dataset to train it and part of the dataset to test the model's accuracy.
1. To split the loaded data into the needed datasets, add the following code as the next line in the LoadData()
method:
The previous code uses the TrainTestSplit() method to split the loaded dataset into train and test datasets
and return them in the TrainTestData class. Specify the test set percentage of data with the testFraction
parameter. The default is 10%, in this case you use 20% to evaluate more data.
2. Return the splitDataView at the end of the LoadData() method:
return splitDataView;
The FeaturizeText() method in the previous code converts the text column ( SentimentText ) into a
numeric key type Features column used by the machine learning algorithm and adds it as a new dataset
column:
.Append(mlContext.BinaryClassification.Trainers.SdcaLogisticRegression(labelColumnName: "Label",
featureColumnName: "Features"));
The Fit() method trains your model by transforming the dataset and applying the training.
Return the model trained to use for evaluation
Return the model at the end of the BuildAndTrainModel() method:
return model;
The previous code uses the Transform() method to make predictions for multiple provided input rows of a
test dataset.
4. Evaluate the model by adding the following as the next line of code in the Evaluate() method:
Once you have the prediction set ( predictions ), the Evaluate() method assesses the model, which compares the
predicted values with the actual Labels in the test dataset and returns a CalibratedBinaryClassificationMetrics
object on how the model is performing.
Displaying the metrics for model validation
Use the following code to display the metrics:
Console.WriteLine();
Console.WriteLine("Model quality metrics evaluation");
Console.WriteLine("--------------------------------");
Console.WriteLine($"Accuracy: {metrics.Accuracy:P2}");
Console.WriteLine($"Auc: {metrics.AreaUnderRocCurve:P2}");
Console.WriteLine($"F1Score: {metrics.F1Score:P2}");
Console.WriteLine("=============== End of model evaluation ===============");
The Accuracy metric gets the accuracy of a model, which is the proportion of correct predictions in the test
set.
The AreaUnderRocCurve metric indicates how confident the model is correctly classifying the positive and
negative classes. You want the AreaUnderRocCurve to be as close to one as possible.
The F1Score metric gets the model's F1 score, which is a measure of balance between precision and recall.
You want the F1Score to be as close to one as possible.
Predict the test data outcome
1. Create the UseModelWithSingleItem() method, just after the Evaluate() method, using the following code:
UseModelWithSingleItem(mlContext, model);
3. Add the following code to create as the first line in the UseModelWithSingleItem() Method:
The PredictionEngine is a convenience API, which allows you to pass in and then perform a prediction on a
single instance of data.
4. Add a comment to test the trained model's prediction in the UseModelWithSingleItem() method by creating
an instance of SentimentData :
5. Pass the test comment data to the Prediction Engine by adding the following as the next lines of code in
the UseModelWithSingleItem() method:
Console.WriteLine();
Console.WriteLine($"Sentiment: {resultPrediction.SentimentText} | Prediction:
{(Convert.ToBoolean(resultPrediction.Prediction) ? "Positive" : "Negative")} | Probability:
{resultPrediction.Probability} ");
UseModelWithBatchItems(mlContext, model);
3. Add some comments to test the trained model's predictions in the UseModelWithBatchItems() method:
// Use model to predict whether comment data is Positive (1) or Negative (0).
IEnumerable<SentimentPrediction> predictedResults = mlContext.Data.CreateEnumerable<SentimentPrediction>
(predictions, reuseRowObject: false);
Console.WriteLine();
}
Console.WriteLine("=============== End of predictions ===============");
Results
Your results should be similar to the following. During processing, messages are displayed. You may see warnings,
or processing messages. These have been removed from the following results for clarity.
=============== Prediction Test of model with a single sample and test dataset ===============
Sentiment: This was a very bad steak | Prediction: Negative | Probability: 0.1027377
=============== End of Predictions ===============
Congratulations! You've now successfully built a machine learning model for classifying and predicting messages
sentiment.
Building successful models is an iterative process. This model has initial lower quality as the tutorial uses small
datasets to provide quick model training. If you aren't satisfied with the model quality, you can try to improve it by
providing larger training datasets or by choosing different training algorithms with different hyper-parameters for
each algorithm.
You can find the source code for this tutorial at the dotnet/samples repository.
Next steps
In this tutorial, you learned how to:
Create a console application
Prepare data
Load the data
Build and train the model
Evaluate the model
Use the model to make a prediction
See the results
Advance to the next tutorial to learn more
Issue Classification
Tutorial: Categorize support issues using multiclass
classification with ML .NET
8/1/2019 • 12 minutes to read • Edit Online
This sample tutorial illustrates using ML.NET to create a GitHub issue classifier to train a model that classifies and
predicts the Area label for a GitHub issue via a .NET Core console application using C# in Visual Studio.
In this tutorial, you learn how to:
Prepare your data
Transform the data
Train the model
Evaluate the model
Predict with the trained model
Deploy and Predict with a loaded model
You can find the source code for this tutorial at the dotnet/samples repository.
Prerequisites
Visual Studio 2017 15.6 or later with the ".NET Core cross-platform development" workload installed.
The Github issues tab separated file (issues_train.tsv).
The Github issues test tab separated file (issues_test.tsv).
using System;
using System.IO;
using System.Linq;
using Microsoft.ML;
Create three global fields to hold the paths to the recently downloaded files, and global variables for the
MLContext , DataView , and PredictionEngine :
_trainDataPath has the path to the dataset used to train the model.
_testDataPath has the path to the dataset used to evaluate the model.
_modelPath has the path where the trained model is saved.
_mlContext is the MLContext that provides processing context.
_trainingDataView is the IDataView used to process the training dataset.
_predEngine is the PredictionEngine<TSrc,TDst> used for single predictions.
Add the following code to the line right above the Main method to specify those paths and the other variables:
Create some classes for your input data and predictions. Add a new class to your project:
1. In Solution Explorer, right-click the project, and then select Add > New Item.
2. In the Add New Item dialog box, select Class and change the Name field to GitHubIssueData.cs. Then,
select the Add button.
The GitHubIssueData.cs file opens in the code editor. Add the following using statement to the top of
GitHubIssueData.cs:
using Microsoft.ML.Data;
Remove the existing class definition and add the following code, which has two classes GitHubIssue and
IssuePrediction , to the GitHubIssueData.cs file:
public class GitHubIssue
{
[LoadColumn(0)]
public string ID { get; set; }
[LoadColumn(1)]
public string Area { get; set; }
[LoadColumn(2)]
public string Title { get; set; }
[LoadColumn(3)]
public string Description { get; set; }
}
The label is the column you want to predict. The identified Features are the inputs you give the model to predict
the Label.
Use the LoadColumnAttribute to specify the indices of the source columns in the data set.
GitHubIssue is the input dataset class and has the following String fields:
the first column ID (GitHub Issue ID )
the second column Area (the prediction for training)
the third column Title (GitHub issue title) is the first feature used for predicting the Area
the fourth column Description is the second feature used for predicting the Area
IssuePrediction is the class used for prediction after the model has been trained. It has a single string ( Area )
and a PredictedLabel ColumnName attribute. The PredictedLabel is used during prediction and evaluation. For
evaluation, an input with training data, the predicted values, and the model are used.
All ML.NET operations start in the MLContext class. Initializing mlContext creates a new ML.NET environment
that can be shared across the model creation workflow objects. It's similar, conceptually, to DBContext in
Entity Framework .
The LoadFromTextFile() defines the data schema and reads in the file. It takes in the data path variables and
returns an IDataView .
Add the following as the next line of code in the Main method:
Next, call mlContext.Transforms.Text.FeaturizeText which transforms the text ( Title and Description ) columns
into a numeric vector for each called TitleFeaturized and DescriptionFeaturized . Append the featurization for
both columns to the pipeline with the following code:
The last step in data preparation combines all of the feature columns into the Features column using the
Concatenate() method. By default, a learning algorithm processes only features from the Features column.
Append this transformation to the pipeline with the following code:
Next, append a AppendCacheCheckpoint to cache the DataView so when you iterate over the data multiple times
using the cache might get better performance, as with the following code:
.AppendCacheCheckpoint(_mlContext);
WARNING
Use AppendCacheCheckpoint for small/medium datasets to lower training time. Do NOT use it (remove
.AppendCacheCheckpoint()) when handling very large datasets.
Return the pipeline at the end of the ProcessData method.
return pipeline;
This step handles preprocessing/featurization. Using additional components available in ML.NET can enable better
results with your model.
var trainingPipeline =
pipeline.Append(_mlContext.MulticlassClassification.Trainers.SdcaMaximumEntropy("Label", "Features"))
.Append(_mlContext.Transforms.Conversion.MapKeyToValue("PredictedLabel"));
The SdcaMaximumEntropy is your multiclass classification training algorithm. This is appended to the pipeline
and accepts the featurized Title and Description ( Features ) and the Label input parameters to learn from the
historic data.
Train the model
Fit the model to the splitTrainSet data and return the trained model by adding the following as the next line of
code in the BuildAndTrainModel() method:
_trainedModel = trainingPipeline.Fit(trainingDataView);
The Fit() method trains your model by transforming the dataset and applying the training.
The PredictionEngine is a convenience API, which allows you to pass in and then perform a prediction on a single
instance of data. Add this as the next line in the BuildAndTrainModel() method:
return trainingPipeline;
Evaluate(_trainingDataView.Schema);
As you did previously with the training dataset, load the test dataset by adding the following code to the Evaluate
method:
The Evaluate() method computes the quality metrics for the model using the specified dataset. It returns a
MulticlassClassificationMetrics object that contains the overall metrics computed by multiclass classification
evaluators. To display the metrics to determine the quality of the model, you need to get them first. Notice the use
of the Transform() method of the machine learning _trainedModel global variable (an ITransformer) to input the
features and return predictions. Add the following code to the Evaluate method as the next line:
Console.WriteLine($"******************************************************************************************
*******************");
Console.WriteLine($"* Metrics for Multi-class Classification model - Test Data ");
Console.WriteLine($"*-----------------------------------------------------------------------------------------
-------------------");
Console.WriteLine($"* MicroAccuracy: {testMetrics.MicroAccuracy:0.###}");
Console.WriteLine($"* MacroAccuracy: {testMetrics.MacroAccuracy:0.###}");
Console.WriteLine($"* LogLoss: {testMetrics.LogLoss:#.###}");
Console.WriteLine($"* LogLossReduction: {testMetrics.LogLossReduction:#.###}");
Console.WriteLine($"******************************************************************************************
*******************");
Add the following code to your SaveModelAsFile method. This code uses the Save method to serialize and store
the trained model as a zip file.
PredictIssue();
Create the PredictIssue method, just after the Evaluate method (and just before the SaveModelAsFile method),
using the following code:
Add a GitHub issue to test the trained model's prediction in the Predict method by creating an instance of
GitHubIssue :
GitHubIssue singleIssue = new GitHubIssue() { Title = "Entity Framework crashes", Description = "When
connecting to the database, EF is crashing" };
As you did previously, create a PredictionEngine instance with the following code:
_predEngine = _mlContext.Model.CreatePredictionEngine<GitHubIssue, IssuePrediction>(loadedModel);
Use the PredictionEngine to predict the Area GitHub label by adding the following code to the PredictIssue
method for the prediction:
Results
Your results should be similar to the following. As the pipeline processes, it displays messages. You may see
warnings, or processing messages. These messages have been removed from the following results for clarity.
Congratulations! You've now successfully built a machine learning model for classifying and predicting an Area
label for a GitHub issue. You can find the source code for this tutorial at the dotnet/samples repository.
Next steps
In this tutorial, you learned how to:
Prepare your data
Transform the data
Train the model
Evaluate the model
Predict with the trained model
Deploy and Predict with a loaded model
Advance to the next tutorial to learn more
Taxi Fare Predictor
Tutorial: Predict prices using regression with ML.NET
8/20/2019 • 10 minutes to read • Edit Online
This tutorial illustrates how to build a regression model using ML.NET to predict prices, specifically, New York City
taxi fares.
In this tutorial, you learn how to:
Prepare and understand the data
Load and transform the data
Choose a learning algorithm
Train the model
Evaluate the model
Use the model for predictions
Prerequisites
Visual Studio 2017 15.6 or later with the ".NET Core cross-platform development" workload installed.
using Microsoft.ML.Data;
Remove the existing class definition and add the following code, which has two classes TaxiTrip and
TaxiTripFarePrediction , to the TaxiTrip.cs file:
[LoadColumn(1)]
public string RateCode;
[LoadColumn(2)]
public float PassengerCount;
[LoadColumn(3)]
public float TripTime;
[LoadColumn(4)]
public float TripDistance;
[LoadColumn(5)]
public string PaymentType;
[LoadColumn(6)]
public float FareAmount;
}
TaxiTrip is the input data class and has definitions for each of the data set columns. Use the LoadColumnAttribute
attribute to specify the indices of the source columns in the data set.
The TaxiTripFarePrediction class represents predicted results. It has a single float field, FareAmount , with a Score
ColumnNameAttribute attribute applied. In case of the regression task the Score column contains predicted label
values.
NOTE
Use the float type to represent floating-point values in the input and prediction data classes.
using System;
using System.IO;
using Microsoft.ML;
You need to create three fields to hold the paths to the files with data sets and the file to save the model:
_trainDataPath contains the path to the file with the data set used to train the model.
_testDataPath contains the path to the file with the data set used to evaluate the model.
_modelPath contains the path to the file where the trained model is stored.
Add the following code right above the Main method to specify those paths and for the _textLoader variable:
All ML.NET operations start in the MLContext class. Initializing mlContext creates a new ML.NET environment
that can be shared across the model creation workflow objects. It's similar, conceptually, to DBContext in Entity
Framework.
Initialize variables in Main
Replace the Console.WriteLine("Hello World!") line in the Main method with the following code to declare and
initialize the mlContext variable:
Add the following as the next line of code in the Main method to call the Train method:
As you want to predict the taxi trip fare, the FareAmount column is the Label that you will predict (the output of
the model)Use the CopyColumnsEstimator transformation class to copy FareAmount , and add the following code:
The algorithm that trains the model requires numeric features, so you have to transform the categorical data (
VendorId , RateCode , and PaymentType ) values into numbers ( VendorIdEncoded , RateCodeEncoded , and
PaymentTypeEncoded ). To do that, use the OneHotEncodingTransformer transformation class, which assigns different
numeric key values to the different values in each of the columns, and add the following code:
.Append(mlContext.Transforms.Categorical.OneHotEncoding(outputColumnName: "VendorIdEncoded",
inputColumnName:"VendorId"))
.Append(mlContext.Transforms.Categorical.OneHotEncoding(outputColumnName: "RateCodeEncoded", inputColumnName:
"RateCode"))
.Append(mlContext.Transforms.Categorical.OneHotEncoding(outputColumnName: "PaymentTypeEncoded",
inputColumnName: "PaymentType"))
The last step in data preparation combines all of the feature columns into the Features column using the
mlContext.Transforms.Concatenate transformation class. By default, a learning algorithm processes only features
from the Features column. Add the following code:
.Append(mlContext.Regression.Trainers.FastTree());
The Fit() method trains your model by transforming the dataset and applying the training.
Return the trained model with the following line of code in the Train() method:
return model;
Evaluate(mlContext, model);
Load the test dataset using the LoadFromTextFile() method. Evaluate the model using this dataset as a quality
check by adding the following code in the Evaluate method:
Next, transform the Test data by adding the following code to EvaluateModel() :
The Transform() method makes predictions for the test dataset input rows.
The RegressionContext.Evaluate method computes the quality metrics for the PredictionModel using the specified
dataset. It returns a RegressionMetrics object that contains the overall metrics computed by regression evaluators.
To display these to determine the quality of the model, you need to get the metrics first. Add the following code as
the next line in the Evaluate method:
Console.WriteLine();
Console.WriteLine($"*************************************************");
Console.WriteLine($"* Model quality metrics evaluation ");
Console.WriteLine($"*------------------------------------------------");
RSquared is another evaluation metric of the regression models. RSquared takes values between 0 and 1. The
closer its value is to 1, the better the model is. Add the following code into the Evaluate method to display the
RSquared value:
RMS is one of the evaluation metrics of the regression model. The lower it is, the better the model is. Add the
following code into the Evaluate method to display the RMS value:
TestSinglePrediction(mlContext, model);
Use the PredictionEngine to predict the fare by adding the following code to TestSinglePrediction() :
The PredictionEngine class is a convenience API, which allows you to pass a single instance of data and then
perform a prediction on it.
This tutorial uses one test trip within this class. Later you can add other scenarios to experiment with the model.
Add a trip to test the trained model's prediction of cost in the TestSinglePrediction() method by creating an
instance of TaxiTrip :
var taxiTripSample = new TaxiTrip()
{
VendorId = "VTS",
RateCode = "1",
PassengerCount = 1,
TripTime = 1140,
TripDistance = 3.75f,
PaymentType = "CRD",
FareAmount = 0 // To predict. Actual/Observed = 15.5
};
Next, predict the fare based on a single instance of the taxi trip data and pass it to the PredictionEngine by adding
the following as the next lines of code in the TestSinglePrediction() method:
Console.WriteLine($"**********************************************************************");
Console.WriteLine($"Predicted fare: {prediction.FareAmount:0.####}, actual fare: 15.5");
Console.WriteLine($"**********************************************************************");
Run the program to see the predicted taxi fare for your test case.
Congratulations! You've now successfully built a machine learning model for predicting taxi trip fares, evaluated its
accuracy, and used it to make predictions. You can find the source code for this tutorial at the dotnet/samples
GitHub repository.
Next steps
In this tutorial, you learned how to:
Prepare and understand the data
Create a learning pipeline
Load and transform the data
Choose a learning algorithm
Train the model
Evaluate the model
Use the model for predictions
Advance to the next tutorial to learn more.
Iris clustering
Tutorial: Categorize iris flowers using k-means
clustering with ML.NET
8/20/2019 • 7 minutes to read • Edit Online
This tutorial illustrates how to use ML.NET to build a clustering model for the iris flower data set.
In this tutorial, you learn how to:
Understand the problem
Select the appropriate machine learning task
Prepare the data
Load and transform the data
Choose a learning algorithm
Train the model
Use the model for predictions
Prerequisites
Visual Studio 2017 15.6 or later with the ".NET Core cross-platform development" workload installed.
using Microsoft.ML.Data;
Remove the existing class definition and add the following code, which defines the classes IrisData and
ClusterPrediction , to the IrisData.cs file:
[LoadColumn(1)]
public float SepalWidth;
[LoadColumn(2)]
public float PetalLength;
[LoadColumn(3)]
public float PetalWidth;
}
[ColumnName("Score")]
public float[] Distances;
}
IrisData is the input data class and has definitions for each feature from the data set. Use the LoadColumn
attribute to specify the indices of the source columns in the data set file.
The ClusterPrediction class represents the output of the clustering model applied to an IrisData instance. Use
the ColumnName attribute to bind the PredictedClusterId and Distances fields to the PredictedLabel and
Score columns respectively. In case of the clustering task those columns have the following meaning:
PredictedLabel column contains the ID of the predicted cluster.
Score column contains an array with squared Euclidean distances to the cluster centroids. The array length is
equal to the number of clusters.
NOTE
Use the float type to represent floating-point values in the input and prediction data classes.
Add the following code right above the Main method to specify those paths:
To make the preceding code compile, add the following using directives at the top of the Program.cs file:
using System;
using System.IO;
Create ML context
Add the following additional using directives to the top of the Program.cs file:
using Microsoft.ML;
using Microsoft.ML.Data;
In the Main method, replace the Console.WriteLine("Hello World!"); line with the following code:
The Microsoft.ML.MLContext class represents the machine learning environment and provides mechanisms for
logging and entry points for data loading, model training, prediction, and other tasks. This is comparable
conceptually to using DbContext in Entity Framework.
The generic MLContext.Data.LoadFromTextFile extension method infers the data set schema from the provided
IrisData type and returns IDataView which can be used as input for transformers.
The code specifies that the data set should be split in three clusters.
This tutorial introduces one iris data instance within this class. You can add other scenarios to experiment with the
model. Add the following code into the TestIrisData class:
To find out the cluster to which the specified item belongs to, go back to the Program.cs file and add the following
code into the Main method:
Run the program to see which cluster contains the specified data instance and squared distances from that
instance to the cluster centroids. Your results should be similar to the following:
Cluster: 2
Distances: 11.69127 0.02159119 25.59896
Congratulations! You've now successfully built a machine learning model for iris clustering and used it to make
predictions. You can find the source code for this tutorial at the dotnet/samples GitHub repository.
Next steps
In this tutorial, you learned how to:
Understand the problem
Select the appropriate machine learning task
Prepare the data
Load and transform the data
Choose a learning algorithm
Train the model
Use the model for predictions
Check out our GitHub repository to continue learning and find more samples.
dotnet/machinelearning GitHub repository
Tutorial: Build a movie recommender using matrix
factorizaton with ML.NET
8/20/2019 • 16 minutes to read • Edit Online
This tutorial shows you how to build a movie recommender with ML.NET in a .NET Core console application. The
steps use C# and Visual Studio 2019.
In this tutorial, you learn how to:
Select a machine learning algorithm
Prepare and load your data
Build and train a model
Evaluate a model
Deploy and consume a model
You can find the source code for this tutorial at the dotnet/samples repository.
Prerequisites
Visual Studio 2017 15.6 or later with the ".NET Core cross-platform development" workload installed.
using System;
using System.IO;
using Microsoft.ML;
using Microsoft.ML.Trainers;
In machine learning, the columns that are used to make a prediction are called Features, and the column with the
returned prediction is called the Label.
You want to predict movie ratings, so the rating column is the Label . The other three columns, userId , movieId ,
and timestamp are all Features used to predict the Label .
FEATURES LABEL
userId rating
movieId
timestamp
It's up to you to decide which Features are used to predict the Label . You can also use methods like Feature
Permutation Importance to help with selecting the best Features .
In this case, you should eliminate the timestamp column as a Feature because the timestamp does not really
affect how a user rates a given movie and thus would not contribute to making a more accurate prediction:
FEATURES LABEL
userId rating
movieId
Next you must define your data structure for the input class.
Add a new class to your project:
1. In Solution Explorer, right-click the project, and then select Add > New Item.
2. In the Add New Item dialog box, select Class and change the Name field to MovieRatingData.cs. Then,
select the Add button.
The MovieRatingData.cs file opens in the code editor. Add the following using statement to the top of
MovieRatingData.cs:
using Microsoft.ML.Data;
Create a class called MovieRating by removing the existing class definition and adding the following code in
MovieRatingData.cs:
MovieRating specifies an input data class. The LoadColumn attribute specifies which columns (by column index) in
the dataset should be loaded. The userId and movieId columns are your Features (the inputs you will give the
model to predict the Label ), and the rating column is the Label that you will predict (the output of the model).
Create another class, MovieRatingPrediction , to represent predicted results by adding the following code after the
MovieRating class in MovieRatingData.cs:
In Program.cs, replace the Console.WriteLine("Hello World!") with the following code inside Main() :
The MLContext class is a starting point for all ML.NET operations, and initializing mlContext creates a new
ML.NET environment that can be shared across the model creation workflow objects. It's similar, conceptually, to
DBContext in Entity Framework.
NOTE
This method will give you an error until you add a return statement in the following steps.
Initialize your data path variables, load the data from the *.csv files, and return the Train and Test data as
IDataView objects by adding the following as the next line of code in LoadData() :
var trainingDataPath = Path.Combine(Environment.CurrentDirectory, "Data", "recommendation-ratings-train.csv");
var testDataPath = Path.Combine(Environment.CurrentDirectory, "Data", "recommendation-ratings-test.csv");
Data in ML.NET is represented as an IDataView class. IDataView is a flexible, efficient way of describing tabular
data (numeric and text). Data can be loaded from a text file or in real time (for example, SQL database or log files)
to an IDataView object.
The LoadFromTextFile() defines the data schema and reads in the file. It takes in the data path variables and returns
an IDataView . In this case, you provide the path for your Test and Train files and indicate both the text file
header (so it can use the column names properly) and the comma character data separator (the default separator is
a tab).
Add the following as the next two lines of code in the Main() method to call your LoadData() method and return
the Train and Test data:
You create Transformers in ML.NET by creating Estimators . Estimators take in data and return Transformers .
The recommendation training algorithm you will use for training your model is an example of an Estimator .
Build an Estimator with the following steps:
Create the BuildAndTrainModel() method, just after the LoadData() method, using the following code:
NOTE
This method will give you an error until you add a return statement in the following steps.
Since userId and movieId represent users and movie titles, not real values, you use the MapValueToKey() method
to transform each userId and each movieId into a numeric key type Feature column (a format accepted by
recommendation algorithms) and add them as new dataset columns:
1 1 4 userKey1 movieKey1
1 3 4 userKey1 movieKey2
1 6 4 userKey1 movieKey3
Choose the machine learning algorithm and append it to the data transformation definitions by adding the
following as the next line of code in BuildAndTrainModel() :
User 1 Watched and liked movie Watched and liked movie Watched and liked movie
User 2 Watched and liked movie Watched and liked movie Has not watched --
RECOMMEND movie
The Matrix Factorization trainer has several Options, which you can read more about in the Algorithm
hyperparameters section below.
Fit the model to the Train data and return the trained model by adding the following as the next line of code in
the BuildAndTrainModel() method:
return model;
The Fit() method trains your model with the provided training dataset. Technically, it executes the Estimator
definitions by transforming the data and applying the training, and it returns back the trained model, which is a
Transformer .
Add the following as the next line of code in the Main() method to call your BuildAndTrainModel() method and
return the trained model:
The Transform() method makes predictions for multiple provided input rows of a test dataset.
Evaluate the model by adding the following as the next line of code in the EvaluateModel() method:
Once you have the prediction set, the Evaluate() method assesses the model, which compares the predicted values
with the actual Labels in the test dataset and returns metrics on how the model is performing.
Print your evaluation metrics to the console by adding the following as the next line of code in the
EvaluateModel() method:
Add the following as the next line of code in the Main() method to call your EvaluateModel() method:
In this output, there are 20 iterations. In each iteration, the measure of error decreases and converges closer and
closer to 0.
The root of mean squared error (RMS or RMSE ) is used to measure the differences between the model predicted
values and the test dataset observed values. Technically it's the square root of the average of the squares of the
errors. The lower it is, the better the model is.
R Squared indicates how well data fits a model. Ranges from 0 to 1. A value of 0 means that the data is random or
otherwise can't be fit to the model. A value of 1 means that the model exactly matches the data. You want your
R Squared score to be as close to 1 as possible.
Building successful models is an iterative process. This model has initial lower quality as the tutorial uses small
datasets to provide quick model training. If you aren't satisfied with the model quality, you can try to improve it by
providing larger training datasets or by choosing different training algorithms with different hyper-parameters for
each algorithm. For more information, check out the Improve your model section below.
Use the PredictionEngine to predict the rating by adding the following code to UseModelForSinglePrediction() :
The PredictionEngine class is a convenience API, which allows you to pass a single instance of data and then
perform a prediction on this single instance of data.
Create an instance of MovieRating called testInput and pass it to the Prediction Engine by adding the following
as the next lines of code in the UseModelForSinglePrediction() method:
Add the following as the next line of code in the Main() method to call your UseModelForSinglePrediction()
method:
UseModelForSinglePrediction(mlContext, model);
The output of this method should look similar to the following text:
Save your trained model by adding the following code in the SaveModel() method:
This method saves your trained model to a .zip file (in the "Data" folder), which can then be used in other .NET
applications to make predictions.
Add the following as the next line of code in the Main() method to call your SaveModel() method:
Results
After following the steps above, run your console app (Ctrl + F5). Your results from the single prediction above
should be similar to the following. You may see warnings or processing messages, but these messages have been
removed from the following results for clarity.
One Class Matrix Factorization Use this when you only have userId and >Try it out
movieId. This style of recommendation
is based upon the co-purchase scenario,
or products frequently bought together,
which means it will recommend to
customers a set of products based
upon their own purchase order history.
Field Aware Factorization Machines Use this to make recommendations >Try it out
when you have more Features beyond
userId, productId, and rating (such as
product description or product price).
This method also uses a collaborative
filtering approach.
Resources
The data used in this tutorial is derived from MovieLens Dataset.
Next steps
In this tutorial, you learned how to:
Select a machine learning algorithm
Prepare and load your data
Build and train a model
Evaluate a model
Deploy and consume a model
Advance to the next tutorial to learn more
Sentiment Analysis
Tutorial: Retrain a TensorFlow image classifier with
transfer learning and ML.NET
7/11/2019 • 19 minutes to read • Edit Online
Learn how to retrain an image classification TensorFlow model with transfer learning and ML.NET. The original
model was trained to classify individual images. After retraining, the new model organizes the images into broad
categories.
Training an Image Classification model from scratch requires setting millions of parameters, a ton of labeled
training data and a vast amount of compute resources (hundreds of GPU hours). While not as effective as training
a custom model from scratch, transfer learning allows you to shortcut this process by working with thousands of
images vs. millions of labeled images and build a customized model fairly quickly (within an hour on a machine
without a GPU ).
In this tutorial, you learn how to:
Understand the problem
Reuse and tune the pre-trained model
Classify Images
Prerequisites
Visual Studio 2017 15.6 or later with the ".NET Core cross-platform development" workload installed.
Microsoft.ML 1.0.0 Nuget package
Microsoft.ML.ImageAnalytics 1.0.0 Nuget package
Microsoft.ML.TensorFlow 0.12.0 Nuget package
The tutorial assets directory .ZIP file
The InceptionV1 machine learning model
NOTE
The preceding images belong to Wikimedia Commons and are attributed as follows:
"220px-Pepperoni_pizza.jpg" Public Domain, https://commons.wikimedia.org/w/index.php?curid=79505,
"119px-Nalle_-_a_small_brown_teddy_bear.jpg" By Jonik - Self-photographed, CC BY-SA 2.0,
https://commons.wikimedia.org/w/index.php?curid=48166.
"193px-Broodrooster.jpg" By M.Minderhoud - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?
curid=27403
Transfer learning includes a few strategies, such as retrain all layers and penultimate layer. This tutorial will explain
and show how to use the penultimate layer strategy. The penultimate layer strategy reuses a model that's already
been pre-trained to solve a specific problem. The strategy then retrains the final layer of that model to make it
solve a new problem. Reusing the pre-trained model as part of your new model will save significant time and
resources.
Your image classification model reuses the Inception model, a popular image recognition model trained on the
ImageNet dataset where the TensorFlow model tries to classify entire images into a thousand classes, like
“Umbrella”, “Jersey”, and “Dishwasher”.
The Inception v1 model can be classified as a deep convolutional neural network and can achieve reasonable
performance on hard visual recognition tasks, matching or exceeding human performance in some domains. The
model/algorithm was developed by multiple researchers and based on the original paper: "Rethinking the
Inception Architecture for Computer Vision” by Szegedy, et. al.
Because the Inception model has already been pre trained on thousands of different images, it contains the image
features needed for image identification. The lower image feature layers recognize simple features (such as edges)
and the higher layers recognize more complex features (such as shapes). The final layer is trained against a much
smaller set of data because you're starting with a pre trained model that already understands how to classify
images. As your model allows you to classify more than two categories, this is an example of a multi-class
classifier.
TensorFlow is a popular deep learning and machine learning toolkit that enables training deep neural networks
(and general numeric computations), and is implemented as a transformer in ML.NET. For this tutorial, it's used to
reuse the Inception model .
As shown in the following diagram, you add a reference to the ML.NET NuGet packages in your .NET Core or
.NET Framework applications. Under the covers, ML.NET includes and references the native TensorFlow library
that allows you to write code that loads an existing trained TensorFlow model file for scoring.
The Inception model is trained to classify images into a thousand categories, but you need to classify images in a
smaller category set, and only those categories. Enter the transfer part of transfer learning . You can transfer
the Inception model 's ability to recognize and classify images to the new limited categories of your custom image
classifier.
You're going to retrain the final layer of that model using a set of three categories:
Food
Toy
Appliance
Your layer uses a multinomial logistic regression algorithm to find the correct category as quickly as possible. This
algorithm classifies using probabilities to determine the answer, giving a one value to the correct category and a
zero value to the others.
DataSet
There are two data sources: the .tsv file, and the image files. The tags.tsv file contains two columns: the first
one is defined as ImagePath and the second one is the Label corresponding to the image. The following example
file doesn't have a header row, and looks like this:
broccoli.jpg food
pizza.jpg food
pizza2.jpg food
teddy2.jpg toy
teddy3.jpg toy
teddy4.jpg toy
toaster.jpg appliance
toaster2.png appliance
The training and testing images are located in the assets folders that you'll download in a zip file. These images
belong to Wikimedia Commons.
Wikimedia Commons, the free media repository. Retrieved 10:48, October 17, 2018 from:
https://commons.wikimedia.org/wiki/Pizza
https://commons.wikimedia.org/wiki/Toaster
https://commons.wikimedia.org/wiki/Teddy_bear
Create a console application
Create a project
1. Create a .NET Core Console Application called "TransferLearningTF".
2. Install the Microsoft.ML NuGet Package:
In Solution Explorer, right-click on your project and select Manage NuGet Packages. Choose "nuget.org"
as the Package source, select the Browse tab, search for Microsoft.ML. Click on the Version drop-down,
select the 1.0.0 package in the list, and select the Install button. Select the OK button on the Preview
Changes dialog and then select the I Accept button on the License Acceptance dialog if you agree with
the license terms for the packages listed. Repeat these steps for Microsoft.ML.ImageAnalytics v1.0.0 and
Microsoft.ML.TensorFlow v0.12.0.
Prepare your data
1. Download The project assets directory zip file, and unzip.
2. Copy the assets directory into your TransferLearningTF project directory. This directory and its
subdirectories contain the data and support files (except for the Inception model, which you'll download and
add in the next step) needed for this tutorial.
3. Download the Inception model, and unzip.
4. Copy the contents of the inception5h directory just unzipped into your TransferLearningTF project
assets\inputs-train\inception directory. This directory contains the model and additional support files
needed for this tutorial, as shown in the following image:
5. In Solution Explorer, right-click each of the files in the asset directory and subdirectories and select
Properties. Under Advanced, change the value of Copy to Output Directory to Copy if newer.
Create classes and define paths
Add the following additional using statements to the top of the Program.cs file:
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using Microsoft.ML;
using Microsoft.ML.Data;
using Microsoft.ML.Data.IO;
using Microsoft.ML.Trainers;
using Microsoft.ML.Transforms.Image;
Create global fields to hold the paths to the various assets, and global variables for the LabelTokey , ImageReal , and
PredictedLabelValue :
Add the following code to the line right above the Main method to specify those paths and the other variables:
Create some classes for your input data, and predictions. Add a new class to your project:
1. In Solution Explorer, right-click the project, and then select Add > New Item.
2. In the Add New Item dialog box, select Class and change the Name field to ImageData.cs. Then, select the
Add button.
The ImageData.cs file opens in the code editor. Add the following using statement to the top of
ImageData.cs:
using Microsoft.ML.Data;
Remove the existing class definition and add the following code for the ImageData class to the ImageData.cs file:
[LoadColumn(1)]
public string Label;
}
ImageData is the input image data class and has the following String fields:
ImagePath contains the image file name.
Label contains a value for the image label.
Remove the existing class definition and add the following code, which has the ImagePrediction class, to the
ImagePrediction.cs file:
ImagePrediction is the image prediction class and has the following fields:
Score contains the confidence percentage for a given image classification.
PredictedLabelValue contains a value for the predicted image classification label.
ImagePrediction is the class used for prediction after the model has been trained. It has a string ( ImagePath ) for
the image path. The Label is used to reuse and retrain the model. The PredictedLabelValue is used during
prediction and evaluation. For evaluation, an input with training data, the predicted values, and the model are used.
The MLContext class is a starting point for all ML.NET operations, and initializing mlContext creates a new
ML.NET environment that can be shared across the model creation workflow objects. It's similar, conceptually, to
DBContext in Entity Framework.
The Transform() method populated ImagePath in ImagePrediction along with the predicted fields. As the ML.NET
process progresses, each component adds columns, and this makes it easy to display the results:
You'll call the DisplayResults() method in the two image classification methods.
Create a .tsv file utility method
The ReadFromTsv() method executes the following tasks:
Reads the image data tags.tsv file.
Adds the file path to the image file name.
Loads the file data into an IEnumerable ImageData object.
Create the ReadFromTsv() method, just after the PairAndDisplayResults() method, using the following code:
The following code parses through the tags.tsv file to add the file path to the image file name for the ImagePath
property and load it and the Label into an ImageData object. Add it as the first line of the ReadFromTsv() method.
You need the fully qualified file path to display the prediction results.
return File.ReadAllLines(file)
.Select(line => line.Split('\t'))
.Select(line => new ImageData()
{
ImagePath = Path.Combine(folder, line[0])
});
There are three major concepts in ML.NET: Data, Transformers, and Estimators.
Your image processing estimator uses pre-trained Deep Neural Network(DNN ) featurizers for feature extraction.
When dealing with deep neural networks, you adapt the images to the expected network format. This is the reason
you use several image transforms to get the image data into the model's expected form:
1. The LoadImages transform images are loaded in memory as a Bitmap type.
2. The ResizeImages transform resizes the images as the pre-trained model has a defined input image width and
height.
3. The ExtractPixels transform extracts the pixels from the input images and converts them into a numeric
vector.
Add these image transforms as the next lines of code:
The LoadTensorFlowModel is a convenience method that allows the TensorFlow model to be loaded once and then
creates the TensorFlowEstimator using ScoreTensorFlowModel . The ScoreTensorFlowModel extracts specified outputs
(the Inception model 's image features softmax2_pre_activation ), and scores a dataset using the pre-trained
TensorFlow model.
softmax2_pre_activation assists the model with determining which class the images belongs to.
softmax2_pre_activation returns a probability for each of the categories for an image, and all of those probabilities
must add up to 1. It assumes that an image will belong to only one category, as shown in the following example:
CLASS PROBABILITY
Food 0.001
Toy 0.95
Appliance 0.06
Append the TensorFlowTransform to the estimator with the following line of code:
.Append(mlContext.Model.LoadTensorFlowModel(inputModelLocation).
ScoreTensorFlowModel(outputColumnNames: new[] { "softmax2_pre_activation" }, inputColumnNames: new[] {
"input" }, addBatchDimensionInput: true))
.Append(mlContext.MulticlassClassification.Trainers.LbfgsMaximumEntropy(labelColumnName: LabelTokey,
featureColumnName: "softmax2_pre_activation"))
.Append(mlContext.Transforms.Conversion.MapKeyToValue(PredictedLabelValue, "PredictedLabel"))
.AppendCacheCheckpoint(mlContext);
The Fit() method trains your model by transforming the dataset and applying the training. Fit the model to the
training dataset and return the trained model by adding the following as the next line of code in the
ReuseAndTuneInceptionModel() method:
The Transform() method makes predictions for multiple provided input rows of a test dataset. Transform the
Training data by adding the following code to ReuseAndTuneInceptionModel() :
Convert your image data and prediction DataViews into strongly-typed IEnumerables to pair for easier display.
Use the MLContext.CreateEnumerable() method to do that, using the following code:
Call the method to display your data and predictions as the next line in the
DisplayResults()
ReuseAndTuneInceptionModel() method:
DisplayResults(imagePredictionData);
Use the following code to display the metrics, share the results, and then act on them:
Console.WriteLine($"LogLoss is: {metrics.LogLoss}");
Console.WriteLine($"PerClassLogLoss is: {String.Join(" , ", metrics.PerClassLogLoss.Select(c =>
c.ToString()))}");
Add the following code to return the trained model as the next line:
return model;
public static void ClassifyImages(MLContext mlContext, string dataLocation, string imagesFolder, string
outputModelLocation, ITransformer model)
{
First, call the ReadFromTsv() method to create an IEnumerable<ImageData> class that contains the fully qualified
path for each ImagePath . You need that file path to pair your data and prediction results. You also need to convert
the IEnumerable<ImageData> class to an IDataView that you will use to predict. Add the following code as the next
two lines in the ClassifyImages() method:
As you did previously with the training image data, predict the category of the test image data using the
Transform() method of the model passed in. Add the following code to the ClassifyImages() method for the
predictions and to convert the predictions IDataView into an IEnumerable for pairing and display:
To pair and display your test image data and predictions, add the following code to call the DisplayResults()
method previously created as the next line in the ClassifyImages() method:
DisplayResults(imagePredictionData);
First, create an ImageData class that contains the fully qualified path and image file name for the single ImagePath .
Add the following code as the next lines in the ClassifySingleImage() method:
The PredictionEngine class is a convenience API that performs a prediction on a single instance of data. The
Predict() function makes a prediction on a single column of data. Pass imageData to the PredictionEngine to
predict the image category by adding the following code to ClassifySingleImage() :
Display the prediction result as the next line of code in the ClassifySingleImage() method:
Results
After following the previous steps, run your console app (Ctrl + F5). Your results should be similar to the following
output. You may see warnings or processing messages, but these messages have been removed from the following
results for clarity.
=============== Training classification model ===============
Image: broccoli.jpg predicted as: food with score: 0.976743
Image: pizza.jpg predicted as: food with score: 0.9751652
Image: pizza2.jpg predicted as: food with score: 0.9660203
Image: teddy2.jpg predicted as: toy with score: 0.9748783
Image: teddy3.jpg predicted as: toy with score: 0.9829691
Image: teddy4.jpg predicted as: toy with score: 0.9868168
Image: toaster.jpg predicted as: appliance with score: 0.9769174
Image: toaster2.png predicted as: appliance with score: 0.9800823
=============== Classification metrics ===============
LogLoss is: 0.0228266745633507
PerClassLogLoss is: 0.0277501705149937 , 0.0186303530571291 , 0.0217359128952187
=============== Making classifications ===============
Image: broccoli.png predicted as: food with score: 0.905548
Image: pizza3.jpg predicted as: food with score: 0.9709008
Image: teddy6.jpg predicted as: toy with score: 0.9750155
=============== Making single image classification ===============
Image: toaster3.jpg predicted as: appliance with score: 0.9625379
Congratulations! You've now successfully built a machine learning model for image classification by reusing a pre-
trained TensorFlow model in ML.NET.
You can find the source code for this tutorial at the dotnet/samples repository.
In this tutorial, you learned how to:
Understand the problem
Reuse and tune the pre-trained model
Classify images with a loaded model
Check out the Machine Learning samples GitHub repository to explore an expanded image classification sample.
dotnet/machinelearning-samples GitHub repository
Tutorial: Detect anomalies in product sales with
ML.NET
8/2/2019 • 11 minutes to read • Edit Online
Learn how to build an anomaly detection application for product sales data. This tutorial creates a .NET Core
console application using C# in Visual Studio.
In this tutorial, you learn how to:
Load the data
Create a transform for spike anomaly detection
Detect spike anomalies with the transform
Create a transform for change point anomaly detection
Detect change point anomalies with the transform
You can find the source code for this tutorial at the dotnet/samples repository.
Prerequisites
Visual Studio 2017 15.6 or later with the ".NET Core cross-platform development" workload installed.
The product-sales.csv dataset
NOTE
The data format in product-sales.csv is based on the dataset “Shampoo Sales Over a Three Year Period” originally
sourced from DataMarket and provided by Time Series Data Library (TSDL), created by Rob Hyndman. “Shampoo Sales Over
a Three Year Period” Dataset Licensed Under the DataMarket Default Open License.
using System;
using System.IO;
using Microsoft.ML;
using System.Collections.Generic;
MONTH PRODUCTSALES
1-Jan 271
2-Jan 150.9
..... .....
1-Feb 199.3
..... .....
using Microsoft.ML.Data;
4. Remove the existing class definition and add the following code, which has two classes ProductSalesData
and ProductSalesPrediction , to the ProductSalesData.cs file:
[LoadColumn(1)]
public float numSales;
}
The MLContext class is a starting point for all ML.NET operations, and initializing mlContext creates a new
ML.NET environment that can be shared across the model creation workflow objects. It's similar,
conceptually, to DBContext in Entity Framework.
Load the data
Data in ML.NET is represented as an IDataView class. IDataView is a flexible, efficient way of describing tabular
data (numeric and text). Data can be loaded from a text file or from other sources (for example, SQL database or
log files) to an IDataView object.
1. Add the following code as the next line of the Main() method:
The LoadFromTextFile() defines the data schema and reads in the file. It takes in the data path variables and
returns an IDataView .
Spike detection
The goal of spike detection is to identify sudden yet temporary bursts that significantly differ from the majority of
the time series data values. It's important to detect these suspicious rare items, events, or observations in a timely
manner to be minimized. The following approach can be used to detect a variety of anomalies such as: outages,
cyber-attacks, or viral web content. The following image is an example of spikes in a time series dataset:
Add the CreateEmptyDataView() method
Add the following method to Program.cs :
The CreateEmptyDataView() produces an empty data view object with the correct schema to be used as input to the
IEstimator.Fit() method.
Create the DetectSpike () method
The DetectSpike() method:
Creates the transform from the estimator.
Detects spikes based on historical sales data.
Displays the results.
1. Create the DetectSpike() method, just after the Main() method, using the following code:
2. Use the IidSpikeEstimator to train the model for spike detection. Add it to the DetectSpike() method with
the following code:
3. Create the spike detection transform by adding the following as the next line of code in the DetectSpike()
method:
4. Add the following line of code to transform the productSales data as the next line in the DetectSpike()
method:
IDataView transformedData = iidSpikeTransform.Transform(productSales);
The previous code uses the Transform() method to make predictions for multiple input rows of a dataset.
5. Convert your transformedData into a strongly-typed IEnumerable for easier display using the
CreateEnumerable() method with the following code:
Console.WriteLine("Alert\tScore\tP-Value");
if (p.Prediction[0] == 1)
{
results += " <-- Spike detected";
}
Console.WriteLine(results);
}
Console.WriteLine("");
2. Create the iidChangePointEstimator in the DetectChangepoint() method with the following code:
3. As you did previously, create the transform from the estimator by adding the following line of code in the
DetectChangePoint() method:
4. Use the Transform() method to transform the data by adding the following code to DetectChangePoint() :
5. As you did previously, convert your transformedData into a strongly-typed IEnumerable for easier display
using the CreateEnumerable() method with the following code:
6. Create a display header with the following code as the next line in the DetectChangePoint() method:
Console.WriteLine("Alert\tScore\tP-Value\tMartingale value");
You'll display the following information in your change point detection results:
Alert indicates a change point alert for a given data point.
Score is the ProductSales value for a given data point in the dataset.
P-Value The "P" stands for probability. The closer the P -value is to 0, the more likely the data point is an
anomaly.
Martingale value is used to identify how "weird" a data point is, based on the sequence of P -values.
7. Iterate through the predictions IEnumerable and display the results with the following code:
foreach (var p in predictions)
{
var results = $"
{p.Prediction[0]}\t{p.Prediction[1]:f2}\t{p.Prediction[2]:F2}\t{p.Prediction[3]:F2}";
if (p.Prediction[0] == 1)
{
results += " <-- alert is on, predicted changepoint";
}
Console.WriteLine(results);
}
Console.WriteLine("");
8. Add the following call to the DetectChangepoint() method in the Main() method:
Congratulations! You've now successfully built machine learning models for detecting spikes and change point
anomalies in sales data.
You can find the source code for this tutorial at the dotnet/samples repository.
In this tutorial, you learned how to:
Load the data
Train the model for spike anomaly detection
Detect spike anomalies with the trained model
Train the model for change point anomaly detection
Detect change point anomalies with the trained mode
Next steps
Check out the Machine Learning samples GitHub repository to explore a Power Consumption Anomaly Detection
sample.
dotnet/machinelearning-samples GitHub repository
Tutorial: Detect objects using ONNX in ML.NET
8/7/2019 • 30 minutes to read • Edit Online
Learn how to use a pre-trained ONNX model in ML.NET to detect objects in images.
Training an object detection model from scratch requires setting millions of parameters, a large amount of labeled
training data and a vast amount of compute resources (hundreds of GPU hours). Using a pre-trained model allows
you to shortcut the training process.
In this tutorial, you learn how to:
Understand the problem
Learn what ONNX is and how it works with ML.NET
Understand the model
Reuse the pre-trained model
Detect objects with a loaded model
Pre-requisites
Visual Studio 2017 15.6 or later with the ".NET Core cross-platform development" workload installed.
Microsoft.ML NuGet Package
Microsoft.ML.ImageAnalytics NuGet Package
Microsoft.ML.OnnxTransformer NuGet Package
Tiny YOLOv2 pre-trained model
Netron (optional)
The YOLO model takes an image 3(RGB) x 416px x 416px . The model takes this input and passes it through the
different layers to produce an output. The output divides the input image into a 13 x 13 grid, with each cell in the
grid consisting of 125 values.
What is an ONNX model?
The Open Neural Network Exchange (ONNX) is an open source format for AI models. ONNX supports
interoperability between frameworks. This means you can train a model in one of the many popular machine
learning frameworks like PyTorch, convert it into ONNX format and consume the ONNX model in a different
framework like ML.NET. To learn more, visit the ONNX website.
The pre-trained Tiny YOLOv2 model is stored in ONNX format, a serialized representation of the layers and
learned patterns of those layers. In ML.NET, interoperability with ONNX is achieved with the ImageAnalytics and
OnnxTransformer NuGet packages. The ImageAnalytics package contains a series of transforms that take an image
and encode it into numerical values that can be used as input into a prediction or training pipeline. The
OnnxTransformer package leverages the ONNX Runtime to load an ONNX model and use it to make predictions
based on input provided.
4. Copy the extracted model.onnx file from the directory just unzipped into your ObjectDetection project
assets\Model directory and rename it to TinyYolo2_model.onnx . This directory contains the model needed
for this tutorial.
5. In Solution Explorer, right-click each of the files in the asset directory and subdirectories and select
Properties. Under Advanced, change the value of Copy to Output Directory to Copy if newer.
Create classes and define paths
Open the Program.cs file and add the following additional using statements to the top of the file:
using System;
using System.IO;
using System.Collections.Generic;
using System.Drawing;
using System.Drawing.Drawing2D;
using System.Linq;
using Microsoft.ML;
using ObjectDetection.YoloParser;
using ObjectDetection.DataStructures;
return fullPath;
}
2. Then, inside the Main method, create fields to store the location of your assets:
Add a new directory to your project to store your input data and prediction classes.
In Solution Explorer, right-click the project, and then select Add > New Folder. When the new folder appears in
the Solution Explorer, name it "DataStructures".
Create your input data class in the newly created DataStructures directory.
1. In Solution Explorer, right-click the DataStructures directory, and then select Add > New Item.
2. In the Add New Item dialog box, select Class and change the Name field to ImageNetData.cs. Then, select
the Add button.
The ImageNetData.cs file opens in the code editor. Add the following using statement to the top of
ImageNetData.cs:
using System.Collections.Generic;
using System.IO;
using System.Linq;
using Microsoft.ML.Data;
Remove the existing class definition and add the following code for the ImageNetData class to the
ImageNetData.cs file:
[LoadColumn(1)]
public string Label;
ImageNetData is the input image data class and has the following String fields:
ImagePath contains the path where the image is stored.
Label contains the name of the file.
Additionally, ImageNetData contains a method ReadFromFile which loads multiple image files stored in the
imageFolder path specified and returns them as a collection of ImageNetData objects.
using Microsoft.ML.Data;
Remove the existing class definition and add the following code for the ImageNetPrediction class to the
ImageNetPrediction.cs file:
public class ImageNetPrediction
{
[ColumnName("grid")]
public float[] PredictedLabels;
}
ImageNetPrediction is the prediction data class and has the following float[] field:
PredictedLabel contains the dimensions, objectness score and class probabilities for each of the
bounding boxes detected in an image.
Initialize variables in Main
The MLContext class is a starting point for all ML.NET operations, and initializing mlContext creates a new
ML.NET environment that can be shared across the model creation workflow objects. It's similar, conceptually, to
DBContext in Entity Framework.
Initialize the mlContext variable with a new instance of MLContext by adding the following line to the Main
method of Program.cs below the outputFolder field.
First, load the image and get the height and width dimensions in the DrawBoundingBox method.
Then, create a for-each loop to iterate over each of the bounding boxes detected by the model.
Inside of the for-each loop, get the dimensions of the bounding box.
Because the dimensions of the bounding box correspond to the model input of 416 x 416 , scale the bounding box
dimensions to match the actual size of the image.
x = (uint)originalImageWidth * x / OnnxModelScorer.ImageNetSettings.imageWidth;
y = (uint)originalImageHeight * y / OnnxModelScorer.ImageNetSettings.imageHeight;
width = (uint)originalImageWidth * width / OnnxModelScorer.ImageNetSettings.imageWidth;
height = (uint)originalImageHeight * height / OnnxModelScorer.ImageNetSettings.imageHeight;
Then, define a template for text that will apear above each bounding box. The text will contain the class of the
object inside of the respective bounding box as well as the confidence.
Inside the using code block, tune the graphic's Graphics object settings.
thumbnailGraphic.CompositingQuality = CompositingQuality.HighQuality;
thumbnailGraphic.SmoothingMode = SmoothingMode.HighQuality;
thumbnailGraphic.InterpolationMode = InterpolationMode.HighQualityBicubic;
Below that, set the font and color options for the text and bounding box.
Create and fill a rectangle above the bounding box to contain the text using the FillRectangle method. This will
help contrast the text and improve readability.
Then, Draw the text and bounding box on the image using the DrawString and DrawRectangle methods.
Outside of the for-each loop, add code to save the images in the outputDirectory .
if (!Directory.Exists(outputImageLocation))
{
Directory.CreateDirectory(outputImageLocation);
}
image.Save(Path.Combine(outputImageLocation, imageName));
To get additional feedback that the application is making predictions as expected at runtime, add a method called
LogDetectedObjects below the DrawBoundingBox method in the Program.cs file to output the detected objects to the
console.
Console.WriteLine("");
}
Both of these methods will be useful when the model has produced outputs and those have been processed. First
though, create the functionality to process the model outputs.
x the x position of the bounding box center relative to the grid cell it's associated with.
y the y position of the bounding box center relative to the grid cell it's associated with.
w the width of the bounding box.
h the height of the bounding box.
o the confidence value that an object exists within the bounding box, also known as objectness score.
p1-p20 class probabilities for each of the 20 classes predicted by the model.
In total, the 25 elements describing each of the 5 bounding boxes make up the 125 elements contained in each
grid cell.
The output generated by the pre-trained ONNX model is a float array of length 21125 , representing the elements
of a tensor with dimensions 125 x 13 x 13 . In order to transform the predictions generated by the model into a
tensor, some post-processing work is required. To do so, create a set of classes to help parse the output.
Add a new directory to your project to organize the set of parser classes.
1. In Solution Explorer, right-click the project, and then select Add > New Folder. When the new folder
appears in the Solution Explorer, name it "YoloParser".
Create bounding boxes and dimensions
The data output by the model contains coordinates and dimensions of the bounding boxes of objects within the
image. Create a base class for dimensions.
1. In Solution Explorer, right-click the YoloParser directory, and then select Add > New Item.
2. In the Add New Item dialog box, select Class and change the Name field to DimensionsBase.cs. Then,
select the Add button.
The DimensionsBase.cs file opens in the code editor. Remove all using statements and existing class
definition.
Add the following code for the DimensionsBase class to the DimensionsBase.cs file:
using System.Drawing;
Just above the existing class definition, add a new class definition called BoundingBoxDimensions which
inherits from the DimensionsBase class to contain the dimensions of the respective bounding box.
using System;
using System.Collections.Generic;
using System.Drawing;
using System.Linq;
Inside the existing YoloOutputParser class definition, add a nested class that contains the dimensions of
each of the cells in the image. Add the following code for the CellDimensions class which inherits from the
DimensionsBase class at the top of the YoloOutputParser class definition.
3. Inside the YoloOutputParser class definition, add the following constant and fields.
public const int ROW_COUNT = 13;
public const int COL_COUNT = 13;
public const int CHANNEL_COUNT = 125;
public const int BOXES_PER_CELL = 5;
public const int BOX_INFO_FEATURE_COUNT = 5;
public const int CLASS_COUNT = 20;
public const float CELL_WIDTH = 32;
public const float CELL_HEIGHT = 32;
ROW_COUNT is the number of rows in the grid the image is divided into.
COL_COUNT is the number of columns in the grid the image is divided into.
CHANNEL_COUNT is the total number of values contained in one cell of the grid.
BOXES_PER_CELL is the number of bounding boxes in a cell,
BOX_INFO_FEATURE_COUNT is the number of features contained within a box (x,y,height,width,confidence).
CLASS_COUNT is the number of class predictions contained in each bounding box.
CELL_WIDTH is the width of one cell in the image grid.
CELL_HEIGHT is the height of one cell in the image grid.
channelStride is the starting position of the current cell in the grid.
When the model scores an image, it divides the 416px x 416px input into a grid of cells the size of 13 x 13 .
Each cell contains is 32px x 32px . Within each cell, there are 5 bounding boxes each containing 5 features
(x, y, width, height, confidence). In addition, each bounding box contains the probability of each of the classes
which in this case is 20. Therefore, each cell contains 125 pieces of information (5 features + 20 class
probabilities).
Create a list of anchors below channelStride for all 5 bounding boxes:
Anchors are pre-defined height and width ratios of bounding boxes. Most object or classes detected by a model
have similar ratios. This is valuable when it comes to creating bounding boxes. Instead of predicting the bounding
boxes, the offset from the pre-defined dimensions is calculated therefore reducing the computation required to
predict the bounding box. Typically these anchor ratios are calculated based on the dataset used. In this case
because the dataset is known and the values have been pre-computed, the anchors can be hard-coded.
Next, define the labels or classes that the model will predict. This model predicts 20 classes which is a subset of the
total number of classes predicted by the original YOLOv2 model.
Add your list of labels below the anchors .
There are colors associated with each of the classes. Assign your class colors below your labels :
private static Color[] classColors = new Color[]
{
Color.Khaki,
Color.Fuchsia,
Color.Silver,
Color.RoyalBlue,
Color.Green,
Color.DarkOrange,
Color.Purple,
Color.Gold,
Color.Red,
Color.Aquamarine,
Color.Lime,
Color.AliceBlue,
Color.Sienna,
Color.Orchid,
Color.Tan,
Color.LightPink,
Color.Yellow,
Color.HotPink,
Color.OliveDrab,
Color.SandyBrown,
Color.DarkTurquoise
};
Add the code for all the helper methods below your list of classColors .
if (areaA <= 0)
return 0;
Once you have defined all of the helper methods, it's time to use them to process the model output.
Below the IntersectionOverUnion method, create the ParseOutputs method to process the output generated by the
model.
Create a list to store your bounding boxes and define variables inside the ParseOutputs method.
Each image is divided into a grid of 13 x 13 cells. Each cell contains five bounding boxes. Below the boxes
variable, add code to process all of the boxes in each of the cells.
}
}
}
Inside the inner-most loop, calculate the starting position of the current box within the one-dimensional model
output.
Directly below that, use the ExtractBoundingBoxDimensions method to get the dimensions of the current bounding
box.
Then, use the GetConfidence method to get the confidence for the current bounding box.
Before doing any further processing, check whether your confidence value is greater than the threshold provided.
If not, process the next bounding box.
Otherwise, continue processing the output. The next step is to get the probability distribution of the predicted
classes for the current bounding box using the ExtractClasses method.
Then, use the GetTopResult method to get the value and index of the class with the highest probability for the
current box and compute its score.
Use the topScore to once again keep only those bounding boxes that are above the specified threshold.
Finally, if the current bounding box exceeds the threshold, create a new BoundingBox object and add it to the
boxes list.
boxes.Add(new YoloBoundingBox()
{
Dimensions = new BoundingBoxDimensions
{
X = (mappedBoundingBox.X - mappedBoundingBox.Width / 2),
Y = (mappedBoundingBox.Y - mappedBoundingBox.Height / 2),
Width = mappedBoundingBox.Width,
Height = mappedBoundingBox.Height,
},
Confidence = topScore,
Label = labels[topResultIndex],
BoxColor = classColors[topResultIndex]
});
Once all cells in the image have been processed, return the boxes list. Add the following return statement below
the outer-most for-loop in the ParseOutputs method.
return boxes;
Inside the FilterBoundingBoxes method, start off by creating an array equal to the size of detected boxes and
marking all slots as active or ready for processing.
Then, sort the list containing your bounding boxes in descending order based on confidence.
Begin processing each bounding box by iterating over each of the bounding boxes.
Inside of this for-loop, check whether the current bounding box can be processed.
if (isActiveBoxes[i])
{
If so, add the bounding box to the list of results. If the results exceeds the specified limit of boxes to be extracted,
break out of the loop. Add the following code inside the if-statement.
Otherwise, look at the adjacent bounding boxes. Add the following code below the box limit check.
for (var j = i + 1; j < boxes.Count; j++)
{
Like the first box, if the adjacent box is active or ready to be processed, use the IntersectionOverUnion method to
check whether the first box and the second box exceed the specified threshold. Add the following code to your
inner-most for-loop.
if (isActiveBoxes[j])
{
var boxB = sortedBoxes[j].Box;
if (activeCount <= 0)
break;
}
}
Outside of the inner-most for-loop that checks adjacent bounding boxes, see whether there are any remaining
bounding boxes to be processed. If not, break out of the outer for-loop.
if (activeCount <= 0)
break;
Finally, outside of the initial for-loop of the FilterBoundingBoxes method, return the results:
return results;
Great! Now it's time to use this code along with the model for scoring.
using System;
using System.Collections.Generic;
using System.Linq;
using Microsoft.ML;
using Microsoft.ML.Data;
using ObjectDetection.DataStructures;
using ObjectDetection.YoloParser;
Inside the OnnxModelScorer class definition, add the following variables.
Directly below that, create a constructor for the OnnxModelScorer class that will initialize the previously
defined variables.
Once you have created the constructor, define a couple of structs that contain variables related to the image
and model settings. Create a struct called ImageNetSettings to contain the height and width expected as
input for the model.
After that, create another struct called TinyYoloModelSettings which contains the names of the input and
output layers of the model. To visualize the name of the input and output layers of the model, you can use a
tool like Netron.
Next, create the first set of methods use for scoring. Create the LoadModel method inside of your
OnnxModelScorer class.
Inside the LoadModel method, add the following code for logging.
Console.WriteLine("Read model");
Console.WriteLine($"Model location: {modelLocation}");
Console.WriteLine($"Default parameters: image size=({ImageNetSettings.imageWidth},
{ImageNetSettings.imageHeight})");
ML.NET pipelines typically expect data to operate on when the Fit method is called. In this case, a process
similar to training will be used. However, because no actual training is happening, it is acceptable to use an
empty IDataView . Create a new IDataView for the pipeline from an empty list.
Below that, define the pipeline. The pipeline will consist of four transforms.
LoadImages loads the image as a Bitmap.
ResizeImages rescales the image to the size specified (in this case, 416 x 416 ).
ExtractPixels changes the pixel representation of the image from a Bitmap to a numerical vector.
ApplyOnnxModel loads the ONNX model and uses it to score on the data provided.
Define your pipeline in the LoadModel method below the data variable.
Now it's time to instantiate the model for scoring. Call the Fit method on the pipeline and return it for
further processing.
return model;
Once the model is loaded, it can then be used to make predictions. To facilitate that process, create a method called
PredictDataUsingModel below the LoadModel method.
Extract the predicted probabilities and return them for additional processing.
return probabilities;
Now that both steps are set up, combine them into a single method. Below the PredictDataUsingModel method,
add a new method called Score .
Detect objects
Now that all of the setup is complete, it's time to detect some objects. Inside the Main method of your Program.cs
class, add a try-catch statement.
try
{
}
catch (Exception ex)
{
Console.WriteLine(ex.ToString());
}
Inside of the try block, start implementing the object detection logic. First, load the data into an IDataView .
Then, create an instance of OnnxModelScorer and use it to score the loaded data.
Now it's time for the post-processing step. Create an instance of YoloOutputParser and use it to process the model
output.
YoloOutputParser parser = new YoloOutputParser();
var boundingBoxes =
probabilities
.Select(probability => parser.ParseOutputs(probability))
.Select(boxes => parser.FilterBoundingBoxes(boxes, 5, .5F));
Once the model output has been processed, it's time to draw the bounding boxes on the images. Create a for-loop
to iterate over each of the scored images.
Inside of the for-loop, get the name of the image file and the bounding boxes associated with it.
Below that, use the DrawBoundingBox method to draw the bounding boxes on the image.
LogDetectedObjects(imageFileName, detectedObjects);
After the try-catch statement, add additional logic to indicate the process is done running.
That's it!
Results
After following the previous steps, run your console app (Ctrl + F5). Your results should be similar to the following
output. You may see warnings or processing messages, but these messages have been removed from the following
results for clarity.
=====Identify the objects in the images=====
To see the images with bounding boxes, navigate to the assets/images/output/ directory. Below is a sample from
one of the processed images.
Congratulations! You've now successfully built a machine learning model for object detection by reusing a pre-
trained ONNX model in ML.NET.
You can find the source code for this tutorial at the dotnet/samples repository.
In this tutorial, you learned how to:
Understand the problem
Learn what ONNX is and how it works with ML.NET
Understand the model
Reuse the pre-trained model
Detect objects with a loaded model
Check out the Machine Learning samples GitHub repository to explore an expanded object detection sample.
dotnet/machinelearning-samples GitHub repository
Load data from files and other sources
8/2/2019 • 3 minutes to read • Edit Online
This how -to shows you how to load data for processing and training into ML.NET. The data is originally stored in
files or other data sources such as databases, JSON, XML or in-memory collections.
Size (Sq. ft.), HistoricalPrice1 ($), HistoricalPrice2 ($), HistoricalPrice3 ($), Current Price ($)
700, 100000, 3000000, 250000, 500000
1000, 600000, 400000, 650000, 700000
[LoadColumn(1, 3)]
[VectorType(3)]
public float[] HistoricalPrices { get; set; }
[LoadColumn(4)]
[ColumnName("Label")]
public float CurrentPrice { get; set; }
}
IMPORTANT
LoadColumn is only required when loading data from a file.
//Create MLContext
MLContext mlContext = new MLContext();
//Load Data
IDataView data = mlContext.Data.LoadFromTextFile<HousingData>("my-data-file.csv", separatorChar: ',',
hasHeader: true);
//Create MLContext
MLContext mlContext = new MLContext();
//Create MLContext
MLContext mlContext = new MLContext();
// Create TextLoader
TextLoader textLoader = mlContext.Data.CreateTextLoader<HousingData>(separatorChar: ',', hasHeader: true);
// Load Data
IDataView data = textLoader.Load("DataFolder/SubFolder1/1.txt", "DataFolder/SubFolder2/1.txt");
Load the in-memory collection into an IDataView with the LoadFromEnumerable method:
IMPORTANT
LoadFromEnumerable assumes that the IEnumerable it loads from is thread-safe.
// Create MLContext
MLContext mlContext = new MLContext();
//Load Data
IDataView data = mlContext.Data.LoadFromEnumerable<HousingData>(inMemoryCollection);
Prepare data for building a model
6/26/2019 • 6 minutes to read • Edit Online
Learn how to use ML.NET to prepare data for additional processing or building a model.
Data is often unclean and sparse. Additionally, ML.NET machine learning algorithms expect input or features to be
in a single numerical vector. Therefore one of the goals of data preparation is to get the data into the format
expected by ML.NET algorithms.
Filter data
Sometimes, not all data in a dataset is relevant for analysis. An approach to remove irrelevant data is filtering. The
DataOperationsCatalog contains a set of filter operations that take in an IDataView containing all of the data and
return an IDataView containing only the data points of interest. It's important to note that because filter operations
are not an IEstimator or ITransformer like those in the TransformsCatalog , they cannot be included as part of an
EstimatorChain or TransformerChain data preparation pipeline.
To filter data based on the value of a column, use the FilterRowsByColumn method.
// Apply filter
IDataView filteredData = mlContext.Data.FilterRowsByColumn(data, "Price", lowerBound: 200000, upperBound:
1000000);
The sample above takes rows in the dataset with a price between 200000 and 1000000. The result of applying this
filter would return only the last two rows in the data and exclude the first row because its price is 100000 and not
between the specified range.
Notice that the last element in our list has a missing value for Price . To replace the missing values in the Price
column, use the ReplaceMissingValues method to fill in that missing value.
IMPORTANT
ReplaceMissingValue only works with numerical data.
// Transform data
IDataView transformedData = replacementTransformer.Transform(data);
ML.NET supports various replacement modes. The sample above uses the Mean replacement mode which will fill
in the missing value with that column's average value. The replacement 's result fills in the Price property for the
last element in our data with 200,000 since it's the average of 100,000 and 300,000.
Use normalizers
Normalization is a data pre-processing technique used to standardize features that are not on the same scale
which helps algorithms converge faster. For example, the ranges for values like age and income vary significantly
with age generally being in the range of 0-100 and income generally being in the range of zero to thousands. Visit
the transforms page for a more detailed list and description of normalization transforms.
Min-Max normalization
Using the following input data which is loaded into an IDataView :
HomeData[] homeDataList = new HomeData[]
{
new HomeData
{
NumberOfBedrooms = 2f,
Price = 200000f
},
new HomeData
{
NumberOfBedrooms = 1f,
Price = 100000f
}
};
Normalization can be applied to columns with single numerical values as well as vectors. Normalize the data in the
Price column using min-max normalization with the NormalizeMinMax method.
// Transform data
IDataView transformedData = minMaxTransformer.Transform(data);
The original price values [200000,100000] are converted to [ 1, 0.5 ] using the MinMax normalization formula
which generates output values in the range of 0-1.
Binning
Binning converts continuous values into a discrete representation of the input. For example, suppose one of your
features is age. Instead of using the actual age value, binning creates ranges for that value. 0-18 could be one bin,
another could be 19-35 and so on.
Using the following input data which is loaded into an IDataView :
Normalize the data into bins using the NormalizeBinning method. The maximumBinCount parameter enables you to
specify the number of bins needed to classify your data. In this example, data will be put into two bins.
// Define binning estimator
var binningEstimator = mlContext.Transforms.NormalizeBinning("Price", maximumBinCount: 2);
// Transform Data
IDataView transformedData = binningTransformer.Transform(data);
The result of binning creates bin bounds of [0,200000,Infinity] . Therefore the resulting bins are [0,1,1]
because the first observation is between 0-200000 and the others are greater than 200000 but less than infinity.
The categorical VehicleType property can be converted into a number using the OneHotEncoding method.
// Transform Data
IDataView transformedData = categoricalTransformer.Transform(data);
The resulting transform converts the text value of VehicleType to a number. The entries in the VehicleType
column become the following when the transform is applied:
[
1, // SUV
2, // Sedan
1 // SUV
]
Work with text data
Text data needs to be transformed into numbers before using it to build a machine learning model. Visit the
transforms page for a more detailed list and description of text transforms.
Using data like the data below that has been loaded into an IDataView :
The minimum step to convert text to a numerical vector representation is to use the FeaturizeText method. By
using the FeaturizeText transform, a series of transformations is applied to the input text column resulting in a
numerical vector representing the lp-normalized word and character ngrams.
// Transform data
IDataView transformedData = textTransformer.Transform(data);
The resulting transform would convert the text values in the Description column to a numerical vector that looks
similar to the output below:
Combine complex text processing steps into an EstimatorChain to remove noise and potentially reduce the
amount of required processing resources as needed.
textEstimator contains a subset of operations performed by the FeaturizeText method. The benefit of a more
complex pipeline is control and visibility over the transformations applied to the data.
Using the first entry as an example, the following is a detailed description of the results produced by the
transformation steps defined by textEstimator :
Original Text: This is a good product
Learn how to build machine learning models, collect metrics, and measure performance with ML.NET. Although
this sample trains a regression model, the concepts are applicable throughout a majority of the other algorithms.
[LoadColumn(1, 3)]
[VectorType(3)]
public float[] HistoricalPrices { get; set; }
[LoadColumn(4)]
[ColumnName("Label")]
public float CurrentPrice { get; set; }
}
Use the TrainTestSplit method to split the data into train and test sets. The result will be a TrainTestData object
which contains two IDataView members, one for the train set and the other for the test set. The data split
percentage is determined by the testFraction parameter. The snippet below is holding out 20 percent of the
original data for the test set.
NOTE
Other models have parameters that are specific to their tasks. For example, the K-Means algorithm puts data into cluster
based on centroids and the KMeansModelParameters contains a property that stores these learned centroids. To learn more,
visit the Microsoft.ML.Trainers API Documentation and look for classes that contain ModelParameters in their name.
NOTE
The Evaluate method produces different metrics depending on which machine learning task was performed. For more
details, visit the Microsoft.ML.Data API Documentation and look for classes that contain Metrics in their name.
NOTE
In this small example, the R-Squared is a number not in the range of 0-1 because of the limited size of the data. In a real-
world scenario, you should expect to see a value between 0 and 1.
Train a machine learning model using cross validation
6/26/2019 • 3 minutes to read • Edit Online
Learn how to use cross validation to train more robust machine learning models in ML.NET.
Cross-validation is a training and model evaluation technique that splits the data into several partitions and trains
multiple algorithms on these partitions. This technique improves the robustness of the model by holding out data
from the training process. In addition to improving performance on unseen observations, in data-constrained
environments it can be an effective tool for training models with a smaller dataset.
Size (Sq. ft.), HistoricalPrice1 ($), HistoricalPrice2 ($), HistoricalPrice3 ($), Current Price ($)
620.00, 148330.32, 140913.81, 136686.39, 146105.37
550.00, 557033.46, 529181.78, 513306.33, 548677.95
1127.00, 479320.99, 455354.94, 441694.30, 472131.18
1120.00, 47504.98, 45129.73, 43775.84, 46792.41
[LoadColumn(1, 3)]
[VectorType(3)]
public float[] HistoricalPrices { get; set; }
[LoadColumn(4)]
[ColumnName("Label")]
public float CurrentPrice { get; set; }
}
// Transform data
IDataView transformedData = dataPrepTransformer.Transform(data);
NOTE
Although this sample uses a linear regression model, CrossValidate is applicable to all other machine learning tasks in ML.NET
except Anomaly Detection.
IEnumerable<double> rSquared =
cvResults
.Select(fold => fold.Metrics.RSquared);
If you inspect the contents of the rSquared variable, the output should be five values ranging from 0-1 where
closer to 1 means best. Using metrics like R -Squared, select the models from best to worst performing. Then, select
the top model to make predictions or perform additional operations with.
Learn how to inspect intermediate data during loading, processing, and model training steps in ML.NET.
Intermediate data is the output of each stage in the machine learning pipeline.
Intermediate data like the one represented below which is loaded into an IDataView can be inspected in various
ways in ML.NET.
To optimize performance, set reuseRowObject to true . Doing so will lazily populate the same object with the data
of the current row as it's being evaluated as opposed to creating a new object for each row in the dataset.
// Create an IEnumerable of HousingData objects from IDataView
IEnumerable<HousingData> housingDataEnumerable =
mlContext.Data.CreateEnumerable<HousingData>(data, reuseRowObject: true);
WARNING
Converting the result of CreateEnumerable to an array or list will load all the requested IDataView rows into memory
which may affect performance.
Once the collection has been created, you can perform operations on the data. The code snippet below takes the
first three rows in the dataset and calculates the average current price.
// Create DataViewCursor
using (DataViewRowCursor cursor = data.GetRowCursor(columns))
{
// Define variables where extracted values will be stored to
float size = default;
VBuffer<float> historicalPrices = default;
float currentPrice = default;
The model building process is experimental and iterative. To preview what data would look like after pre-
processing or training a machine learning model on a subset of the data, use the Preview method which returns a
DataDebuggerPreview . The result is an object with ColumnView and RowView properties which are both an
IEnumerable and contain the values in a particular column or row. Specify the number of rows to apply the
transformation to with the maxRows parameter.
Learn how to explain ML.NET machine learning model predictions by understanding the contribution features
have to predictions using Permutation Feature Importance (PFI).
Machine learning models are often thought of as black boxes that take inputs and generate an output. The
intermediate steps or interactions among the features that influence the output are rarely understood. As machine
learning is introduced into more aspects of everyday life such as healthcare, it's of utmost importance to
understand why a machine learning model makes the decisions it does. For example, if diagnoses are made by a
machine learning model, healthcare professionals need a way to look into the factors that went into making that
diagnoses. Providing the right diagnosis could make a great difference on whether a patient has a speedy recovery
or not. Therefore the higher the level of explainability in a model, the greater confidence healthcare professionals
have to accept or reject the decisions made by the model.
Various techniques are used to explain models, one of which is PFI. PFI is a technique used to explain classification
and regression models that is inspired by Breiman's Random Forests paper(see section 10). At a high level, the way
it works is by randomly shuffling data one feature at a time for the entire dataset and calculating how much the
performance metric of interest decreases. The larger the change, the more important that feature is.
Additionally, by highlighting the most important features, model builders can focus on using a subset of more
meaningful features which can potentially reduce noise and training time.
1,24,13,1,0.59,3,96,11,23,608,14,13,32
4,80,18,1,0.37,5,14,7,4,346,19,13,41
2,98,16,1,0.25,10,5,1,8,689,13,36,12
class HousingPriceData
{
[LoadColumn(0)]
public float CrimeRate { get; set; }
[LoadColumn(1)]
public float ResidentialZones { get; set; }
[LoadColumn(2)]
public float CommercialZones { get; set; }
[LoadColumn(3)]
public float NearWater { get; set; }
[LoadColumn(4)]
public float ToxicWasteLevels { get; set; }
[LoadColumn(5)]
public float AverageRoomNumber { get; set; }
[LoadColumn(6)]
public float HomeAge { get; set; }
[LoadColumn(7)]
public float BusinessCenterDistance { get; set; }
[LoadColumn(8)]
public float HighwayAccess { get; set; }
[LoadColumn(9)]
public float TaxRate { get; set; }
[LoadColumn(10)]
public float StudentTeacherRatio { get; set; }
[LoadColumn(11)]
public float PercentPopulationBelowPoverty { get; set; }
[LoadColumn(12)]
[ColumnName("Label")]
public float Price { get; set; }
}
ImmutableArray<RegressionMetricsStatistics> permutationFeatureImportance =
mlContext
.Regression
.PermutationFeatureImportance(sdcaModel, preprocessedTrainData, permutationCount:3);
Console.WriteLine("Feature\tPFI");
Printing the values for each of the features in featureImportanceMetrics would generate output similar to that
below. Keep in mind that you should expect to see different results because these values vary based on the data
that they are given.
HighwayAccess -0.042731
StudentTeacherRatio -0.012730
BusinessCenterDistance -0.010491
TaxRate -0.008545
AverageRoomNumber -0.003949
CrimeRate -0.003665
CommercialZones 0.002749
HomeAge -0.002426
ResidentialZones -0.002319
NearWater 0.000203
PercentPopulationLivingBelowPoverty 0.000031
ToxicWasteLevels -0.000019
Taking a look at the five most important features for this dataset, the price of a house predicted by this model is
influenced by its proximity to highways, student teacher ratio of schools in the area, proximity to major
employment centers, property tax rate and average number of rooms in the home.
Save and load trained models
5/6/2019 • 3 minutes to read • Edit Online
// Create MLContext
MLContext mlContext = new MLContext();
// Load Data
IDataView data = mlContext.Data.LoadFromEnumerable<HousingData>(housingData);
// Train model
ITransformer trainedModel = pipelineEstimator.Fit(data);
// Save model
mlContext.Model.Save(trainedModel, data.Schema, "model.zip");
Because most models and data preparation pipelines inherit from the same set of classes, the save and load
method signatures for these components is the same. Depending on your use case, you can either combine the
data preparation pipeline and model into a single EstimatorChain which would output a single ITransformer or
separate them thus creating a separate ITransformer for each.
Save a model locally
When saving a model you need two things:
1. The ITransformer of the model.
2. The DataViewSchema of the ITransformer 's expected input.
After training the model, use the Save method to save the trained model to a file called model.zip using the
DataViewSchema of the input data.
// Create MLContext
MLContext mlContext = new MLContext();
When working with separate data preparation pipelines and models, the same process as single pipelines applies;
except now both pipelines need to be saved and loaded simultaneously.
Given separate data preparation and model training pipelines:
// Create MLContext
MLContext mlContext = new MLContext();
[LoadColumn(1, 3)]
[VectorType(3)]
public float[] HistoricalPrices { get; set; }
[LoadColumn(4)]
[ColumnName("Label")]
public float CurrentPrice { get; set; }
}
Output data
Like the Features and Label input column names, ML.NET has default names for the predicted value columns
produced by a model. Depending on the task the name may differ.
Because the algorithm used in this sample is a linear regression algorithm, the default name of the output column
is Score which is defined by the ColumnName attribute on the PredictedPrice property.
The HousingPrediction data model inherits from HousingData to make it easy to visualize the original input data
along with the output generated by the model.
//Create MLContext
MLContext mlContext = new MLContext();
// Create PredictionEngines
PredictionEngine<HousingData, HousingPrediction> predictionEngine =
mlContext.Model.CreatePredictionEngine<HousingData, HousingPrediction>(predictionPipeline);
Then, use the Predict method and pass in your input data as a parameter. Notice that using the Predict method
does not require the input to be an IDataView ). This is because it conveniently internalizes the input data type
manipulation so you can pass in an object of the input data type. Additionally, since CurrentPrice is the target or
label you're trying to predict using new data, it's assumed there is no value for it at the moment.
// Input Data
HousingData inputData = new HousingData
{
Size = 900f,
HistoricalPrices = new float[] { 155000f, 190000f, 220000f }
};
// Get Prediction
HousingPrediction prediction = predictionEngine.Predict(inputData);
If you access the Score property of the prediction object, you should get a value similar to 150079 .
Batch prediction
Given the following data, load it into an IDataView . In this case, the name of the IDataView is inputData . Because
CurrentPrice is the target or label you're trying to predict using new data, it's assumed there is no value for it at
the moment.
// Actual data
HousingData[] housingData = new HousingData[]
{
new HousingData
{
Size = 850f,
HistoricalPrices = new float[] { 150000f,175000f,210000f }
},
new HousingData
{
Size = 900f,
HistoricalPrices = new float[] { 155000f, 190000f, 220000f }
},
new HousingData
{
Size = 550f,
HistoricalPrices = new float[] { 99000f, 98000f, 130000f }
}
};
Then, use the Transform method to apply the data transformations and generate predictions.
// Predicted Data
IDataView predictions = predictionPipeline.Transform(inputData);
The predicted values in the score column should look like the following:
OBSERVATION PREDICTION
1 144638.2
2 150079.4
3 107789.8
Re-train a model
6/26/2019 • 2 minutes to read • Edit Online
// Create MLContext
MLContext mlContext = new MLContext();
Re-train model
The process for retraining a model is no different than that of training a model. The only difference is, the Fit
method in addition to the data also takes as input the original learned model parameters and uses them as a
starting point in the re-training process.
// New Data
HousingData[] housingData = new HousingData[]
{
new HousingData
{
Size = 850f,
HistoricalPrices = new float[] { 150000f,175000f,210000f },
CurrentPrice = 205000f
},
new HousingData
{
Size = 900f,
HistoricalPrices = new float[] { 155000f, 190000f, 220000f },
CurrentPrice = 210000f
},
new HousingData
{
Size = 550f,
HistoricalPrices = new float[] { 99000f, 98000f, 130000f },
CurrentPrice = 180000f
}
};
// Preprocess Data
IDataView transformedNewData = dataPrepPipeline.Transform(newData);
// Retrain model
RegressionPredictionTransformer<LinearRegressionModelParameters> retrainedModel =
mlContext.Regression.Trainers.OnlineGradientDescent()
.Fit(transformedNewData, originalModelParameters);
The table below shows what the output might look like.
Learn how to deploy a pre-trained ML.NET machine learning model for predictions over HTTP through an Azure
Functions serverless environment.
NOTE
PredictionEnginePool service extension is currently in preview.
Prerequisites
Visual Studio 2017 15.6 or later with the ".NET Core cross-platform development" workload and "Azure
development" installed.
Microsoft.NET.Sdk.Functions NuGet Package version 1.0.28+.
Azure Functions Tools
Powershell
Pre-trained model. Use the ML.NET Sentiment Analysis tutorial to build your own model or download this pre-
trained sentiment analysis machine learning model
using System;
using System.IO;
using System.Threading.Tasks;
using Microsoft.AspNetCore.Mvc;
using Microsoft.Azure.WebJobs;
using Microsoft.Azure.WebJobs.Extensions.Http;
using Microsoft.AspNetCore.Http;
using Microsoft.Extensions.Logging;
using Newtonsoft.Json;
using Microsoft.Extensions.ML;
using SentimentAnalysisFunctionsApp.DataModels;
By default, the AnalyzeSentiment class is static . Make sure to remove the static keyword from the class
definition.
using Microsoft.ML.Data;
Remove the existing class definition and add the following code to the SentimentData.cs file:
[LoadColumn(1)]
[ColumnName("Label")]
public bool Sentiment;
}
4. In Solution Explorer, right-click the DataModels directory, and then select Add > New Item.
5. In the Add New Item dialog box, select Class and change the Name field to SentimentPrediction.cs. Then,
select the Add button. The SentimentPrediction.cs file opens in the code editor. Add the following using
statement to the top of SentimentPrediction.cs:
using Microsoft.ML.Data;
Remove the existing class definition and add the following code to the SentimentPrediction.cs file:
[ColumnName("PredictedLabel")]
public bool Prediction { get; set; }
SentimentPrediction inherits from SentimentData which provides access to the original data in the
SentimentText property as well as the output generated by the model.
using Microsoft.Azure.Functions.Extensions.DependencyInjection;
using Microsoft.Extensions.ML;
using SentimentAnalysisFunctionsApp;
using SentimentAnalysisFunctionsApp.DataModels;
Remove the existing code below the using statements and add the following code to the Startup.cs file:
[assembly: FunctionsStartup(typeof(Startup))]
namespace SentimentAnalysisFunctionsApp
{
public class Startup : FunctionsStartup
{
public override void Configure(IFunctionsHostBuilder builder)
{
builder.Services.AddPredictionEnginePool<SentimentData, SentimentPrediction>()
.FromFile("MLModels/sentiment_model.zip");
}
}
}
At a high level, this code initializes the objects and services automatically when requested by the application
instead of having to manually do it.
WARNING
PredictionEngine is not thread-safe. For improved performance and thread safety, use the PredictionEnginePool
service, which creates an ObjectPool of PredictionEngine objects for application use.
This code assigns the PredictionEnginePool by passing it to the function's constructor which you get via
dependency injection.
//Make Prediction
SentimentPrediction prediction = _predictionEnginePool.Predict(data);
//Return Prediction
return (ActionResult)new OkObjectResult(sentiment);
}
When the Run method executes, the incoming data from the HTTP request is deserialized and used as input for
the PredictionEnginePool . The Predict method is then called to generate a prediction and return the result to the
user.
Test locally
Now that everything is set up, it's time to test the application:
1. Run the application
2. Open PowerShell and enter the code into the prompt where PORT is the port your application is running
on. Typically the port is 7071.
Negative
Congratulations! You have successfully served your model to make predictions over the internet using an Azure
Function.
Next Steps
Deploy to Azure
Deploy a model in an ASP.NET Core Web API
8/21/2019 • 5 minutes to read • Edit Online
Learn how to serve a pre-trained ML.NET machine learning model on the web using an ASP.NET Core Web API.
Serving a model over a web API enables predictions via standard HTTP methods.
NOTE
PredictionEnginePool service extension is currently in preview.
Prerequisites
Visual Studio 2017 15.6 or later with the ".NET Core cross-platform development" workload installed.
Powershell.
Pre-trained model. Use the ML.NET Sentiment Analysis tutorial to build your own model or download this pre-
trained sentiment analysis machine learning model
using Microsoft.ML.Data;
Remove the existing class definition and add the following code to the SentimentData.cs file:
[LoadColumn(1)]
[ColumnName("Label")]
public bool Sentiment;
}
4. In Solution Explorer, right-click the DataModels directory, and then select Add > New Item.
5. In the Add New Item dialog box, select Class and change the Name field to SentimentPrediction.cs. Then,
select the Add button. The SentimentPrediction.cs file opens in the code editor. Add the following using
statement to the top of SentimentPrediction.cs:
using Microsoft.ML.Data;
Remove the existing class definition and add the following code to the SentimentPrediction.cs file:
[ColumnName("PredictedLabel")]
public bool Prediction { get; set; }
SentimentPrediction inherits from SentimentData . This makes it easier to see the original data in the
SentimentText property along with the output generated by the model.
using Microsoft.AspNetCore.Builder;
using Microsoft.AspNetCore.Hosting;
using Microsoft.AspNetCore.Mvc;
using Microsoft.Extensions.Configuration;
using Microsoft.Extensions.DependencyInjection;
using Microsoft.Extensions.ML;
using SentimentAnalysisWebAPI.DataModels;
At a high level, this code initializes the objects and services automatically when requested by the application
instead of having to manually do it.
WARNING
PredictionEngine is not thread-safe. For improved performance and thread safety, use the PredictionEnginePool
service, which creates an ObjectPool of PredictionEngine objects for application use. Read the following blog post to
learn more about creating and using PredictionEngine object pools in ASP.NET Core.
using System;
using Microsoft.AspNetCore.Mvc;
using Microsoft.Extensions.ML;
using SentimentAnalysisWebAPI.DataModels;
Remove the existing class definition and add the following code to the PredictController.cs file:
public class PredictController : ControllerBase
{
private readonly PredictionEnginePool<SentimentData, SentimentPrediction> _predictionEnginePool;
public PredictController(PredictionEnginePool<SentimentData,SentimentPrediction>
predictionEnginePool)
{
_predictionEnginePool = predictionEnginePool;
}
[HttpPost]
public ActionResult<string> Post([FromBody] SentimentData input)
{
if(!ModelState.IsValid)
{
return BadRequest();
}
return Ok(sentiment);
}
}
This code assigns the PredictionEnginePool by passing it to the controller's constructor which you get via
dependency injection. Then, the Predict controller's Post method uses the PredictionEnginePool to make
predictions and return the results back to the user if successful.
Negative
Congratulations! You have successfully served your model to make predictions over the internet using an
ASP.NET Core Web API.
Next Steps
Deploy to Azure
Create a game match up list app with Infer.NET and
probabilistic programming
5/7/2019 • 4 minutes to read • Edit Online
This how -to guide teaches you about probabilistic programming using Infer.NET. Probabilistic programming is a
machine learning approach where custom models are expressed as computer programs. It allows for incorporating
domain knowledge in the models and makes the machine learning system more interpretable. It also supports
online inference – the process of learning as new data arrives. Infer.NET is used in various products at Microsoft in
Azure, Xbox, and Bing.
Prerequisites
Local development environment setup
This how -to guide expects you to have a machine you can use for development. The .NET Get Started in 10
minutes tutorial has instructions for setting up your local development environment on Mac, PC, or Linux.
The dotnetcommand creates a new application of type console . The -o parameter creates a directory named
myApp where your app is stored and populates it with the required files. The cd myApp command puts you into the
newly created app directory.
1 Player 0 Player 1
2 Player 0 Player 3
3 Player 0 Player 4
4 Player 1 Player 2
5 Player 3 Player 1
6 Player 4 Player 2
With a closer look at the sample data, you’ll notice that players 3 and 4 both have one win and one loss. Let's see
what the rankings look like using probabilistic programming. Notice also there is a player zero because even office
match up lists are zero based to us developers.
{
using System;
using System.Linq;
using Microsoft.ML.Probabilistic;
using Microsoft.ML.Probabilistic.Distributions;
using Microsoft.ML.Probabilistic.Models;
class Program
{
using (Variable.ForEach(game))
{
// The player performance is a noisy version of their skill
var winnerPerformance = Variable.GaussianFromMeanAndVariance(playerSkills[winners[game]], 1.0);
var loserPerformance = Variable.GaussianFromMeanAndVariance(playerSkills[losers[game]], 1.0);
// Run inference
var inferenceEngine = new InferenceEngine();
var inferredSkills = inferenceEngine.Infer<Gaussian[]>(playerSkills);
dotnet run
Results
Your results should be similar to the following:
Compiling model...done.
Iterating:
.........|.........|.........|.........|.........| 50
Player 0 skill: Gaussian(9.517, 3.926)
Player 3 skill: Gaussian(6.834, 3.892)
Player 4 skill: Gaussian(6.054, 4.731)
Player 1 skill: Gaussian(4.955, 3.503)
Player 2 skill: Gaussian(2.639, 4.288)
In the results, notice that player 3 ranks slightly higher than player 4 according to our model. That’s because the
victory of player 3 over player 1 is more significant than the victory of player 4 over player 2 – note that player 1
beats player 2. Player 0 is the overall champ!
Keep learning
Designing statistical models is a skill on its own. The Microsoft Research Cambridge team has written a free online
book, which gives a gentle introduction to the article. Chapter 3 of this book covers the TrueSkill model in more
detail. Once you have a model in mind, you can transform it into code using the extensive documentation on the
Infer.NET website.
Next steps
Check out the Infer.NET GitHub repository to continue learning and find more samples.
dotnet/infer GitHub repository
Machine learning resources
5/15/2019 • 2 minutes to read • Edit Online
The following ML.NET resources may be helpful to build custom AI solutions and integrate them into your .NET
applications:
Machine learning glossary: contains important machine learning term definitions.
Machine learning basics: provides links to learning resources to get started with machine learning.
Machine learning tasks: describes various machine learning usage scenarios supported by ML.NET.
Data transforms: provides the overview of data transforms supported by ML.NET.
Machine learning glossary of important terms
8/1/2019 • 7 minutes to read • Edit Online
The following list is a compilation of important machine learning terms that are useful as you build your custom
models in ML.NET.
Accuracy
In classification, accuracy is the number of correctly classified items divided by the total number of items in the test
set. Ranges from 0 (least accurate) to 1 (most accurate). Accuracy is one of evaluation metrics of the model
performance. Consider it in conjunction with precision, recall, and F -score.
Binary classification
A classification case where the label is only one out of two classes. For more information, see the Binary
classification section of the Machine learning tasks topic.
Calibration
Calibration is the process of mapping a raw score onto a class membership, for binary and multiclass classification.
Some ML.NET trainers have a NonCalibrated suffix. These algorithms produce a raw score that then must be
mapped to a class probability.
Catalog
In ML.NET, a catalog is a collection of extension functions, grouped by a common purpose.
For example, each machine learning task (binary classification, regression, ranking etc) has a catalog of available
machine learning algorithms (trainers). The catalog for the binary classification trainers is:
BinaryClassificationCatalog.BinaryClassificationTrainers.
Classification
When the data is used to predict a category, supervised machine learning task is called classification. Binary
classification refers to predicting only two categories (for example, classifying an image as a picture of either a 'cat'
or a 'dog'). Multiclass classification refers to predicting multiple categories (for example, when classifying an image
as a picture of a specific breed of dog).
Coefficient of determination
In regression, an evaluation metric that indicates how well data fits a model. Ranges from 0 to 1. A value of 0
means that the data is random or otherwise cannot be fit to the model. A value of 1 means that the model exactly
matches the data. This is often referred to as r2, R 2, or r-squared.
Data
Data is central to any machine learning application. In ML.NET data is represented by IDataView objects. Data
view objects:
are made up of columns and rows
are lazily evaluated, that is they only load data when an operation calls for it
contain a schema that defines the type, format and length of each column
Estimator
A class in ML.NET that implements the IEstimator<TTransformer> interface.
An estimator is a specification of a transformation (both data preparation transformation and machine learning
model training transformation). Estimators can be chained together into a pipeline of transformations. The
parameters of an estimator or pipeline of estimators are learned when Fit is called. The result of Fit is a
Transformer.
Extension method
A .NET method that is part of a class but is defined outside of the class. The first parameter of an extension method
is a static this reference to the class to which the extension method belongs.
Extension methods are used extensively in ML.NET to construct instances of estimators.
Feature
A measurable property of the phenomenon being measured, typically a numeric (double) value. Multiple features
are referred to as a Feature vector and typically stored as double[] . Features define the important characteristics
of the phenomenon being measured. For more information, see the Feature article on Wikipedia.
Feature engineering
Feature engineering is the process that involves defining a set of features and developing software that produces
feature vectors from available phenomenon data, i.e., feature extraction. For more information, see the Feature
engineering article on Wikipedia.
F-score
In classification, an evaluation metric that balances precision and recall.
Hyperparameter
A parameter of a machine learning algorithm. Examples include the number of trees to learn in a decision forest or
the step size in a gradient descent algorithm. Values of Hyperparameters are set before training the model and
govern the process of finding the parameters of the prediction function, for example, the comparison points in a
decision tree or the weights in a linear regression model. For more information, see the Hyperparameter article on
Wikipedia.
Label
The element to be predicted with the machine learning model. For example, the breed of dog or a future stock
price.
Log loss
In classification, an evaluation metric that characterizes the accuracy of a classifier. The smaller log loss is, the more
accurate a classifier is.
Loss function
A loss function is the difference between the training label values and the prediction made by the model. The
parameters of the model are estimated by minimizing the loss function.
Different trainers can be configured with different loss functions.
Model
Traditionally, the parameters for the prediction function. For example, the weights in a linear regression model or
the split points in a decision tree. In ML.NET, a model contains all the information necessary to predict the label of
a domain object (for example, image or text). This means that ML.NET models include the featurization steps
necessary as well as the parameters for the prediction function.
Multiclass classification
A classification case where the label is one out of three or more classes. For more information, see the Multiclass
classification section of the Machine learning tasks topic.
N-gram
A feature extraction scheme for text data: any sequence of N words turns into a feature value.
Normalization
Normalization is the process of scaling floating point data to values between 0 and 1. Many of the training
algorithms used in ML.NET require input feature data to be normalized. ML.NET provides a series of transforms
for normalization
Pipeline
All of the operations needed to fit a model to a data set. A pipeline consists of data import, transformation,
featurization, and learning steps. Once a pipeline is trained, it turns into a model.
Precision
In classification, the precision for a class is the number of items correctly predicted as belonging to that class
divided by the total number of items predicted as belonging to the class.
Recall
In classification, the recall for a class is the number of items correctly predicted as belonging to that class divided
by the total number of items that actually belong to the class.
Regularization
Regularization penalizes a linear model for being too complicated. There are two types of regularization:
$L_1$ regularization zeros weights for insignificant features. The size of the saved model may become smaller
after this type of regularization.
$L_2$ regularization minimizes weight range for insignificant features, This is a more general process and is
less sensitive to outliers.
Regression
A supervised machine learning task where the output is a real value, for example, double. Examples include
predicting stock prices. For more information, see the Regression section of the Machine learning tasks topic.
Scoring
Scoring is the process of applying new data to a trained machine learning model, and generating predictions.
Scoring is also known as inferencing. Depending on the type of model, the score may be a raw value, a probability,
or a category.
Training
The process of identifying a model for a given training data set. For a linear model, this means finding the weights.
For a tree, it involves identifying the split points.
Transformer
An ML.NET class that implements the ITransformer interface.
A transformer transforms one IDataView into another. A transformer is created by training an estimator, or an
estimator pipeline.
Unsupervised machine learning
A subclass of machine learning in which a desired model finds hidden (or latent) structure in data. Examples
include clustering, topic modeling, and dimensionality reduction. For more information, see the Unsupervised
learning article on Wikipedia.
Machine learning tasks in ML.NET
7/30/2019 • 7 minutes to read • Edit Online
When building a machine learning model, you first need to define what you are hoping to achieve with your data.
This allows you to choose the right machine learning task for your situation. The following list describes the
different machine learning tasks that you can choose from and some common use cases.
Once you have decided which task works for your scenario, then you need to choose the best algorithm to train
your model. The available algorithms are listed in the section for each task.
Binary classification
A supervised machine learning task that is used to predict which of two classes (categories) an instance of data
belongs to. The input of a classification algorithm is a set of labeled examples, where each label is an integer of
either 0 or 1. The output of a binary classification algorithm is a classifier, which you can use to predict the class
of new unlabeled instances. Examples of binary classification scenarios include:
Understanding sentiment of Twitter comments as either "positive" or "negative".
Diagnosing whether a patient has a certain disease or not.
Making a decision to mark an email as "spam" or not.
Determining if a photo contains a dog or fruit.
For more information, see the Binary classification article on Wikipedia.
Binary classification trainers
You can train a binary classification model using the following algorithms:
AveragedPerceptronTrainer
SdcaLogisticRegressionBinaryTrainer
SdcaNonCalibratedBinaryTrainer
SymbolicSgdLogisticRegressionBinaryTrainer
LbfgsLogisticRegressionBinaryTrainer
LightGbmBinaryTrainer
FastTreeBinaryTrainer
FastForestBinaryTrainer
GamBinaryTrainer
FieldAwareFactorizationMachineTrainer
PriorTrainer
LinearSvmTrainer
Binary classification inputs and outputs
For best results with binary classification, the training data should be balanced (that is, equal numbers of positive
and negative training data). Missing values should be handled before training.
The input label column data must be Boolean. The input features column data must be a fixed-size vector of
Single.
These trainers outputs the following columns:
OUTPUT COLUMN NAME COLUMN TYPE DESCRIPTION
Multiclass classification
A supervised machine learning task that is used to predict the class (category) of an instance of data. The input of
a classification algorithm is a set of labeled examples. Each label normally starts as text. It is then run through the
TermTransform, which converts it to the Key (numeric) type. The output of a classification algorithm is a classifier,
which you can use to predict the class of new unlabeled instances. Examples of multi-class classification scenarios
include:
Determining the breed of a dog as a "Siberian Husky", "Golden Retriever", "Poodle", etc.
Understanding movie reviews as "positive", "neutral", or "negative".
Categorizing hotel reviews as "location", "price", "cleanliness", etc.
For more information, see the Multiclass classification article on Wikipedia.
NOTE
One vs all upgrades any binary classification learner to act on multiclass datasets. More information on [Wikipedia]
(https://en.wikipedia.org/wiki/Multiclass_classification#One-vs.-rest).
Regression
A supervised machine learning task that is used to predict the value of the label from a set of related features.
The label can be of any real value and is not from a finite set of values as in classification tasks. Regression
algorithms model the dependency of the label on its related features to determine how the label will change as
the values of the features are varied. The input of a regression algorithm is a set of examples with labels of known
values. The output of a regression algorithm is a function, which you can use to predict the label value for any
new set of input features. Examples of regression scenarios include:
Predicting house prices based on house attributes such as number of bedrooms, location, or size.
Predicting future stock prices based on historical data and current market trends.
Predicting sales of a product based on advertising budgets.
Regression trainers
You can train a regression model using the following algorithms:
LbfgsPoissonRegressionTrainer
LightGbmRegressionTrainer
SdcaRegressionTrainer
OlsTrainer
OnlineGradientDescentTrainer
FastTreeRegressionTrainer
FastTreeTweedieTrainer
FastForestRegressionTrainer
GamRegressionTrainer
Regression inputs and outputs
The input label column data must be Single.
The trainers for this task output the following:
Clustering
An unsupervised machine learning task that is used to group instances of data into clusters that contain similar
characteristics. Clustering can also be used to identify relationships in a dataset that you might not logically
derive by browsing or simple observation. The inputs and outputs of a clustering algorithm depends on the
methodology chosen. You can take a distribution, centroid, connectivity, or density-based approach. ML.NET
currently supports a centroid-based approach using K-Means clustering. Examples of clustering scenarios
include:
Understanding segments of hotel guests based on habits and characteristics of hotel choices.
Identifying customer segments and demographics to help build targeted advertising campaigns.
Categorizing inventory based on manufacturing metrics.
Clustering trainer
You can train a clustering model using the following algorithm:
KMeansTrainer
Clustering inputs and outputs
The input features data must be Single. No labels are needed.
This trainer outputs the following:
Anomaly detection
This task creates an anomaly detection model by using Principal Component Analysis (PCA). PCA-Based
Anomaly Detection helps you build a model in scenarios where it is easy to obtain training data from one class,
such as valid transactions, but difficult to obtain sufficient samples of the targeted anomalies.
An established technique in machine learning, PCA is frequently used in exploratory data analysis because it
reveals the inner structure of the data and explains the variance in the data. PCA works by analyzing data that
contains multiple variables. It looks for correlations among the variables and determines the combination of
values that best captures differences in outcomes. These combined feature values are used to create a more
compact feature space called the principal components.
Anomaly detection encompasses many important tasks in machine learning:
Identifying transactions that are potentially fraudulent.
Learning patterns that indicate that a network intrusion has occurred.
Finding abnormal clusters of patients.
Checking values entered into a system.
Because anomalies are rare events by definition, it can be difficult to collect a representative sample of data to use
for modeling. The algorithms included in this category have been especially designed to address the core
challenges of building and training models by using imbalanced data sets.
Anomaly detection trainer
You can train an anomaly detection model using the following algorithm:
RandomizedPcaTrainer
Anomaly detection inputs and outputs
The input features must be a fixed-sized vector of Single.
This trainer outputs the following:
OUTPUT NAME TYPE DESCRIPTION
Ranking
A ranking task constructs a ranker from a set of labeled examples. This example set consists of instance groups
that can be scored with a given criteria. The ranking labels are { 0, 1, 2, 3, 4 } for each instance. The ranker is
trained to rank new instance groups with unknown scores for each instance. ML.NET ranking learners are
machine learned ranking based.
Ranking training algorithms
You can train a ranking model with the following algorithms:
LightGbmRankingTrainer
FastTreeRankingTrainer
Ranking input and outputs
The input label data type must be key type or Single. The value of the label determines relevance, where higher
values indicate higher relevance. If the label is a key type, then the key index is the relevance value, where the
smallest index is the least relevant. If the label is a Single, larger values indicate higher relevance.
The feature data must be a fixed size vector of Single and input row group column must be key type.
This trainer outputs the following:
Recommendation
A recommendation task enables producing a list of recommended products or services. ML.NET uses Matrix
factorization (MF ), a collaborative filtering algorithm for recommendations when you have historical product
rating data in your catalog. For example, you have historical movie rating data for your users and want to
recommend other movies they are likely to watch next.
Recommendation training algorithms
You can train a recommendation model with the following algorithm:
MatrixFactorizationTrainer
How to choose an ML.NET algorithm
8/13/2019 • 4 minutes to read • Edit Online
For each ML.NET task, there are multiple training algorithms to choose from. Which one to choose depends on the
problem you are trying to solve, the characteristics of your data, and the compute and storage resources you have
available. It is important to note that training a machine learning model is an iterative process. You might need to
try multiple algorithms to find the one that works best.
Algorithms operate on features. Features are numerical values computed from your input data. They are optimal
inputs for machine learning algorithms. You transform your raw input data into features using one or more data
transforms. For example, text data is transformed into a set of word counts and word combination counts. Once the
features have been extracted from a raw data type using data transforms, they are referred to as featurized. For
example, featurized text, or featurized image data.
Linear algorithms
Linear algorithms produce a model that calculates scores from a linear combination of the input data and a set of
weights. The weights are parameters of the model estimated during training.
Linear algorithms work well for features that are linearly separable.
Before training with a linear algorithm, the features should be normalized. This prevents one feature having more
influence over the result than others.
In general linear algorithms are scalable and fast, cheap to train, cheap to predict. They scale by the number of
features and approximately by the size of the training data set.
Linear algorithms make multiple passes over the training data. If your dataset fits into memory, then adding a
cache checkpoint to your ML.NET pipeline before appending the trainer, will make the training run faster.
Linear Trainers
Stochastic dual coordinated ascent Tuning not needed for good default SdcaLogisticRegressionBinaryTrainer
performance SdcaNonCalibratedBinaryTrainer
SdcaMaximumEntropyMulticlassTrainer
SdcaNonCalibratedMulticlassTrainer
SdcaRegressionTrainer
Symbolic stochastic gradient descent Fastest and most accurate linear binary SymbolicSgdLogisticRegressionBinaryTra
classification trainer. Scales well with iner
number of processors
Light gradient boosted machine Fastest and most accurate of the binary LightGbmBinaryTrainer
classification tree trainers. Highly LightGbmMulticlassTrainer
tunable LightGbmRegressionTrainer
LightGbmRankingTrainer
Generalized additive model (GAM) Best for problems that perform well GamBinaryTrainer
with tree algorithms but where GamRegressionTrainer
explainability is a priority
Matrix factorization
PROPERTIES TRAINERS
Meta algorithms
These trainers create a multi-class trainer from a binary trainer. Use with AveragedPerceptronTrainer,
LbfgsLogisticRegressionBinaryTrainer, SymbolicSgdLogisticRegressionBinaryTrainer, LightGbmBinaryTrainer,
FastTreeBinaryTrainer, FastForestBinaryTrainer, GamBinaryTrainer.
K-Means
PROPERTIES TRAINERS
Naive Bayes
PROPERTIES TRAINERS
Use this multi-class classification trainer when the features are NaiveBayesMulticlassTrainer
independent, and the training dataset is small.
Prior Trainer
PROPERTIES TRAINERS
SelectColumns Select one or more columns to keep from the input data
NormalizeMeanVariance Subtract the mean (of the training data) and divide by the
variance (of the training data)
NormalizeGlobalContrast Scale each value in a row by subtracting the mean of the row
data and divide by either the standard deviation or l2-norm
(of the row data), and multiply by a configurable scale factor
(default 2)
TRANSFORM DEFINITION
NormalizeBinning Assign the input value to a bin index and divide by the
number of bins to produce a float value between 0 and 1. The
bin boundaries are calculated to evenly distribute the training
data across bins
NormalizeSupervisedBinning Assign the input value to a bin based on its correlation with
label column
NormalizeMinMax Scale the input by the difference between the minimum and
maximum values in the training data
Text transformations
TRANSFORM DEFINITION
RemoveDefaultStopWords Remove default stop words for the specified language from
input columns
Image transformations
TRANSFORM DEFINITION
DetectAnomalyBySrCnn Detect anomalies in the input time series data using the
Spectral Residual (SR) algorithm
TRANSFORM DEFINITION
Missing values
TRANSFORM DEFINITION
Feature selection
TRANSFORM DEFINITION
SelectFeaturesBasedOnMutualInformation Select the features on which the data in the label column is
most dependent
Feature transformations
TRANSFORM DEFINITION
Explainability transformations
TRANSFORM DEFINITION
Calibration transformations
TRANSFORM DEFINITION
Platt(String, String, String) Transforms a binary classifier raw score into a class probability
using logistic regression with parameters estimated using the
training data
Platt(Double, Double, String) Transforms a binary classifier raw score into a class probability
using logistic regression with fixed parameters
Custom transformations
TRANSFORM DEFINITION
Accuracy Accuracy is the proportion of correct The closer to 1.00, the better. But
predictions with a test data set. It is the exactly 1.00 indicates an issue
ratio of number of correct predictions (commonly: label/target leakage, over-
to the total number of input samples. It fitting, or testing with training data).
works well only if there are similar When the test data is unbalanced
number of samples belonging to each (where most of the instances belong to
class. one of the classes), the dataset is very
small, or scores approach 0.00 or 1.00,
then accuracy doesn’t really capture the
effectiveness of a classifier and you need
to check additional metrics.
AUC aucROC or Area under the curve: This is The closer to 1.00, the better. It
measuring the area under the curve should be greater than 0.50 for a model
created by sweeping the true positive to be acceptable; a model with AUC of
rate vs. the false positive rate. 0.50 or less is worthless.
AUCPR aucPR or Area under the curve of a The closer to 1.00, the better. High
Precision-Recall curve: Useful measure scores close to 1.00 show that the
of success of prediction when the classifier is returning accurate results
classes are very imbalanced (highly (high precision), as well as returning a
skewed datasets). majority of all positive results (high
recall).
F1-score F1 score also known as balanced F- The closer to 1.00, the better. An F1
score or F-measure. It's the harmonic score reaches its best value at 1.00 and
mean of the precision and recall. F1 worst score at 0.00. It tells you how
Score is helpful when you want to seek precise your classifier is.
a balance between Precision and Recall.
For further details on binary classification metrics read the following articles:
Accuracy, Precision, Recall or F1?
Binary Classification Metrics class
The Relationship Between Precision-Recall and ROC Curves
Micro-Accuracy Micro-average Accuracy aggregates the The closer to 1.00, the better. In a
contributions of all classes to compute multi-class classification task, micro-
the average metric. It is the fraction of accuracy is preferable over macro-
instances predicted correctly. The micro- accuracy if you suspect there might be
average does not take class class imbalance (i.e you may have many
membership into account. Basically, more examples of one class than of
every sample-class pair contributes other classes).
equally to the accuracy metric.
Macro-Accuracy Macro-average Accuracy is the average The closer to 1.00, the better. It
accuracy at the class level. The accuracy computes the metric independently for
for each class is computed and the each class and then takes the average
macro-accuracy is the average of these (hence treating all classes equally)
accuracies. Basically, every class
contributes equally to the accuracy
metric. Minority classes are given equal
weight as the larger classes. The macro-
average metric gives the same weight
to each class, no matter how many
instances from that class the dataset
contains.
Log-loss Logarithmic loss measures the The closer to 0.00, the better. A
performance of a classification model perfect model would have a log-loss of
where the prediction input is a 0.00. The goal of our machine learning
probability value between 0.00 and models is to minimize this value.
1.00. Log-loss increases as the
predicted probability diverges from the
actual label.
Log-Loss Reduction Logarithmic loss reduction can be Ranges from -inf and 1.00, where
interpreted as the advantage of the 1.00 is perfect predictions and 0.00
classifier over a random prediction. indicates mean predictions. For
example, if the value equals 0.20, it can
be interpreted as "the probability of a
correct prediction is 20% better than
random guessing"
Micro-accuracy is generally better aligned with the business needs of ML predictions. If you want to select a single
metric for choosing the quality of a multiclass classification task, it should usually be micro-accuracy.
Example, for a support ticket classification task: (maps incoming tickets to support teams)
Micro-accuracy -- how often does an incoming ticket get classified to the right team?
Macro-accuracy -- for an average team, how often is an incoming ticket correct for their team?
Macro-accuracy overweights small teams in this example; a small team which gets only 10 tickets per year counts
as much as a large team with 10k tickets per year. Micro-accuracy in this case correlates better with the business
need of, "how much time/money can the company save by automating my ticket routing process".
For further details on multi-class classification metrics read the following articles:
Micro- and Macro-average of Precision, Recall and F -Score
Multiclass Classification with Imbalanced Dataset
R-Squared R-squared (R2), or Coefficient of The closer to 1.00, the better quality.
determination represents the predictive However, sometimes low R-squared
power of the model as a value between values (such as 0.50) can be entirely
-inf and 1.00. 1.00 means there is a normal or good enough for your
perfect fit, and the fit can be arbitrarily scenario and high R-squared values are
poor so the scores can be negative. A not always good and be suspicious.
score of 0.00 means the model is
guessing the expected value for the
label. R2 measures how close the actual
test data values are to the predicted
values.
Absolute-loss Absolute-loss or Mean absolute error The closer to 0.00, the better quality.
(MAE) measures how close the Note that the mean absolute error uses
predictions are to the actual outcomes. the same scale as the data being
It is the average of all the model errors, measured (is not normalized to specific
where model error is the absolute range). Absolute-loss, Squared-loss, and
distance between the predicted label RMS-loss can only be used to make
value and the correct label value. This comparisons between models for the
prediction error is calculated for each same dataset or dataset with a similar
record of the test data set. Finally, the label value distribution.
mean value is calculated for all recorded
absolute errors.
RMS-loss RMS-loss or Root Mean Squared Error It is always non-negative, and values
(RMSE) (also called Root Mean Square closer to 0.00 are better. RMSD is a
Deviation, RMSD), measures the measure of accuracy, to compare
difference between values predicted by forecasting errors of different models
a model and the values actually for a particular dataset and not
observed from the environment that is between datasets, as it is scale-
being modeled. RMS-loss is the square dependent.
root of Squared-loss and has the same
units as the label, similar to the
absolute-loss though giving more
weight to larger differences. Root mean
square error is commonly used in
climatology, forecasting, and regression
analysis to verify experimental results.
Cross-validation
Cross-validation is a training and model evaluation technique that splits the data into several partitions and trains
multiple algorithms on these partitions. This technique improves the robustness of the model by holding out data
from the training process. In addition to improving performance on unseen observations, in data-constrained
environments it can be an effective tool for training models with a smaller dataset.
Visit the following link to learn how to use cross validation in ML.NET
Hyperparameter tuning
Training machine learning models is an iterative and exploratory process. For example, what is the optimal number
of clusters when training a model using the K-Means algorithm? The answer depends on many factors such as the
structure of the data. Finding that number would require experimenting with different values for k and then
evaluating performance to determine which value is best. The practice of tuning these parameters to find an
optimal model is known as hyper-parameter tuning.
Automated machine learning is a feature of ML.NET that performs automatic model selection and training. You
specify the machine learning task and supply a dataset, and automated ML chooses the model with the best
metrics. It outputs:
a model file that can be loaded into your prediction application
application code to make predictions
the source code used for feature selection and model training (to understand the model)
NOTE
This feature is currently in Preview, and material may be subject to change.
Automated ML is currently limited to the machine learning tasks of binary classification, multiclass classification,
and regression. The other machine learning tasks will be supported in future releases.
There are three ways to use automated ML:
1. With a graphical user interface, with the ML.NET Model Builder
2. On the command line, with the ML.NET CLI
3. Via an application, with the automated ML API
What is Model Builder and how does it work?
8/19/2019 • 5 minutes to read • Edit Online
ML.NET Model Builder is an intuitive graphical Visual Studio extension to build, train, and deploy custom machine
learning models.
Model Builder uses automated machine learning (AutoML ) to explore different machine learning algorithms and
settings to help you find the one that best suits your scenario.
You don't need machine learning expertise to use Model Builder. All you need is some data, and a problem to
solve. Model Builder generates the code to add the model to your .NET application.
NOTE
Model Builder is currently in Preview.
Scenarios
You can bring many different scenarios to Model Builder, to generate a machine learning model for your
application.
A scenario is a description of the type of prediction you want to make using your data. For example:
predict future product sales volume based on historical sales data
classify sentiments as positive or negative based on customer reviews
detect whether a banking transaction is fraudulent
route customer feedback issues to the correct team in your company
Sentiment analysis can be used to predict positive or negative sentiment of customer feedback. It is an example of
a binary classification model type.
If your scenario requires classification into two categories, you can use this template with your own dataset.
Predict a category (when there are three or more categories )
Multiclass classification can be used to categorize data into three or more classes.
Issue classification can be used to categorize customer feedback (for example, on GitHub) issues using the issue
title and description. It is an example of the multi-class classification model type.
You can use the issue classification template for your scenario if you want to categorize data into three or more
categories.
Predict a number
Regression is used to predict numbers.
Price prediction can be used to predict house prices using location, size, and other characteristics of the house. It is
an example of a regression model type.
You can use the price prediction template for your scenario if you want to predict a numerical value with your own
dataset.
Custom scenario (choose your model type)
The custom scenario allows you to manually choose your model type.
Data
Once you have chosen your model type, Model Builder asks you to provide a dataset. The data is used to train,
evaluate, and choose the best model for your scenario.
Example datasets
If you don't have your own data yet, try out one of these datasets:
Price prediction regression taxi fare data Fare Trip time, distance
Anomaly detection binary classification product sales data Product Sales Month
Sentiment analysis binary classification website comment Label (0 when Comment, Year
data negative sentiment, 1
when positive)
Fraud detection binary classification credit card data Class (1 when Amount, V1-V28
fraudulent, 0 (anonymized features)
otherwise)
Train
Once you select your scenario, data, and label, Model Builder trains the model.
What is training?
Training is an automatic process by which Model Builder teaches your model how to answer questions for your
scenario. Once trained, your model can make predictions with input data that it has not seen before. For example, if
you are predicting house prices and a new house comes on the market, you can predict its sale price.
Because Model Builder uses automated machine learning (AutoML ), it does not require any input or tuning from
you during training.
Evaluate
Evaluation is the process of using the trained model to make predictions with new test data, and then measuring
how good the predictions are.
Model Builder splits the training data into a training set and a test set. The training data (80%) is used to train your
model and the test data (20%) is held back to evaluate your model. Model Builder uses metrics to measure how
good the model is. The specific metrics used are dependent on the type of model. For more information, see model
evaluation metrics.
Improve
If your model performance score is not as good as you want it to be, you can:
Train for a longer period of time. With more time, the automated machine learning engine to try more
algorithms and settings.
Add more data. Sometimes the amount of data is not sufficient to train a high-quality machine learning
model.
Balance your data. For classification tasks, make sure that the training set is balanced across the categories.
For example, if you have four classes for 100 training examples, and the two first classes (tag1 and tag2) are
used for 90 records, but the other two (tag3 and tag4) are only used on the remaining 10 records, the lack of
balanced data may cause your model to struggle to correctly predict tag3 or tag4.
Code
After the evaluation phase, Model Builder outputs a model file, and code that you can use to add the model to your
application. ML.NET models are saved as a zip file. The code to load and use your model is added as a new project
in your solution. Model Builder also adds a sample console app that you can run to see your model in action.
In addition, Model Builder outputs the code that generated the model, so that you can understand the steps used
to generate the model. You can also use the model training code to retrain your model with new data.
What's next?
Install the Model Builder Visual Studio extension
Try price prediction or any regression scenario
How to install ML.NET Model Builder
6/27/2019 • 2 minutes to read • Edit Online
Learn how to install ML.NET Model Builder to add machine learning to your .NET applications.
NOTE
Model Builder is currently in Preview.
Pre-requisites
Visual Studio 2017 15.9.12 or later / Visual Studio 2019
.NET Core 2.1 or later SDK
Limitations
ML.NET Model Builder Extension currently only works on Visual Studio on Windows.
Training dataset limit of 1GB
SQL Server has a limit of 100 thousand rows for training
Microsoft SQL Server Data Tools for Visual Studio 2017 is not supported
Install
ML.NET Model builder can be installed either through the Visual Studio Marketplace or from within Visual Studio.
Visual Studio Marketplace
1. Download from Visual Studio Marketplace
2. Follow prompts to install onto respective Visual Studio version
Visual Studio 2017
1. In the menu bar, select Tools > Extensions and Updates
2. Inside the Extension and Updates prompt, select the Online node.
3. In the search bar, search for ML.NET Model Builder and from the results, select ML.NET Model Builder
(Preview )
4. Follow the prompts to complete the installation
Visual Studio 2019
1. On the menu bar, select Extensions > Manage Extensions
2. Inside the Extension and Updates prompt, select the Online node.
3. Type ML.NET Model Builder into the search bar select ML.NET Model Builder (Preview )
4. Follow the prompts to complete the installation
Uninstall
Visual Studio 2017
1. On the menu bar, select Tools > Extensions and Updates
2. Inside the Extension and Updates prompt, expand the Installed node and select Tools
3. Select ML.NET Model Builder (Preview ) from the list of tools and then, select Uninstall
4. Follow the prompts to complete the uninstallation.
Visual Studio 2019
1. On the menu bar, select Extensions > Manage Extensions
2. Inside the Extension and Updates prompt, expand the Installed node and select Tools
3. Select ML.NET Model Builder (Preview ) from the list of tools and then, select Uninstall
4. Follow the prompts to complete the uninstallation.
Upgrade
The upgrade process is similar to the installation process. Either download the latest version from Visual Studio
Marketplace or use the Extensions Manager in Visual Studio.
Predict prices using regression with Model Builder
8/19/2019 • 6 minutes to read • Edit Online
Learn how to use ML.NET Model Builder to build a regression model() to predict prices. The .NET console app that
you develop in this tutorial predicts taxi fares based on historical New York taxi fare data.
The Model Builder price prediction template can be used for any scenario requiring a numerical prediction value.
Example scenarios include: house price prediction, demand prediction, and sales forecasting.
In this tutorial, you learn how to:
Prepare and understand the data
Choose a scenario
Load the data
Train the model
Evaluate the model
Use the model for predictions
NOTE
Model Builder is currently in Preview.
Pre-requisites
For a list of pre-requisites and installation instructions, visit the Model Builder installation guide.
Choose a scenario
To train your model, you need to select from the list of available machine learning scenarios provided by Model
Builder. In this case, the scenario is Price Prediction .
1. In Solution Explorer, right-click the TaxiFarePrediction project, and select Add > Machine Learning.
2. In the scenario step of the Model Builder tool, select Price Prediction scenario.
1. Because the training data file is more than 10MB, use 600 seconds (10 minutes) as the value for Time to train
(seconds).
2. Select Start Training.
Throughout the training process, progress data is displayed in the Progress section of the train step.
Status displays the completion status of the training process.
Best accuracydisplays the accuracy of the best performing model found by Model Builder so far. Higher
accuracy means the model predicted more correctly on test data.
Best algorithmdisplays the name of the best performing algorithm performed found by Model Builder so far.
Last algorithmdisplays the name of the algorithm most recently used by Model Builder to train the model.
Once training is complete, navigate to the evaluate step.
using System;
using Microsoft.ML;
using TaxiFarePredictionML.Model.DataModels;
// 2. Create PredictionEngine
var predictionEngine = mlContext.Model.CreatePredictionEngine<ModelInput, ModelOutput>(mlModel);
The ConsumeModel will load the trained model, create a PredictionEngine for the model and use it to make
predictions on new data.
6. To make a prediction on new data using the model, create a new instance of the ModelInput class and use
the ConsumeModel method. Notice that the fare amount is not part of the input. This is because the model
will generate the prediction for it. Add the following code to the Main method and run the application
// Make prediction
ModelOutput prediction = ConsumeModel(input);
// Print Prediction
Console.WriteLine($"Predicted Fare: {prediction.Score}");
Console.ReadKey();
The output generated by the program should look similar to the snippet below:
If you need to reference the generated projects at a later time inside of another solution, you can find them inside
the C:\Users\%USERNAME%\AppData\Local\Temp\MLVSTools directory.
Next Steps
In this tutorial, you learned how to:
Prepare and understand the data
Choose a scenario
Load the data
Train the model
Evaluate the model
Use the model for predictions
Additional Resources
To learn more about topics mentioned in this tutorial, visit the following resources:
Model Builder Scenarios
Regression
Regression Model Metrics
NYC TLC Taxi Trip data set
How to install the ML.NET Command-Line Interface
(CLI) tool
6/11/2019 • 3 minutes to read • Edit Online
The ML.NET CLI (command-line interface) is a tool you can run on any command-prompt (Windows, Mac, or
Linux) for generating good quality ML.NET models and source code based on training datasets you provide.
NOTE
This topic refers to ML.NET CLI and ML.NET AutoML, which are currently in Preview, and material may be subject to change.
Pre-requisites
.NET Core 2.2 SDK
(Optional) Visual Studio 2017 or 2019
You can either run the generated C# code projects with Visual Studio F5 or with dotnet run (.NET Core CLI).
Note: If after installing .NET Core 2.2 SDK the dotnet tool command is not working, sign out from Windows and
sign in again.
Install
The ML.NET CLI is installed like any other dotnet Global Tool. You use the dotnet tool install .NET Core CLI
command.
The following example shows how to install the ML.NET CLI in the default NuGet feed location:
If the tool can't be installed (that is, if it is not available at the default NuGet feed), error messages are displayed.
Check that the feeds you expected are being checked.
If installation is successful, a message is displayed showing the command used to call the tool and the version
installed, similar to the following example:
You can invoke the tool using the following command: mlnet
Tool 'mlnet' (version 'X.X.X') was successfully installed.
You can confirm the installation was successful by typing the following command:
mlnet
You should see the help for available commands for the mlnet tool such as the 'auto-train' command.
You can also check if the package is properly installed by typing the following command:
'Tab-based auto-completion' (parameter suggestions) works on Windows PowerShell and macOS/Linux bash but
it won't work on Windows CMD.
To enable it, in the current preview version, the end user has to take a few steps once per shell, outlined below.
Once this is done, completions will work for all apps written using System.CommandLine such as the ML.NET CLI.
On the machine where you'd like to enable completion, you'll need to do two things.
1. Install the dotnet-suggest global tool by running the following command:
2. Add the appropriate shim script to your shell profile. You may have to create a shell profile file. The shim
script will forward completion requests from your shell to the dotnet-suggest tool, which delegates to the
appropriate System.CommandLine -based app.
For bash, add the contents of dotnet-suggest-shim.bash to ~/.bash_profile .
For PowerShell, add the contents of dotnet-suggest-shim.ps1 to your PowerShell profile. You can
find the expected path to your PowerShell profile by running the following command in your
console:
echo $profile
Installation directory
The ML.NET CLI can be installed in the default directory or in a specific location. The default directories are:
OS PATH
Linux/macOS $HOME/.dotnet/tools
Windows %USERPROFILE%\.dotnet\tools
These locations are added to the user's path when the SDK is first run, so Global Tools installed there can be called
directly.
Note: the Global Tools are user-specific, not machine global. Being user-specific means you cannot install a Global
Tool that is available to all users of the machine. The tool is only available for each user profile where the tool was
installed.
Global Tools can also be installed in a specific directory. When installed in a specific directory, the user must ensure
the command is available, by including that directory in the path, by calling the command with the directory
specified, or calling the tool from within the specified directory. In this case, the .NET Core CLI doesn't add this
location automatically to the PATH environment variable.
See also
Tutorial on 'Getting Started with ML.NET CLI tool'
How to automatically train models with the ML.NET CLI tool
ML.NET CLI auto-train command reference guide
Telemetry in ML.NET CLI
Automate model training with the ML.NET CLI
7/9/2019 • 3 minutes to read • Edit Online
The ML.NET CLI "democratizes" ML.NET for .NET developers when learning ML.NET.
To use the ML.NET API by itself, (without the ML.NET AutoML CLI) you need to choose a trainer (implementation
of a machine learning algorithm for a particular task), and the set of data transformations (feature engineering) to
apply to your data. The optimal pipeline will vary for each dataset and selecting the optimal algorithm from all the
choices adds to the complexity. Even further, each algorithm has a set of hyperparameters to be tuned. Hence, you
can spend weeks and sometimes months on machine learning model optimization trying to find the best
combinations of feature engineering, learning algorithms, and hyperparameters.
This process can be automated with the ML.NET CLI, which implements the ML.NET AutoML intelligent engine.
NOTE
This topic refers to ML.NET CLI and ML.NET AutoML, which are currently in Preview, and material may be subject to change.
You can generate those assets from your own datasets without coding by yourself, so it also improves your
productivity even if you already know ML.NET.
Currently, the ML Tasks supported by the ML.NET CLI are:
binary-classification
multiclass-classification
regression
Future: other machine learning tasks such as recommendation , ranking , anomaly-detection , clustering
Example of usage:
You can run it the same way on Windows PowerShell, *macOS/Linux bash, or Windows CMD. However, tabular
auto-completion (parameter suggestions) won't work on Windows CMD.
Accuracy is a popular metric for classification problems, however accuracy is not always the best metric to select
the best model from as explained in the references below. There are cases where you need to evaluate the quality
of your model with additional metrics.
To explore and understand the metrics that are output by the CLI, see Metrics for binary classification.
Metrics for Multi-class Classification models
The following displays the multi-class classification ML task metrics list for the top five models found by the CLI:
To explore and understand the metrics that are output by the CLI, see Metrics for multiclass classification.
Metrics for Regression models
A regression model fits the data well if the differences between the observed values and the model's predicted
values are small and unbiased. Regression can be evaluated with certain metrics.
You will see a similar list of metrics for the best top five quality models found by the CLI. In this particular case
related to a regression ML task:
To explore and understand the metrics that are output by the CLI, see Metrics for regression.
See also
How to install the ML.NET CLI tool
Tutorial: Auto generate a binary classifier using the ML.NET CLI
ML.NET CLI command reference
Telemetry in ML.NET CLI
Auto generate a binary classifier using the CLI
8/22/2019 • 12 minutes to read • Edit Online
Learn how to use ML.NET CLI to automatically generate an ML.NET model and underlying C# code. You provide
your dataset and the machine learning task you want to implement, and the CLI uses the AutoML engine to create
model generation and deployment source code, as well as the binary model.
In this tutorial, you will do the following steps:
Prepare your data for the selected machine learning task
Run the 'mlnet auto-train' command from the CLI
Review the quality metric results
Understand the generated C# code to use the model in your application
Explore the generated C# code that was used to train the model
NOTE
This topic refers to the ML.NET CLI tool, which is currently in Preview, and material may be subject to change. For more
information, visit the ML.NET introduction.
The ML.NET CLI is part of ML.NET and its main goal is to "democratize" ML.NET for .NET developers when
learning ML.NET so you don't need to code from scratch to get started.
You can run the ML.NET CLI on any command-prompt (Windows, Mac, or Linux) to generate good quality
ML.NET models and source code based on training datasets you provide.
Pre-requisites
.NET Core 2.2 SDK or later
(Optional) Visual Studio 2017 or 2019
ML.NET CLI
You can either run the generated C# code projects from Visual Studio or with dotnet run (.NET Core CLI).
NOTE
The datasets this tutorial uses a dataset from the 'From Group to Individual Labels using Deep Features', Kotzias et
al,. KDD 2015, and hosted at the UCI Machine Learning Repository - Dua, D. and Karra Taniskidou, E. (2017). UCI
Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information
and Computer Science.
2. Copy the yelp_labelled.txt file into any folder you previously created (such as /cli-test ).
3. Open your preferred command prompt and move to the folder where you copied the dataset file. For
example:
> cd /cli-test
Using any text editor such as Visual Studio Code, you can open, and explore the yelp_labelled.txt dataset
file. You can see that the structure is:
The file has no header. You will use the column's index.
There are just two columns:
Make sure you close the dataset file from the editor.
Now, you are ready to start using the CLI for this 'Sentiment Analysis' scenario.
NOTE
After finishing this tutorial you can also try with your own datasets as long as they are ready to be used for any of
the ML tasks currently supported by the ML.NET CLI Preview which are 'Binary Classification', 'Multi-class
Classification' and 'Regression').
Explore the generated C# code to use for running the model to make
predictions
1. In Visual Studio (2017 or 2019) open the solution generated in the folder named
SampleBinaryClassification within your original destination folder (in the tutorial was named /cli-test ).
You should see a solution similar to:
NOTE
In the tutorial we suggest to use Visual Studio, but you can also explore the generated C# code (two projects) with
any text editor and run the generated console app with the dotnet CLI on macOS, Linux or Windows machine.
The generated class library containing the serialized ML model (.zip file) and the data classes (data
models) is something you can directly use in your end-user application, even by directly referencing that
class library (or moving the code, as you prefer).
The generated console app contains execution code that you must review and then you usually reuse
the 'scoring code' (code that runs the ML model to make predictions) by moving that simple code (just a
few lines) to your end-user application where you want to make the predictions.
2. Open the ModelInput.cs and ModelOutput.cs class files within the class library project. You will see that
these classes are 'data classes' or POCO classes used to hold data. It is 'boilerplate code' but useful to have
it generated if your dataset has tens or even hundreds of columns.
The ModelInput class is used when reading data from the dataset.
The ModelOutput class is used to get the prediction result (prediction data).
3. Open the Program.cs file and explore the code. In just a few lines, you are able to run the model and make a
sample prediction.
// Training code used by ML.NET CLI and AutoML to generate the model
//ModelBuilder.CreateModel();
The first line of code simply creates an MLContext object needed whenever you run ML.NET code.
The second line of code is commented because you don't need to train the model since it was already
trained for you by the CLI tool and saved into the model's serialized .ZIP file. But if you want to see "how
the model was trained" by the CLI, you could uncomment that line and run/debug the training code used
for that particular ML model.
In the third line of code, you load the model from the serialized model .ZIP file with the
mlContext.Model.Load() API by providing the path to that model .ZIP file.
In the fourth line of code you load create the object with the
PredictionEngine
mlContext.Model.CreatePredictionEngine<TSrc,TDst>(ITransformer mlModel)API. You need the
PredictionEngine object whenever you want to make a prediction targeting a single sample of data ( In this
case, a single piece of text to predict its sentiment).
The fifth line of code is where you create that single sample data to be used for the prediction by calling the
function CreateSingleDataSample() . Since the CLI tool doesn't know what kind of sample data to use, within
that function it is loading the first row of the dataset. However, for this case you can also create you own
'hard-coded' data instead of the current implementation of the CreateSingleDataSample() function by
updating this simpler code implementing that function:
1. Run the project, either using the original sample data loaded from the first row of the dataset or by
providing your own custom hard-coded sample data. You should get a prediction comparable to:
Windows
macOS Bash
Run the console app from Visual Studio by hitting F5 (Play button):
)
2. Try changing the hard-coded sample data to other sentences with different sentiment and see how the
model predicts positive or negative sentiment.
Explore the generated C# code that was used to train the "best
quality" model
For more advanced learning purposes, you can also explore the generated C# code that was used by the CLI tool
to train the generated model.
That 'training model code' is currently generated in the custom class generated named ModelBuilder so you can
investigate that training code.
More importantly, for this particular scenario (Sentiment Analysis model) you can also compare that generated
training code with the code explained in the following tutorial:
Compare: Tutorial: Use ML.NET in a sentiment analysis binary classification scenario.
It is interesting to compare the chosen algorithm and pipeline configuration in the tutorial with the code generated
by the CLI tool. Depending on how much time you spend iterating and searching for better models, the chosen
algorithm might be different along with its particular hyper-parameters and pipeline configuration.
See also
Automate model training with the ML.NET CLI
Tutorial: Running ML.NET models on scalable ASP.NET Core web apps and WebAPIs
Sample: Scalable ML.NET model on ASP.NET Core WebAPI
ML.NET CLI auto-train command reference guide
How to install the ML.NET Command-Line Interface (CLI) tool
Telemetry in ML.NET CLI
Next steps
In this tutorial, you learned how to:
Prepare your data for the selected ML task (problem to solve)
Run the 'mlnet auto-train' command in the CLI tool
Review the quality metric results
Understand the generated C# code to run the model (Code to use in your end-user app)
Explore the generated C# code that was used to train the "best quality" model (Learning purposes)
Automate model training with the ML.NET CLI
The 'auto-train' command in ML.NET CLI
6/26/2019 • 8 minutes to read • Edit Online
NOTE
This topic refers to ML.NET CLI and ML.NET AutoML, which are currently in Preview, and material may be subject to change.
The auto-train command is the main command provided by the ML.NET CLI tool. The command allows you to
generate a good quality ML.NET model (serialized model .zip file) plus the example C# code to run/score that
model. In addition, the C# code to create/train that model is also generated for you to research what algorithm
and settings it is using for that generated "best model".
You can generate those assets from your own datasets without coding by yourself, so it also improves your
productivity even if you already know ML.NET.
Currently, the ML Tasks supported by the ML.NET CLI are:
binary-classification
multiclass-classification
regression
Create and train a binary-classification model with a train dataset, a test dataset, and further customization explicit
arguments:
Name
mlnet auto-train - Trains multiple models ('n' iterations) based on the provided dataset and finally selects the
best model, saves it as a serialized .zip file plus generates related C# code for scoring and training.
Synopsis
> mlnet auto-train
--dataset | -d <value>
[
[--validation-dataset | -v <value>]
--test-dataset | -t <value>
]
--label-column-name | -n <value>
|
--label-column-index | -i <value>
[--ignore-columns | -I <value>]
[--has-header | -h <value>]
[--max-exploration-time | -x <value>]
[--verbosity | -V <value>]
[--cache | -c <value>]
[--name | -N <value>]
[--output-path | -o <value>]
[--help | -h]
Invalid input options should cause the CLI tool to emit a list of valid inputs and an error message explaining which
arg is missing, if that is the case.
Options
--task | --mltask | -T (string)
A single string providing the ML problem to solve. For instance, any of the following tasks (The CLI will eventually
support all tasks supported in AutoML ):
regression - Choose if the ML Model will be used to predict a numeric value
binary-classification - Choose if the ML Model result has two possible categorical boolean values (0 or 1 ).
multiclass-classification - Choose if the ML Model result has multiple categorical possible values.
In future releases additional ML Tasks and scenarios such as recommendations , clustering and ranking will be
supported.
Only one ML task should be provided in this argument.
--dataset | -d (string)
This argument provides the filepath to either one of the following options:
A: The whole dataset file: If using this option and the user is not providing --test-dataset and
--validation-dataset , then cross-validation (k-fold, etc.) or automated data split approaches will be used
internally for validating the model. In that case, the user will just need to provide the dataset filepath.
B: The training dataset file: If the user is also providing datasets for model validation (using
--test-dataset and optionally --validation-dataset ), then the --dataset argument means to only have
the "training dataset". For example, when using an 80% - 20% approach to validate the quality of the model
and to obtain accuracy metrics, the "training dataset" will have 80% of the data and the "test dataset" would
have 20% of the data.
--test-dataset | -t (string)
File path pointing to the test dataset file, for example when using an 80% - 20% approach when making regular
validations to obtain accuracy metrics.
If using --test-dataset , then --dataset is also required.
The --test-dataset argument is optional unless the --validation-dataset is used. In that case, the user must use
the three arguments.
--validation-dataset | -v (string)
File path pointing to the validation dataset file. The validation dataset is optional, in any case.
If using a validation dataset , the behavior should be:
The test-dataset and --dataset arguments are also required.
The validation-dataset dataset is used to estimate prediction error for model selection.
The test-dataset is used for assessment of the generalization error of the final chosen model. Ideally, the
test set should be kept in a “vault,” and be brought out only at the end of the data analysis.
Basically, when using a validation dataset plus the test dataset , the validation phase is split into two parts:
1. In the first part, you just look at your models and select the best performing approach using the validation data
(=validation)
2. Then you estimate the accuracy of the selected approach (=test).
Hence, the separation of data could be 80/10/10 or 75/15/10. For example:
training-dataset file should have 75% of the data.
validation-dataset file should have 15% of the data.
test-dataset file should have 10% of the data.
In any case, those percentages will be decided by the user using the CLI who will provide the files already split.
--label-column-name | -n (string)
With this argument, a specific objective/target column (the variable that you want to predict) can be specified by
using the column's name set in the dataset's header.
This argument is used only for supervised ML tasks such as a classification problem. It cannot be used for
unsupervised ML Tasks such as clustering.
--label-column-index | -i (int)
With this argument, a specific objective/target column (the variable that you want to predict) can be specified by
using the column's numeric index in the dataset's file (The column index values start at 1).
Note: If the user is also using the --label-column-name , the --label-column-name is the one being used.
This argument is used only for supervised ML task such as a classification problem. It cannot be used for
unsupervised ML Tasks such as clustering.
--ignore-columns | -I (string)
With this argument, you can ignore existing columns in the dataset file so they are not loaded and used by the
training processes.
Specify the columns names that you want to ignore. Use ', ' (comma with space) or ' ' (space) to separate multiple
column names. You can use quotes for column names containing whitespace (e.g. "logged in").
Example:
--ignore-columns email, address, id, logged_in
--has-header | -h (bool)
Specify if the dataset file(s) have a header row. Possible values are:
true
false
The by default value is true if this argument is not specified by the user.
In order to use the --label-column-name argument, you need to have a header in the dataset file and
--has-header set to true (which is by default).
--max-exploration-time | -x (string)
By default, the maximum exploration time is 30 minutes.
This argument sets the maximum time (in seconds) for the process to explore multiple trainers and configurations.
The configured time may be exceeded if the provided time is too short (say 2 seconds) for a single iteration. In this
case, the actual time is the required time to produce one model configuration in a single iteration.
The needed time for iterations can vary depending on the size of the dataset.
--cache | -c (string)
If you use caching, the whole training dataset will be loaded in-memory.
For small and medium datasets, using cache can drastically improve the training performance, meaning the
training time can be shorter than when you don't use cache.
However, for large datasets, loading all the data in memory can impact negatively since you might get out of
memory. When training with large dataset files and not using cache, ML.NET will be streaming chunks of data
from the drive when it needs to load more data while training.
You can specify the following values:
on : Forces cache to be used when training. off : Forces cache not to be used when training. auto : Depending
on AutoML heuristics, the cache will be used or not. Usually, small/medium datasets will use cache and large
datasets won't use cache if you use the auto choice.
If you don't specify the --cache parameter, then the cache auto configuration will be used by default.
--name | -N (string)
The name for the created output project or solution. If no name is specified, the name sample-{mltask} is used.
The ML.NET model file (.ZIP file) will get the same name, as well.
--output-path | -o (string)
Root location/folder to place the generated output. The default is the current directory.
--verbosity | -V (string)
Sets the verbosity level of the standard output.
Allowed values are:
q[uiet]
m[inimal] (by default)
diag[nostic] (logging information level)
By default, the CLI tool should show some minimum feedback (minimal) when working, such as mentioning that
it is working and if possible how much time is left or what % of the time is completed.
-h|--help
Prints out help for the command with a description for each command's parameter.
See also
How to install the ML.NET CLI tool
Automate model training with the ML.NET CLI
Tutorial: Auto generate a binary classifier using the ML.NET CLI
Telemetry in ML.NET CLI
Telemetry collection by the ML.NET CLI
8/17/2019 • 2 minutes to read • Edit Online
The ML.NET CLI includes a telemetry feature that collects anonymous usage data that is aggregated for use by
Microsoft.
Scope
The mlnet command launches the ML.NET CLI, but the command itself doesn't collect telemetry.
Telemetry isn't enabled when you run the mlnet command with no other command attached. For example:
mlnet
mlnet --help
Telemetry is enabled when you run an ML.NET CLI command, such as mlnet auto-train .
License
The Microsoft distribution of ML.NET CLI is licensed with the Microsoft Software License Terms: Microsoft .NET
Library. For details on data collection and processing, see the section entitled "Data."
Disclosure
When you first run a ML.NET CLI command such as mlnet auto-train , the ML.NET CLI tool displays disclosure
text that tells you how to opt out of telemetry. Text may vary slightly depending on the version of the CLI you're
running.
See also
ML.NET CLI reference
Microsoft Software License Terms: Microsoft .NET Library
Privacy at Microsoft
Microsoft Privacy Statement
How to use the ML.NET automated machine learning
API
5/21/2019 • 4 minutes to read • Edit Online
Automated machine learning (AutoML ) automates the process of applying machine learning to data. Given a
dataset, you can run an AutoML experiment to iterate over different data featurizations, machine learning
algorithms, and hyperparameters to select the best model.
NOTE
This topic refers to the automated machine learning API for ML.NET, which is currently in preview. Material may be subject to
change.
Load data
Automated machine learning supports loading a dataset into an IDataView. Data can be in the form of tab-
separated value (TSV ) files and comma separated value (CSV ) files.
Example:
using Microsoft.ML;
using Microsoft.ML.AutoML;
...
MLContext mlContext = new MLContext();
IDataView trainDataView = mlContext.Data.LoadFromTextFile<SentimentIssue>("my-data-file.csv", hasHeader:
true);
Multiclass Classification
experimentSettings.MaxExperimentTimeInSeconds = 3600;
experimentSettings.CancellationToken = cts.Token;
4. The CacheDirectory setting is a pointer to a directory where all models trained during the AutoML task will
be saved. If CacheDirectory is set to null, models will be kept in memory instead of written to disk.
experimentSettings.CacheDirectory = null;
The list of supported trainers per ML task can be found at the corresponding link below:
Supported Binary Classification Algorithms
Supported Multiclass Classification Algorithms
Supported Regression Algorithms
Optimizing metric
The optimizing metric, as shown in the example above, determines the metric to be optimized during model
training. The optimizing metric you can select is determined by the task type you choose. Below is a list of available
metrics.
BINARY CLASSIFICATION MULTICLASS CLASSIFICATION REGRESSION
NegativePrecision TopKAccuracy
NegativeRecall
PositivePrecision
PositiveRecall
Exit criteria
Define the criteria to complete your task:
1. Exit after a length of time - Using MaxExperimentTimeInSeconds in your experiment settings you can define
how long in seconds that an task should continue to run.
2. Exit on a cancellation token - You can use a cancellation token that lets you cancel the task before it is
scheduled to finish.
var cts = new CancellationTokenSource();
var experimentSettings = new RegressionExperimentSettings();
experimentSettings.MaxExperimentTimeInSeconds = 3600;
experimentSettings.CancellationToken = cts.Token;
Create an experiment
Once you have configured the experiment settings, you are ready to create the experiment.
Explore other overloads for Execute() if you want to pass in validation data, column information indicating the
column purpose, or prefeaturizers.
Training modes
Training dataset
AutoML provides an overloaded experiment execute method which allows you to provide training data. Internally,
automated ML divides the data into train-validate splits.
experiment.Execute(trainDataView);
experiment.Execute(trainDataView, validationDataView);
See also
For full code samples and more visit the dotnet/machinelearning-samples GitHub repository.