Data Science For Beginners
Data Science For Beginners
Microsoft Azure Machine Learning Studio is a collaborative, drag-and-drop tool you can use to build, test, and
deploy predictive analytics solutions on your data. Machine Learning Studio publishes models as web services that
can easily be consumed by custom apps or BI tools such as Excel.
Machine Learning Studio is where data science, predictive analytics, cloud resources, and your data meet.
Components of an experiment
An experiment consists of datasets that provide data to analytical modules, which you connect together to
construct a predictive analysis model. Specifically, a valid experiment has these characteristics:
The experiment has at least one dataset and one module
Datasets may be connected only to modules
Modules may be connected to either datasets or other modules
All input ports for modules must have some connection to the data flow
All required parameters for each module must be set
You can create an experiment from scratch, or you can use an existing sample experiment as a template. For more
information, see Copy example experiments to create new machine learning experiments.
For an example of creating a simple experiment, see Create a simple experiment in Azure Machine Learning
Studio.
For a more complete walkthrough of creating a predictive analytics solution, see Develop a predictive solution with
Azure Machine Learning Studio.
Datasets
A dataset is data that has been uploaded to Machine Learning Studio so that it can be used in the modeling
process. A number of sample datasets are included with Machine Learning Studio for you to experiment with, and
you can upload more datasets as you need them. Here are some examples of included datasets:
MPG data for various automobiles - Miles per gallon (MPG ) values for automobiles identified by number of
cylinders, horsepower, etc.
Breast cancer data - Breast cancer diagnosis data.
Forest fires data - Forest fire sizes in northeast Portugal.
As you build an experiment, you can choose from the list of datasets available to the left of the canvas.
For a list of sample datasets included in Machine Learning Studio, see Use the sample data sets in Azure Machine
Learning Studio.
Modules
A module is an algorithm that you can perform on your data. Machine Learning Studio has a number of modules
ranging from data ingress functions to training, scoring, and validation processes. Here are some examples of
included modules:
Convert to ARFF - Converts a .NET serialized dataset to Attribute-Relation File Format (ARFF ).
Compute Elementary Statistics - Calculates elementary statistics such as mean, standard deviation, etc.
Linear Regression - Creates an online gradient descent-based linear regression model.
Score Model - Scores a trained classification or regression model.
As you build an experiment you can choose from the list of modules available to the left of the canvas.
A module may have a set of parameters that you can use to configure the module's internal algorithms. When you
select a module on the canvas, the module's parameters are displayed in the Properties pane to the right of the
canvas. You can modify the parameters in that pane to tune your model.
For some help navigating through the large library of machine learning algorithms available, see How to choose
algorithms for Microsoft Azure Machine Learning Studio.
Training compute targets Proprietary compute target, CPU Supports Azure Machine Learning
support only compute, GPU or CPU.
(Other computes supported in SDK)
Deployment compute targets Proprietary web service format, not Enterprise security options & Azure
customizable Kubernetes Service.
(Other computes supported in SDK)
Try out the visual interface (preview ) with Quickstart: Prepare and visualize data without writing code
NOTE
Models created in Studio can't be deployed or managed by Azure Machine Learning service. However, models created and
deployed in the service visual interface can be managed through the Azure Machine Learning service workspace.
Free trial
Try Azure Machine Learning Studio, available in paid or free options.
Next steps
You can learn the basics of predictive analytics and machine learning using a step-by-step quickstart and by
building on samples.
Quickstart: Create your first data science experiment
in Azure Machine Learning Studio
3/15/2019 • 11 minutes to read • Edit Online
In this quickstart, you create a machine learning experiment in Azure Machine Learning Studio that predicts the
price of a car based on different variables such as make and technical specifications.
If you're brand new to machine learning, the video series Data Science for Beginners is a great introduction to
machine learning using everyday language and concepts.
This quickstart follows the default workflow for an experiment:
1. Create a model
Get the data
Prepare the data
Define features
2. Train the model
Choose and apply an algorithm
3. Score and test the model
Predict new automobile prices
If you don't have a Studio account, go to the Studio homepage and select Sign up here to create a free account.
The free workspace will have all the features you need for this quickstart.
TIP
You can find a working copy of the following experiment in the Azure AI Gallery. Go to Your first data science
experiment - Automobile price prediction and click Open in Studio to download a copy of the experiment into your
Machine Learning Studio workspace.
To see what this data looks like, click the output port at the bottom of the automobile dataset then select
Visualize.
TIP
Datasets and modules have input and output ports represented by small circles - input ports at the top, output ports at
the bottom. To create a flow of data through your experiment, you'll connect an output port of one module to an input
port of another. At any time, you can click the output port of a dataset or module to see what the data looks like at that
point in the data flow.
In this dataset, each row represents an automobile, and the variables associated with each automobile appear as
columns. We'll predict the price in far-right column (column 26, titled "price") using the variables for a specific
automobile.
Close the visualization window by clicking the "x" in the upper-right corner.
TIP
Cleaning the missing values from input data is a prerequisite for using most of the modules.
First, we add a module that removes the normalized-losses column completely. Then we add another module
that removes any row that has missing data.
1. Type select columns in the search box at the top of the module palette to find the Select Columns in
Dataset module. Then drag it to the experiment canvas. This module allows us to select which columns of
data we want to include or exclude in the model.
2. Connect the output port of the Automobile price data (Raw) dataset to the input port of the Select
Columns in Dataset.
3. Click the Select Columns in Dataset module and click Launch column selector in the Properties pane.
On the left, click With rules
Under Begin With, click All columns. These rules direct Select Columns in Dataset to pass
through all the columns (except those columns we're about to exclude).
From the drop-downs, select Exclude and column names, and then click inside the text box. A list
of columns is displayed. Select normalized-losses, and it's added to the text box.
Click the check mark (OK) button to close the column selector (on the lower right).
Now the properties pane for Select Columns in Dataset indicates that it will pass through all
columns from the dataset except normalized-losses.
TIP
You can add a comment to a module by double-clicking the module and entering text. This can help you
see at a glance what the module is doing in your experiment. In this case double-click the Select Columns in
Dataset module and type the comment "Exclude normalized losses."
4. Drag the Clean Missing Data module to the experiment canvas and connect it to the Select Columns in
Dataset module. In the Properties pane, select Remove entire row under Cleaning mode. These
options direct Clean Missing Data to clean the data by removing rows that have any missing values.
Double-click the module and type the comment "Remove missing value rows."
TIP
Why did we run the experiment now? By running the experiment, the column definitions for our data pass from the
dataset, through the Select Columns in Dataset module, and through the Clean Missing Data module. This means that any
modules we connect to Clean Missing Data will also have this same information.
Now we have clean data. If you want to view the cleaned dataset, click the left output port of the Clean Missing
Data module and select Visualize. Notice that the normalized-losses column is no longer included, and there
are no missing values.
Now that the data is clean, we're ready to specify what features we're going to use in the predictive model.
Define features
In machine learning, features are individual measurable properties of something you’re interested in. In our
dataset, each row represents one automobile, and each column is a feature of that automobile.
Finding a good set of features for creating a predictive model requires experimentation and knowledge about the
problem you want to solve. Some features are better for predicting the target than others. Some features have a
strong correlation with other features and can be removed. For example, city-mpg and highway-mpg are closely
related so we can keep one and remove the other without significantly affecting the prediction.
Let's build a model that uses a subset of the features in our dataset. You can come back later and select different
features, run the experiment again, and see if you get better results. But to start, let's try the following features:
1. Drag another Select Columns in Dataset module to the experiment canvas. Connect the left output port of
the Clean Missing Data module to the input of the Select Columns in Dataset module.
3. Run the experiment. When the experiment is run, the Select Columns in Dataset and Split Data modules
pass column definitions to the modules we'll be adding next.
4. To select the learning algorithm, expand the Machine Learning category in the module palette to the left
of the canvas, and then expand Initialize Model. This displays several categories of modules that can be
used to initialize machine learning algorithms. For this experiment, select the Linear Regression module
under the Regression category, and drag it to the experiment canvas. (You can also find the module by
typing "linear regression" in the palette Search box.)
5. Find and drag the Train Model module to the experiment canvas. Connect the output of the Linear
Regression module to the left input of the Train Model module, and connect the training data output (left
port) of the Split Data module to the right input of the Train Model module.
6. Click the Train Model module, click Launch column selector in the Properties pane, and then select the
price column. Price is the value that our model is going to predict.
You select the price column in the column selector by moving it from the Available columns list to the
Selected columns list.
7. Run the experiment.
We now have a trained regression model that can be used to score new automobile data to make price
predictions.
2. Run the experiment and view the output from the Score Model module by clicking the output port of
Score Model and select Visualize. The output shows the predicted values for price and the known values
from the test data.
3. Finally, we test the quality of the results. Select and drag the Evaluate Model module to the experiment
canvas, and connect the output of the Score Model module to the left input of Evaluate Model. The final
experiment should look something like this:
4. Run the experiment.
To view the output from the Evaluate Model module, click the output port, and then select Visualize.
The following statistics are shown for our model:
Mean Absolute Error (MAE ): The average of absolute errors (an error is the difference between the
predicted value and the actual value).
Root Mean Squared Error (RMSE ): The square root of the average of squared errors of predictions made on
the test dataset.
Relative Absolute Error: The average of absolute errors relative to the absolute difference between actual
values and the average of all actual values.
Relative Squared Error: The average of squared errors relative to the squared difference between the actual
values and the average of all actual values.
Coefficient of Determination: Also known as the R squared value, this is a statistical metric indicating
how well a model fits the data.
For each of the error statistics, smaller is better. A smaller value indicates that the predictions more closely match
the actual values. For Coefficient of Determination, the closer its value is to one (1.0), the better the
predictions.
Clean up resources
If you no longer need the resources you created using this article, delete them to avoid incurring any charges.
Learn how in the article, Export and delete in-product user data.
Next steps
In this quickstart, you created a simple experiment using a sample dataset. To explore the process of creating and
deploying a model in more depth, continue to the predictive solution tutorial.
Tutorial: Develop a predictive solution in Studio
Tutorial 1: Predict credit risk - Azure Machine
Learning Studio
5/20/2019 • 12 minutes to read • Edit Online
In this tutorial, you take an extended look at the process of developing a predictive analytics solution. You develop
a simple model in Machine Learning Studio. You then deploy the model as an Azure Machine Learning web
service. This deployed model can make predictions using new data. This tutorial is part one of a three-part
tutorial series.
Suppose you need to predict an individual's credit risk based on the information they gave on a credit application.
Credit risk assessment is a complex problem, but this tutorial will simplify it a bit. You'll use it as an example of
how you can create a predictive analytics solution using Microsoft Azure Machine Learning Studio. You'll use
Azure Machine Learning Studio and a Machine Learning web service for this solution.
In this three-part tutorial, you start with publicly available credit risk data. You then develop and train a predictive
model. Finally you deploy the model as a web service.
In this part of the tutorial you:
Create a Machine Learning Studio workspace
Upload existing data
Create an experiment
You can then use this experiment to train models in part 2 and then deploy them in part 3.
Try Azure Machine Learning Studio, available in paid or free options.
Prerequisites
This tutorial assumes that you've used Machine Learning Studio at least once before, and that you have some
understanding of machine learning concepts. But it doesn't assume you're an expert in either.
If you've never used Azure Machine Learning Studio before, you might want to start with the quickstart,
Create your first data science experiment in Azure Machine Learning Studio. The quickstart takes you through
Machine Learning Studio for the first time. It shows you the basics of how to drag-and-drop modules onto your
experiment, connect them together, run the experiment, and look at the results.
TIP
You can find a working copy of the experiment that you develop in this tutorial in the Azure AI Gallery. Go to Tutorial -
Predict credit risk and click Open in Studio to download a copy of the experiment into your Machine Learning Studio
workspace.
TIP
If you are owner of the workspace, you can share the experiments you're working on by inviting others to the workspace.
You can do this in Machine Learning Studio on the SETTINGS page. You just need the Microsoft account or organizational
account for each user.
On the SETTINGS page, click USERS, then click INVITE MORE USERS at the bottom of the window.
In either case, you have created a comma-separated version of the data in a file named german.csv that you can
use in your experiment.
Upload the dataset to Machine Learning Studio
Once the data has been converted to CSV format, you need to upload it into Machine Learning Studio.
1. Open the Machine Learning Studio home page (https://studio.azureml.net).
2. Click the menu in the upper-left corner of the window, click Azure Machine Learning, select
Studio, and sign in.
3. Click +NEW at the bottom of the window.
4. Select DATASET.
5. Select FROM LOCAL FILE.
6. In the Upload a new dataset dialog, click Browse, and find the german.csv file you created.
7. Enter a name for the dataset. For this tutorial, call it "UCI German Credit Card Data".
8. For data type, select Generic CSV File With no header (.nh.csv).
9. Add a description if you’d like.
10. Click the OK check mark.
This uploads the data into a dataset module that you can use in an experiment.
You can manage datasets that you've uploaded to Studio by clicking the DATASETS tab to the left of the Studio
window.
For more information about importing other types of data into an experiment, see Import your training data into
Azure Machine Learning Studio.
Create an experiment
The next step in this tutorial is to create an experiment in Machine Learning Studio that uses the dataset you
uploaded.
1. In Studio, click +NEW at the bottom of the window.
2. Select EXPERIMENT, and then select "Blank Experiment".
3. Select the default experiment name at the top of the canvas and rename it to something meaningful.
TIP
It's a good practice to fill in Summary and Description for the experiment in the Properties pane. These
properties give you the chance to document the experiment so that anyone who looks at it later will understand
your goals and methodology.
4. In the module palette to the left of the experiment canvas, expand Saved Datasets.
5. Find the dataset you created under My Datasets and drag it onto the canvas. You can also find the dataset
by entering the name in the Search box above the palette.
The red exclamation mark indicates that you haven't set the properties for this module yet. You'll do that
next.
TIP
You can add a comment to a module by double-clicking the module and entering text. This can help you see at a
glance what the module is doing in your experiment. In this case, double-click the Edit Metadata module and type
the comment "Add column headings". Click anywhere else on the canvas to close the text box. To display the
comment, click the down-arrow on the module.
4. Select Edit Metadata, and in the Properties pane to the right of the canvas, click Launch column
selector.
5. In the Select columns dialog, select all the rows in Available Columns and click > to move them to
Selected Columns. The dialog should look like this:
6. Click the OK check mark.
7. Back in the Properties pane, look for the New column names parameter. In this field, enter a list of
names for the 21 columns in the dataset, separated by commas and in column order. You can obtain the
columns names from the dataset documentation on the UCI website, or for convenience you can copy and
paste the following list:
Status of checking account, Duration in months, Credit history, Purpose, Credit amount, Savings
account/bond, Present employment since, Installment rate in percentage of disposable income, Personal
status and sex, Other debtors, Present residence since, Property, Age in years, Other installment
plans, Housing, Number of existing credits, Job, Number of people providing maintenance for,
Telephone, Foreign worker, Credit risk
TIP
The property Fraction of rows in the first output dataset determines how much of the data is output through
the left output port. For instance, if you set the ratio to 0.7, then 70% of the data is output through the left port
and 30% through the right port.
3. Double-click the Split Data module and enter the comment, "Training/testing data split 50%".
You can use the outputs of the Split Data module however you like, but let's choose to use the left output as
training data and the right output as testing data.
As mentioned in the previous step, the cost of misclassifying a high credit risk as low is five times higher than the
cost of misclassifying a low credit risk as high. To account for this, you generate a new dataset that reflects this
cost function. In the new dataset, each high risk example is replicated five times, while each low risk example is
not replicated.
You can do this replication using R code:
1. Find and drag the Execute R Script module onto the experiment canvas.
2. Connect the left output port of the Split Data module to the first input port ("Dataset1") of the Execute R
Script module.
3. Double-click the Execute R Script module and enter the comment, "Set cost adjustment".
4. In the Properties pane, delete the default text in the R Script parameter and enter this script:
TIP
The copy of the Execute R Script module contains the same script as the original module. When you copy and paste a
module on the canvas, the copy retains all the properties of the original.
For more information on using R scripts in your experiments, see Extend your experiment with R.
Clean up resources
If you no longer need the resources you created using this article, delete them to avoid incurring any charges.
Learn how in the article, Export and delete in-product user data.
Next steps
In this tutorial you completed these steps:
Create a Machine Learning Studio workspace
Upload existing data into the workspace
Create an experiment
You are now ready to train and evaluate models for this data.
Tutorial 2 - Train and evaluate models
Tutorial 2: Train credit risk models - Azure Machine
Learning Studio
2/20/2019 • 9 minutes to read • Edit Online
In this tutorial, you take an extended look at the process of developing a predictive analytics solution. You develop
a simple model in Machine Learning Studio. You then deploy the model as an Azure Machine Learning web
service. This deployed model can make predictions using new data. This tutorial is part two of a three-part
tutorial series.
Suppose you need to predict an individual's credit risk based on the information they gave on a credit application.
Credit risk assessment is a complex problem, but this tutorial will simplify it a bit. You'll use it as an example of
how you can create a predictive analytics solution using Microsoft Azure Machine Learning Studio. You'll use
Azure Machine Learning Studio and a Machine Learning web service for this solution.
In this three-part tutorial, you start with publicly available credit risk data. You then develop and train a predictive
model. Finally you deploy the model as a web service.
In part one of the tutorial, you created a Machine Learning Studio workspace, uploaded data, and created an
experiment.
In this part of the tutorial you:
Train multiple models
Score and evaluate the models
In part three of the tutorial, you'll deploy the model as a web service.
Try Azure Machine Learning Studio, available in paid or free options.
Prerequisites
Complete part one of the tutorial.
TIP
To get help deciding which Machine Learning algorithm best suits the particular problem you're trying to solve, see How to
choose algorithms for Microsoft Azure Machine Learning Studio.
You'll add both the Two-Class Boosted Decision Tree module and Two-Class Support Vector Machine module in
this experiment.
Two -Class Boosted Decision Tree
First, set up the boosted decision tree model.
1. Find the Two-Class Boosted Decision Tree module in the module palette and drag it onto the canvas.
2. Find the Train Model module, drag it onto the canvas, and then connect the output of the Two-Class
Boosted Decision Tree module to the left input port of the Train Model module.
The Two-Class Boosted Decision Tree module initializes the generic model, and Train Model uses training
data to train the model.
3. Connect the left output of the left Execute R Script module to the right input port of the Train Model
module (in this tutorial you used the data coming from the left side of the Split Data module for training).
TIP
you don't need two of the inputs and one of the outputs of the Execute R Script module for this experiment, so you
can leave them unattached.
Now you need to tell the Train Model module that you want the model to predict the Credit Risk value.
1. Select the Train Model module. In the Properties pane, click Launch column selector.
2. In the Select a single column dialog, type "credit risk" in the search field under Available Columns,
select "Credit risk" below, and click the right arrow button (>) to move "Credit risk" to Selected Columns.
3. Click the OK check mark.
Two -Class Support Vector Machine
Next, you set up the SVM model.
First, a little explanation about SVM. Boosted decision trees work well with features of any type. However, since
the SVM module generates a linear classifier, the model that it generates has the best test error when all numeric
features have the same scale. To convert all numeric features to the same scale, you use a "Tanh" transformation
(with the Normalize Data module). This transforms our numbers into the [0,1] range. The SVM module converts
string features to categorical features and then to binary 0/1 features, so you don't need to manually transform
string features. Also, you don't want to transform the Credit Risk column (column 21) - it's numeric, but it's the
value we're training the model to predict, so you need to leave it alone.
To set up the SVM model, do the following:
1. Find the Two-Class Support Vector Machine module in the module palette and drag it onto the canvas.
2. Right-click the Train Model module, select Copy, and then right-click the canvas and select Paste. The copy
of the Train Model module has the same column selection as the original.
3. Connect the output of the Two-Class Support Vector Machine module to the left input port of the second
Train Model module.
4. Find the Normalize Data module and drag it onto the canvas.
5. Connect the left output of the left Execute R Script module to the input of this module (notice that the
output port of a module may be connected to more than one other module).
6. Connect the left output port of the Normalize Data module to the right input port of the second Train
Model module.
This portion of our experiment should now look something like this:
Now configure the Normalize Data module:
1. Click to select the Normalize Data module. In the Properties pane, select Tanh for the Transformation
method parameter.
2. Click Launch column selector, select "No columns" for Begin With, select Include in the first dropdown,
select column type in the second dropdown, and select Numeric in the third dropdown. This specifies that
all the numeric columns (and only numeric) are transformed.
3. Click the plus sign (+) to the right of this row - this creates a row of dropdowns. Select Exclude in the first
dropdown, select column names in the second dropdown, and enter "Credit risk" in the text field. This
specifies that the Credit Risk column should be ignored (you need to do this because this column is
numeric and so would be transformed if you didn't exclude it).
4. Click the OK check mark.
The Normalize Data module is now set to perform a Tanh transformation on all numeric columns except for the
Credit Risk column.
The Score Model module can now take the credit information from the testing data, run it through the
model, and compare the predictions the model generates with the actual credit risk column in the testing
data.
4. Copy and paste the Score Model module to create a second copy.
5. Connect the output of the SVM model (that is, the output port of the Train Model module that's connected
to the Two-Class Support Vector Machine module) to the input port of the second Score Model module.
6. For the SVM model, you have to do the same transformation to the test data as you did to the training data.
So copy and paste the Normalize Data module to create a second copy and connect it to the right Execute R
Script module.
7. Connect the left output of the second Normalize Data module to the right input port of the second Score
Model module.
To check the results, click the output port of the Evaluate Model module and select Visualize.
The Evaluate Model module produces a pair of curves and metrics that allow you to compare the results of the
two scored models. You can view the results as Receiver Operator Characteristic (ROC ) curves, Precision/Recall
curves, or Lift curves. Additional data displayed includes a confusion matrix, cumulative values for the area under
the curve (AUC ), and other metrics. You can change the threshold value by moving the slider left or right and see
how it affects the set of metrics.
To the right of the graph, click Scored dataset or Scored dataset to compare to highlight the associated curve
and to display the associated metrics below. In the legend for the curves, "Scored dataset" corresponds to the left
input port of the Evaluate Model module - in our case, this is the boosted decision tree model. "Scored dataset to
compare" corresponds to the right input port - the SVM model in our case. When you click one of these labels, the
curve for that model is highlighted and the corresponding metrics are displayed, as shown in the following
graphic.
By examining these values, you can decide which model is closest to giving you the results you're looking for. You
can go back and iterate on your experiment by changing parameter values in the different models.
The science and art of interpreting these results and tuning the model performance is outside the scope of this
tutorial. For additional help, you might read the following articles:
How to evaluate model performance in Azure Machine Learning Studio
Choose parameters to optimize your algorithms in Azure Machine Learning Studio
Interpret model results in Azure Machine Learning Studio
TIP
Each time you run the experiment a record of that iteration is kept in the Run History. You can view these iterations, and
return to any of them, by clicking VIEW RUN HISTORY below the canvas. You can also click Prior Run in the Properties
pane to return to the iteration immediately preceding the one you have open.
You can make a copy of any iteration of your experiment by clicking SAVE AS below the canvas. Use the experiment's
Summary and Description properties to keep a record of what you've tried in your experiment iterations.
For more information, see Manage experiment iterations in Azure Machine Learning Studio.
Clean up resources
If you no longer need the resources you created using this article, delete them to avoid incurring any charges.
Learn how in the article, Export and delete in-product user data.
Next steps
In this tutorial, you completed these steps:
Create an experiment
Train multiple models
Score and evaluate the models
You're now ready to deploy models for this data.
Tutorial 3 - Deploy models
Tutorial 3: Deploy credit risk model - Azure Machine
Learning Studio
3/28/2019 • 12 minutes to read • Edit Online
In this tutorial, you take an extended look at the process of developing a predictive analytics solution. You develop
a simple model in Machine Learning Studio. You then deploy the model as an Azure Machine Learning web
service. This deployed model can make predictions using new data. This tutorial is part three of a three-part
tutorial series.
Suppose you need to predict an individual's credit risk based on the information they gave on a credit application.
Credit risk assessment is a complex problem, but this tutorial will simplify it a bit. You'll use it as an example of
how you can create a predictive analytics solution using Microsoft Azure Machine Learning Studio. You'll use
Azure Machine Learning Studio and a Machine Learning web service for this solution.
In this three-part tutorial, you start with publicly available credit risk data. You then develop and train a predictive
model. Finally you deploy the model as a web service.
In part one of the tutorial, you created a Machine Learning Studio workspace, uploaded data, and created an
experiment.
In part two of the tutorial, you trained and evaluated models.
In this part of the tutorial you:
Prepare for deployment
Deploy the web service
Test the web service
Manage the web service
Access the web service
Try Azure Machine Learning Studio, available in paid or free options.
Prerequisites
Complete part two of the tutorial.
TIP
If you want more details on what happens when you convert a training experiment to a predictive experiment, see How to
prepare your model for deployment in Azure Machine Learning Studio.
NOTE
You can see that the experiment is saved in two parts under tabs that have been added at the top of the experiment
canvas. The original training experiment is under the tab Training experiment, and the newly created predictive
experiment is under Predictive experiment. The predictive experiment is the one you'll deploy as a web service.
you need to take one additional step with this particular experiment. you added two Execute R Script modules to
provide a weighting function to the data. That was just a trick you needed for training and testing, so you can take
out those modules in the final model. Machine Learning Studio removed one Execute R Script module when it
removed the Split module. Now you can remove the other and connect Metadata Editor directly to Score Model.
Our experiment should now look like this:
NOTE
You may be wondering why you left the UCI German Credit Card Data dataset in the predictive experiment. The service is
going to score the user's data, not the original dataset, so why leave the original dataset in the model?
It's true that the service doesn't need the original credit card data. But it does need the schema for that data, which
includes information such as how many columns there are and which columns are numeric. This schema information is
necessary to interpret the user's data. you leave these components connected so that the scoring module has the dataset
schema when the service is running. The data isn't used, just the schema.
One important thing to note is that if your original dataset contained the label, then the expected schema from the web
input will also expect a column with the label! A way around this is to remove the label, and any other data that was in the
training dataset, but will not be in the web inputs, before connecting the web input and training dataset into a common
module.
Run the experiment one last time (click Run.) If you want to verify that the model is still working, click the output
of the Score Model module and select View Results. You can see that the original data is displayed, along with
the credit risk value ("Scored Labels") and the scoring probability value ("Scored Probabilities".)
You can configure the service by clicking the CONFIGURATION tab. Here you can modify the service name (it's
given the experiment name by default) and give it a description. You can also give more friendly labels for the
input and output data.
Deploy as a New web service
NOTE
To deploy a New web service you must have sufficient permissions in the subscription to which you are deploying the web
service. For more information, see Manage a web service using the Azure Machine Learning Web Services portal.
TIP
The way you have the predictive experiment configured, the entire results from the Score Model module are returned. This
includes all the input data plus the credit risk value and the scoring probability. But you can return something different if
you want - for example, you could return just the credit risk value. To do this, insert a Select Columns module between
Score Model and the Web service output to eliminate columns you don't want the web service to return.
You can test a Classic web service either in Machine Learning Studio or in the Azure Machine Learning Web
Services portal. You can test a New web service only in the Machine Learning Web Services portal.
TIP
When testing in the Azure Machine Learning Web Services portal, you can have the portal create sample data that you can
use to test the Request-Response service. On the Configure page, select "Yes" for Sample Data Enabled?. When you
open the Request-Response tab on the Test page, the portal fills in sample data taken from the original credit risk dataset.
Clean up resources
If you no longer need the resources you created using this article, delete them to avoid incurring any charges.
Learn how in the article, Export and delete in-product user data.
Next steps
In this tutorial, you completed these steps:
Prepare for deployment
Deploy the web service
Test the web service
Manage the web service
Access the web service
You can also develop a custom application to access the web service using starter code provided for you in R, C#,
and Python programming languages.
Consume an Azure Machine Learning Web service
Use the sample datasets in Azure Machine Learning
Studio
3/15/2019 • 14 minutes to read • Edit Online
When you create a new workspace in Azure Machine Learning Studio, a number of sample datasets and
experiments are included by default. Many of these sample datasets are used by the sample models in the Azure
AI Gallery. Others are included as examples of various types of data typically used in machine learning.
Some of these datasets are available in Azure Blob storage. For these datasets, the following table provides a direct
link. You can use these datasets in your experiments by using the Import Data module.
The rest of these sample datasets are available in your workspace under Saved Datasets. You can find this in the
module palette to the left of the experiment canvas in Machine Learning Studio. You can use any of these datasets
in your own experiment by dragging it to your experiment canvas.
Datasets
DATASET NAME DATASET DESCRIPTION
Adult Census Income Binary Classification dataset A subset of the 1994 Census database, using working adults
over the age of 16 with an adjusted income index of > 100.
Usage: Classify people using demographics to predict
whether a person earns over 50K a year.
Related Research: Kohavi, R., Becker, B., (1996). UCI
Machine Learning Repository https://archive.ics.uci.edu/ml.
Irvine, CA: University of California, School of Information
and Computer Science
Automobile price data (Raw) Information about automobiles by make and model, including
the price, features such as the number of cylinders and MPG,
as well as an insurance risk score.
The risk score is initially associated with auto price. It is
then adjusted for actual risk in a process known to
actuaries as symboling. A value of +3 indicates that the
auto is risky, and a value of -3 that it is probably safe.
Usage: Predict the risk score by features, using regression
or multivariate classification.
Related Research: Schlimmer, J.C. (1987). UCI Machine
Learning Repository https://archive.ics.uci.edu/ml. Irvine,
CA: University of California, School of Information and
Computer Science
Bike Rental UCI dataset UCI Bike Rental dataset that is based on real data from Capital
Bikeshare company that maintains a bike rental network in
Washington DC.
The dataset has one row for each hour of each day in
2011 and 2012, for a total of 17,379 rows. The range of
hourly bike rentals is from 1 to 977.
Bill Gates RGB Image Publicly available image file converted to CSV data.
The code for converting the image is provided in the
Color quantization using K-Means clustering model
detail page.
Blood donation data A subset of data from the blood donor database of the Blood
Transfusion Service Center of Hsin-Chu City, Taiwan.
Donor data includes the months since last donation), and
frequency, or the total number of donations, time since
last donation, and amount of blood donated.
Usage: The goal is to predict via classification whether the
donor donated blood in March 2007, where 1 indicates a
donor during the target period, and 0 a non-donor.
Related Research: Yeh, I.C., (2008). UCI Machine
Learning Repository https://archive.ics.uci.edu/ml. Irvine,
CA: University of California, School of Information and
Computer Science
Yeh, I-Cheng, Yang, King-Jang, and Ting, Tao-Ming,
"Knowledge discovery on RFM model using Bernoulli
sequence, "Expert Systems with Applications, 2008,
https://dx.doi.org/10.1016/j.eswa.2008.07.018
Breast Cancer Features The dataset contains information for 102K suspicious regions
(candidates) of X-ray images, each described by 117 features.
The features are proprietary and their meaning is not revealed
by the dataset creators (Siemens Healthcare).
Breast Cancer Info The dataset contains additional information for each
suspicious region of X-ray image. Each example provides
information (for example, label, patient ID, coordinates of
patch relative to the whole image) about the corresponding
row number in the Breast Cancer Features dataset. Each
patient has a number of examples. For patients who have a
cancer, some examples are positive and some are negative. For
patients who don't have a cancer, all examples are negative.
The dataset has 102K examples. The dataset is biased, 0.6% of
the points are positive, the rest are negative. The dataset was
made available by Siemens Healthcare.
CRM Appetency Labels Shared Labels from the KDD Cup 2009 customer relationship
prediction challenge (orange_small_train_appetency.labels).
CRM Churn Labels Shared Labels from the KDD Cup 2009 customer relationship
prediction challenge (orange_small_train_churn.labels).
CRM Dataset Shared This data comes from the KDD Cup 2009 customer
relationship prediction challenge (orange_small_train.data.zip).
The dataset contains 50K customers from the French
Telecom company Orange. Each customer has 230
anonymized features, 190 of which are numeric and 40
are categorical. The features are very sparse.
CRM Upselling Labels Shared Labels from the KDD Cup 2009 customer relationship
prediction challenge (orange_large_train_upselling.labels).
Flight Delays Data Passenger flight on-time performance data taken from the
TranStats data collection of the U.S. Department of
Transportation (On-Time).
The dataset covers the time period April-October 2013.
Before uploading to Azure Machine Learning Studio, the
dataset was processed as follows:
The dataset was filtered to cover only the 70 busiest
airports in the continental US
Canceled flights were labeled as delayed by more than
15 minutes
Diverted flights were filtered out
The following columns were selected: Year, Month,
DayofMonth, DayOfWeek, Carrier, OriginAirportID,
DestAirportID, CRSDepTime, DepDelay, DepDel15,
CRSArrTime, ArrDelay, ArrDel15, Canceled
Flight on-time performance (Raw) Records of airplane flight arrivals and departures within United
States from October 2011.
Usage: Predict flight delays.
Related Research: From US Dept. of Transportation
https://www.transtats.bts.gov/DL_SelectFields.asp?
Table_ID=236&DB_Short_Name=On-Time.
Forest fires data Contains weather data, such as temperature and humidity
indices and wind speed. The data is taken from an area of
northeast Portugal, combined with records of forest fires.
Usage: This is a difficult regression task, where the aim is
to predict the burned area of forest fires.
Related Research: Cortez, P., & Morais, A. (2008). UCI
Machine Learning Repository https://archive.ics.uci.edu/ml.
Irvine, CA: University of California, School of Information
and Computer Science
[Cortez and Morais, 2007] P. Cortez and A. Morais. A Data
Mining Approach to Predict Forest Fires using
Meteorological Data. In J. Neves, M. F. Santos and J.
Machado Eds., New Trends in Artificial Intelligence,
Proceedings of the 13th EPIA 2007 - Portuguese
Conference on Artificial Intelligence, December,
Guimarães, Portugal, pp. 512-523, 2007. APPIA, ISBN-13
978-989-95618-0-9. Available at:
http://www.dsi.uminho.pt/~pcortez/fires.pdf.
German Credit Card UCI dataset The UCI Statlog (German Credit Card) dataset
(Statlog+German+Credit+Data), using the german.data file.
The dataset classifies people, described by a set of
attributes, as low or high credit risks. Each example
represents a person. There are 20 features, both numerical
and categorical, and a binary label (the credit risk value).
High credit risk entries have label = 2, low credit risk
entries have label = 1. The cost of misclassifying a low risk
example as high is 1, whereas the cost of misclassifying a
high risk example as low is 5.
IMDB Movie Titles The dataset contains information about movies that were
rated in Twitter tweets: IMDB movie ID, movie name, genre,
and production year. There are 17K movies in the dataset. The
dataset was introduced in the paper "S. Dooms, T. De
Pessemier and L. Martens. MovieTweetings: a Movie Rating
Dataset Collected From Twitter. Workshop on Crowdsourcing
and Human Computation for Recommender Systems,
CrowdRec at RecSys 2013."
Iris two class data This is perhaps the best known database to be found in the
pattern recognition literature. The dataset is relatively small,
containing 50 examples each of petal measurements from
three iris varieties.
Usage: Predict the iris type from the measurements.
Related Research: Fisher, R.A. (1988). UCI Machine
Learning Repository https://archive.ics.uci.edu/ml. Irvine,
CA: University of California, School of Information and
Computer Science
Movie Tweets The dataset is an extended version of the Movie Tweetings
dataset. The dataset has 170K ratings for movies, extracted
from well-structured tweets on Twitter. Each instance
represents a tweet and is a tuple: user ID, IMDB movie ID,
rating, timestamp, number of favorites for this tweet, and
number of retweets of this tweet. The dataset was made
available by A. Said, S. Dooms, B. Loni and D. Tikk for
Recommender Systems Challenge 2014.
MPG data for various automobiles This dataset is a slightly modified version of the dataset
provided by the StatLib library of Carnegie Mellon University.
The dataset was used in the 1983 American Statistical
Association Exposition.
The data lists fuel consumption for various automobiles in
miles per gallon. It also includes information such as the
number of cylinders, engine displacement, horsepower,
total weight, and acceleration.
Usage: Predict fuel economy based on three multivalued
discrete attributes and five continuous attributes.
Related Research: StatLib, Carnegie Mellon University,
(1993). UCI Machine Learning Repository
https://archive.ics.uci.edu/ml. Irvine, CA: University of
California, School of Information and Computer Science
Pima Indians Diabetes Binary Classification dataset A subset of data from the National Institute of Diabetes and
Digestive and Kidney Diseases database. The dataset was
filtered to focus on female patients of Pima Indian heritage.
The data includes medical data such as glucose and insulin
levels, as well as lifestyle factors.
Usage: Predict whether the subject has diabetes (binary
classification).
Related Research: Sigillito, V. (1990). UCI Machine
Learning Repository https://archive.ics.uci.edu/ml". Irvine,
CA: University of California, School of Information and
Computer Science
Restaurant feature data A set of metadata about restaurants and their features, such
as food type, dining style, and location.
Usage: Use this dataset, in combination with the other
two restaurant datasets, to train and test a recommender
system.
Related Research: Bache, K. and Lichman, M. (2013). UCI
Machine Learning Repository https://archive.ics.uci.edu/ml.
Irvine, CA: University of California, School of Information
and Computer Science.
Restaurant ratings Contains ratings given by users to restaurants on a scale from
0 to 2.
Usage: Use this dataset, in combination with the other
two restaurant datasets, to train and test a recommender
system.
Related Research: Bache, K. and Lichman, M. (2013). UCI
Machine Learning Repository https://archive.ics.uci.edu/ml.
Irvine, CA: University of California, School of Information
and Computer Science.
Steel Annealing multi-class dataset This dataset contains a series of records from steel annealing
trials. It contains the physical attributes (width, thickness, type
(coil, sheet, etc.) of the resulting steel types.
Usage: Predict any of two numeric class attributes;
hardness or strength. You might also analyze correlations
among attributes.
Steel grades follow a set standard, defined by SAE and
other organizations. You are looking for a specific 'grade'
(the class variable) and want to understand the values
needed.
Related Research: Sterling, D. & Buntine, W. (NA). UCI
Machine Learning Repository https://archive.ics.uci.edu/ml.
Irvine, CA: University of California, School of Information
and Computer Science
A useful guide to steel grades can be found here:
https://otk-sitecore-prod-v2-cdn.azureedge.net/-
/media/from-sharepoint/documents/product/outokumpu-
steel-grades-properties-global-standards.pdf
Telescope data Record of high energy gamma particle bursts along with
background noise, both simulated using a Monte Carlo
process.
The intent of the simulation was to improve the accuracy
of ground-based atmospheric Cherenkov gamma
telescopes. This is done by using statistical methods to
differentiate between the desired signal (Cherenkov
radiation showers) and background noise (hadronic
showers initiated by cosmic rays in the upper
atmosphere).
The data has been pre-processed to create an elongated
cluster with the long axis is oriented towards the camera
center. The characteristics of this ellipse (often called Hillas
parameters) are among the image parameters that can be
used for discrimination.
Usage: Predict whether image of a shower represents
signal or background noise.
Notes: Simple classification accuracy is not meaningful for
this data, since classifying a background event as signal is
worse than classifying a signal event as background. For
comparison of different classifiers, the ROC graph should
be used. The probability of accepting a background event
as signal must be below one of the following thresholds:
0.01, 0.02, 0.05, 0.1, or 0.2.
Also, note that the number of background events (h, for
hadronic showers) is underestimated. In real
measurements, the h or noise class represents the
majority of events.
Related Research: Bock, R.K. (1995). UCI Machine
Learning Repository https://archive.ics.uci.edu/ml. Irvine,
CA: University of California, School of Information
network_intrusion_detection.csv Dataset from the KDD Cup 1999 Knowledge Discovery and
Data Mining Tools Competition (kddcup99.html).
The dataset was downloaded and stored in Azure Blob
storage (network_intrusion_detection.csv) and includes
both training and testing datasets. The training dataset
has approximately 126K rows and 43 columns, including
the labels. Three columns are part of the label information,
and 40 columns, consisting of numeric and
string/categorical features, are available for training the
model. The test data has approximately 22.5K test
examples with the same 43 columns as in the training
data.
Next steps
Kickstart your experiments with examples
Share and discover resources in the Azure AI Gallery
3/15/2019 • 10 minutes to read • Edit Online
Azure AI Gallery is a community-driven site for discovering and sharing solutions built with Azure AI. The
Gallery has a variety of resources that you can use to develop your own analytics solutions.
Once the resource is in your workspace, you can customize and use it as you would anything that you create in
Studio.
To use an imported custom module:
1. Create an experiment or open an existing experiment.
2. To expand the list of custom modules in your workspace, in the module palette select Custom. The module
palette is to the left of the experiment canvas.
3. Select the module that you imported and drag it to your experiment.
Contribute experiments
To demonstrate analytics techniques, or to give others a jump-start on their solutions, you can contribute
experiments you've developed in Studio. As others come across your contribution in the Gallery, you can follow
the number of views and downloads of your contribution. Users can also add comments and share your
contributions with other members of the data science community. And you can log in with a discussion tool such
as Disqus to receive notifications for comments on your contributions.
1. Open your experiment in Studio.
2. In the list of actions below the experiment canvas, select Publish to Gallery.
3. In the Gallery, enter a Name and Tags that are descriptive. Highlight the techniques you used or the real-
world problem you're solving. An example of a descriptive experiment title is “Binary Classification: Twitter
Sentiment Analysis.”
4. In the SUMMARY box, enter a summary of your experiment. Briefly describe the problem the experiment
solves, and how you approached it.
5. In the DETAILED DESCRIPTION box, describe the steps you took in each part of your experiment. Some
useful topics to include are:
Experiment graph screenshot
Data sources and explanation
Data processing
Feature engineering
Model description
Results and evaluation of model performance
You can use markdown to format your description. To see how your entries on the experiment description
page will look when the experiment is published, select Preview.
TIP
The text boxes provided for markdown editing and preview are small. We recommend that you write your experiment
documentation in a markdown editor (such as Visual Studio Code), then copy and paste the completed
documentation into the text box in the Gallery.
6. On the Image Selection page, choose a thumbnail image for your experiment. The thumbnail image
appears at the top of the experiment details page and in the experiment tile. Other users will see the
thumbnail image when they browse the Gallery. You can upload an image from your computer, or select a
stock image from the Gallery.
7. On the Settings page, under Visibility, choose whether to publish your content publicly (Public) or to
have it accessible only to people who have a link to the page (Unlisted).
TIP
If you want to make sure your documentation looks correct before you release it publicly, you can first publish the
experiment as Unlisted. Later, you can change the visibility setting to Public on the experiment details page. Note
that after you set an experiment to Public you cannot later change it to Unlisted.
7. Select Create.
Your contribution is now in Azure AI Gallery. Your contributions are listed on your account page on the Items tab.
Add to and edit your collection
You can add items to your collection in two ways:
Open the collection, select Edit, and then select Add Item. You can add items that you've contributed to the
Gallery or you can search the Gallery for items to add. After you've selected the items you want to add, click
Add.
If you find an item that you want to add while you're browsing the Gallery, open the item and select Add to
collection. Select the collection that you want to add the item to.
You can edit the items in your collection by selecting Edit.
You can change the summary, description, or tags for your collection.
You can change the order of the items in the collection by using the arrows next to an item.
To add notes to the items in your collection, select the upper-right corner of an item, and then select Add/Edit
note.
To remove an item from your collection, select the upper-right corner of an item, and then select Remove.
Get a quick introduction to data science from Data Science for Beginners in five short videos from a top data
scientist. These videos are basic but useful, whether you're interested in doing data science or you work with data
scientists.
This first video is about the kinds of questions that data science can answer. To get the most out of the series,
watch them all. Go to the list of videos
If you have a credit card, you’ve already benefited from anomaly detection. Your credit card company analyzes
your purchase patterns, so that they can alert you to possible fraud. Charges that are "weird" might be a purchase
at a store where you don't normally shop or buying an unusually pricey item.
This question can be useful in lots of ways. For instance:
If you have a car with pressure gauges, you might want to know: Is this pressure gauge reading normal?
If you're monitoring the internet, you’d want to know: Is this message from the internet typical?
Anomaly detection flags unexpected or unusual events or behaviors. It gives clues where to look for problems.
Questions it answers are always about what action should be taken - usually by a machine or a robot. Examples
are:
If I'm a temperature control system for a house: Adjust the temperature or leave it where it is?
If I'm a self-driving car: At a yellow light, brake or accelerate?
For a robot vacuum: Keep vacuuming, or go back to the charging station?
Reinforcement learning algorithms gather data as they go, learning from trial and error.
So that's it - The 5 questions data science can answer.
Next steps
Try a first data science experiment with Machine Learning Studio
Get an introduction to Machine Learning on Microsoft Azure
Is your data ready for data science?
3/22/2019 • 4 minutes to read • Edit Online
Here is some relevant data on the quality of hamburgers: grill temperature, patty weight, and rating in the local
food magazine. But notice the gaps in the table on the left.
Most data sets are missing some values. It's common to have holes like this and there are ways to work around
them. But if there's too much missing, your data begins to look like Swiss cheese.
If you look at the table on the left, there's so much missing data, it's hard to come up with any kind of relationship
between grill temperature and patty weight. This example shows disconnected data.
The table on the right, though, is full and complete - an example of connected data.
Look at the target in the upper right. There is a tight grouping right around the bulls eye. That, of course, is
accurate. Oddly, in the language of data science, performance on the target right below it is also considered
accurate.
If you mapped out the center of these arrows, you'd see that it's very close to the bulls eye. The arrows are spread
out all around the target, so they're considered imprecise, but they're centered around the bulls eye, so they're
considered accurate.
Now look at the upper-left target. Here the arrows hit very close together, a tight grouping. They're precise, but
they're inaccurate because the center is way off the bulls eye. The arrows in the bottom-left target are both
inaccurate and imprecise. This archer needs more practice.
Next steps
Try a first data science experiment with Machine Learning Studio
Get an introduction to Machine Learning on Microsoft Azure
Ask a question you can answer with data
5/24/2019 • 4 minutes to read • Edit Online
These examples of answers are called a target. A target is what we are trying to predict about future data points,
whether it's a category or a number.
If you don't have any target data, you'll need to get some. You won't be able to answer your question without it.
Next steps
Try a first data science experiment with Machine Learning Studio
Get an introduction to Machine Learning on Microsoft Azure
Predict an answer with a simple model
3/22/2019 • 5 minutes to read • Edit Online
We're going to take this data now and turn it into a scatter plot. This is a great way to visualize numerical data sets.
For the first data point, we eyeball a vertical line at 1.01 carats. Then, we eyeball a horizontal line at $7,366. Where
they meet, we draw a dot. This represents our first diamond.
Now we go through each diamond on this list and do the same thing. When we're through, this is what we get: a
bunch of dots, one for each diamond.
Draw the model through the data points
Now if you look at the dots and squint, the collection looks like a fat, fuzzy line. We can take our marker and draw
a straight line through it.
By drawing a line, we created a model. Think of this as taking the real world and making a simplistic cartoon
version of it. Now the cartoon is wrong - the line doesn't go through all the data points. But, it's a useful
simplification.
The fact that all the dots don't go exactly through the line is OK. Data scientists explain this by saying that there's
the model - that's the line - and then each dot has some noise or variance associated with it. There's the
underlying perfect relationship, and then there's the gritty, real world that adds noise and uncertainty.
Because we're trying to answer the question How much? this is called a regression. And because we're using a
straight line, it's a linear regression.
Next steps
Try a first data science experiment with Machine Learning Studio
Get an introduction to Machine Learning on Microsoft Azure
Copy other people's work to do data science
3/22/2019 • 3 minutes to read • Edit Online
IMPORTANT
Cortana Intelligence Gallery was renamed Azure AI Gallery. As a result, text and images in this transcript vary slightly
from the video, which uses the former name.
To get the most out of the series, watch them all. Go to the list of videos
And now I have a starting point. I can swap out their data for my own and do my own tweaking of the model. This
gives me a running start, and it lets me build on the work of people who really know what they’re doing.
Next steps
Try your first data science experiment with Azure Machine Learning Studio
Get an introduction to Machine Learning on Microsoft Azure
Azure Machine Learning Studio Web Services:
Deployment and consumption
4/9/2019 • 3 minutes to read • Edit Online
You can use Azure Machine Learning Studio to deploy machine learning workflows and models as web services.
These web services can then be used to call the machine learning models from applications over the Internet to do
predictions in real time or in batch mode. Because the web services are RESTful, you can call them from various
programming languages and platforms, such as .NET and Java, and from applications, such as Excel.
The next sections provide links to walkthroughs, code, and documentation to help get you started.
Running the application creates a web service JSON template. To use the template to deploy a web service, you
must add the following information:
Storage account name and key
You can get the storage account name and key from the Azure portal.
Commitment plan ID
You can get the plan ID from the Azure Machine Learning Web Services portal by signing in and clicking a
plan name.
Add them to the JSON template as children of the Properties node at the same level as the
MachineLearningWorkspace node.
Here's an example:
"StorageAccount": {
"name": "YourStorageAccountName",
"key": "YourStorageAccountKey"
},
"CommitmentPlan": {
"id":
"subscriptions/YouSubscriptionID/resourceGroups/YourResourceGroupID/providers/Microsoft.MachineLearning/commitm
entPlans/YourPlanName"
}
See the following articles and sample code for additional details:
Azure Machine Learning Studio Cmdlets reference on MSDN
Sample walkthrough on GitHub
To use Azure Machine Learning Studio, you need to have a Machine Learning Studio workspace. This workspace
contains the tools you need to create, manage, and publish experiments.
NOTE
To sign in and create a Studio workspace, you need to be an Azure subscription administrator.
2. Click +New
3. In the search box, type Machine Learning Studio Workspace and select the matching item. Then, select
click Create at the bottom of the page.
4. Enter your workspace information:
The workspace name may be up to 260 characters, not ending in a space. The name can't include
these characters: < > * % & : \ ? + /
The web service plan you choose (or create), along with the associated pricing tier you select, is used
if you deploy web services from this workspace.
5. Click Create.
NOTE
Machine Learning Studio relies on an Azure storage account that you provide to save intermediary data when it executes the
workflow. After the workspace is created, if the storage account is deleted, or if the access keys are changed, the workspace
will stop functioning and all experiments in that workspace will fail. If you accidentally delete the storage account, recreate the
storage account with the same name in the same region as the deleted storage account and resync the access key. If you
changed storage account access keys, resync the access keys in the workspace by using the Azure portal.
Once the workspace is deployed, you can open it in Machine Learning Studio.
1. Browse to Machine Learning Studio at https://studio.azureml.net/.
2. Select your workspace in the upper-right-hand corner.
3. Click my experiments.
For information about managing your Studio workspace, see Manage an Azure Machine Learning Studio
workspace. If you encounter a problem creating your workspace, see Troubleshooting guide: Create and connect to
a Machine Learning Studio workspace.
NOTE
The administrator account that creates the workspace is automatically added to the workspace as workspace Owner.
However, other administrators or users in that subscription are not automatically granted access to the workspace - you
need to invite them explicitly.
NOTE
For users to be able to deploy or manage web services in this workspace, they must be a contributor or administrator in the
Azure subscription.
Manage an Azure Machine Learning Studio
workspace
3/15/2019 • 2 minutes to read • Edit Online
NOTE
For information on managing Web services in the Machine Learning Web Services portal, see Manage a Web service using
the Azure Machine Learning Web Services portal.
You can manage Machine Learning Studio workspaces in the Azure portal.
NOTE
To deploy or manage New web services you must be assigned a contributor or administrator role on the subscription to
which the web service is deployed. If you invite another user to a machine learning Studio workspace, you must assign them
to a contributor or administrator role on the subscription before they can deploy or manage web services.
For more information on setting access permissions, see Manage access using RBAC and the Azure portal.
Next steps
Learn more about deploy Machine Learning with Azure Resource Manager Templates.
Troubleshooting guide: Create and connect to an
Azure Machine Learning Studio workspace
3/15/2019 • 2 minutes to read • Edit Online
This guide provides solutions for some frequently encountered challenges when you are setting up Azure Machine
Learning Studio workspaces.
Workspace owner
To open a workspace in Machine Learning Studio, you must be signed in to the Microsoft Account you used to
create the workspace, or you need to receive an invitation from the owner to join the workspace. From the Azure
portal you can manage the workspace, which includes the ability to configure access.
For more information on managing a workspace, see Manage an Azure Machine Learning Studio workspace.
Allowed regions
Machine Learning is currently available in a limited number of regions. If your subscription does not include one of
these regions, you may see the error message, “You have no subscriptions in the allowed regions.”
To request that a region be added to your subscription, create a new Microsoft support request from the Azure
portal, choose Billing as the problem type, and follow the prompts to submit your request.
Storage account
The Machine Learning service needs a storage account to store data. You can use an existing storage account, or
you can create a new storage account when you create the new Machine Learning Studio workspace (if you have
quota to create a new storage account).
After the new Machine Learning Studio workspace is created, you can sign in to Machine Learning Studio by using
the Microsoft account you used to create the workspace. If you encounter the error message, “Workspace Not
Found” (similar to the following screenshot), please use the following steps to delete your browser cookies.
3. In the Delete Browsing History dialog box, make sure Cookies and website data is selected, and click
Delete.
After the cookies are deleted, restart the browser and then go to the Microsoft Azure Machine Learning Studio
page. When you are prompted for a user name and password, enter the same Microsoft account you used to
create the workspace.
Comments
Our goal is to make the Machine Learning experience as seamless as possible. Please post any comments and
issues at the Azure Machine Learning forum to help us serve you better.
Deploy Azure Machine Learning Studio Workspace
Using Azure Resource Manager
4/4/2019 • 3 minutes to read • Edit Online
Using an Azure Resource Manager deployment template saves you time by giving you a scalable way to deploy
interconnected components with a validation and retry mechanism. To set up Azure Machine Learning Studio
Workspaces, for example, you need to first configure an Azure storage account and then deploy your workspace.
Imagine doing this manually for hundreds of workspaces. An easier alternative is to use an Azure Resource
Manager template to deploy an Studio Workspace and all its dependencies. This article takes you through this
process step-by-step. For a great overview of Azure Resource Manager, see Azure Resource Manager overview.
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.
# Install the Azure Resource Manager modules from the PowerShell Gallery (press “A”)
Install-Module Az -Scope CurrentUser
# Install the Azure Service Management modules from the PowerShell Gallery (press “A”)
Install-Module Azure -Scope CurrentUser
These steps download and install the modules necessary to complete the remaining steps. This only needs to be
done once in the environment where you are executing the PowerShell commands.
Authenticate to Azure
This step needs to be repeated for each session. Once authenticated, your subscription information should be
displayed.
Now that we have access to Azure, we can create the resource group.
Create a resource group
Verify that the resource group is correctly provisioned. ProvisioningState should be “Succeeded.” The resource
group name is used by the template to generate the storage account name. The storage account name must be
between 3 and 24 characters in length and use numbers and lower-case letters only.
Using the resource group deployment, deploy a new Machine Learning Workspace.
Once the deployment is completed, it is straightforward to access properties of the workspace you deployed. For
example, you can access the Primary Key Token.
# Access Azure Machine Learning studio Workspace Token after its deployment.
$rgd.Outputs.mlWorkspaceToken.Value
Another way to retrieve tokens of existing workspace is to use the Invoke-AzResourceAction command. For
example, you can list the primary and secondary tokens of all workspaces.
After the workspace is provisioned, you can also automate many Azure Machine Learning Studio tasks using the
PowerShell Module for Azure Machine Learning Studio.
Next steps
Learn more about authoring Azure Resource Manager Templates.
Have a look at the Azure Quickstart Templates Repository.
Watch this video about Azure Resource Manager.
See the Resource Manager template reference help
Import your training data into Azure Machine
Learning Studio from various data sources
3/15/2019 • 13 minutes to read • Edit Online
To use your own data in Machine Learning Studio to develop and train a predictive analytics solution, you can use
data from:
Local file - Load local data ahead of time from your hard drive to create a dataset module in your workspace
Online data sources - Use the Import Data module to access data from one of several online sources while
your experiment is running
Machine Learning Studio experiment - Use data that was saved as a dataset in Machine Learning Studio
On-premises SQL Server database - Use data from an on-premises SQL Server database without having to
copy data manually
NOTE
There are a number of sample datasets available in Machine Learning Studio that you can use for training data. For
information on these, see Use the sample datasets in Azure Machine Learning Studio.
Prepare data
Machine Learning Studio is designed to work with rectangular or tabular data, such as text data that's delimited or
structured data from a database, though in some circumstances non-rectangular data may be used.
It's best if your data is relatively clean before you import it into Studio. For example, you'll want to take care of
issues such as unquoted strings.
However, there are modules available in Studio that enable some manipulation of data within your experiment
after you import your data. Depending on the machine learning algorithms you'll be using, you may need to
decide how you'll handle data structural issues such as missing values and sparse data, and there are modules that
can help with that. Look in the Data Transformation section of the module palette for modules that perform
these functions.
At any point in your experiment, you can view or download the data that's produced by a module by clicking the
output port. Depending on the module, there may be different download options available, or you may be able to
visualize the data within your web browser in Studio.
Data capacities
Modules in Machine Learning Studio support datasets of up to 10 GB of dense numerical data for common use
cases. If a module takes more than one input, the 10 GB value is the total of all input sizes. You can sample larger
datasets by using queries from Hive or Azure SQL Database, or you can use Learning by Counts preprocessing
before you import the data.
The following types of data can expand to larger datasets during feature normalization and are limited to less than
10 GB:
Sparse
Categorical
Strings
Binary data
The following modules are limited to datasets less than 10 GB:
Recommender modules
Synthetic Minority Oversampling Technique (SMOTE ) module
Scripting modules: R, Python, SQL
Modules where the output data size can be larger than input data size, such as Join or Feature Hashing
Cross-validation, Tune Model Hyperparameters, Ordinal Regression, and One-vs-All Multiclass, when the
number of iterations is very large
For datasets that are larger than a couple GBs, upload the data to Azure Storage or Azure SQL Database, or use
Azure HDInsight, rather than uploading directly from a local file.
You can find information about image data in the Import Images module reference.
Upload time depends on the size of your data and the speed of your connection to the service. If you know the file
will take a long time, you can do other things inside Studio while you wait. However, closing the browser before
the data upload is complete causes the upload to fail.
Once your data is uploaded, it's stored in a dataset module and is available to any experiment in your workspace.
When you're editing an experiment, you can find the datasets you've uploaded in the My Datasets list under the
Saved Datasets list in the module palette. You can drag and drop the dataset onto the experiment canvas when
you want to use the dataset for further analytics and machine learning.
NOTE
This article provides general information about the Import Data module. For more detailed information about the types of
data you can access, formats, parameters, and answers to common questions, see the module reference topic for the Import
Data module.
By using the Import Data module, you can access data from one of several online data sources while your
experiment is running:
A Web URL using HTTP
Hadoop using HiveQL
Azure blob storage
Azure table
Azure SQL database or SQL Server on Azure VM
On-premises SQL Server database
A data feed provider, OData currently
Azure Cosmos DB
Because this training data is accessed while your experiment is running, it's only available in that experiment. By
comparison, data that has been stored in a dataset module is available to any experiment in your workspace.
To access online data sources in your Studio experiment, add the Import Data module to your experiment. Then
select Launch Import Data Wizard under Properties for step-by-step guided instructions to select and
configure the data source. Alternatively, you can manually select Data source under Properties and supply the
parameters needed to access the data.
The online data sources that are supported are itemized in the table below. This table also summarizes the file
formats that are supported and parameters that are used to access the data.
IMPORTANT
Currently, the Import Data and Export Data modules can read and write data only from Azure storage created using the
Classic deployment model. In other words, the new Azure Blob Storage account type that offers a hot storage access tier or
cool storage access tier is not yet supported.
Generally, any Azure storage accounts that you might have created before this service option became available should not
be affected. If you need to create a new account, select Classic for the Deployment model, or use Resource manager and
select General purpose rather than Blob storage for Account kind.
For more information, see Azure Blob Storage: Hot and Cool Storage Tiers.
Web URL via HTTP Reads data in comma-separated values URL: Specifies the full name of the file,
(CSV), tab-separated values (TSV), including the site URL and the file
attribute-relation file format (ARFF), and name, with any extension.
Support Vector Machines (SVM-light)
formats, from any web URL that uses Data format: Specifies one of the
HTTP supported data formats: CSV, TSV,
ARFF, or SVM-light. If the data has a
header row, it is used to assign column
names.
Hadoop/HDFS Reads data from distributed storage in Hive database query: Specifies the
Hadoop. You specify the data you want Hive query used to generate the data.
by using HiveQL, a SQL-like query
language. HiveQL can also be used to HCatalog server URI : Specified the
aggregate data and perform data name of your cluster using the format
filtering before you add the data to <your cluster
Studio. name>.azurehdinsight.net.
SQL database Reads data that is stored in an Azure Database server name: Specifies the
SQL database or in a SQL Server name of the server on which the
database running on an Azure virtual database is running.
machine. In case of Azure SQL Database
enter the server name that is
generated. Typically it has the form
<generated_identifier>.database.w
indows.net.
On-premises SQL database Reads data that is stored in an on- Data gateway: Specifies the name of
premises SQL database. the Data Management Gateway
installed on a computer where it can
access your SQL Server database. For
information about setting up the
gateway, see Perform advanced
analytics with Azure Machine Learning
Studio using data from an on-premises
SQL server.
Azure Table Reads data from the Table service in The options in the Import Data change
Azure Storage. depending on whether you are
accessing public information or a
If you read large amounts of data private storage account that requires
infrequently, use the Azure Table login credentials. This is determined by
Service. It provides a flexible, non- the Authentication Type which can
relational (NoSQL), massively scalable, have value of "PublicOrSAS" or
inexpensive, and highly available "Account", each of which has its own set
storage solution. of parameters.
Azure Blob Storage Reads data stored in the Blob service in The options in the Import Data
Azure Storage, including images, module change depending on whether
unstructured text, or binary data. you are accessing public information or
a private storage account that requires
You can use the Blob service to publicly login credentials. This is determined by
expose data, or to privately store the Authentication Type which can
application data. You can access your have a value either of "PublicOrSAS" or
data from anywhere by using HTTP or of "Account".
HTTPS connections.
Public or Shared Access Signature
(SAS) URI: The parameters are:
Data Feed Provider Reads data from a supported feed Data content type: Specifies the OData
provider. Currently only the Open Data format.
Protocol (OData) format is supported.
Source URL: Specifies the full URL for
the data feed.
For example, the following URL reads
from the Northwind sample database:
https://services.odata.org/northwind/no
rthwind.svc/
Next steps
Deploying Azure Machine Learning studio web services that use Data Import and Data Export modules
Perform analytics with Azure Machine Learning
Studio using an on-premises SQL Server database
3/18/2019 • 9 minutes to read • Edit Online
Often enterprises that work with on-premises data would like to take advantage of the scale and agility of the
cloud for their machine learning workloads. But they don't want to disrupt their current business processes and
workflows by moving their on-premises data to the cloud. Azure Machine Learning Studio now supports reading
your data from an on-premises SQL Server database and then training and scoring a model with this data. You no
longer have to manually copy and sync the data between the cloud and your on-premises server. Instead, the
Import Data module in Azure Machine Learning Studio can now read directly from your on-premises SQL
Server database for your training and scoring jobs.
This article provides an overview of how to ingress on-premises SQL server data into Azure Machine Learning
Studio. It assumes that you're familiar with Studio concepts like workspaces, modules, datasets, experiments, etc..
NOTE
This feature is not available for free workspaces. For more information about Machine Learning pricing and tiers, see Azure
Machine Learning Pricing.
NOTE
You can't run Data Factory Self-hosted Integration Runtime and Power BI Gateway on the same computer.
You need to use the Data Factory Self-hosted Integration Runtime for Azure Machine Learning Studio even
if you are using Azure ExpressRoute for other data. You should treat your data source as an on-premises
data source (that's behind a firewall) even when you use ExpressRoute. Use the Data Factory Self-hosted
Integration Runtime to establish connectivity between Machine Learning and the data source.
You can find detailed information on installation prerequisites, installation steps, and troubleshooting tips in the
article Integration Runtime in Data Factory.
Ingress data from your on-premises SQL Server database into Azure
Machine Learning
In this walkthrough, you will set up an Azure Data Factory Integration Runtime in an Azure Machine Learning
workspace, configure it, and then read data from an on-premises SQL Server database.
TIP
Before you start, disable your browser’s pop-up blocker for studio.azureml.net . If you're using the Google Chrome
browser, download and install one of the several plug-ins available at Google Chrome WebStore Click Once App Extension.
NOTE
Azure Data Factory Self-hosted Integration Runtime was formerly known as Data Management Gateway. The step by step
tutorial will continue to refer to it as a gateway.
4. In the New data gateway dialog, enter the Gateway Name and optionally add a Description. Click the
arrow on the bottom right-hand corner to go to the next step of the configuration.
5. In the Download and register data gateway dialog, copy the GATEWAY REGISTRATION KEY to the
clipboard.
6. If you have not yet downloaded and installed the Microsoft Data Management Gateway, then click
Download data management gateway. This takes you to the Microsoft Download Center where you can
select the gateway version you need, download it, and install it. You can find detailed information on
installation prerequisites, installation steps, and troubleshooting tips in the beginning sections of the article
Move data between on-premises sources and cloud with Data Management Gateway.
7. After the gateway is installed, the Data Management Gateway Configuration Manager will open and the
Register gateway dialog is displayed. Paste the Gateway Registration Key that you copied to the
clipboard and click Register.
8. If you already have a gateway installed, run the Data Management Gateway Configuration Manager. Click
Change key, paste the Gateway Registration Key that you copied to the clipboard in the previous step,
and click OK.
9. When the installation is complete, the Register gateway dialog for Microsoft Data Management Gateway
Configuration Manager is displayed. Paste the GATEWAY REGISTRATION KEY that you copied to the
clipboard in a previous step and click Register.
10. The gateway configuration is complete when the following values are set on the Home tab in Microsoft
Data Management Gateway Configuration Manager:
Gateway name and Instance name are set to the name of the gateway.
Registration is set to Registered.
Status is set to Started.
The status bar at the bottom displays Connected to Data Management Gateway Cloud Service
along with a green check mark.
Azure Machine Learning Studio also gets updated when the registration is successful.
11. In the Download and register data gateway dialog, click the check mark to complete the setup. The
Settings page displays the gateway status as "Online". In the right-hand pane, you'll find status and other
useful information.
12. In the Microsoft Data Management Gateway Configuration Manager switch to the Certificate tab. The
certificate specified on this tab is used to encrypt/decrypt credentials for the on-premises data store that
you specify in the portal. This certificate is the default certificate. Microsoft recommends changing this to
your own certificate that you back up in your certificate management system. Click Change to use your
own certificate instead.
13. (optional) If you want to enable verbose logging in order to troubleshoot issues with the gateway, in the
Microsoft Data Management Gateway Configuration Manager switch to the Diagnostics tab and check the
Enable verbose logging for troubleshooting purposes option. The logging information can be found in
the Windows Event Viewer under the Applications and Services Logs -> Data Management Gateway
node. You can also use the Diagnostics tab to test the connection to an on-premises data source using the
gateway.
This completes the gateway setup process in Azure Machine Learning Studio. You're now ready to use your on-
premises data.
You can create and set up multiple gateways in Studio for each workspace. For example, you may have a gateway
that you want to connect to your test data sources during development, and a different gateway for your
production data sources. Azure Machine Learning Studio gives you the flexibility to set up multiple gateways
depending upon your corporate environment. Currently you can’t share a gateway between workspaces and only
one gateway can be installed on a single computer. For more information, see Move data between on-premises
sources and cloud with Data Management Gateway.
Step 2: Use the gateway to read data from an on-premises data source
After you set up the gateway, you can add an Import Data module to an experiment that inputs the data from the
on-premises SQL Server database.
1. In Machine Learning Studio, select the EXPERIMENTS tab, click +NEW in the lower-left corner, and select
Blank Experiment (or select one of several sample experiments available).
2. Find and drag the Import Data module to the experiment canvas.
3. Click Save as below the canvas. Enter "Azure Machine Learning Studio On-Premises SQL Server Tutorial"
for the experiment name, select the workspace, and click the OK check mark.
4. Click the Import Data module to select it, then in the Properties pane to the right of the canvas, select
"On-Premises SQL Database" in the Data source dropdown list.
5. Select the Data gateway you installed and registered. You can set up another gateway by selecting "(add
new Data Gateway…)".
6. Enter the SQL Database server name and Database name, along with the SQL Database query you
want to execute.
7. Click Enter values under User name and password and enter your database credentials. You can use
Windows Integrated Authentication or SQL Server Authentication depending upon how your on-premises
SQL Server is configured.
The message "values required" changes to "values set" with a green check mark. You only need to enter the
credentials once unless the database information or password changes. Azure Machine Learning Studio
uses the certificate you provided when you installed the gateway to encrypt the credentials in the cloud.
Azure never stores on-premises credentials without encryption.
Azure Machine Learning Studio is a tool for developing machine learning experiments that are operationalized in
the Azure cloud platform. It is like the Visual Studio IDE and scalable cloud service merged into a single platform.
You can incorporate standard Application Lifecycle Management (ALM ) practices from versioning various assets to
automated execution and deployment, into Azure Machine Learning Studio. This article discusses some of the
options and approaches.
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.
Versioning experiment
There are two recommended ways to version your experiments. You can either rely on the built-in run history or
export the experiment in a JSON format so as to manage it externally. Each approach comes with its pros and cons.
Experiment snapshots using Run History
In the execution model of the Azure Machine Learning Studio learning experiment, an immutable snapshot of the
experiment is submitted to the job scheduler whenever you click Run in the experiment editor. To view this list of
snapshots, click Run History on the command bar in the experiment editor view.
You can then open the snapshot in Locked mode by clicking the name of the experiment at the time the experiment
was submitted to run and the snapshot was taken. Notice that only the first item in the list, which represents the
current experiment, is in an Editable state. Also notice that each snapshot can be in various Status states as well,
including Finished (Partial run), Failed, Failed (Partial run), or Draft.
After it's opened, you can save the snapshot experiment as a new experiment and then modify it. If your experiment
snapshot contains assets such as trained models, transforms, or datasets that have updated versions, the snapshot
retains the references to the original version when the snapshot was taken. If you save the locked snapshot as a
new experiment, Azure Machine Learning Studio detects the existence of a newer version of these assets, and
automatically updates them in the new experiment.
If you delete the experiment, all snapshots of that experiment are deleted.
Export/import experiment in JSON format
The run history snapshots keep an immutable version of the experiment in Azure Machine Learning Studio every
time it is submitted to run. You can also save a local copy of the experiment and check it in to your favorite source
control system, such as Team Foundation Server, and later on re-create an experiment from that local file. You can
use the Azure Machine Learning PowerShell commandlets Export-AmlExperimentGraph and Import-
AmlExperimentGraph to accomplish that.
The JSON file is a textual representation of the experiment graph, which might include a reference to assets in the
workspace such as a dataset or a trained model. It doesn't contain a serialized version of the asset. If you attempt to
import the JSON document back into the workspace, the referenced assets must already exist with the same asset
IDs that are referenced in the experiment. Otherwise you cannot access the imported experiment.
Next steps
Download the Azure Machine Learning Studio PowerShell module and start to automate your ALM tasks.
Learn how to create and manage large number of ML models by using just a single experiment through
PowerShell and retraining API.
Learn more about deploying Azure Machine Learning web services.
Manage experiment iterations in Azure Machine
Learning Studio
3/15/2019 • 4 minutes to read • Edit Online
Developing a predictive analysis model is an iterative process - as you modify the various functions and
parameters of your experiment, your results converge until you are satisfied that you have a trained, effective
model. Key to this process is tracking the various iterations of your experiment parameters and configurations.
You can review previous runs of your experiments at any time in order to challenge, revisit, and ultimately either
confirm or refine previous assumptions. When you run an experiment, Machine Learning Studio keeps a history of
the run, including dataset, module, and port connections and parameters. This history also captures results,
runtime information such as start and stop times, log messages, and execution status. You can look back at any of
these runs at any time to review the chronology of your experiment and intermediate results. You can even use a
previous run of your experiment to launch into a new phase of inquiry and discovery on your path to creating
simple, complex, or even ensemble modeling solutions.
NOTE
When you view a previous run of an experiment, that version of the experiment is locked and can't be edited. You can,
however, save a copy of it by clicking SAVE AS and providing a new name for the copy. Machine Learning Studio opens the
new copy, which you can then edit and run. This copy of your experiment is available in the EXPERIMENTS list along with all
your other experiments.
Click any of these runs to view a snapshot of the experiment at the time you ran it. The configuration, parameter
values, comments, and results are all preserved to give you a complete record of that run of your experiment.
TIP
To document your iterations of the experiment, you can modify the title each time you run it, you can update the Summary
of the experiment in the properties pane, and you can add or update comments on individual modules to record your
changes. The title, summary, and module comments are saved with each run of the experiment.
The list of experiments in the EXPERIMENTS tab in Machine Learning Studio always displays the latest version
of an experiment. If you open a previous run of the experiment (using Prior Run or VIEW RUN HISTORY ), you
can return to the draft version by clicking VIEW RUN HISTORY and selecting the iteration that has a STATE of
Editable.
Here's a common machine learning problem: You want to create many models that have the same training
workflow and use the same algorithm. But you want them to have different training datasets as input. This article
shows you how to do this at scale in Azure Machine Learning Studio using just a single experiment.
For example, let's say you own a global bike rental franchise business. You want to build a regression model to
predict the rental demand based on historic data. You have 1,000 rental locations across the world and you've
collected a dataset for each location. They include important features such as date, time, weather, and traffic that
are specific to each location.
You could train your model once using a merged version of all the datasets across all locations. But, each of your
locations has a unique environment. So a better approach would be to train your regression model separately
using the dataset for each location. That way, each trained model could take into account the different store sizes,
volume, geography, population, bike-friendly traffic environment, and more.
That may be the best approach, but you don't want to create 1,000 training experiments in Azure Machine
Learning Studio with each one representing a unique location. Besides being an overwhelming task, it also seems
inefficient since each experiment would have all the same components except for the training dataset.
Fortunately, you can accomplish this by using the Azure Machine Learning Studio retraining API and automating
the task with Azure Machine Learning Studio PowerShell.
NOTE
To make your sample run faster, reduce the number of locations from 1,000 to 10. But the same principles and procedures
apply to 1,000 locations. However, if you do want to train from 1,000 datasets you might want to run the following
PowerShell scripts in parallel. How to do that is beyond the scope of this article, but you can find examples of PowerShell
multi-threading on the Internet.
NOTE
In order to follow along with this example, you may want to use a standard workspace rather than a free workspace. You
create one endpoint for each customer - for a total of 10 endpoints - and that requires a standard workspace since a free
workspace is limited to 3 endpoints. If you only have a free workspace, just change the scripts to allow for only th locations.
The experiment uses an Import Data module to import the training dataset customer001.csv from an Azure
storage account. Let's assume you have collected training datasets from all bike rental locations and stored them
in the same blob storage location with file names ranging from rentalloc001.csv to rentalloc10.csv.
Note that a Web Service Output module has been added to the Train Model module. When this experiment is
deployed as a web service, the endpoint associated with that output returns the trained model in the format of an
.ilearner file.
Also note that you set up a web service parameter that defines the URL that the Import Data module uses. This
allows you to use the parameter to specify individual training datasets to train the model for each location. There
are other ways you could have done this. You can use a SQL query with a web service parameter to get data from
a SQL Azure database. Or you can use a Web Service Input module to pass in a dataset to the web service.
Now, let's run this training experiment using the default value rental001.csv as the training dataset. If you view the
output of the Evaluate module (click the output and select Visualize), you can see you get a decent performance
of AUC = 0.91. At this point, you're ready to deploy a web service out of this training experiment.
Import-Module .\AzureMLPS.dll
# Assume the default configuration file exists and is properly set to point to the valid Workspace.
$scoringSvc = Get-AmlWebService | where Name -eq 'Bike Rental Scoring'
$trainingSvc = Get-AmlWebService | where Name -eq 'Bike Rental Training'
Now you created 10 endpoints and they all contain the same trained model trained on customer001.csv. You can
view them in the Azure portal.
NOTE
The BES endpoint is the only supported mode for this operation. RRS cannot be used for producing trained models.
As you can see above, instead of constructing 10 different BES job configuration json files, you dynamically create
the config string instead. Then feed it to the jobConfigString parameter of the
InvokeAmlWebServceBESEndpoint cmdlet. There's really no need to keep a copy on disk.
If everything goes well, after a while you should see 10 .iLearner files, from model001.ilearner to
model010.ilearner, in your Azure storage account. Now you're ready to update the 10 scoring web service
endpoints with these models using the Patch-AmlWebServiceEndpoint PowerShell cmdlet. Remember again
that you can only patch the non-default endpoints you programmatically created earlier.
This should run fairly quickly. When the execution finishes, you'll have successfully created 10 predictive web
service endpoints. Each one will contain a trained model uniquely trained on the dataset specific to a rental
location, all from a single training experiment. To verify this, you can try calling these endpoints using the
InvokeAmlWebServiceRRSEndpoint cmdlet, providing them with the same input data. You should expect to
see different prediction results since the models are trained with different training sets.
# Invoke the retraining API 10 times to produce 10 regression models in .ilearner format
$trainingSvcEp = (Get-AmlWebServiceEndpoint -WebServiceId $trainingSvc.Id)[0];
$submitJobRequestUrl = $trainingSvcEp.ApiLocation + '/jobs?api-version=2.0';
$apiKey = $trainingSvcEp.PrimaryKey;
For ($i = 1; $i -le 10; $i++){
$seq = $i.ToString().PadLeft(3, '0');
$inputFileName = 'https://bostonmtc.blob.core.windows.net/hai/retrain/bike_rental/BikeRental' + $seq +
'.csv';
$configContent = '{ "GlobalParameters": { "URI": "' + $inputFileName + '" }, "Outputs": { "output1": {
"ConnectionString": "DefaultEndpointsProtocol=https;AccountName=<myaccount>;AccountKey=<mykey>",
"RelativeLocation": "hai/retrain/bike_rental/model' + $seq + '.ilearner" } } }';
Write-Host ('training regression model on ' + $inputFileName + ' for rental location ' + $seq + '...');
Invoke-AmlWebServiceBESEndpoint -JobConfigString $configContent -SubmitJobRequestUrl $submitJobRequestUrl
-ApiKey $apiKey
}
Learn how to start with example experiments from Azure AI Gallery instead of creating machine learning
experiments from scratch. You can use the examples to build your own machine learning solution.
The gallery has example experiments by the Microsoft Azure Machine Learning Studio team as well as examples
shared by the Machine Learning community. You also can ask questions or post comments about experiments.
To see how to use the gallery, watch the 3-minute video Copy other people's work to do data science from the
series Data Science for Beginners.
Next steps
Import data from various sources
Quickstart tutorial for the R language in Machine Learning
Deploy a Machine Learning web service
How to evaluate model performance in Azure
Machine Learning Studio
3/15/2019 • 12 minutes to read • Edit Online
This article demonstrates how to evaluate the performance of a model in Azure Machine Learning Studio and
provides a brief explanation of the metrics available for this task. Three common supervised learning scenarios are
presented:
regression
binary classification
multiclass classification
Evaluating the performance of a model is one of the core stages in the data science process. It indicates how
successful the scoring (predictions) of a dataset has been by a trained model.
Azure Machine Learning Studio supports model evaluation through two of its main machine learning modules:
Evaluate Model and Cross-Validate Model. These modules allow you to see how your model performs in terms of
a number of metrics that are commonly used in machine learning and statistics.
This topic describes how to choose the right hyperparameter set for an algorithm in Azure Machine Learning
Studio. Most machine learning algorithms have parameters to set. When you train a model, you need to provide
values for those parameters. The efficacy of the trained model depends on the model parameters that you choose.
The process of finding the optimal set of parameters is known as model selection.
There are various ways to do model selection. In machine learning, cross-validation is one of the most widely used
methods for model selection, and it is the default model selection mechanism in Azure Machine Learning Studio.
Because Azure Machine Learning Studio supports both R and Python, you can always implement their own model
selection mechanisms by using either R or Python.
There are four steps in the process of finding the best parameter set:
1. Define the parameter space: For the algorithm, first decide the exact parameter values you want to consider.
2. Define the cross-validation settings: Decide how to choose cross-validation folds for the dataset.
3. Define the metric: Decide what metric to use for determining the best set of parameters, such as accuracy,
root mean squared error, precision, recall, or f-score.
4. Train, evaluate, and compare: For each unique combination of the parameter values, cross-validation is
carried out by and based on the error metric you define. After evaluation and comparison, you can choose the
best-performing model.
The following image illustrates shows how this can be achieved in Azure Machine Learning Studio.
The model is then evaluated on the validation dataset. The left output port of the module shows different metrics
as functions of parameter values. The right output port gives the trained model that corresponds to the best-
performing model according to the chosen metric (Accuracy in this case).
You can see the exact parameters chosen by visualizing the right output port. This model can be used in scoring a
test set or in an operationalized web service after saving as a trained model.
Interpret model results in Azure Machine Learning
Studio
3/15/2019 • 13 minutes to read • Edit Online
This topic explains how to visualize and interpret prediction results in Azure Machine Learning Studio. After you
have trained a model and done predictions on top of it ("scored the model"), you need to understand and interpret
the prediction result.
There are four major kinds of machine learning models in Azure Machine Learning Studio:
Classification
Clustering
Regression
Recommender systems
The modules used for prediction on top of these models are:
Score Model module for classification and regression
Assign to Clusters module for clustering
Score Matchbox Recommender for recommendation systems
This document explains how to interpret prediction results for each of these modules. For an overview of these
modules, see How to choose parameters to optimize your algorithms in Azure Machine Learning Studio.
This topic addresses prediction interpretation but not model evaluation. For more information about how to
evaluate your model, see How to evaluate model performance in Azure Machine Learning Studio.
If you are new to Azure Machine Learning Studio and need help creating a simple experiment to get started, see
Create a simple experiment in Azure Machine Learning Studio in Azure Machine Learning Studio.
Classification
There are two subcategories of classification problems:
Problems with only two classes (two-class or binary classification)
Problems with more than two classes (multi-class classification)
Azure Machine Learning Studio has different modules to deal with each of these types of classification, but the
methods for interpreting their prediction results are similar.
Two -class classification
Example experiment
An example of a two-class classification problem is the classification of iris flowers. The task is to classify iris
flowers based on their features. The Iris data set provided in Azure Machine Learning Studio is a subset of the
popular Iris data set containing instances of only two flower species (classes 0 and 1). There are four features for
each flower (sepal length, sepal width, petal length, and petal width).
Figure 1. Iris two-class classification problem experiment
An experiment has been performed to solve this problem, as shown in Figure 1. A two-class boosted decision tree
model has been trained and scored. Now you can visualize the prediction results from the Score Model module by
clicking the output port of the Score Model module and then clicking Visualize.
In the training data, there are 16 features extracted from hand-written letter images. The 26 letters form our 26
classes. Figure 6 shows an experiment that will train a multiclass classification model for letter recognition and
predict on the same feature set on a test data set.
Figure 6. Letter recognition multiclass classification problem experiment
Visualizing the results from the Score Model module by clicking the output port of Score Model module and then
clicking Visualize, you should see content as shown in Figure 7.
Regression
Regression problems are different from classification problems. In a classification problem, you're trying to predict
discrete classes, such as which class an iris flower belongs to. But as you can see in the following example of a
regression problem, you're trying to predict a continuous variable, such as the price of a car.
Example experiment
Use automobile price prediction as your example for regression. You are trying to predict the price of a car based
on its features, including make, fuel type, body type, and drive wheel. The experiment is shown in Figure 11.
Figure 12. Scoring result for the automobile price prediction problem
Result interpretation
Scored Labels is the result column in this scoring result. The numbers are the predicted price for each car.
Web service publication
You can publish the regression experiment into a web service and call it for automobile price prediction in the
same way as in the two-class classification use case.
Figure 13. Scoring experiment of an automobile price regression problem
Running the web service, the returned result looks like Figure 14. The predicted price for this car is $15,085.52.
Clustering
Example experiment
Let’s use the Iris data set again to build a clustering experiment. Here you can filter out the class labels in the data
set so that it only has features and can be used for clustering. In this iris use case, specify the number of clusters to
be two during the training process, which means you would cluster the flowers into two classes. The experiment is
shown in Figure 15.
Figure 15. Iris clustering problem experiment
Clustering differs from classification in that the training data set doesn’t have ground-truth labels by itself.
Clustering groups the training data set instances into distinct clusters. During the training process, the model labels
the entries by learning the differences between their features. After that, the trained model can be used to further
classify future entries. There are two parts of the result we are interested in within a clustering problem. The first
part is labeling the training data set, and the second is classifying a new data set with the trained model.
The first part of the result can be visualized by clicking the left output port of Train Clustering Model and then
clicking Visualize. The visualization is shown in Figure 16.
Figure 16. Visualize clustering result for the training data set
The result of the second part, clustering new entries with the trained clustering model, is shown in Figure 17.
Figure 17. Visualize clustering result on a new data set
Result interpretation
Although the results of the two parts stem from different experiment stages, they look the same and are
interpreted in the same way. The first four columns are features. The last column, Assignments, is the prediction
result. The entries assigned the same number are predicted to be in the same cluster, that is, they share similarities
in some way (this experiment uses the default Euclidean distance metric). Because you specified the number of
clusters to be 2, the entries in Assignments are labeled either 0 or 1.
Web service publication
You can publish the clustering experiment into a web service and call it for clustering predictions the same way as
in the two-class classification use case.
Figure 18. Scoring experiment of an iris clustering problem
After you run the web service, the returned result looks like Figure 19. This flower is predicted to be in cluster 0.
Recommender system
Example experiment
For recommender systems, you can use the restaurant recommendation problem as an example: you can
recommend restaurants for customers based on their rating history. The input data consists of three parts:
Restaurant ratings from customers
Customer feature data
Restaurant feature data
There are several things we can do with the Train Matchbox Recommender module in Azure Machine Learning
Studio:
Predict ratings for a given user and item
Recommend items to a given user
Find users related to a given user
Find items related to a given item
You can choose what you want to do by selecting from the four options in the Recommender prediction kind
menu. Here you can walk through all four scenarios.
A typical Azure Machine Learning Studio experiment for a recommender system looks like Figure 20. For
information about how to use those recommender system modules, see Train matchbox recommender and Score
matchbox recommender.
When running a model, you may run into the following errors:
the Train Model module produces an error
the Score Model module produces incorrect results
This article explains potential causes for these errors.
Retraining is one way to ensure machine learning models stay accurate and based on the most relevant data
available. This article shows how to retrain and deploy a machine learning model as a new web service in Studio. If
you're looking to retrain a classic web service, view this how -to article.
This article assumes you already have a predictive web service deployed. If you don't already have a predictive web
service, learn how to deploy a Studio web service here.
You'll follow these steps to retrain and deploy a machine learning new web service:
1. Deploy a retraining web service
2. Train a new model using your retraining web service
3. Update your existing predictive experiment to use the new model
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.
const string apiKey = "abc123"; // Replace this with the API key for the web service
In the Basic consumption info section of the Consume page, locate the primary key, and copy it to the apikey
declaration.
Update the Azure Storage information
The BES sample code uploads a file from a local drive (for example, "C:\temp\CensusInput.csv") to Azure Storage,
processes it, and writes the results back to Azure Storage.
1. Sign into the Azure portal
2. In the left navigation column, click More services, search for Storage accounts, and select it.
3. From the list of storage accounts, select one to store the retrained model.
4. In the left navigation column, click Access keys.
5. Copy and save the Primary Access Key.
6. In the left navigation column, click Blobs.
7. Select an existing container, or create a new one and save the name.
Locate the StorageAccountName, StorageAccountKey, and StorageContainerName declarations, and update the
values that you saved from the portal.
const string StorageAccountName = "mystorageacct"; // Replace this with your Azure storage account name
const string StorageAccountKey = "a_storage_account_key"; // Replace this with your Azure Storage key
const string StorageContainerName = "mycontainer"; // Replace this with your Azure Storage container name
You also must ensure that the input file is available at the location that you specify in the code.
Specify the output location
When you specify the output location in the Request Payload, the extension of the file that is specified in
RelativeLocation must be specified as ilearner .
To determine the resource group name of an existing web service, run the Get-AzMlWebService cmdlet without
any parameters to display the web services in your subscription. Locate the web service, and then look at its web
service ID. The name of the resource group is the fourth element in the ID, just after the resourceGroups element.
In the following example, the resource group name is Default-MachineLearning-SouthCentralUS.
Properties : Microsoft.Azure.Management.MachineLearning.WebServices.Models.WebServicePropertiesForGraph
Id : /subscriptions/<subscription ID>/resourceGroups/Default-MachineLearning-
SouthCentralUS/providers/Microsoft.MachineLearning/webServices/RetrainSamplePre.2016.8.17.0.3.51.237
Name : RetrainSamplePre.2016.8.17.0.3.51.237
Location : South Central US
Type : Microsoft.MachineLearning/webServices
Tags : {}
Alternatively, to determine the resource group name of an existing web service, sign in to the Azure Machine
Learning Web Services portal. Select the web service. The resource group name is the fifth element of the URL of
the web service, just after the resourceGroups element. In the following example, the resource group name is
Default-MachineLearning-SouthCentralUS.
https://services.azureml.net/subscriptions/<subscription ID>/resourceGroups/Default-MachineLearning-
SouthCentralUS/providers/Microsoft.MachineLearning/webServices/RetrainSamplePre.2016.8.17.0.3.51.237
"asset3": {
"name": "Retrain Sample [trained model]",
"type": "Resource",
"locationInfo": {
"uri":
"https://mltestaccount.blob.core.windows.net/azuremlassetscontainer/baca7bca650f46218633552c0bcbba0e.ilearner"
},
"outputPorts": {
"Results dataset": {
"type": "Dataset"
}
}
},
Next steps
To learn more about how to manage web services or keep track of multiple experiments runs, see the following
articles:
Explore the Web Services portal
Manage experiment iterations
Retrain and deploy a classic Studio web service
5/6/2019 • 4 minutes to read • Edit Online
Retraining machine learning models is one way to ensure they stay accurate and based on the most relevant data
available. This article will show you how to retrain a classic Studio web service. For a guide on how to retrain a new
Studio web service, view this how -to article.
Prerequisites
This article assumes you already have both a retraining experiment and a predictive experiment. These steps are
explained in Retrain and deploy a machine learning model. However, instead of deploying your machine learning
model as a new web service, you will deploy your predictive experiment as a classic web service.
NOTE
Be sure you are adding the endpoint to the Predictive Web Service, not the Training Web Service. If you have correctly
deployed both a Training and a Predictive Web Service, you should see two separate web services listed. The Predictive Web
Service should end with "[predictive exp.]".
NOTE
If you added the endpoint to the Training Web Service instead of the Predictive Web Service, you will receive the following
error when you click the Update Resource link: "Sorry, but this feature is not supported or available in this context. This
Web Service has no updatable resources. We apologize for the inconvenience and are working on improving this workflow."
The PATCH help page contains the PATCH URL you must use and provides sample code you can use to call it.
if (!response.IsSuccessStatusCode)
{
await WriteFailedResponse(response);
}
The apiKey and the endpointUrl for the call can be obtained from endpoint dashboard.
The value of the Name parameter in Resources should match the Resource Name of the saved Trained Model in
the predictive experiment. To get the Resource Name:
1. Sign in to the Azure portal.
2. In the left menu, click Machine Learning.
3. Under Name, click your workspace and then click Web Services.
4. Under Name, click Census Model [predictive exp.].
5. Click the new endpoint you added.
6. On the endpoint dashboard, click Update Resource.
7. On the Update Resource API Documentation page for the web service, you can find the Resource Name
under Updatable Resources.
If your SAS token expires before you finish updating the endpoint, you must perform a GET with the Job ID to
obtain a fresh token.
When the code has successfully run, the new endpoint should start using the retrained model in approximately 30
seconds.
Next steps
To learn more about how to manage web services or keep track of multiple experiments runs, see the following
articles:
Explore the Web Services portal
Manage experiment iterations
Getting started with the R programming language in
Azure Machine Learning Studio
4/24/2019 • 53 minutes to read • Edit Online
Introduction
This tutorial helps you start extending Azure Machine Learning Studio by using the R programming language.
Follow this R programming tutorial to create, test and execute R code within Studio. As you work through tutorial,
you will create a complete forecasting solution by using the R language in Studio.
Microsoft Azure Machine Learning Studio contains many powerful machine learning and data manipulation
modules. The powerful R language has been described as the lingua franca of analytics. Happily, analytics and data
manipulation in Studio can be extended by using R. This combination provides the scalability and ease of
deployment of Studio with the flexibility and deep analytics of R.
Forecasting and the dataset
Forecasting is a widely employed and quite useful analytical method. Common uses range from predicting sales of
seasonal items, determining optimal inventory levels, to predicting macroeconomic variables. Forecasting is
typically done with time series models.
Time series data is data in which the values have a time index. The time index can be regular, e.g. every month or
every minute, or irregular. A time series model is based on time series data. The R programming language contains
a flexible framework and extensive analytics for time series data.
In this guide we will be working with California dairy production and pricing data. This data includes monthly
information on the production of several dairy products and the price of milk fat, a benchmark commodity.
The data used in this article, along with R scripts, can be downloaded from MachineLearningSamples-
Notebooks/studio-samples. Data in the file cadairydata.csv was originally synthesized from information available
from the University of Wisconsin at https://dairymarkets.com.
Organization
We will progress through several steps as you learn how to create, test and execute analytics and data manipulation
R code in the Azure Machine Learning Studio environment.
First we will explore the basics of using the R language in the Azure Machine Learning Studio environment.
Then we progress to discussing various aspects of I/O for data, R code and graphics in the Azure Machine
Learning Studio environment.
We will then construct the first part of our forecasting solution by creating code for data cleaning and
transformation.
With our data prepared we will perform an analysis of the correlations between several of the variables in our
dataset.
Finally, we will create a seasonal time series forecasting model for milk production.
Figure 1. The Machine Learning Studio environment showing the Execute R Script module selected.
Referring to Figure 1, let's look at some of the key parts of the Machine Learning Studio environment for working
with the Execute R Script module.
The modules in the experiment are shown in the center pane.
The upper part of the right pane contains a window to view and edit your R scripts.
The lower part of right pane shows some properties of the Execute R Script. You can view the error and output
logs by selecting the appropriate spots of this pane.
We will, of course, be discussing the Execute R Script in greater detail in the rest of this article.
When working with complex R functions, I recommend that you edit, test and debug in RStudio. As with any
software development, extend your code incrementally and test it on small simple test cases. Then cut and paste
your functions into the R script window of the Execute R Script module. This approach allows you to harness both
the RStudio integrated development environment (IDE ) and the power of Azure Machine Learning Studio.
Execute R code
Any R code in the Execute R Script module will execute when you run the experiment by selecting the Run button.
When execution has completed, a check mark will appear on the Execute R Script icon.
Defensive R coding for Azure Machine Learning
If you are developing R code for, say, a web service by using Azure Machine Learning Studio, you should definitely
plan how your code will deal with an unexpected data input and exceptions. To maintain clarity, I have not included
much in the way of checking or exception handling in most of the code examples shown. However, as we proceed I
will give you several examples of functions by using R's exception handling capability.
If you need a more complete treatment of R exception handling, I recommend you read the applicable sections of
the book by Wickham listed below in Further reading.
Debug and test R in Machine Learning Studio
To reiterate, I recommend you test and debug your R code on a small scale in RStudio. However, there are cases
where you will need to track down R code problems in the Execute R Script itself. In addition, it is good practice to
check your results in Machine Learning Studio.
Output from the execution of your R code and on the Azure Machine Learning Studio platform is found primarily
in output.log. Some additional information will be seen in error.log.
If an error occurs in Machine Learning Studio while running your R code, your first course of action should be to
look at error.log. This file can contain useful error messages to help you understand and correct your error. To view
error.log, select View error log on the properties pane for the Execute R Script containing the error.
For example, I ran the following R code, with an undefined variable y, in an Execute R Script module:
x <- 1.0
z <- x + y
This code fails to execute, resulting in an error condition. Selecting View error log on the properties pane
produces the display shown in Figure 2.
[Critical] Error: Error 0063: The following error occurred during evaluation of R script:
---------- Start of error message from R ----------
object 'y' not found
This error message contains no surprises and clearly identifies the problem.
To inspect the value of any object in R, you can print these values to the output.log file. The rules for examining
object values are essentially the same as in an interactive R session. For example, if you type a variable name on a
line, the value of the object will be printed to the output.log file.
Packages in Machine Learning Studio
Studio comes with over 350 preinstalled R language packages. You can use the following code in the Execute R
Script module to retrieve a list of the preinstalled packages.
Figure 3. The CA Dairy Analysis experiment with dataset and Execute R Script module.
Check on the data
Let's have a look at the data we have loaded into our experiment. In the experiment, select the output of the
cadairydata.csv dataset and select visualize. You should see something like Figure 4.
Now I need to transfer this script to Azure Machine Learning Studio. I could simply cut and paste. However, in this
case, I will transfer my R script via a zip file.
Data input to the Execute R Script module
Let's have a look at the inputs to the Execute R Script module. In this example we will read the California dairy data
into the Execute R Script module.
There are three possible inputs for the Execute R Script module. You may use any one or all of these inputs,
depending on your application. It is also perfectly reasonable to use an R script that takes no input at all.
Let's look at each of these inputs, going from left to right. You can see the names of each of the inputs by placing
your cursor over the input and reading the tooltip.
Script Bundle
The Script Bundle input allows you to pass the contents of a zip file into Execute R Script module. You can use one
of the following commands to read the contents of the zip file into your R code.
NOTE
Azure Machine Learning Studio treats files in the zip as if they are in the src/ directory, so you need to prefix your file names
with this directory name. For example, if the zip contains the files yourfile.R and yourData.rdata in the root of the zip,
you would address these as src/yourfile.R and src/yourData.rdata when using source and load .
We already discussed loading datasets in Load the dataset. Once you have created and tested the R script shown in
the previous section, do the following:
1. Save the R script into a .R file. I call my script file "simpleplot.R". Here's the contents.
2. Create a zip file and copy your script into this zip file. On Windows, you can right-click the file and select
Send to, and then Compressed folder. This will create a new zip file containing the "simpleplot.R" file.
3. Add your file to the datasets in Machine Learning Studio, specifying the type as zip. You should now see
the zip file in your datasets.
4. Drag and drop the zip file from datasets onto the ML Studio canvas.
5. Connect the output of the zip data icon to the Script Bundle input of the Execute R Script module.
6. Type the source() function with your zip file name into the code window for the Execute R Script module.
In my case I typed source("src/simpleplot.R") .
7. Make sure you select Save.
Once these steps are complete, the Execute R Script module will execute the R script in the zip file when the
experiment is run. At this point your experiment should look something like Figure 5.
Figure 5. Experiment using zipped R script.
Dataset1
You can pass a rectangular table of data to your R code by using the Dataset1 input. In our simple script the
maml.mapInputPort(1) function reads the data from port 1. This data is then assigned to a dataframe variable name
in your code. In our simple script the first line of code performs the assignment.
Execute your experiment by selecting the Run button. When the execution finishes, select the Execute R Script
module and then select View output log on the properties pane. A new page should appear in your browser
showing the contents of the output.log file. When you scroll down you should see something like the following.
[ModuleOutput] InputDataStructure
[ModuleOutput]
[ModuleOutput] {
[ModuleOutput] "InputName":Dataset1
[ModuleOutput] "Rows":228
[ModuleOutput] "Cols":9
[ModuleOutput] "ColumnTypes":System.Int32,3,System.Double,5,System.String,1
[ModuleOutput] }
Farther down the page is more detailed information on the columns, which will look something like the following.
These results are mostly as expected, with 228 observations and 9 columns in the dataframe. We can see the
column names, the R data type and a sample of each column.
NOTE
This same printed output is conveniently available from the R Device output of the Execute R Script module. We will discuss
the outputs of the Execute R Script module in the next section.
Dataset2
The behavior of the Dataset2 input is identical to that of Dataset1. Using this input you can pass a second
rectangular table of data into your R code. The function maml.mapInputPort(2) , with the argument 2, is used to pass
this data.
Execute R Script outputs
Output a dataframe
You can output the contents of an R dataframe as a rectangular table through the Result Dataset1 port by using the
maml.mapOutputPort() function. In our simple R script this is performed by the following line.
maml.mapOutputPort('cadairydata')
After running the experiment, select the Result Dataset1 output port and then select Visualize. You should see
something like Figure 6.
Looking at the data types of the columns we input in the previous section: all columns are of type numeric, except
for the column labeled 'Month', which is of type character. Let's convert this to a factor and test the results.
I have deleted the line that created the scatterplot matrix and added a line converting the 'Month' column to a
factor. In my experiment I will just cut and paste the R code into the code window of the Execute R Script Module.
You could also update the zip file and upload it to Azure Machine Learning Studio, but this takes several steps.
Let's execute this code and look at the output log for the R script. The relevant data from the log is shown in Figure
9.
[ModuleOutput] [1] "Loading variable port1..."
[ModuleOutput]
[ModuleOutput] 'data.frame': 228 obs. of 9 variables:
[ModuleOutput]
[ModuleOutput] $ Column 0 : int 1 2 3 4 5 6 7 8 9 10 ...
[ModuleOutput]
[ModuleOutput] $ Year.Month : num 1995 1995 1995 1995 1995 ...
[ModuleOutput]
[ModuleOutput] $ Month.Number : int 1 2 3 4 5 6 7 8 9 10 ...
[ModuleOutput]
[ModuleOutput] $ Year : int 1995 1995 1995 1995 1995 1995 1995 1995 1995 1995 ...
[ModuleOutput]
[ModuleOutput] $ Month : Factor w/ 14 levels "Apr","April",..: 6 5 9 1 11 8 7 3 14 13 ...
[ModuleOutput]
[ModuleOutput] $ Cotagecheese.Prod: num 4.37 3.69 4.54 4.28 4.47 ...
[ModuleOutput]
[ModuleOutput] $ Icecream.Prod : num 51.6 56.1 68.5 65.7 73.7 ...
[ModuleOutput]
[ModuleOutput] $ Milk.Prod : num 2.11 1.93 2.16 2.13 2.23 ...
[ModuleOutput]
[ModuleOutput] $ N.CA.Fat.Price : num 0.98 0.892 0.892 0.897 0.897 ...
[ModuleOutput]
[ModuleOutput] [1] "Saving variable cadairydata ..."
[ModuleOutput]
[ModuleOutput] [1] "Saving the following item(s): .maml.oport1"
Rerun the experiment and view the output log. The expected results are shown in Figure 10.
[ModuleOutput] [1] "Loading variable port1..."
[ModuleOutput]
[ModuleOutput] 'data.frame': 228 obs. of 9 variables:
[ModuleOutput]
[ModuleOutput] $ Column 0 : int 1 2 3 4 5 6 7 8 9 10 ...
[ModuleOutput]
[ModuleOutput] $ Year.Month : num 1995 1995 1995 1995 1995 ...
[ModuleOutput]
[ModuleOutput] $ Month.Number : int 1 2 3 4 5 6 7 8 9 10 ...
[ModuleOutput]
[ModuleOutput] $ Year : int 1995 1995 1995 1995 1995 1995 1995 1995 1995 1995 ...
[ModuleOutput]
[ModuleOutput] $ Month : Factor w/ 12 levels "Apr","Aug","Dec",..: 5 4 8 1 9 7 6 2 12 11 ...
[ModuleOutput]
[ModuleOutput] $ Cotagecheese.Prod: num 4.37 3.69 4.54 4.28 4.47 ...
[ModuleOutput]
[ModuleOutput] $ Icecream.Prod : num 51.6 56.1 68.5 65.7 73.7 ...
[ModuleOutput]
[ModuleOutput] $ Milk.Prod : num 2.11 1.93 2.16 2.13 2.23 ...
[ModuleOutput]
[ModuleOutput] $ N.CA.Fat.Price : num 0.98 0.892 0.892 0.897 0.897 ...
[ModuleOutput]
[ModuleOutput] [1] "Saving variable cadairydata ..."
[ModuleOutput]
[ModuleOutput] [1] "Saving the following item(s): .maml.oport1"
Figure 10. Summary of the dataframe with correct number of factor levels.
Our factor variable now has the desired 12 levels.
Basic data frame filtering
R dataframes support powerful filtering capabilities. Datasets can be subsetted by using logical filters on either
rows or columns. In many cases, complex filter criteria will be required. The references in Further reading below
contain extensive examples of filtering dataframes.
There is one bit of filtering we should do on our dataset. If you look at the columns in the cadairydata dataframe,
you will see two unnecessary columns. The first column just holds a row number, which is not very useful. The
second column, Year.Month, contains redundant information. We can easily exclude these columns by using the
following R code.
NOTE
From now on in this section, I will just show you the additional code I am adding in the Execute R Script module. I will add
each new line before the str() function. I use this function to verify my results in Azure Machine Learning Studio.
Run this code in your experiment and check the result from the output log. These results are shown in Figure 11.
[ModuleOutput] [1] "Loading variable port1..."
[ModuleOutput]
[ModuleOutput] 'data.frame': 228 obs. of 7 variables:
[ModuleOutput]
[ModuleOutput] $ Month.Number : int 1 2 3 4 5 6 7 8 9 10 ...
[ModuleOutput]
[ModuleOutput] $ Year : int 1995 1995 1995 1995 1995 1995 1995 1995 1995 1995 ...
[ModuleOutput]
[ModuleOutput] $ Month : Factor w/ 12 levels "Apr","Aug","Dec",..: 5 4 8 1 9 7 6 2 12 11 ...
[ModuleOutput]
[ModuleOutput] $ Cotagecheese.Prod: num 4.37 3.69 4.54 4.28 4.47 ...
[ModuleOutput]
[ModuleOutput] $ Icecream.Prod : num 51.6 56.1 68.5 65.7 73.7 ...
[ModuleOutput]
[ModuleOutput] $ Milk.Prod : num 2.11 1.93 2.16 2.13 2.23 ...
[ModuleOutput]
[ModuleOutput] $ N.CA.Fat.Price : num 0.98 0.892 0.892 0.897 0.897 ...
[ModuleOutput]
[ModuleOutput] [1] "Saving variable cadairydata ..."
[ModuleOutput]
[ModuleOutput] [1] "Saving the following item(s): .maml.oport1"
## Compute the number of months from the start of the time series
12 * (Year - min.year) + Month - 1
}
Now run the updated experiment and use the output log to view the results. These results are shown in Figure 12.
[ModuleOutput] [1] "Loading variable port1..."
[ModuleOutput]
[ModuleOutput] 'data.frame': 228 obs. of 8 variables:
[ModuleOutput]
[ModuleOutput] $ Month.Number : int 1 2 3 4 5 6 7 8 9 10 ...
[ModuleOutput]
[ModuleOutput] $ Year : int 1995 1995 1995 1995 1995 1995 1995 1995 1995 1995 ...
[ModuleOutput]
[ModuleOutput] $ Month : Factor w/ 12 levels "Apr","Aug","Dec",..: 5 4 8 1 9 7 6 2 12 11 ...
[ModuleOutput]
[ModuleOutput] $ Cotagecheese.Prod: num 4.37 3.69 4.54 4.28 4.47 ...
[ModuleOutput]
[ModuleOutput] $ Icecream.Prod : num 51.6 56.1 68.5 65.7 73.7 ...
[ModuleOutput]
[ModuleOutput] $ Milk.Prod : num 2.11 1.93 2.16 2.13 2.23 ...
[ModuleOutput]
[ModuleOutput] $ N.CA.Fat.Price : num 0.98 0.892 0.892 0.897 0.897 ...
[ModuleOutput]
[ModuleOutput] $ Month.Count : num 0 1 2 3 4 5 6 7 8 9 ...
[ModuleOutput]
[ModuleOutput] [1] "Saving variable cadairydata ..."
[ModuleOutput]
[ModuleOutput] [1] "Saving the following item(s): .maml.oport1"
There is quite a bit happening in the log.transform() function. Most of this code is checking for potential problems
with the arguments or dealing with exceptions, which can still arise during the computations. Only a few lines of
this code actually do the computations.
The goal of the defensive programming is to prevent the failure of a single function that prevents processing from
continuing. An abrupt failure of a long-running analysis can be quite frustrating for users. To avoid this situation,
default return values must be chosen that will limit damage to downstream processing. A message is also
produced to alert users that something has gone wrong.
If you are not used to defensive programming in R, all this code may seem a bit overwhelming. I will walk you
through the major steps:
1. A vector of four messages is defined. These messages are used to communicate information about some of the
possible errors and exceptions that can occur with this code.
2. I return a value of NA for each case. There are many other possibilities that might have fewer side effects. I
could return a vector of zeroes, or the original input vector, for example.
3. Checks are run on the arguments to the function. In each case, if an error is detected, a default value is returned
and a message is produced by the warning() function. I am using warning() rather than stop() as the latter
will terminate execution, exactly what I am trying to avoid. Note that I have written this code in a procedural
style, as in this case a functional approach seemed complex and obscure.
4. The log computations are wrapped in tryCatch() so that exceptions will not cause an abrupt halt to processing.
Without tryCatch() most errors raised by R functions result in a stop signal, which does just that.
Execute this R code in your experiment and have a look at the printed output in the output.log file. You will now see
the transformed values of the four columns in the log, as shown in Figure 13.
[ModuleOutput] [1] "Loading variable port1..."
[ModuleOutput]
[ModuleOutput] 'data.frame': 228 obs. of 8 variables:
[ModuleOutput]
[ModuleOutput] $ Month.Number : int 1 2 3 4 5 6 7 8 9 10 ...
[ModuleOutput]
[ModuleOutput] $ Year : int 1995 1995 1995 1995 1995 1995 1995 1995 1995 1995 ...
[ModuleOutput]
[ModuleOutput] $ Month : Factor w/ 12 levels "Apr","Aug","Dec",..: 5 4 8 1 9 7 6 2 12 11 ...
[ModuleOutput]
[ModuleOutput] $ Cotagecheese.Prod: num 1.47 1.31 1.51 1.45 1.5 ...
[ModuleOutput]
[ModuleOutput] $ Icecream.Prod : num 5.82 5.9 6.1 6.06 6.17 ...
[ModuleOutput]
[ModuleOutput] $ Milk.Prod : num 7.66 7.57 7.68 7.66 7.71 ...
[ModuleOutput]
[ModuleOutput] $ N.CA.Fat.Price : num 6.89 6.79 6.79 6.8 6.8 ...
[ModuleOutput]
[ModuleOutput] $ Month.Count : num 0 1 2 3 4 5 6 7 8 9 ...
[ModuleOutput]
[ModuleOutput] [1] "Saving variable cadairydata ..."
[ModuleOutput]
[ModuleOutput] [1] "Saving the following item(s): .maml.oport1"
Now, run the experiment. The log of the new Execute R Script shape should look like Figure 14.
Run this code and see what happens. The plot produced at the R Device port should look like Figure 16.
# The input arguments are not of the same length, return ts and quit
if(length(Time) != length(ts)) {warning(messages[1]); return(ts)}
ts
}
## Apply the detrend.ts function to the variables of interest
df.detrend <- data.frame(lapply(cadairydata[, 4:7], ts.detrend, cadairydata$Time))
There is quite a bit happening in the ts.detrend() function. Most of this code is checking for potential problems
with the arguments or dealing with exceptions, which can still arise during the computations. Only a few lines of
this code actually do the computations.
We have already discussed an example of defensive programming in Value transformations. Both computation
blocks are wrapped in tryCatch() . For some errors it makes sense to return the original input vector, and in other
cases, I return a vector of zeros.
Note that the linear regression used for de-trending is a time series regression. The predictor variable is a time
series object.
Once ts.detrend() is defined we apply it to the variables of interest in our dataframe. We must coerce the
resulting list created by lapply() to data dataframe by using as.data.frame() . Because of defensive aspects of
ts.detrend() , failure to process one of the variables will not prevent correct processing of the others.
The final line of code creates a pairwise scatterplot. After running the R code, the results of the scatterplot are
shown in Figure 17.
cadairycorrelations
Figure 18. List of ccf objects from the pairwise correlation analysis.
There is a correlation value for each lag. None of these correlation values are large enough to be significant. We
can therefore conclude that we can model each variable independently.
Output a dataframe
We have computed the pairwise correlations as a list of R ccf objects. This presents a bit of a problem as the Result
Dataset output port really requires a dataframe. Further, the ccf object is itself a list and we want only the values in
the first element of this list, the correlations at the various lags.
The following code extracts the lag values from the list of ccf objects, which are themselves lists.
df.correlations <- data.frame(do.call(rbind, lapply(cadairycorrelations, '[[', 1)))
c.names <- c("correlation pair", "-1 lag", "0 lag", "+1 lag")
r.names <- c("Corr Cot Cheese - Ice Cream",
"Corr Cot Cheese - Milk Prod",
"Corr Cot Cheese - Fat Price",
"Corr Ice Cream - Mik Prod",
"Corr Ice Cream - Fat Price",
"Corr Milk Prod - Fat Price")
## WARNING!
## The following line works only in Azure Machine Learning Studio
## When running in RStudio, this code will result in an error
#maml.mapOutputPort('outframe')
The first line of code is a bit tricky, and some explanation may help you understand it. Working from the inside out
we have the following:
1. The '[[' operator with the argument '1' selects the vector of correlations at the lags from the first element of the
ccf object list.
2. The do.call() function applies the rbind() function over the elements of the list returns by lapply() .
3. The data.frame() function coerces the result produced by do.call() to a dataframe.
Note that the row names are in a column of the dataframe. Doing so preserves the row names when they are
output from the Execute R Script.
Running the code produces the output shown in Figure 19 when I Visualize the output at the Result Dataset port.
The row names are in the first column, as intended.
Figure 20. The experiment with the new Execute R Script module added.
As with the correlation analysis we just completed, we need to add a column with a POSIXct time series object. The
following code will do just this.
# If running in Machine Learning Studio, uncomment the first line with maml.mapInputPort()
cadairydata <- maml.mapInputPort(1)
str(cadairydata)
Run this code and look at the log. The result should look like Figure 21.
[ModuleOutput] [1] "Loading variable port1..."
[ModuleOutput]
[ModuleOutput] 'data.frame': 228 obs. of 9 variables:
[ModuleOutput]
[ModuleOutput] $ Month.Number : int 1 2 3 4 5 6 7 8 9 10 ...
[ModuleOutput]
[ModuleOutput] $ Year : int 1995 1995 1995 1995 1995 1995 1995 1995 1995 1995 ...
[ModuleOutput]
[ModuleOutput] $ Month : Factor w/ 12 levels "Apr","Aug","Dec",..: 5 4 8 1 9 7 6 2 12 11 ...
[ModuleOutput]
[ModuleOutput] $ Cotagecheese.Prod: num 1.47 1.31 1.51 1.45 1.5 ...
[ModuleOutput]
[ModuleOutput] $ Icecream.Prod : num 5.82 5.9 6.1 6.06 6.17 ...
[ModuleOutput]
[ModuleOutput] $ Milk.Prod : num 7.66 7.57 7.68 7.66 7.71 ...
[ModuleOutput]
[ModuleOutput] $ N.CA.Fat.Price : num 6.89 6.79 6.79 6.8 6.8 ...
[ModuleOutput]
[ModuleOutput] $ Month.Count : num 0 1 2 3 4 5 6 7 8 9 ...
[ModuleOutput]
[ModuleOutput] $ Time : POSIXct, format: "1995-01-01" "1995-02-01" ...
Running the code produces the series of time series plots from the R Device output shown in Figure 22. Note that
the time axis is in units of dates, a nice benefit of the time series plot method.
Figure 22. Time series plots of California dairy production and price data.
A trend model
Having created a time series object and having had a look at the data, let's start to construct a trend model for the
California milk production data. We can do this with a time series regression. However, it is clear from the plot that
we will need more than a slope and intercept to accurately model the observed trend in the training data.
Given the small scale of the data, I will build the model for trend in RStudio and then cut and paste the resulting
model into Azure Machine Learning Studio. RStudio provides an interactive environment for this type of interactive
analysis.
As a first attempt, I will try a polynomial regression with powers up to 3. There is a real danger of over-fitting these
kinds of models. Therefore, it is best to avoid high order terms. The I() function inhibits interpretation of the
contents (interprets the contents 'as is') and allows you to write a literally interpreted function in a regression
equation.
##
## Call:
## lm(formula = Milk.Prod ~ Time + I(Month.Count^2) + I(Month.Count^3),
## data = cadairytrain)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.12667 -0.02730 0.00236 0.02943 0.10586
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.33e+00 1.45e-01 43.60 <2e-16 ***
## Time 1.63e-09 1.72e-10 9.47 <2e-16 ***
## I(Month.Count^2) -1.71e-06 4.89e-06 -0.35 0.726
## I(Month.Count^3) -3.24e-08 1.49e-08 -2.17 0.031 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0418 on 212 degrees of freedom
## Multiple R-squared: 0.941, Adjusted R-squared: 0.94
## F-statistic: 1.12e+03 on 3 and 212 DF, p-value: <2e-16
From P values ( Pr(>|t|) ) in this output, we can see that the squared term may not be significant. I will use the
update() function to modify this model by dropping the squared term.
##
## Call:
## lm(formula = Milk.Prod ~ Time + I(Month.Count^3), data = cadairytrain)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.12597 -0.02659 0.00185 0.02963 0.10696
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.38e+00 4.07e-02 156.6 <2e-16 ***
## Time 1.57e-09 4.32e-11 36.3 <2e-16 ***
## I(Month.Count^3) -3.76e-08 2.50e-09 -15.1 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0417 on 213 degrees of freedom
## Multiple R-squared: 0.941, Adjusted R-squared: 0.94
## F-statistic: 1.69e+03 on 2 and 213 DF, p-value: <2e-16
This looks better. All of the terms are significant. However, the 2e-16 value is a default value, and should not be
taken too seriously.
As a sanity test, let's make a time series plot of the California dairy production data with the trend curve shown. I
have added the following code in the Azure Machine Learning Studio Execute R Script model (not RStudio) to
create the model and make a plot. The result is shown in Figure 23.
plot(cadairytrain$Time, cadairytrain$Milk.Prod, xlab = "Time", ylab = "Log CA Milk Production 1000s lb", type =
"l")
lines(cadairytrain$Time, predict(milk.lm, cadairytrain), lty = 2, col = 2)
Figure 23. California milk production data with trend model shown.
It looks like the trend model fits the data fairly well. Further, there does not seem to be evidence of over-fitting,
such as odd wiggles in the model curve.
Seasonal model
With a trend model in hand, we need to push on and include the seasonal effects. We will use the month of the year
as a dummy variable in the linear model to capture the month-by-month effect. Note that when you introduce
factor variables into a model, the intercept must not be computed. If you do not do this, the formula is over-
specified and R will drop one of the desired factors but keep the intercept term.
Since we have a satisfactory trend model we can use the update() function to add the new terms to the existing
model. The -1 in the update formula drops the intercept term. Continuing in RStudio for the moment:
We see that the model no longer has an intercept term and has 12 significant month factors. This is exactly what we
wanted to see.
Let's make another time series plot of the California dairy production data to see how well the seasonal model is
working. I have added the following code in the Azure Machine Learning Studio Execute R Script to create the
model and make a plot.
plot(cadairytrain$Time, cadairytrain$Milk.Prod, xlab = "Time", ylab = "Log CA Milk Production 1000s lb", type =
"l")
lines(cadairytrain$Time, predict(milk.lm2, cadairytrain), lty = 2, col = 2)
Running this code in Azure Machine Learning Studio produces the plot shown in Figure 24.
Figure 24. California milk production with model including seasonal effects.
The fit to the data shown in Figure 24 is rather encouraging. Both the trend and the seasonal effect (monthly
variation) look reasonable.
As another check on our model, let's have a look at the residuals. The following code computes the predicted values
from our two models, computes the residuals for the seasonal model, and then plots these residuals for the training
data.
messages <- c("ERROR: Input arguments to function RMS.error of wrong type encountered",
"ERROR: Input vector to function RMS.error is too short",
"ERROR: Input vectors to function RMS.error must be of same length",
"WARNING: Funtion rms.error has received invald input time series.")
if((length(series1) != length(series2))) {
warning(messages[3])
return(NA)}
As with the log.transform() function we discussed in the "Value transformations" section, there is quite a lot of
error checking and exception recovery code in this function. The principles employed are the same. The work is
done in two places wrapped in tryCatch() . First, the time series are exponentiated, since we have been working
with the logs of the values. Second, the actual RMS error is computed.
Equipped with a function to measure the RMS error, let's build and output a dataframe containing the RMS errors.
We will include terms for the trend model alone and the complete model with seasonal factors. The following code
does the job by using the two linear models we have constructed.
## Compute the RMS error in a dataframe
## Include the row names in the first column so they will
## appear in the output of the Execute R Script
RMS.df <- data.frame(
rowNames = c("Trend Model", "Seasonal Model"),
Traing = c(
RMS.error(predict1[1:216], cadairydata$Milk.Prod[1:216]),
RMS.error(predict2[1:216], cadairydata$Milk.Prod[1:216])),
Forecast = c(
RMS.error(predict1[217:228], cadairydata$Milk.Prod[217:228]),
RMS.error(predict2[217:228], cadairydata$Milk.Prod[217:228]))
)
RMS.df
Running this code produces the output shown in Figure 27 at the Result Dataset output port.
Further reading
This R programming tutorial covers the basics of what you need to use the R language with Azure Machine
Learning Studio. If you are not familiar with R, two introductions are available on CRAN:
R for Beginners by Emmanuel Paradis is a good place to start.
An Introduction to R by W. N. Venables et. al. goes into a bit more depth.
There are many books on R that can help you get started. Here are a few I find useful:
The Art of R Programming: A Tour of Statistical Software Design by Norman Matloff is an excellent
introduction to programming in R.
R Cookbook by Paul Teetor provides a problem and solution approach to using R.
R in Action by Robert Kabacoff is another useful introductory book. The companion Quick R website is a
useful resource.
R Inferno by Patrick Burns is a surprisingly humorous book that deals with a number of tricky and difficult
topics that can be encountered when programming in R. The book is available for free at The R Inferno.
If you want a deep dive into advanced topics in R, have a look at the book Advanced R by Hadley Wickham.
The online version of this book is available for free at http://adv-r.had.co.nz/.
A catalog of R time series packages can be found in CRAN Task View: Time Series Analysis. For information on
specific time series object packages, you should refer to the documentation for that package.
The book Introductory Time Series with R by Paul Cowpertwait and Andrew Metcalfe provides an introduction
to using R for time series analysis. Many more theoretical texts provide R examples.
Here are some great internet resources:
DataCamp teaches R in the comfort of your browser with video lessons and coding exercises. There are
interactive tutorials on the latest R techniques and packages. Take the free interactive R tutorial.
Learn R Programming, The Definitive Guide from Programiz.
A quick R Tutorial by Kelly Black from Clarkson University.
There are over 60 R resources listed at Top R language resources to improve your data skills.
Azure Machine Learning Studio: Extend your
experiment with R
3/15/2019 • 2 minutes to read • Edit Online
You can extend the functionality of Azure Machine Learning Studio through the R language by using the Execute R
Script module.
This module accepts multiple input datasets and yields a single dataset as output. You can type an R script into the
R Script parameter of the Execute R Script module.
You access each input port of the module by using code similar to the following:
This sends the list of packages to the output port of the Execute R Script module. To view the package list, connect
a conversion module such as Convert to CSV to the left output of the Execute R Script module, run the experiment,
then click the output of the conversion module and select Download.
Importing packages
You can import packages that are not already installed by using the following commands in the Execute R Script
module:
install.packages("src/my_favorite_package.zip", lib = ".", repos = NULL, verbose = TRUE)
success <- library("my_favorite_package", lib.loc = ".", logical.return = TRUE, verbose = TRUE)
This topic describes how to author and deploy a custom R Studio. It explains what custom R modules are and what
files are used to define them. It illustrates how to construct the files that define a module and how to register the
module for deployment in a Machine Learning workspace. The elements and attributes used in the definition of the
custom module are then described in more detail. How to use auxiliary functionality and files and multiple outputs
is also discussed.
<!-- Specify the base language, script file and R function to use for this module. -->
<Language name="R"
sourceFile="CustomAddRows.R"
entryPoint="CustomAddRows" />
It is critical to note that the value of the id attributes of the Input and Arg elements in the XML file must match the
function parameter names of the R code in the CustomAddRows.R file EXACTLY: (dataset1, dataset2, and swap in
the example). Similarly, the value of the entryPoint attribute of the Language element must match the name of
the function in the R script EXACTLY: (CustomAddRows in the example).
In contrast, the id attribute for the Output element does not correspond to any variables in the R script. When
more than one output is required, simply return a list from the R function with results placed in the same order as
Outputs elements are declared in the XML file.
Package and register the module
Save these two files as CustomAddRows.R and CustomAddRows.xml and then zip the two files together into a
CustomAddRows.zip file.
To register them in your Machine Learning workspace, go to your workspace in the Machine Learning Studio, click
the +NEW button on the bottom and choose MODULE -> FROM ZIP PACKAGE to upload the new Custom
Add Rows module.
The Custom Add Rows module is now ready to be accessed by your Machine Learning experiments.
Within the Module element, you can specify two additional optional elements:
an Owner element that is embedded into the module
a Description element that contains text that is displayed in quick help for the module and when you hover
over the module in the Machine Learning UI.
Rules for characters limits in the Module elements:
The value of the name attribute in the Module element must not exceed 64 characters in length.
The content of the Description element must not exceed 128 characters in length.
The content of the Owner element must not exceed 32 characters in length.
A module's results can be deterministic or nondeterministic.** By default, all modules are considered to be
deterministic. That is, given an unchanging set of input parameters and data, the module should return the same
results eacRAND or a function time it is run. Given this behavior, Azure Machine Learning Studio only reruns
modules marked as deterministic if a parameter or the input data has changed. Returning the cached results also
provides much faster execution of experiments.
There are functions that are nondeterministic, such as RAND or a function that returns the current date or time. If
your module uses a nondeterministic function, you can specify that the module is non-deterministic by setting the
optional isDeterministic attribute to FALSE. This insures that the module is rerun whenever the experiment is
run, even if the module input and parameters have not changed.
Language Definition
The Language element in your XML definition file is used to specify the custom module language. Currently, R is
the only supported language. The value of the sourceFile attribute must be the name of the R file that contains the
function to call when the module is run. This file must be part of the zip package. The value of the entryPoint
attribute is the name of the function being called and must match a valid function defined with in the source file.
Ports
The input and output ports for a custom module are specified in child elements of the Ports section of the XML
definition file. The order of these elements determines the layout experienced (UX) by users. The first child input or
output listed in the Ports element of the XML file becomes the left-most input port in the Machine Learning UX.
Each input and output port may have an optional Description child element that specifies the text shown when
you hover the mouse cursor over the port in the Machine Learning UI.
Ports Rules:
Maximum number of input and output ports is 8 for each.
Input elements
Input ports allow you to pass data to your R function and workspace. The data types that are supported for input
ports are as follows:
DataTable: This type is passed to your R function as a data.frame. In fact, any types (for example, CSV files or
ARFF files) that are supported by Machine Learning and that are compatible with DataTable are converted to a
data.frame automatically.
The id attribute associated with each DataTable input port must have a unique value and this value must match its
corresponding named parameter in your R function. Optional DataTable ports that are not passed as input in an
experiment have the value NULL passed to the R function and optional zip ports are ignored if the input is not
connected. The isOptional attribute is optional for both the DataTable and Zip types and is false by default.
Zip: Custom modules can accept a zip file as input. This input is unpacked into the R working directory of your
function
For custom R modules, the ID for a Zip port does not have to match any parameters of the R function. This is
because the zip file is automatically extracted to the R working directory.
Input Rules:
The value of the id attribute of the Input element must be a valid R variable name.
The value of the id attribute of the Input element must not be longer than 64 characters.
The value of the name attribute of the Input element must not be longer than 64 characters.
The content of the Description element must not be longer than 128 characters
The value of the type attribute of the Input element must be Zip or DataTable.
The value of the isOptional attribute of the Input element is not required (and is false by default when not
specified); but if it is specified, it must be true or false.
Output elements
Standard output ports: Output ports are mapped to the return values from your R function, which can then be
used by subsequent modules. DataTable is the only standard output port type supported currently. (Support for
Learners and Transforms is forthcoming.) A DataTable output is defined as:
For outputs in custom R modules, the value of the id attribute does not have to correspond with anything in the R
script, but it must be unique. For a single module output, the return value from the R function must be a
data.frame. In order to output more than one object of a supported data type, the appropriate output ports need to
be specified in the XML definition file and the objects need to be returned as a list. The output objects are assigned
to output ports from left to right, reflecting the order in which the objects are placed in the returned list.
For example, if you want to modify the Custom Add Rows module to output the original two datasets, dataset1
and dataset2, in addition to the new joined dataset, dataset, (in an order, from left to right, as: dataset, dataset1,
dataset2), then define the output ports in the CustomAddRows.xml file as follows:
<Ports>
<Output id="dataset" name="Dataset Out" type="DataTable">
<Description>New Dataset</Description>
</Output>
<Output id="dataset1_out" name="Dataset 1 Out" type="DataTable">
<Description>First Dataset</Description>
</Output>
<Output id="dataset2_out" name="Dataset 2 Out" type="DataTable">
<Description>Second Dataset</Description>
</Output>
<Input id="dataset1" name="Dataset 1" type="DataTable">
<Description>First Input Table</Description>
</Input>
<Input id="dataset2" name="Dataset 2" type="DataTable">
<Description>Second Input Table</Description>
</Input>
</Ports>
And return the list of objects in a list in the correct order in ‘CustomAddRows.R’:
Visualization output: You can also specify an output port of type Visualization, which displays the output from
the R graphics device and console output. This port is not part of the R function output and does not interfere with
the order of the other output port types. To add a visualization port to the custom modules, add an Output
element with a value of Visualization for its type attribute:
Output Rules:
The value of the id attribute of the Output element must be a valid R variable name.
The value of the id attribute of the Output element must not be longer than 32 characters.
The value of the name attribute of the Output element must not be longer than 64 characters.
The value of the type attribute of the Output element must be Visualization.
Arguments
Additional data can be passed to the R function via module parameters which are defined in the Arguments
element. These parameters appear in the rightmost properties pane of the Machine Learning UI when the module
is selected. Arguments can be any of the supported types or you can create a custom enumeration when needed.
Similar to the Ports elements, Arguments elements can have an optional Description element that specifies the
text that appears when you hover the mouse over the parameter name. Optional properties for a module, such as
defaultValue, minValue, and maxValue can be added to any argument as attributes to a Properties element. Valid
properties for the Properties element depend on the argument type and are described with the supported
argument types in the next section. Arguments with the isOptional property set to "true" do not require the user
to enter a value. If a value is not provided to the argument, then the argument is not passed to the entry point
function. Arguments of the entry point function that are optional need to be explicitly handled by the function, e.g.
assigned a default value of NULL in the entry point function definition. An optional argument will only enforce the
other argument constraints, i.e. min or max, if a value is provided by the user. As with inputs and outputs, it is
critical that each of the parameters have unique ID values associated with them. In our quickstart example the
associated id/parameter was swap.
Arg element
A module parameter is defined using the Arg child element of the Arguments section of the XML definition file.
As with the child elements in the Ports section, the ordering of parameters in the Arguments section defines the
layout encountered in the UX. The parameters appear from top down in the UI in the same order in which they are
defined in the XML file. The types supported by Machine Learning for parameters are listed here.
int – an Integer (32-bit) type parameter.
Required Properties: portId - matches the ID of an Input element with type DataTable.
Optional Properties:
allowedTypes - Filters the column types from which you can pick. Valid values include:
Numeric
Boolean
Categorical
String
Label
Feature
Score
All
default - Valid default selections for the column picker include:
None
NumericFeature
NumericLabel
NumericScore
NumericAll
BooleanFeature
BooleanLabel
BooleanScore
BooleanAll
CategoricalFeature
CategoricalLabel
CategoricalScore
CategoricalAll
StringFeature
StringLabel
StringScore
StringAll
AllLabel
AllFeature
AllScore
All
DropDown: a user-specified enumerated (dropdown) list. The dropdown items are specified within the Properties
element using an Item element. The id for each Item must be unique and a valid R variable. The value of the
name of an Item serves as both the text that you see and the value that is passed to the R function.
Optional Properties:
default - The value for the default property must correspond with an ID value from one of the Item
elements.
Auxiliary Files
Any file that is placed in your custom module ZIP file is going to be available for use during execution time. Any
directory structures present are preserved. This means that file sourcing works the same locally and in Azure
Machine Learning Studio execution.
NOTE
Notice that all files are extracted to ‘src’ directory so all paths should have ‘src/’ prefix.
For example, say you want to remove any rows with NAs from the dataset, and also remove any duplicate rows,
before outputting it into CustomAddRows, and you’ve already written an R function that does that in a file
RemoveDupNARows.R:
You can source the auxiliary file RemoveDupNARows.R in the CustomAddRows function:
Execution Environment
The execution environment for the R script uses the same version of R as the Execute R Script module and can
use the same default packages. You can also add additional R packages to your custom module by including them
in the custom module zip package. Just load them in your R script as you would in your own R environment.
Limitations of the execution environment include:
Non-persistent file system: Files written when the custom module is run are not persisted across multiple runs
of the same module.
No network access
Execute Python machine learning scripts in Azure
Machine Learning Studio
3/15/2019 • 6 minutes to read • Edit Online
Python is a valuable tool in the tool chest of many data scientists. It's used in every stage of typical machine
learning workflows including data exploration, feature extraction, model training and validation, and deployment.
This article describes how you can use the Execute Python Script module to use Python code in your Azure
Machine Learning Studio experiments and web services.
Input parameters
Inputs to the Python module are exposed as Pandas DataFrames. The azureml_main function accepts up to two
optional Pandas DataFrames as parameters.
The mapping between input ports and function parameters is positional:
The first connected input port is mapped to the first parameter of the function.
The second input (if connected) is mapped to the second parameter of the function.
The third input is used to import additional Python modules.
More detailed semantics of how the input ports get mapped to parameters of the azureml_main function are
shown below.
Output return values
The azureml_main function must return a single Pandas DataFrame packaged in a Python sequence such as a tuple,
list, or NumPy array. The first element of this sequence is returned to the first output port of the module. The
second output port of the module is used for visualizations and does not require a return value. This scheme is
shown below.
Duplicate column names Add numeric suffix: (1), (2), (3), and so on.
*All input data frames in the Python function always have a 64 -bit numerical index from 0 to the number of rows
minus 1
Importing existing Python script modules
The backend used to execute Python is based on Anaconda, a widely used scientific Python distribution. It comes
with close to 200 of the most common Python packages used in data-centric workloads. Studio does not currently
support the use of package management systems like Pip or Conda to install and manage external libraries. If you
find the need to incorporate additional libraries, use the following scenario as a guide.
A common use-case is to incorporate existing Python scripts into Studio experiments. The Execute Python Script
module accepts a zip file containing Python modules at the third input port. The file is unzipped by the execution
framework at runtime and the contents are added to the library path of the Python interpreter. The azureml_main
entry point function can then import these modules directly.
As an example, consider the file Hello.py containing a simple “Hello, World” function.
Upload the zip file as a dataset into Studio. Then create and run an experiment that uses the Python code in the
Hello.zip file by attaching it to the third input port of the Execute Python Script module as shown in the
following image.
The module output shows that the zip file has been unpackaged and that the function print_hello has been run.
Accessing Azure Storage Blobs
You can access data stored in an Azure Blob Storage account using these steps:
1. Download the Azure Blob Storage package for Python locally.
2. Upload the zip file to your Studio workspace as a dataset.
3. Create your BlobService object with protocol='http'
# Create the BlockBlockService that is used to call the Blob service for the storage account
block_blob_service = BlockBlobService(account_name='account_name', account_key='account_key', protocol='http')
The following experiment then computes and returns the importance scores of features in the “Pima Indian
Diabetes” dataset in Azure Machine Learning Studio:
Limitations
The Execute Python Script module currently has the following limitations:
Sandboxed execution
The Python runtime is currently sandboxed and doesn't allow access to the network or the local file system in a
persistent manner. All files saved locally are isolated and deleted once the module finishes. The Python code cannot
access most directories on the machine it runs on, the exception being the current directory and its subdirectories.
Lack of sophisticated development and debugging support
The Python module currently does not support IDE features such as intellisense and debugging. Also, if the
module fails at runtime, the full Python stack trace is available. But it must be viewed in the output log for the
module. We currently recommend that you develop and debug Python scripts in an environment such as IPython
and then import the code into the module.
Single data frame output
The Python entry point is only permitted to return a single data frame as output. It is not currently possible to
return arbitrary Python objects such as trained models directly back to the Studio runtime. Like Execute R Script,
which has the same limitation, it is possible in many cases to pickle objects into a byte array and then return that
inside of a data frame.
Inability to customize Python installation
Currently, the only way to add custom Python modules is via the zip file mechanism described earlier. While this is
feasible for small modules, it's cumbersome for large modules (especially modules with native DLLs) or a large
number of modules.
Next steps
For more information, see the Python Developer Center.
How to choose algorithms for Azure Machine
Learning Studio
3/15/2019 • 15 minutes to read • Edit Online
The answer to the question "What machine learning algorithm should I use?" is always "It depends." It depends on
the size, quality, and nature of the data. It depends on what you want to do with the answer. It depends on how the
math of the algorithm was translated into instructions for the computer you are using. And it depends on how
much time you have. Even the most experienced data scientists can't tell which algorithm will perform best before
trying them.
Machine Learning Studio provides state-of-the-art algorithms, such as Scalable Boosted Decision trees, Bayesian
Recommendation systems, Deep Neural Networks, and Decision Jungles developed at Microsoft Research.
Scalable open-source machine learning packages, like Vowpal Wabbit, are also included. Machine Learning Studio
supports machine learning algorithms for multiclass and binary classification, regression, and clustering. See the
complete list of Machine Learning Modules. The documentation provides some information about each algorithm
and how to tune parameters to optimize the algorithm for your use.
NOTE
To download the cheat sheet and follow along with this article, go to Machine learning algorithm cheat sheet for Microsoft
Azure Machine Learning Studio.
This cheat sheet has a very specific audience in mind: a beginning data scientist with undergraduate-level machine
learning, trying to choose an algorithm to start with in Azure Machine Learning Studio. That means that it makes
some generalizations and oversimplifications, but it points you in a safe direction. It also means that there are lots
of algorithms not listed here.
These recommendations are compiled feedback and tips from many data scientists and machine learning experts.
We didn't agree on everything, but we've tried to harmonize our opinions into a rough consensus. Most of the
statements of disagreement begin with "It depends…"
How to use the cheat sheet
Read the path and algorithm labels on the chart as "For <path label>, use <algorithm>." For example, "For speed,
use two class logistic regression." Sometimes more than one branch applies. Sometimes none of them are a
perfect fit. They're intended to be rule-of-thumb recommendations, so don't worry about it being exact. Several
data scientists we talked with said that the only sure way to find the very best algorithm is to try all of them.
Here's an example from the Azure AI Gallery of an experiment that tries several algorithms against the same data
and compares the results: Compare Multi-class Classifiers: Letter recognition.
TIP
To download an easy-to-understand infographic overview of machine learning basics to learn about popular algorithms
used to answer common machine learning questions, see Machine learning basics with algorithm examples.
Non -linear class boundary - relying on a linear classification algorithm would result in low accuracy
Data with a nonlinear trend - using a linear regression method would generate much larger errors than
necessary
Despite their dangers, linear algorithms are very popular as a first line of attack. They tend to be algorithmically
simple and fast to train.
Number of parameters
Parameters are the knobs a data scientist gets to turn when setting up an algorithm. They are numbers that affect
the algorithm's behavior, such as error tolerance or number of iterations, or options between variants of how the
algorithm behaves. The training time and accuracy of the algorithm can sometimes be quite sensitive to getting
just the right settings. Typically, algorithms with large numbers of parameters require the most trial and error to
find a good combination.
Alternatively, there is a parameter sweeping module block in Azure Machine Learning Studio that automatically
tries all parameter combinations at whatever granularity you choose. While this is a great way to make sure you've
spanned the parameter space, the time required to train a model increases exponentially with the number of
parameters.
The upside is that having many parameters typically indicates that an algorithm has greater flexibility. It can often
achieve very good accuracy, provided you can find the right combination of parameter settings.
Number of features
For certain types of data, the number of features can be very large compared to the number of data points. This is
often the case with genetics or textual data. The large number of features can bog down some learning algorithms,
making training time unfeasibly long. Support Vector Machines are particularly well suited to this case (see
below ).
Special cases
Some learning algorithms make particular assumptions about the structure of the data or the desired results. If
you can find one that fits your needs, it can give you more useful results, more accurate predictions, or faster
training times.
Two-class
classification
logistic ● ● 5
regression
decision forest ● ○ 6
averaged ○ ○ ● 4
perceptron
Bayes’ point ○ ● 3
machine
Multi-class
classification
logistic ● ● 5
regression
decision forest ● ○ 6
Regression
linear ● ● 4
Bayesian linear ○ ● 2
decision forest ● ○ 6
Anomaly
detection
ALGORITHM ACCURACY TRAINING TIME LINEARITY PARAMETERS NOTES
PCA-based ○ ● 3
anomaly
detection
K-means ○ ● 4 A clustering
algorithm
Algorithm properties:
● - shows excellent accuracy, fast training times, and the use of linearity
○ - shows good accuracy and moderate training times
Algorithm notes
Linear regression
As mentioned previously, linear regression fits a line (or plane, or hyperplane) to the data set. It's a workhorse,
simple and fast, but it may be overly simplistic for some problems.
Next Steps
To download an easy-to-understand infographic overview of machine learning basics to learn about
popular algorithms used to answer common machine learning questions, see Machine learning basics with
algorithm examples.
For a list by category of all the machine learning algorithms available in Machine Learning Studio, see
Initialize Model in the Machine Learning Studio Algorithm and Module Help.
For a complete alphabetical list of algorithms and modules in Machine Learning Studio, see A-Z list of
Machine Learning Studio modules in Machine Learning Studio Algorithm and Module Help.
Machine learning algorithm cheat sheet for Azure
Machine Learning Studio
3/15/2019 • 5 minutes to read • Edit Online
The Azure Machine Learning Studio Algorithm Cheat Sheet helps you choose the right algorithm for a
predictive analytics model.
Azure Machine Learning Studio has a large library of algorithms from the regression, classification, clustering,
and anomaly detection families. Each is designed to address a different type of machine learning problem.
Download and print the Machine Learning Studio Algorithm Cheat Sheet in tabloid size to keep it handy and get
help choosing an algorithm.
NOTE
For help in using this cheat sheet for choosing the right algorithm, plus a deeper discussion of the different types of machine
learning algorithms and how they're used, see How to choose algorithms for Microsoft Azure Machine Learning Studio.
Next steps
For a downloadable infographic that describes algorithms and provides examples, see Downloadable
Infographic: Machine learning basics with algorithm examples.
For a list by category of all the machine learning algorithms available in Machine Learning Studio, see
Initialize Model in the Machine Learning Studio Algorithm and Module Help.
For a complete alphabetical list of algorithms and modules in Machine Learning Studio, see A-Z list of
Machine Learning Studio modules in Machine Learning Studio Algorithm and Module Help.
Downloadable Infographic: Machine learning basics
with algorithm examples
3/5/2019 • 2 minutes to read • Edit Online
Download this easy-to-understand infographic overview of machine learning basics to learn about popular
algorithms used to answer common machine learning questions. Algorithm examples help the machine learning
beginner understand which algorithms to use and what they're used for.
Azure Machine Learning Studio enables you to build and test a predictive analytic solution. Then you can deploy
the solution as a web service.
Machine Learning Studio web services provide an interface between an application and a Machine Learning
Studio workflow scoring model. An external application can communicate with a Machine Learning Studio
workflow scoring model in real time. A call to a Machine Learning Studio web service returns prediction results
to an external application. To make a call to a web service, you pass an API key that was created when you
deployed the web service. A Machine Learning Studio web service is based on REST, a popular architecture
choice for web programming projects.
Azure Machine Learning Studio has two types of web services:
Request-Response Service (RRS ): A low latency, highly scalable service that scores a single data record.
Batch Execution Service (BES ): An asynchronous service that scores a batch of data records.
The input for BES is like data input that RRS uses. The main difference is that BES reads a block of records from
a variety of sources, such as Azure Blob storage, Azure Table storage, Azure SQL Database, HDInsight (hive
query), and HTTP sources.
From a high-level point-of-view, you deploy your model in three steps:
Create a training experiment - In Studio, you can train and test a predictive analytics model using training
data that you supply, using a large set of built-in machine learning algorithms.
Convert it to a predictive experiment - Once your model has been trained with existing data and you're
ready to use it to score new data, you prepare and streamline your experiment for predictions.
Deploy it as a New web service or a Classic web service - When you deploy your predictive experiment
as an Azure web service, users can send data to your model and receive your model's predictions.
For more information on how to perform this conversion, see How to prepare your model for deployment in
Azure Machine Learning Studio.
The following steps describe deploying a predictive experiment as a New web service. You can also deploy the
experiment as Classic web service.
NOTE
To deploy a New web service you must have sufficient permissions in the subscription to which you deploying the web
service. For more information see, Manage a Web service using the Azure Machine Learning Web Services portal.
To test your BES, click Batch. On the Batch test page, click Browse under your input and select a CSV file
containing appropriate sample values. If you don't have a CSV file, and you created your predictive experiment
using Machine Learning Studio, you can download the data set for your predictive experiment and use it.
To download the data set, open Machine Learning Studio. Open your predictive experiment and right click the
input for your experiment. From the context menu, select dataset and then select Download.
Click Test. The status of your Batch Execution job displays to the right under Test Batch Jobs.
On the CONFIGURATION page, you can change the description, title, update the storage account key, and
enable sample data for your web service.
On the CONFIGURATION page, you can change the display name of the service and give it a description. The
name and description is displayed in the Azure portal where you manage your web services.
You can provide a description for your input data, output data, and web service parameters by entering a string
for each column under INPUT SCHEMA, OUTPUT SCHEMA, and Web SERVICE PARAMETER. These
descriptions are used in the sample code documentation provided for the web service.
You can enable logging to diagnose any failures that you're seeing when your web service is accessed. For more
information, see Enable logging for Machine Learning Studio web services.
You can also configure the endpoints for the web service in the Azure Machine Learning Web Services portal
similar to the procedure shown previously in the New web service section. The options are different, you can
add or change the service description, enable logging, and enable sample data for testing.
Access your Classic web service
Once you deploy your web service from Machine Learning Studio, you can send data to the service and receive
responses programmatically.
The dashboard provides all the information you need to access your web service. For example, the API key is
provided to allow authorized access to the service, and API help pages are provided to help you get started
writing your code.
For more information about accessing a Machine Learning Studio web service, see How to consume an Azure
Machine Learning Studio Web service.
Manage your Classic web service
There are various of actions you can perform to monitor a web service. You can update it, and delete it. You can
also add additional endpoints to a Classic web service in addition to the default endpoint that is created when
you deploy it.
For more information, see Manage an Azure Machine Learning Studio workspace and Manage a web service
using the Azure Machine Learning Studio Web Services portal.
NOTE
If you made configuration changes in the original web service, for example, entering a new display name or description,
you will need to enter those values again.
One option for updating your web service is to retrain the model programmatically. For more information, see
Retrain Machine Learning Studio models programmatically.
Next steps
For more technical details on how deployment works, see How a Machine Learning Studio model
progresses from an experiment to an operationalized Web service.
For details on how to get your model ready to deploy, see How to prepare your model for deployment in
Azure Machine Learning Studio.
There are several ways to consume the REST API and access the web service. See How to consume an
Azure Machine Learning Studio web service.
How a Machine Learning Studio model progresses
from an experiment to a Web service
3/15/2019 • 7 minutes to read • Edit Online
Azure Machine Learning Studio provides an interactive canvas that allows you to develop, run, test, and iterate an
experiment representing a predictive analysis model. There are a wide variety of modules available that can:
Input data into your experiment
Manipulate the data
Train a model using machine learning algorithms
Score the model
Evaluate the results
Output final values
Once you’re satisfied with your experiment, you can deploy it as a Classic Azure Machine Learning Web service
or a New Azure Machine Learning Web service so that users can send it new data and receive back results.
In this article, we give an overview of the mechanics of how your Machine Learning model progresses from a
development experiment to an operationalized Web service.
NOTE
There are other ways to develop and deploy machine learning models, but this article is focused on how you use Machine
Learning Studio. For example, to read a description of how to create a classic predictive Web service with R, see the blog post
Build & Deploy Predictive Web Apps Using RStudio and Azure Machine Learning studio.
While Azure Machine Learning Studio is designed to help you develop and deploy a predictive analysis model, it’s
possible to use Studio to develop an experiment that doesn’t include a predictive analysis model. For example, an
experiment might just input data, manipulate it, and then output the results. Just like a predictive analysis
experiment, you can deploy this non-predictive experiment as a Web service, but it’s a simpler process because the
experiment isn’t training or scoring a machine learning model. While it’s not the typical to use Studio in this way,
we’ll include it in the discussion so that we can give a complete explanation of how Studio works.
NOTE
When you click Predictive Web Service you start an automatic process to convert your training experiment to a predictive
experiment, and this works well in most cases. If your training experiment is complex (for example, you have multiple paths
for training that you join together), you might prefer to do this conversion manually. For more information, see How to
prepare your model for deployment in Azure Machine Learning Studio.
Next steps
For more details on the process of developing and experiment, see the following articles:
converting the experiment - How to prepare your model for deployment in Azure Machine Learning Studio
deploying the Web service - Deploy an Azure Machine Learning web service
retraining the model - Retrain Machine Learning models programmatically
For examples of the whole process, see:
Machine learning tutorial: Create your first experiment in Azure Machine Learning Studio
Walkthrough: Develop a predictive analytics solution for credit risk assessment in Azure Machine Learning
How to prepare your model for deployment in Azure
Machine Learning Studio
3/15/2019 • 6 minutes to read • Edit Online
Azure Machine Learning Studio gives you the tools you need to develop a predictive analytics model and then
operationalize it by deploying it as an Azure web service.
To do this, you use Studio to create an experiment - called a training experiment - where you train, score, and edit
your model. Once you're satisfied, you get your model ready to deploy by converting your training experiment to
a predictive experiment that's configured to score user data.
You can see an example of this process in Tutorial 1: Predict credit risk.
This article takes a deep dive into the details of how a training experiment gets converted into a predictive
experiment, and how that predictive experiment is deployed. By understanding these details, you can learn how to
configure your deployed model to make it more effective.
Overview
The process of converting a training experiment to a predictive experiment involves three steps:
1. Replace the machine learning algorithm modules with your trained model.
2. Trim the experiment to only those modules that are needed for scoring. A training experiment includes a
number of modules that are necessary for training but are not needed once the model is trained.
3. Define how your model will accept data from the web service user, and what data will be returned.
TIP
In your training experiment, you've been concerned with training and scoring your model using your own data. But once
deployed, users will send new data to your model and it will return prediction results. So, as you convert your training
experiment to a predictive experiment to get it ready for deployment, keep in mind how the model will be used by others.
When you convert this training experiment to a predictive experiment, some of these modules are no longer
needed, or they now serve a different purpose:
Data - The data in this sample dataset is not used during scoring - the user of the web service will supply
the data to be scored. However, the metadata from this dataset, such as data types, is used by the trained
model. So you need to keep the dataset in the predictive experiment so that it can provide this metadata.
Prep - Depending on the user data that will be submitted for scoring, these modules may or may not be
necessary to process the incoming data. The Set Up Web Service button doesn't touch these - you need
to decide how you want to handle them.
For instance, in this example the sample dataset may have missing values, so a Clean Missing Data module
was included to deal with them. Also, the sample dataset includes columns that are not needed to train the
model. So a Select Columns in Dataset module was included to exclude those extra columns from the data
flow. If you know that the data that will be submitted for scoring through the web service will not have
missing values, then you can remove the Clean Missing Data module. However, since the Select Columns
in Dataset module helps define the columns of data that the trained model expects, that module needs to
remain.
Train - These modules are used to train the model. When you click Set Up Web Service, these modules
are replaced with a single module that contains the model you trained. This new module is saved in the
Trained Models section of the module palette.
Score - In this example, the Split Data module is used to divide the data stream into test data and training
data. In the predictive experiment, we're not training anymore, so Split Data can be removed. Similarly, the
second Score Model module and the Evaluate Model module are used to compare results from the test
data, so these modules are not needed in the predictive experiment. The remaining Score Model module,
however, is needed to return a score result through the web service.
Here is how our example looks after clicking Set Up Web Service:
The work done by Set Up Web Service may be sufficient to prepare your experiment to be deployed as a web
service. However, you may want to do some additional work specific to your experiment.
Adjust input and output modules
In your training experiment, you used a set of training data and then did some processing to get the data in a
form that the machine learning algorithm needed. If the data you expect to receive through the web service will
not need this processing, you can bypass it: connect the output of the Web service input module to a different
module in your experiment. The user's data will now arrive in the model at this location.
For example, by default Set Up Web Service puts the Web service input module at the top of your data flow,
as shown in the figure above. But we can manually position the Web service input past the data processing
modules:
The input data provided through the web service will now pass directly into the Score Model module without any
preprocessing.
Similarly, by default Set Up Web Service puts the Web service output module at the bottom of your data flow.
In this example, the web service will return to the user the output of the Score Model module, which includes the
complete input data vector plus the scoring results. However, if you would prefer to return something different,
then you can add additional modules before the Web service output module.
For example, to return only the scoring results and not the entire vector of input data, add a Select Columns in
Dataset module to exclude all columns except the scoring results. Then move the Web service output module to
the output of the Select Columns in Dataset module. The experiment looks like this:
Add or remove additional data processing modules
If there are more modules in your experiment that you know will not be needed during scoring, these can be
removed. For example, because we moved the Web service input module to a point after the data processing
modules, we can remove the Clean Missing Data module from the predictive experiment.
Our predictive experiment now looks like this:
When you create a predictive experiment, you typically add a web service input and output. When you deploy the
experiment, consumers can send and receive data from the web service through the inputs and outputs. For some
applications, a consumer's data may be available from a data feed or already reside in an external data source such
as Azure Blob storage. In these cases, they do not need read and write data using web service inputs and outputs.
They can, instead, use the Batch Execution Service (BES ) to read data from the data source using an Import Data
module and write the scoring results to a different data location using an Export Data module.
The Import Data and Export data modules, can read from and write to various data locations such as a Web URL
via HTTP, a Hive Query, an Azure SQL database, Azure Table storage, Azure Blob storage, a Data Feed provide, or
an on-premises SQL database.
This topic uses the "Sample 5: Train, Test, Evaluate for Binary Classification: Adult Dataset" sample and assumes
the dataset has already been loaded into an Azure SQL table named censusdata.
from dbo.censusdata;
8. At the bottom of the experiment canvas, click Run.
NOTE
To deploy a New web service you must have sufficient permissions in the subscription to which you deploying the web
service. For more information, see Manage a Web service using the Azure Machine Learning Web Services portal.
An Azure Machine Learning web service is created by publishing an experiment that contains modules with
configurable parameters. In some cases, you may want to change the module behavior while the web service is
running. Web Service Parameters allow you to do this task.
A common example is setting up the Import Data module so that the user of the published web service can specify
a different data source when the web service is accessed. Or configuring the Export Data module so that a different
destination can be specified. Some other examples include changing the number of bits for the Feature Hashing
module or the number of desired features for the Filter-Based Feature Selection module.
You can set Web Service Parameters and associate them with one or more module parameters in your experiment,
and you can specify whether they are required or optional. The user of the web service can then provide values for
these parameters when they call the web service.
NOTE
The API documentation for a classic web service is provided through the API help page link in the web service
DASHBOARD in Machine Learning Studio. The API documentation for a new web service is provided through the Azure
Machine Learning Web Services portal on the Consume and Swagger API pages for your web service.
Example
As an example, let's assume we have an experiment with an Export Data module that sends information to Azure
blob storage. We'll define a Web Service Parameter named "Blob path" that allows the web service user to change
the path to the blob storage when the service is accessed.
1. In Machine Learning Studio, click the Export Data module to select it. Its properties are shown in the
Properties pane to the right of the experiment canvas.
2. Specify the storage type:
Under Please specify data destination, select "Azure Blob Storage".
Under Please specify authentication type, select "Account".
Enter the account information for the Azure blob storage.
3. Click the icon to the right of the Path to blob beginning with container parameter. It looks like this:
6. Click Run.
7. Click Deploy Web Service and select Deploy Web Service [Classic] or Deploy Web Service [New] to
deploy the web service.
NOTE
To deploy a New web service you must have sufficient permissions in the subscription to which you deploying the web
service. For more information see, Manage a Web service using the Azure Machine Learning Web Services portal.
The user of the web service can now specify a new destination for the Export Data module when accessing the web
service.
More information
For a more detailed example, see the Web Service Parameters entry in the Machine Learning Blog.
For more information on accessing a Machine Learning web service, see How to consume an Azure Machine
Learning Web service.
Enable logging for Azure Machine Learning Studio
web services
3/15/2019 • 2 minutes to read • Edit Online
This document provides information on the logging capability of Machine Learning Studio web services. Logging
provides additional information, beyond just an error number and a message, that can help you troubleshoot your
calls to the Machine Learning Studio APIs.
2. On the top menu bar, click Web Services for a New web service, or click Classic Web Services for a
Classic web service.
3. For a New web service, click the web service name. For a Classic web service, click the web service name
and then on the next page click the appropriate endpoint.
4. On the top menu bar, click Configure.
5. Set the Enable Logging option to Error (to log only errors) or All (for full logging).
6. Click Save.
7. For Classic web services, create the ml-diagnostics container.
All web service logs are kept in a blob container named ml-diagnostics in the storage account associated
with the web service. For New web services, this container is created the first time you access the web
service. For Classic web services, you need to create the container if it doesn't already exist.
a. In the Azure portal, go to the storage account associated with the web service.
b. Under Blob Service, click Containers.
c. If the container ml-diagnostics doesn't exist, click +Container, give the container the name "ml-
diagnostics", and select the Access type as "Blob". Click OK.
TIP
For a Classic web service, the Web Services Dashboard in Machine Learning Studio also has a switch to enable logging.
However, because logging is now managed through the Web Services portal, you need to enable logging through the portal
as described in this article. If you already enabled logging in Studio, then in the Web Services Portal, disable logging and
enable it again.
You can manage your Machine Learning New and Classic Web services using the Microsoft Azure Machine
Learning Web Services portal. Since Classic Web services and New Web services are based on different
underlying technologies, you have slightly different management capabilities for each of them.
In the Machine Learning Web Services portal you can:
Monitor how the web service is being used.
Configure the description, update the keys for the web service (New only), update your storage account key
(New only), enable logging, and enable or disable sample data.
Delete the web service.
Create, delete, or update billing plans (New only).
Add and delete endpoints (Classic only)
NOTE
You also can manage Classic web services in Machine Learning Studio on the Web services tab.
Overview
This guide shows you how to quickly get started using API Management to manage your Azure Machine Learning
Studio web services.
Prerequisites
To complete this guide, you need:
An Azure account. If you don’t have an Azure account, click here for details on how to create a free trial account.
An AzureML account. If you don’t have an AzureML account, click here for details on how to create a free trial
account.
The workspace, service, and api_key for an AzureML experiment deployed as a web service. Click here for
details on how to create an AzureML experiment. Click here for details on how to deploy an AzureML
experiment as a web service. Alternately, Appendix A has instructions for how to create and test a simple
AzureML experiment and deploy it as a web service.
The New operation window will be displayed and the Signature tab will be selected by default.
2. Click APIs from the top menu, and then click AzureML Demo API to see the operations available.
5. Click Send.
After an operation is invoked, the developer portal displays the Requested URL from the back-end service, the
Response status, the Response headers, and any Response content.
First, using a browser of your choice, navigate to: https://studio.azureml.net/ and enter your credentials to log in.
Next, create a new blank experiment.
Rename it to SimpleFeatureHashingExperiment. Expand Saved Datasets and drag Book Reviews from
Amazon onto your experiment.
Expand Data Transformation and Manipulation and drag Select Columns in Dataset onto your experiment.
Connect Book Reviews from Amazon to Select Columns in Dataset.
Click Select Columns in Dataset and then click Launch column selector and select Col2. Click the checkmark
to apply these changes.
Expand Text Analytics and drag Feature Hashing onto the experiment. Connect Select Columns in Dataset to
Feature Hashing.
Type 3 for the Hashing bitsize. This will create 8 (23) columns.
At this point, you may want to click Run to test the experiment.
You can find the api_key by clicking your experiment in the web service dashboard.
An easy way to test the RRS endpoint is to click Test on the web service dashboard.
Sa m p l e C o d e
Another way to test your RRS is from your client code. If you click Request/response on the dashboard and scroll
to the bottom, you will see sample code for C#, Python, and R. You will also see the syntax of the RRS request,
including the request URI, headers, and body.
This guide shows a working Python example. You will need to modify it with the workspace, service, and api_key
of your experiment.
import urllib2
import json
workspace = "<REPLACE WITH YOUR EXPERIMENT’S WEB SERVICE WORKSPACE ID>"
service = "<REPLACE WITH YOUR EXPERIMENT’S WEB SERVICE SERVICE ID>"
api_key = "<REPLACE WITH YOUR EXPERIMENT’S WEB SERVICE API KEY>"
data = {
"Inputs": {
"input1": {
"ColumnNames": ["Col2"],
"Values": [ [ "This is a good day" ] ]
},
},
"GlobalParameters": { }
}
url = "https://ussouthcentral.services.azureml.net/workspaces/" + workspace + "/services/" + service +
"/execute?api-version=2.0&details=true"
headers = {'Content-Type':'application/json', 'Authorization':('Bearer '+ api_key)}
body = str.encode(json.dumps(data))
try:
req = urllib2.Request(url, body, headers)
response = urllib2.urlopen(req)
result = response.read()
print "result:" + result
except urllib2.HTTPError, error:
print("The request failed with status code: " + str(error.code))
print(error.info())
print(json.loads(error.read()))
import urllib2
import json
import time
from azure.storage import *
workspace = "<REPLACE WITH YOUR WORKSPACE ID>"
service = "<REPLACE WITH YOUR SERVICE ID>"
api_key = "<REPLACE WITH THE API KEY FOR YOUR WEB SERVICE>"
storage_account_name = "<REPLACE WITH YOUR AZURE STORAGE ACCOUNT NAME>"
storage_account_key = "<REPLACE WITH YOUR AZURE STORAGE KEY>"
storage_container_name = "<REPLACE WITH YOUR AZURE STORAGE CONTAINER NAME>"
input_file = "<REPLACE WITH THE LOCATION OF YOUR INPUT FILE>" # Example: C:\\mydata.csv
output_file = "<REPLACE WITH THE LOCATION OF YOUR OUTPUT FILE>" # Example: C:\\myresults.csv
input_blob_name = "mydatablob.csv"
output_blob_name = "myresultsblob.csv"
def printHttpError(httpError):
print("The request failed with status code: " + str(httpError.code))
print(httpError.info())
print(json.loads(httpError.read()))
return
def saveBlobToFile(blobUrl, resultsLabel):
print("Reading the result from " + blobUrl)
try:
response = urllib2.urlopen(blobUrl)
except urllib2.HTTPError, error:
printHttpError(error)
return
with open(output_file, "w+") as f:
f.write(response.read())
print(resultsLabel + " have been written to the file " + output_file)
return
def processResults(result):
first = True
results = result["Results"]
for outputName in results:
result_blob_location = results[outputName]
sas_token = result_blob_location["SasBlobToken"]
base_url = result_blob_location["BaseLocation"]
relative_url = result_blob_location["RelativeLocation"]
print("The results for " + outputName + " are available at the following Azure Storage location:")
print("BaseLocation: " + base_url)
print("RelativeLocation: " + relative_url)
print("SasBlobToken: " + sas_token)
if (first):
first = False
url3 = base_url + relative_url + sas_token
saveBlobToFile(url3, "The results for " + outputName)
return
def invokeBatchExecutionService():
url = "https://ussouthcentral.services.azureml.net/workspaces/" + workspace +"/services/" + service +"/jobs"
blob_service = BlobService(account_name=storage_account_name, account_key=storage_account_key)
print("Uploading the input to blob storage...")
data_to_upload = open(input_file, "r").read()
blob_service.put_blob(storage_container_name, input_blob_name, data_to_upload, x_ms_blob_type="BlockBlob")
print "Uploaded the input to blob storage"
input_blob_path = "/" + storage_container_name + "/" + input_blob_name
connection_string = "DefaultEndpointsProtocol=https;AccountName=" + storage_account_name + ";AccountKey=" +
storage_account_key
payload = {
"Input": {
"ConnectionString": connection_string,
"RelativeLocation": input_blob_path
},
"Outputs": {
"output1": { "ConnectionString": connection_string, "RelativeLocation": "/" + storage_container_name +
"output1": { "ConnectionString": connection_string, "RelativeLocation": "/" + storage_container_name +
"/" + output_blob_name },
},
"GlobalParameters": {
}
}
body = str.encode(json.dumps(payload))
headers = { "Content-Type":"application/json", "Authorization":("Bearer " + api_key)}
print("Submitting the job...")
# submit the job
req = urllib2.Request(url + "?api-version=2.0", body, headers)
try:
response = urllib2.urlopen(req)
except urllib2.HTTPError, error:
printHttpError(error)
return
result = response.read()
job_id = result[1:-1] # remove the enclosing double-quotes
print("Job ID: " + job_id)
# start the job
print("Starting the job...")
req = urllib2.Request(url + "/" + job_id + "/start?api-version=2.0", "", headers)
try:
response = urllib2.urlopen(req)
except urllib2.HTTPError, error:
printHttpError(error)
return
url2 = url + "/" + job_id + "?api-version=2.0"
while True:
print("Checking the job status...")
# If you are using Python 3+, replace urllib2 with urllib.request in the following code
req = urllib2.Request(url2, headers = { "Authorization":("Bearer " + api_key) })
try:
response = urllib2.urlopen(req)
except urllib2.HTTPError, error:
printHttpError(error)
return
result = json.loads(response.read())
status = result["StatusCode"]
print "status:" + status
if (status == 0 or status == "NotStarted"):
print("Job " + job_id + " not yet started...")
elif (status == 1 or status == "Running"):
print("Job " + job_id + " running...")
elif (status == 2 or status == "Failed"):
print("Job " + job_id + " failed!")
print("Error details: " + result["Details"])
break
elif (status == 3 or status == "Cancelled"):
print("Job " + job_id + " cancelled!")
break
elif (status == 4 or status == "Finished"):
print("Job " + job_id + " finished!")
processResults(result)
break
time.sleep(1) # wait one second
return
invokeBatchExecutionService()
Create endpoints for deployed Azure Machine
Learning Studio web services
3/15/2019 • 2 minutes to read • Edit Online
NOTE
This topic describes techniques applicable to a Classic Machine Learning web service.
After a web service is deployed, a default endpoint is created for that service. The default endpoint can be called by
using its API key. You can add more endpoints with their own keys from the Web Services portal. Each endpoint in
the web service is independently addressed, throttled, and managed. Each endpoint is a unique URL with an
authorization key that you can distribute to your customers.
NOTE
If you have added additional endpoints to the web service, you cannot delete the default endpoint.
1. In Machine Learning Studio, on the left navigation column, click Web Services.
2. At the bottom of the web service dashboard, click Manage endpoints. The Azure Machine Learning Web
Services portal opens to the endpoints page for the web service.
3. Click New.
4. Type a name and description for the new endpoint. Endpoint names must be 24 character or less in length, and
must be made up of lower-case alphabets or numbers. Select the logging level and whether sample data is
enabled. For more information on logging, see Enable logging for Machine Learning web services.
Next steps
How to consume an Azure Machine Learning web service.
How to consume an Azure Machine Learning Studio
web service
3/15/2019 • 8 minutes to read • Edit Online
Once you deploy an Azure Machine Learning Studio predictive model as a Web service, you can use a REST API
to send it data and get predictions. You can send the data in real-time or in batch mode.
You can find more information about how to create and deploy a Machine Learning Web service using Machine
Learning Studio here:
For a tutorial on how to create an experiment in Machine Learning Studio, see Create your first experiment.
For details on how to deploy a Web service, see Deploy a Machine Learning Web service.
For more information about Machine Learning in general, visit the Machine Learning Documentation Center.
Overview
With the Azure Machine Learning Web service, an external application communicates with a Machine Learning
workflow scoring model in real time. A Machine Learning Web service call returns prediction results to an
external application. To make a Machine Learning Web service call, you pass an API key that is created when you
deploy a prediction. The Machine Learning Web service is based on REST, a popular architecture choice for web
programming projects.
Azure Machine Learning Studio has two types of services:
Request-Response Service (RRS ) – A low latency, highly scalable service that provides an interface to the
stateless models created and deployed from the Machine Learning Studio.
Batch Execution Service (BES ) – An asynchronous service that scores a batch for data records.
For more information about Machine Learning Web services, see Deploy a Machine Learning Web service.
using System;
using System.Collections.Generic;
using System.IO;
using System.Net.Http;
using System.Net.Http.Formatting;
using System.Net.Http.Headers;
using System.Text;
using System.Threading.Tasks;
namespace CallRequestResponseService
{
class Program
{
static void Main(string[] args)
{
InvokeRequestResponseService().Wait();
}
// Replace these values with your API key and URI found on https://services.azureml.net/
const string apiKey = "<your-api-key>";
const string apiUri = "<your-api-uri>";
if (response.IsSuccessStatusCode)
{
{
string result = await response.Content.ReadAsStringAsync();
Console.WriteLine("Result: {0}", result);
}
else
{
Console.WriteLine(string.Format("The request failed with status code: {0}",
response.StatusCode));
// Print the headers - they include the request ID and the timestamp,
// which are useful for debugging the failure
Console.WriteLine(response.Headers.ToString());
Python Sample
To connect to a Machine Learning Web service, use the urllib2 library for Python 2.X and urllib.request library
for Python 3.X. You will pass ScoreData, which contains a FeatureVector, an n-dimensional vector of numerical
features that represents the ScoreData. You authenticate to the Machine Learning service with an API key.
To run the code sample
1. Deploy "Sample 1: Download dataset from UCI: Adult 2 class dataset" experiment, part of the Machine
Learning sample collection.
2. Assign apiKey with the key from a Web service. See the Get an Azure Machine Learning Studio
authorization key section near the beginning of this article.
3. Assign serviceUri with the Request URI.
Here is what a complete request will look like.
import urllib2 # urllib.request for Python 3.X
import json
data = {
"Inputs": {
"input1":
[
{
'column1': "value1",
'column2': "value2",
'column3': "value3"
}
],
},
"GlobalParameters": {}
}
body = str.encode(json.dumps(data))
# Replace this with the URI and API Key for your web service
url = '<your-api-uri>'
api_key = '<your-api-key>'
headers = {'Content-Type':'application/json', 'Authorization':('Bearer '+ api_key)}
try:
# "urllib.request.urlopen(req)" for Python 3.X
response = urllib2.urlopen(req)
result = response.read()
print(result)
# "urllib.error.HTTPError as error" for Python 3.X
except urllib2.HTTPError, error:
print("The request failed with status code: " + str(error.code))
# Print the headers - they include the request ID and the timestamp, which are useful for debugging the
failure
print(error.info())
print(json.loads(error.read()))
R Sample
To connect to a Machine Learning Web Service, use the RCurl and rjson libraries to make the request and
process the returned JSON response. You will pass ScoreData, which contains a FeatureVector, an n-dimensional
vector of numerical features that represents the ScoreData. You authenticate to the Machine Learning service
with an API key.
Here is what a complete request will look like.
library("RCurl")
library("rjson")
h = basicTextGatherer()
hdr = basicHeaderGatherer()
req = list(
Inputs = list(
"input1" = list(
list(
'column1' = "value1",
'column2' = "value2",
'column3' = "value3"
)
)
),
GlobalParameters = setNames(fromJSON('{}'), character(0))
)
body = enc2utf8(toJSON(req))
api_key = "<your-api-key>" # Replace this with the API key for the web service
authz_hdr = paste('Bearer', api_key, sep=' ')
h$reset()
curlPerform(url = "<your-api-uri>",
httpheader=c('Content-Type' = "application/json", 'Authorization' = authz_hdr),
postfields=body,
writefunction = h$update,
headerfunction = hdr$update,
verbose = TRUE
)
headers = hdr$value()
httpStatus = headers["status"]
if (httpStatus >= 400)
{
print(paste("The request failed with status code:", httpStatus, sep=" "))
# Print the headers - they include the request ID and the timestamp, which are useful for debugging the
failure
print(headers)
}
print("Result:")
result = h$value()
print(fromJSON(result))
JavaScript Sample
To connect to a Machine Learning Web Service, use the request npm package in your project. You will also use
the JSON object to format your input and parse the result. Install using npm install request --save , or add
"request": "*" to your package.json under dependencies and run npm install .
let data = {
"Inputs": {
"input1":
[
{
'column1': "value1",
'column2': "value2",
'column3': "value3"
}
],
},
"GlobalParameters": {}
}
const options = {
uri: uri,
method: "POST",
headers: {
"Content-Type": "application/json",
"Authorization": "Bearer " + apiKey,
},
body: JSON.stringify(data)
}
Azure Machine Learning Studio makes it easy to call web services directly from Excel without the need to write any
code.
If you are using Excel 2013 (or later) or Excel Online, then we recommend that you use the Excel Excel add-in.
Steps
Publish a web service. Tutorial 3: Deploy credit risk model explains how to do it. Currently the Excel workbook
feature is only supported for Request/Response services that have a single output (that is, a single scoring label).
Once you have a web service, click on the WEB SERVICES section on the left of the studio, and then select the
web service to consume from Excel.
Classic Web Service
1. On the DASHBOARD tab for the web service is a row for the REQUEST/RESPONSE service. If this
service had a single output, you should see the Download Excel Workbook link in that row.
3. A Security Warning appears. Click on the Enable Content button to run macros on your spreadsheet.
4. Once macros are enabled, a table is generated. Columns in blue are required as input into the RRS web
service, or PARAMETERS. Note the output of the RRS service, PREDICTED VALUES in green. When all
columns for a given row are filled, the workbook automatically calls the scoring API, and displays the scored
results.
5. To score more than one row, fill the second row with data and the predicted values are produced. You can
even paste several rows at once.
You can use any of the Excel features (graphs, power map, conditional formatting, etc.) with the predicted values to
help visualize the data.
Automatic updates
An RRS call is made in these two situations:
1. The first time a row has content in all of its PARAMETERS
2. Any time any of the PARAMETERS changes in a row that had all of its PARAMETERS entered.
Excel Add-in for Azure Machine Learning Studio web
services
3/15/2019 • 2 minutes to read • Edit Online
Excel makes it easy to call web services directly without the need to write any code.
NOTE
You will see the list of the Web Services related to the file and at the bottom a checkbox for "Auto-predict". If you
enable auto-predict the predictions of all your services will be updated every time there is a change on the inputs. If
unchecked you will have to click on "Predict All" for refresh. For enabling auto-predict at a service level go to step 6.
2. Choose the web service by clicking it - "Titanic Survivor Predictor (Excel Add-in Sample) [Score]" in this
example.
3. This takes you to the Predict section. This workbook already contains sample data, but for a blank
workbook you can select a cell in Excel and click Use sample data.
4. Select the data with headers and click the input data range icon. Make sure the "My data has headers" box
is checked.
5. Under Output, enter the cell number where you want the output to be, for example "H1" here.
6. Click Predict. If you select the "auto-predict" checkbox any change on the selected areas (the ones specified
as input) will trigger a request and an update of the output cells without the need for you to press the
predict button.
Deploy a web service or use an existing Web service. For more information on deploying a web service, see
Tutorial 3: Deploy credit risk model.
Get the API key for your web service. Where you perform this action depends on whether you published a Classic
Machine Learning web service of a New Machine Learning web service.
Use a Classic web service
1. In Machine Learning Studio, click the WEB SERVICES section in the left pane, and then select the web
service.
3. On the DASHBOARD tab for the web service, click the REQUEST/RESPONSE link.
4. Look for the Request URI section. Copy and save the URL.
NOTE
It is now possible to sign into the Azure Machine Learning Web Services portal to obtain the API key for a Classic Machine
Learning web service.
Machine Learning Batch Pool processing provides customer-managed scale for the Azure Machine Learning Batch
Execution Service. Classic batch processing for machine learning takes place in a multi-tenant environment, which
limits the number of concurrent jobs you can submit, and jobs are queued on a first-in-first-out basis. This
uncertainty means that you can't accurately predict when your job will run.
Batch Pool processing allows you to create pools on which you can submit batch jobs. You control the size of the
pool, and to which pool the job is submitted. Your BES job runs in its own processing space providing predictable
processing performance and the ability to create resource pools that correspond to the processing load that you
submit.
NOTE
You must have a New Resource Manager based Machine Learning web service to create a pool. Once created, you can run
any BES web service, both New Resource Manager based and classic, on the pool.
"Input":{
"ConnectionString":"DefaultEndpointsProtocol=https;BlobEndpoint=https://sampleaccount.blob.core.windows.net/;Ta
bleEndpoint
=https://sampleaccount.table.core.windows.net/;QueueEndpoint=https://sampleaccount.queue.core.windows.net/;File
Endpoint=https://zhguim
l.file.core.windows.net/;AccountName=sampleaccount;AccountKey=<Key>;",
"BaseLocation":null,
"RelativeLocation":"testint/AdultCensusIncomeBinaryClassificationDataset.csv",
"SasBlobToken":null
},
"GlobalParameters":{ },
"Outputs":{
"output1":{
"ConnectionString":"DefaultEndpointsProtocol=https;BlobEndpoint=https://sampleaccount.blob.core.windows.net/;Ta
bleEndpo
int=https://sampleaccount.table.core.windows.net/;QueueEndpoint=https://sampleaccount.queue.core.windows.net/;F
ileEndpoint=https://sampleaccount.file.core.windows.net/;AccountName=sampleaccount;AccountKey=<Key>",
"BaseLocation":null,
"RelativeLocation":"testintoutput/testint\_results.csv",
"SasBlobToken":null
},
"AzureBatchPoolId":"8dfc151b0d3e446497b845f3b29ef53b"
USE BATCH POOL PROCESSING WHEN USE CLASSIC BATCH PROCESSING WHEN
You need to run a large number of jobs You are running just a few jobs
Or And
You need to know that your jobs will run immediately You don’t need the jobs to run immediately
Or
You need guaranteed throughput. For example, you need to
run a number of jobs in a given time frame, and want to scale
out your compute resources to meet your needs.
Analyze Customer Churn using Azure Machine
Learning Studio
4/24/2019 • 12 minutes to read • Edit Online
Overview
This article presents a reference implementation of a customer churn analysis project that is built by using Azure
Machine Learning Studio. In this article, we discuss associated generic models for holistically solving the problem
of industrial customer churn. We also measure the accuracy of models that are built by using Machine Learning,
and assess directions for further development.
Acknowledgements
This experiment was developed and tested by Serge Berger, Principal Data Scientist at Microsoft, and Roger Barga,
formerly Product Manager for Microsoft Azure Machine Learning Studio. The Azure documentation team
gratefully acknowledges their expertise and thanks them for sharing this white paper.
NOTE
The data used for this experiment is not publicly available. For an example of how to build a machine learning model for churn
analysis, see: Retail churn model template in Azure AI Gallery
This forward looking approach is the best way to treat churn, but it comes with complexity: we have to develop a
multi-model archetype and trace dependencies between the models. The interaction among models can be
encapsulated as shown in the following diagram:
Figure 4: Unified multi-model archetype
Interaction between the models is key if we are to deliver a holistic approach to customer retention. Each model
necessarily degrades over time; therefore, the architecture is an implicit loop (similar to the archetype set by the
CRISP -DM data mining standard, [3]).
The overall cycle of risk-decision-marketing segmentation/decomposition is still a generalized structure, which is
applicable to many business problems. Churn analysis is simply a strong representative of this group of problems
because it exhibits all the traits of a complex business problem that does not allow a simplified predictive solution.
The social aspects of the modern approach to churn are not particularly highlighted in the approach, but the social
aspects are encapsulated in the modeling archetype, as they would be in any model.
An interesting addition here is big data analytics. Today's telecommunication and retail businesses collect
exhaustive data about their customers, and we can easily foresee that the need for multi-model connectivity will
become a common trend, given emerging trends such as the Internet of Things and ubiquitous devices, which
allow business to employ smart solutions at multiple layers.
Note that this data is private and therefore the model and data cannot be shared. However, for a similar model
using publicly available data, see this sample experiment in the Azure AI Gallery: Telco Customer Churn.
To learn more about how you can implement a churn analysis model using Cortana Intelligence Suite, we also
recommend this video by Senior Program Manager Wee Hyong Tok.
Results
In this section, we present our findings about the accuracy of the models, based on the scoring dataset.
Accuracy and precision of scoring
Generally, the implementation in Azure Machine Learning Studio is behind SAS in accuracy by about 10-15%
(Area Under Curve or AUC ).
However, the most important metric in churn is the misclassification rate: that is, of the top N churners as predicted
by the classifier, which of them actually did not churn, and yet received special treatment? The following diagram
compares this misclassification rate for all the models:
Performance comparison
We compared the speed at which data was scored using the Machine Learning Studio models and a comparable
model created by using the desktop edition of SAS Enterprise Miner 12.1.
The following table summarizes the performance of the algorithms:
Table 1. General performance (accuracy ) of the algorithms
LR BT AP SVM
The models hosted in Machine Learning Studio outperformed SAS by 15-25% for speed of execution, but accuracy
was largely on par.
Conclusion
This paper describes a sensible approach to tackling the common problem of customer churn by using a generic
framework. We considered a prototype for scoring models and implemented it by using Azure Machine Learning
Studio. Finally, we assessed the accuracy and performance of the prototype solution with regard to comparable
algorithms in SAS.
References
[1] Predictive Analytics: Beyond the Predictions, W. McKnight, Information Management, July/August 2011, p.18-
20.
[2] Wikipedia article: Accuracy and precision
[3] CRISP -DM 1.0: Step-by-Step Data Mining Guide
[4] Big Data Marketing: Engage Your Customers More Effectively and Drive Value
[5] Telco churn model template in Azure AI Gallery
Appendix
You can use Azure Machine Learning Studio to build and operationalize text analytics models. These models can
help you solve, for example, document classification or sentiment analysis problems.
In a text analytics experiment, you would typically:
1. Clean and preprocess text dataset
2. Extract numeric feature vectors from pre-processed text
3. Train classification or regression model
4. Score and validate the model
5. Deploy the model to production
In this tutorial, you learn these steps as we walk through a sentiment analysis model using Amazon Book Reviews
dataset (see this research paper “Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for
Sentiment Classification” by John Blitzer, Mark Dredze, and Fernando Pereira; Association of Computational
Linguistics (ACL ), 2007.) This dataset consists of review scores (1-2 or 4-5) and a free-form text. The goal is to
predict the review score: low (1-2) or high (4-5).
You can find experiments covered in this tutorial at Azure AI Gallery:
Predict Book Reviews
Predict Book Reviews - Predictive Experiment
What if you want to use a custom list of stopwords? You can pass it in as optional input. You can also use custom
C# syntax regular expression to replace substrings, and remove words by part of speech: nouns, verbs, or
adjectives.
After the preprocessing is complete, we split the data into train and test sets.
Now we have an experiment that can be published as a web service and called using request-response or batch
execution APIs.
Next Steps
Learn about text analytics modules from MSDN documentation.
Migrate analytics from Excel to Azure Machine
Learning Studio
3/15/2019 • 7 minutes to read • Edit Online
Kate Baroni and Ben Boatman are enterprise solution architects in Microsoft’s Data Insights Center of
Excellence. In this article, they describe their experience migrating an existing regression analysis suite to a
cloud-based solution using Azure Machine Learning Studio.
Goal
Our project started with two goals in mind:
1. Use predictive analytics to improve the accuracy of our organization’s monthly revenue projections
2. Use Azure Machine Learning Studio to confirm, optimize, increase velocity, and scale of our results.
Like many businesses, our organization goes through a monthly revenue forecasting process. Our small team of
business analysts was tasked with using Azure Machine Learning Studio to support the process and improve
forecast accuracy. The team spent several months collecting data from multiple sources and running the data
attributes through statistical analysis identifying key attributes relevant to services sales forecasting. The next step
was to begin prototyping statistical regression models on the data in Excel. Within a few weeks, we had an Excel
regression model that was outperforming the current field and finance forecasting processes. This became the
baseline prediction result.
We then took the next step to moving our predictive analytics over to Studio to find out how Studio could improve
on predictive performance.
EXCEL STUDIO
Performance
When we ran our process and results by the developers and data scientists on the Machine Learning team, they
quickly provided some useful tips.
When you use the Linear Regression module in Studio, two methods are provided:
Online Gradient Descent: May be more suitable for larger-scale problems
Ordinary Least Squares: This is the method most people think of when they hear linear regression. For
small datasets, Ordinary Least Squares can be a more optimal choice.
Consider tweaking the L2 Regularization Weight parameter to improve performance. It is set to 0.001 by
default, but for our small data set we set it to 0.005 to improve performance.
Mystery solved!
When we applied the recommendations, we achieved the same baseline performance in Studio as with Excel:
Learner Excel -> Data Analysis -> Linear Regression. Linear Regression
Regression
Performance
EXCEL STUDIO (INITIAL) STUDIO W/ LEAST SQUARES
In addition, the Excel coefficients compared well to the feature weights in the Azure trained model:
Next Steps
We wanted to consume the Machine Learning web service within Excel. Our business analysts rely on Excel and we
needed a way to call the Machine Learning web service with a row of Excel data and have it return the predicted
value to Excel.
We also wanted to optimize our model, using the options and algorithms available in Studio.
Integration with Excel
Our solution was to operationalize our Machine Learning regression model by creating a web service from the
trained model. Within a few minutes, the web service was created and we could call it directly from Excel to return
a predicted revenue value.
The Web Services Dashboard section includes a downloadable Excel workbook. The workbook comes pre-
formatted with the web service API and schema information embedded. When you click Download Excel
Workbook, the workbook opens and you can save it to your local computer.
With the workbook open, copy your predefined parameters into the blue Parameter section as shown below. Once
the parameters are entered, Excel calls out to the Machine Learning web service and the predicted scored labels
will display in the green Predicted Values section. The workbook will continue to create predictions for parameters
based on your trained model for all row items entered under Parameters. For more information on how to use this
feature, see Consuming an Azure Machine Learning Web Service from Excel.
Optimization and further experiments
Now that we had a baseline with our Excel model, we moved ahead to optimize our Machine Learning Linear
Regression Model. We used the module Filter-Based Feature Selection to improve on our selection of initial data
elements and it helped us achieve a performance improvement of 4.6% Mean Absolute Error. For future projects
we will use this feature which could save us weeks in iterating through data attributes to find the right set of
features to use for modeling.
Next we plan to include additional algorithms like Bayesian or Boosted Decision Trees in our experiment to
compare performance.
If you want to experiment with regression, a good dataset to try is the Energy Efficiency Regression sample dataset,
which has lots of numerical attributes. The dataset is provided as part of the sample datasets in Studio. You can use
a variety of learning modules to predict either Heating Load or Cooling Load. The chart below is a performance
comparison of different regression learns against the Energy Efficiency dataset predicting for the target variable
Cooling Load:
Key Takeaways
We learned a lot by from running Excel regression and Studio experiments in parallel. Creating the baseline model
in Excel and comparing it to models using Machine Learning Linear Regression helped us learn Studio, and we
discovered opportunities to improve data selection and model performance.
We also found that it is advisable to use Filter-Based Feature Selection to accelerate future prediction projects. By
applying feature selection to your data, you can create an improved model in Studio with better overall
performance.
The ability to transfer the predictive analytic forecasting from Studio to Excel systemically allows a significant
increase in the ability to successfully provide results to a broad business user audience.
Resources
Here are some resources for helping you work with regression:
Regression in Excel. If you’ve never tried regression in Excel, this tutorial makes it easy: https://www.excel-
easy.com/examples/regression.html
Regression vs forecasting. Tyler Chessman wrote a blog article explaining how to do time series forecasting in
Excel, which contains a good beginner’s description of linear regression. http://sqlmag.com/sql-server-analysis-
services/understanding-time-series-forecasting-concepts
Ordinary Least Squares Linear Regression: Flaws, Problems and Pitfalls. For an introduction and discussion of
Regression: https://www.clockbackward.com/2009/06/18/ordinary-least-squares-linear-regression-flaws-
problems-and-pitfalls/
Export and delete in-product user data from Azure
Machine Learning Studio
3/15/2019 • 2 minutes to read • Edit Online
You can delete or export in-product data stored by Azure Machine Learning Studio by using the Azure portal, the
Studio interface, PowerShell, and authenticated REST APIs. This article tells you how.
Telemetry data can be accessed through the Azure Privacy portal.
NOTE
If you’re interested in viewing or deleting personal data, please see the Azure Data Subject Requests for the GDPR article. If
you’re looking for general info about GDPR, see the GDPR section of the Service Trust portal.
NOTE
This article provides steps for how to delete personal data from the device or service and can be used to support your
obligations under the GDPR. If you’re looking for general info about GDPR, see the GDPR section of the Service Trust portal.
Next steps
For documentation covering web services and commitment plan billing, see Azure Machine Learning Studio REST
API reference.
View and delete in-product user data from Azure AI
Gallery
3/29/2019 • 3 minutes to read • Edit Online
You can view and delete your in-product user data from Azure AI Gallery using the interface or AI Gallery Catalog
API. This article tells you how.
NOTE
If you’re interested in viewing or deleting personal data, please see the Azure Data Subject Requests for the GDPR article. If
you’re looking for general info about GDPR, see the GDPR section of the Service Trust portal.
NOTE
This article provides steps for how to delete personal data from the device or service and can be used to support your
obligations under the GDPR. If you’re looking for general info about GDPR, see the GDPR section of the Service Trust portal.
https://catalog.cortanaanalytics.com/users/[AuthorID]
https://catalog.cortanaanalytics.com/users/99F1F5C6260295F1078187FA179FBE08B618CB62129976F09C6AF0923B02A5BA
{"entities_count":9,"contribution_score":86.351575190956922,"scored_at":"2018-05-
07T14:30:25.9305671+00:00","contributed_at":"2018-05-07T14:26:55.0381756+00:00","created_at":"2017-12-
15T00:49:15.6733094+00:00","updated_at":"2017-12-15T00:49:15.6733094+00:00","name":"First Last","slugs":
["First-
Last"],"tenant_id":"14b2744cf8d6418c87ffddc3f3127242","community_id":"9502630827244d60a1214f250e3bbca7","id":"9
9F1F5C6260295F1078187FA179FBE08B618CB62129976F09C6AF0923B02A5BA","_links":
{"self":"https://catalog.azureml.net/tenants/14b2744cf8d6418c87ffddc3f3127242/communities/9502630827244d60a1214
f250e3bbca7/users/99F1F5C6260295F1078187FA179FBE08B618CB62129976F09C6AF0923B02A5BA"},"etag":"\"2100d185-0000-
0000-0000-5af063010000\""}
https://catalog.cortanaanalytics.com/entities?$filter=author/id eq '[AuthorId]'
For example:
https://catalog.cortanaanalytics.com/entities?$filter=author/id eq
'99F1F5C6260295F1078187FA179FBE08B618CB62129976F09C6AF0923B02A5BA'
TIP
If unlisted entities are not showing up in responses from the Catalog API, the user may have an invalid or expired access
token. Sign out of the Azure AI Gallery, and then repeat the steps in Get your access token to renew the token.
PowerShell modules for Azure Machine Learning
Studio
5/6/2019 • 2 minutes to read • Edit Online
Using PowerShell modules, you can programmatically manage your Studio resources and assets such as
workspaces, datasets, and web services.
You can interact with Studio resources using three Powershell modules:
Azure PowerShell Az released in 2018, includes all functionality of AzureRM, although with different cmdlet
names
AzureRM released in 2016, replaced by PowerShell Az
Azure Machine Learning PowerShell classic released in 2016
Although these PowerShell modules have some similarities, each is designed for particular scenarios. This article
describes the differences between the PowerShell modules, and helps you decide which ones to choose.
Check the support table below to see which resources are supported by each module.
PowerShell classic
The Studio PowerShell classic module allows you to manage resources deployed using the classic deployment
model. These resources include Studio user assets, "classic" web services, and "classic" web service endpoints.
However, Microsoft recommends that you use the Resource Manager deployment model for all future resources
to simplify the deployment and management of resources. If you would like to learn more about the deployment
models, see the Azure Resource Manager vs. classic deployment article.
To get started with PowerShell classic, download the release package from GitHub and follow the instructions for
installation. The instructions explain how to unblock the downloaded/unzipped DLL and then import it into your
PowerShell environment.
PowerShell classic can be installed alongside either Az or AzureRM to cover both "new" and "classic" resource
types.
Next steps
Consult the full documentation these PowerShell module:
PowerShell classic
Azure PowerShell Az
Azure Machine Learning Studio REST API Error
Codes
3/15/2019 • 7 minutes to read • Edit Online
The following error codes could be returned by an operation on an Azure Machine Learning Studio web service.
BadParameterValue The parameter value supplied does not satisfy the parameter
rule on the parameter
BadSubscriptionId The subscription Id that is used to score is not the one present
in the resource
BadVersionCall Invalid version parameter was passed during the API call: {0}.
Check the API help page for passing the correct version and
try again.
BatchJobInputsNotSpecified The following required input(s) were not specified with the
request: {0}. Please ensure all input data is specified and try
again.
BatchJobInputsTooManySpecified The request specified more inputs than defined in the service.
List of accepted input(s): {0}. Please ensure all input data is
specified correctly and try again.
BlobNameTooLong Azure blob storage path provided for diagnostic output is too
long: {0}. Shorten the path and try again.
BlobNotFound Unable to access the provided Azure blob - {0}. Azure error
message: {1}.
ContainerSegmentInvalid Invalid container name. Provide a valid container name and try
again.
DuplicateInputInBatchCall The batch request is invalid. Cannot specify both single and
multiple input at the same time. Remove one of these items
from the request and try again.
ExpiryTimeInThePast Expiry time provided is in the past: {0}. Provide a future expiry
time in UTC and try again. To never expire, set expiry time to
NULL.
InvalidBlob Invalid blob specification for blob: {0}. Verify that connection
string / relative path or SAS token specification is correct and
try again.
InvalidBlobExtension The blob reference: {0} has an invalid or missing file extension.
Supported file extensions for this output type are: "{1}".
InvalidOutputOverrideName Invalid output override name: {0}. The service does not have
an output node with this name. Please pass in a correct
output node name to override (case sensitivity applies).
MissingRequestInput The web service expects an input, but no input was provided.
Ensure valid inputs are provided based on the published input
ports in the model and try again.
ERROR CODE USER MESSAGE
MissingRequiredGlobalParameters Not all required web service parameter(s) provided. Verify the
parameter(s) expected for the module(s) are correct and try
again.
ModelPackageIdInvalid Invalid model package Id. Verify that the model package Id is
correct and try again.
UserParameterInvalid {0}
WebServiceTypeInvalid Invalid web service type provided. Verify the valid web service
type is correct and try again. Valid web service types: {0}.
InputParseError Parsing of input vector failed. Verify the input vector has the
correct number of columns and data types. Additional details:
{0}.
UserParameterInvalid {0}
AdminRequestUnauthorized Unauthorized
ManagementRequestUnauthorized Unauthorized
WebServiceNotFound Web service not found. Verify the web service Id is correct and
try again.
ModelOutputMetadataMismatch Invalid output parameter name. Try using the metadata editor
module to rename columns and try again.
AdminAuthenticationFailed
BackendArgumentError
BackendBadRequest
ClusterConfigBlobMisconfigured
DeleteWebServiceResourceFailed
ExceptionDeserializationError
FailedGettingApiDocument
FailedStoringWebService
InvalidResourceCacheConfiguration
InvalidResourceDownloadConfiguration
InvalidWebServiceResources
ModelPackageInvalid
ModuleExecutionFailed
ModuleLoadFailed
ModuleObjectCloneFailed
ERROR CODE USER MESSAGE
OutputConversionFailed
ResourceDownload
ResourceLoadFailed
ServiceUrisNotFound
UnexpectedScoreStatus
UnknownBackendErrorResponse
UnknownError
UnknownModuleError
UpdateWebServiceResourceFailed
WebServiceGroupNotFound
WorkerAuthorizationFailed
WorkerUnreachable
TooManyHostsBeingInitialized Too many hosts being initialized at the same time. Consider
throttling / retrying.
TooManyHostsBeingInitializedPerModel Too many hosts being initialized at the same time. Consider
throttling / retrying.
This article provides information on how to learn more about Azure Machine Learning Studio and get support for
your issues and questions.
Net# is a language developed by Microsoft that is used to define complex neural network architectures such as
deep neural networks or convolutions of arbitrary dimensions. You can use complex structures to improve
learning on data such as image, video, or audio.
You can use a Net# architecture specification in these contexts:
All neural network modules in Microsoft Azure Machine Learning Studio: Multiclass Neural Network, Two-
Class Neural Network, and Neural Network Regression
Neural network functions in Microsoft ML Server: NeuralNet and rxNeuralNetfor the R language, and
rx_neural_network for Python.
This article describes the basic concepts and syntax needed to develop a custom neural network using Net#:
Neural network requirements and how to define the primary components
The syntax and keywords of the Net# specification language
Examples of custom neural networks created using Net#
Supported customizations
The architecture of neural network models that you create in Azure Machine Learning Studio can be extensively
customized by using Net#. You can:
Create hidden layers and control the number of nodes in each layer.
Specify how layers are to be connected to each other.
Define special connectivity structures, such as convolutions and weight sharing bundles.
Specify different activation functions.
For details of the specification language syntax, see Structure Specification.
For examples of defining neural networks for some common machine learning tasks, from simplex to complex,
see Examples.
General requirements
There must be exactly one output layer, at least one input layer, and zero or more hidden layers.
Each layer has a fixed number of nodes, conceptually arranged in a rectangular array of arbitrary dimensions.
Input layers have no associated trained parameters and represent the point where instance data enters the
network.
Trainable layers (the hidden and output layers) have associated trained parameters, known as weights and
biases.
The source and destination nodes must be in separate layers.
Connections must be acyclic; in other words, there cannot be a chain of connections leading back to the initial
source node.
The output layer cannot be a source layer of a connection bundle.
Structure specifications
A neural network structure specification is composed of three sections: the constant declaration, the layer
declaration, the connection declaration. There is also an optional share declaration section. The sections can
be specified in any order.
Constant declaration
A constant declaration is optional. It provides a means to define values used elsewhere in the neural network
definition. The declaration statement consists of an identifier followed by an equal sign and a value expression.
For example, the following statement defines a constant x :
Const X = 28;
To define two or more constants simultaneously, enclose the identifier names and values in braces, and separate
them by using semicolons. For example:
Const { X = 28; Y = 4; }
The right-hand side of each assignment expression can be an integer, a real number, a Boolean value (True or
False), or a mathematical expression. For example:
Const { X = 17 * 2; Y = true; }
Layer declaration
The layer declaration is required. It defines the size and source of the layer, including its connection bundles and
attributes. The declaration statement starts with the name of the layer (input, hidden, or output), followed by the
dimensions of the layer (a tuple of positive integers). For example:
The product of the dimensions is the number of nodes in the layer. In this example, there are two dimensions
[5,20], which means there are 100 nodes in the layer.
The layers can be declared in any order, with one exception: If more than one input layer is defined, the order in
which they are declared must match the order of features in the input data.
To specify that the number of nodes in a layer be determined automatically, use the auto keyword. The auto
keyword has different effects, depending on the layer:
In an input layer declaration, the number of nodes is the number of features in the input data.
In a hidden layer declaration, the number of nodes is the number that is specified by the parameter value for
Number of hidden nodes.
In an output layer declaration, the number of nodes is 2 for two-class classification, 1 for regression, and equal
to the number of output nodes for multiclass classification.
For example, the following network definition allows the size of all layers to be automatically determined:
A layer declaration for a trainable layer (the hidden or output layers) can optionally include the output function
(also called an activation function), which defaults to sigmoid for classification models, and linear for regression
models. Even if you use the default, you can explicitly state the activation function, if desired for clarity.
The following output functions are supported:
sigmoid
linear
softmax
rlinear
square
sqrt
srlinear
abs
tanh
brlinear
For example, the following declaration uses the softmax function:
output Result [100] softmax from Hidden all;
Connection declaration
Immediately after defining the trainable layer, you must declare connections among the layers you have defined.
The connection bundle declaration starts with the keyword from , followed by the name of the bundle's source
layer and the kind of connection bundle to create.
Currently, five kinds of connection bundles are supported:
Full bundles, indicated by the keyword all
Filtered bundles, indicated by the keyword where , followed by a predicate expression
Convolutional bundles, indicated by the keyword convolve , followed by the convolution attributes
Pooling bundles, indicated by the keywords max pool or mean pool
Response normalization bundles, indicated by the keyword response norm
Full bundles
A full connection bundle includes a connection from each node in the source layer to each node in the destination
layer. This is the default network connection type.
Filtered bundles
A filtered connection bundle specification includes a predicate, expressed syntactically, much like a C# lambda
expression. The following example defines two filtered bundles:
In the predicate for ByRow , s is a parameter representing an index into the rectangular array of nodes of
the input layer, Pixels , and d is a parameter representing an index into the array of nodes of the hidden
layer, ByRow . The type of both s and d is a tuple of integers of length two. Conceptually, s ranges over
all pairs of integers with 0 <= s[0] < 10 and 0 <= s[1] < 20 , and d ranges over all pairs of integers, with
0 <= d[0] < 10 and 0 <= d[1] < 12 .
On the right-hand side of the predicate expression, there is a condition. In this example, for every value of
s and d such that the condition is True, there is an edge from the source layer node to the destination
layer node. Thus, this filter expression indicates that the bundle includes a connection from the node
defined by s to the node defined by d in all cases where s[0] is equal to d[0].
Optionally, you can specify a set of weights for a filtered bundle. The value for the Weights attribute must be a
tuple of floating point values with a length that matches the number of connections defined by the bundle. By
default, weights are randomly generated.
Weight values are grouped by the destination node index. That is, if the first destination node is connected to K
source nodes, the first K elements of the Weights tuple are the weights for the first destination node, in source
index order. The same applies for the remaining destination nodes.
It's possible to specify weights directly as constant values. For example, if you learned the weights previously, you
can specify them as constants using this syntax:
const Weights_1 = [0.0188045055, 0.130500451, ...]
Convolutional bundles
When the training data has a homogeneous structure, convolutional connections are commonly used to learn
high-level features of the data. For example, in image, audio, or video data, spatial or temporal dimensionality can
be fairly uniform.
Convolutional bundles employ rectangular kernels that are slid through the dimensions. Essentially, each kernel
defines a set of weights applied in local neighborhoods, referred to as kernel applications. Each kernel
application corresponds to a node in the source layer, which is referred to as the central node. The weights of a
kernel are shared among many connections. In a convolutional bundle, each kernel is rectangular and all kernel
applications are the same size.
Convolutional bundles support the following attributes:
InputShape defines the dimensionality of the source layer for the purposes of this convolutional bundle. The
value must be a tuple of positive integers. The product of the integers must equal the number of nodes in the
source layer, but otherwise, it does not need to match the dimensionality declared for the source layer. The length
of this tuple becomes the arity value for the convolutional bundle. Typically arity refers to the number of
arguments or operands that a function can take.
To define the shape and locations of the kernels, use the attributes KernelShape, Stride, Padding, LowerPad,
and UpperPad:
KernelShape: (required) Defines the dimensionality of each kernel for the convolutional bundle. The value
must be a tuple of positive integers with a length that equals the arity of the bundle. Each component of
this tuple must be no greater than the corresponding component of InputShape.
Stride: (optional) Defines the sliding step sizes of the convolution (one step size for each dimension), that is
the distance between the central nodes. The value must be a tuple of positive integers with a length that is
the arity of the bundle. Each component of this tuple must be no greater than the corresponding
component of KernelShape. The default value is a tuple with all components equal to one.
Sharing: (optional) Defines the weight sharing for each dimension of the convolution. The value can be a
single Boolean value or a tuple of Boolean values with a length that is the arity of the bundle. A single
Boolean value is extended to be a tuple of the correct length with all components equal to the specified
value. The default value is a tuple that consists of all True values.
MapCount: (optional) Defines the number of feature maps for the convolutional bundle. The value can be
a single positive integer or a tuple of positive integers with a length that is the arity of the bundle. A single
integer value is extended to be a tuple of the correct length with the first components equal to the specified
value and all the remaining components equal to one. The default value is one. The total number of feature
maps is the product of the components of the tuple. The factoring of this total number across the
components determines how the feature map values are grouped in the destination nodes.
Weights: (optional) Defines the initial weights for the bundle. The value must be a tuple of floating point
values with a length that is the number of kernels times the number of weights per kernel, as defined later
in this article. The default weights are randomly generated.
There are two sets of properties that control padding, the properties being mutually exclusive:
Padding: (optional) Determines whether the input should be padded by using a default padding
scheme. The value can be a single Boolean value, or it can be a tuple of Boolean values with a length that is
the arity of the bundle.
A single Boolean value is extended to be a tuple of the correct length with all components equal to the
specified value.
If the value for a dimension is True, the source is logically padded in that dimension with zero-valued cells
to support additional kernel applications, such that the central nodes of the first and last kernels in that
dimension are the first and last nodes in that dimension in the source layer. Thus, the number of "dummy"
nodes in each dimension is determined automatically, to fit exactly (InputShape[d] - 1) / Stride[d] + 1
kernels into the padded source layer.
If the value for a dimension is False, the kernels are defined so that the number of nodes on each side that
are left out is the same (up to a difference of 1). The default value of this attribute is a tuple with all
components equal to False.
UpperPad and LowerPad: (optional) Provide greater control over the amount of padding to use.
Important: These attributes can be defined if and only if the Padding property above is not defined. The
values should be integer-valued tuples with lengths that are the arity of the bundle. When these attributes
are specified, "dummy" nodes are added to the lower and upper ends of each dimension of the input layer.
The number of nodes added to the lower and upper ends in each dimension is determined by LowerPad[i]
and UpperPad[i] respectively.
To ensure that kernels correspond only to "real" nodes and not to "dummy" nodes, the following conditions
must be met:
Each component of LowerPad must be strictly less than KernelShape[d]/2 .
Each component of UpperPad must be no greater than KernelShape[d]/2 .
The default value of these attributes is a tuple with all components equal to 0.
The setting Padding = true allows as much padding as is needed to keep the "center" of the kernel
inside the "real" input. This changes the math a bit for computing the output size. Generally, the
output size D is computed as D = (I - K) / S + 1 , where I is the input size, K is the kernel size,
S is the stride, and / is integer division (round toward zero). If you set UpperPad = [1, 1 ], the input
size I is effectively 29, and thus D = (29 - 5) / 2 + 1 = 13 . However, when Padding = true,
essentially I gets bumped up by K - 1 ; hence
D = ((28 + 4) - 5) / 2 + 1 = 27 / 2 + 1 = 13 + 1 = 14 . By specifying values for UpperPad and
LowerPad you get much more control over the padding than if you just set Padding = true.
For more information about convolutional networks and their applications, see these articles:
http://deeplearning.net/tutorial/lenet.html
https://research.microsoft.com/pubs/68920/icdar03.pdf
Pooling bundles
A pooling bundle applies geometry similar to convolutional connectivity, but it uses predefined functions to
source node values to derive the destination node value. Hence, pooling bundles have no trainable state (weights
or biases). Pooling bundles support all the convolutional attributes except Sharing, MapCount, and Weights.
Typically, the kernels summarized by adjacent pooling units do not overlap. If Stride[d] is equal to KernelShape[d]
in each dimension, the layer obtained is the traditional local pooling layer, which is commonly employed in
convolutional neural networks. Each destination node computes the maximum or the mean of the activities of its
kernel in the source layer.
The following example illustrates a pooling bundle:
The arity of the bundle is 3: that is, the length of the tuples InputShape , KernelShape , and Stride .
The number of nodes in the source layer is 5 * 24 * 24 = 2880 .
This is a traditional local pooling layer because KernelShape and Stride are equal.
The number of nodes in the destination layer is 5 * 12 * 12 = 1440 .
Response normalization bundles support all the convolutional attributes except Sharing, MapCount, and
Weights.
If the kernel contains neurons in the same map as x, the normalization scheme is referred to as same map
normalization. To define same map normalization, the first coordinate in InputShape must have the
value 1.
If the kernel contains neurons in the same spatial position as x, but the neurons are in other maps, the
normalization scheme is called across maps normalization. This type of response normalization
implements a form of lateral inhibition inspired by the type found in real neurons, creating competition for
big activation levels amongst neuron outputs computed on different maps. To define across maps
normalization, the first coordinate must be an integer greater than one and no greater than the number of
maps, and the rest of the coordinates must have the value 1.
Because response normalization bundles apply a predefined function to source node values to determine the
destination node value, they have no trainable state (weights or biases).
NOTE
The nodes in the destination layer correspond to neurons that are the central nodes of the kernels. For example, if
KernelShape[d] is odd, then KernelShape[d]/2 corresponds to the central kernel node. If KernelShape[d] is even, the
central node is at KernelShape[d]/2 - 1 . Therefore, if Padding[d] is False, the first and the last KernelShape[d]/2
nodes do not have corresponding nodes in the destination layer. To avoid this situation, define Padding as [true, true, …,
true].
In addition to the four attributes described earlier, response normalization bundles also support the following
attributes:
Alpha: (required) Specifies a floating-point value that corresponds to α in the previous formula.
Beta: (required) Specifies a floating-point value that corresponds to β in the previous formula.
Offset: (optional) Specifies a floating-point value that corresponds to k in the previous formula. It defaults to
1.
The following example defines a response normalization bundle using these attributes:
hidden RN1 [5, 10, 10]
from P1 response norm {
InputShape = [ 5, 12, 12];
KernelShape = [ 1, 3, 3];
Alpha = 0.001;
Beta = 0.75;
}
The source layer includes five maps, each with aof dimension of 12x12, totaling in 1440 nodes.
The value of KernelShape indicates that this is a same map normalization layer, where the neighborhood is a
3x3 rectangle.
The default value of Padding is False, thus the destination layer has only 10 nodes in each dimension. To
include one node in the destination layer that corresponds to every node in the source layer, add Padding =
[true, true, true]; and change the size of RN1 to [5, 12, 12].
Share declaration
Net# optionally supports defining multiple bundles with shared weights. The weights of any two bundles can be
shared if their structures are the same. The following syntax defines bundles with shared weights:
share-declaration:
share { layer-list }
share { bundle-list }
share { bias-list }
layer-list:
layer-name , layer-name
layer-list , layer-name
bundle-list:
bundle-spec , bundle-spec
bundle-list , bundle-spec
bundle-spec:
layer-name => layer-name
bias-list:
bias-spec , bias-spec
bias-list , bias-spec
bias-spec:
1 => layer-name
layer-name:
identifier
For example, the following share-declaration specifies the layer names, indicating that both weights and biases
should be shared:
Const {
InputSize = 37;
HiddenSize = 50;
}
input {
Data1 [InputSize];
Data2 [InputSize];
}
hidden {
H1 [HiddenSize] from Data1 all;
H2 [HiddenSize] from Data2 all;
}
output Result [2] {
from H1 all;
from H2 all;
}
share { H1, H2 } // share both weights and biases
The input features are partitioned into two equal sized input layers.
The hidden layers then compute higher level features on the two input layers.
The share-declaration specifies that H1 and H2 must be computed in the same way from their respective
inputs.
Alternatively, this could be specified with two separate share-declarations as follows:
You can use the short form only when the layers contain a single bundle. In general, sharing is possible only when
the relevant structure is identical, meaning that they have the same size, same convolutional geometry, and so
forth.
// Define the first two hidden layers, using data only from the Pixels input
hidden ByRow [10, 12] from Pixels where (s,d) => s[0] == d[0];
hidden ByCol [5, 20] from Pixels where (s,d) => abs(s[1] - d[1]) <= 1;
// Define the third hidden layer, which uses as source the hidden layers ByRow and ByCol
hidden Gather [100]
{
from ByRow all;
from ByCol all;
}
This example illustrates several features of the neural networks specification language:
The structure has two input layers, Pixels and MetaData .
The Pixels layer is a source layer for two connection bundles, with destination layers, ByRow and ByCol .
The layers Gather and Result are destination layers in multiple connection bundles.
The output layer, Result , is a destination layer in two connection bundles; one with the second level hidden
layer Gather as a destination layer, and the other with the input layer MetaData as a destination layer.
The hidden layers, ByRow and ByCol , specify filtered connectivity by using predicate expressions. More
precisely, the node in ByRow at [x, y] is connected to the nodes in Pixels that have the first index coordinate
equal to the node's first coordinate, x. Similarly, the node in ByCol at [x, y] is connected to the nodes in Pixels
that have the second index coordinate within one of the node's second coordinate, y.
Define a convolutional network for multiclass classification: digit recognition example
The definition of the following network is designed to recognize numbers, and it illustrates some advanced
techniques for customizing a neural network.
The total number of nodes can be calculated by using the declared dimensionality of the layer, [50, 5, 5], as
follows: MapCount * NodeCount\[0] * NodeCount\[1] * NodeCount\[2] = 10 * 5 * 5 * 5
Because Sharing[d] is False only for d == 0 , the number of kernels is
MapCount * NodeCount\[0] = 10 * 5 = 50 .
Acknowledgements
The Net# language for customizing the architecture of neural networks was developed at Microsoft by Shon
Katzenberger (Architect, Machine Learning) and Alexey Kamenev (Software Engineer, Microsoft Research). It is
used internally for machine learning projects and applications ranging from image detection to text analytics. For
more information, see Neural Nets in Azure Machine Learning studio - Introduction to Net#