Data Science in the Cloud with Microsoft Azure Machine Learning and R: 2015 Updateby Stephen F.. Data Science in the Cloud with Microsoft Azure Machine Learning and R: 2015 Update, the c
Trang 3Data Science in the Cloud with
Microsoft Azure Machine
Learning and R: 2015 Update
Stephen F Elston
Trang 4Data Science in the Cloud with Microsoft Azure Machine Learning and R: 2015 Update
by Stephen F Elston
Copyright © 2015 O’Reilly Media Inc All rights reserved
Printed in the United States of America
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North,
Sebastopol, CA 95472
O’Reilly books may be purchased for educational, business, or sales
promotional use Online editions are also available for most titles (
http://safaribooksonline.com ) For more information, contact our
corporate/institutional sales department: 800-998-9938 or
corporate@oreilly.com
Editor: Shannon Cutt
Production Editor: Nicholas Adams
Proofreader: Nicholas Adams
Interior Designer: David Futato
Cover Designer: Karen Montgomery
Illustrator: Rebecca Demarest
September 2015: First Edition
Trang 5Revision History for the First Edition
2015-09-01: First Release
2015-11-21: Second Release
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc Data
Science in the Cloud with Microsoft Azure Machine Learning and R: 2015 Update, the cover image, and related trade dress are trademarks of O’Reilly
Media, Inc
While the publisher and the author(s) have used good faith efforts to ensurethat the information and instructions contained in this work are accurate, thepublisher and the author(s) disclaim all responsibility for errors or omissions,including without limitation responsibility for damages resulting from the use
of or reliance on this work Use of the information and instructions contained
in this work is at your own risk If any code samples or other technology thiswork contains or describes is subject to open source licenses or the
intellectual property rights of others, it is your responsibility to ensure thatyour use thereof complies with such licenses and/or rights
978-1-491-93634-4
[LSI]
Trang 6Chapter 1 Data Science in the Cloud with Microsoft Azure
Machine Learning and R: 2015 Update
Trang 7This report covers the basics of manipulating data, constructing models, andevaluating models in the Microsoft Azure Machine Learning platform (AzureML) The Azure ML platform has greatly simplified the development anddeployment of machine learning models, with easy-to-use and powerful
cloud-based data transformation and machine learning tools
In this report, we’ll explore extending Azure ML with the R language (Acompanion report explores extending Azure ML using the Python language.)All of the concepts we will cover are illustrated with a data science example,using a bicycle rental demand dataset We’ll perform the required data
manipulation, or data munging Then, we will construct and evaluate
regression models for the dataset
You can follow along by downloading the code and data provided in the nextsection Later in the report, we’ll discuss publishing your trained models asweb services in the Azure cloud
Before we get started, let’s review a few of the benefits Azure ML providesfor machine learning solutions:
Solutions can be quickly and easily deployed as web services
Models run in a highly scalable and secure cloud environment
Azure ML is integrated with the powerful Microsoft Cortana AnalyticsSuite, which includes massive storage and processing capabilities It canread data from and write data to Cortana storage at significant volume.Azure ML can even be employed as the analytics engine for other
components of the Cortana Analytics Suite
Machine learning algorithms and data transformations are extendableusing the R language, for solution-specific functionality
Rapidly operationalized analytics are written in the R and Python
languages
Trang 8Code and data are maintained in a secure cloud environment.
Trang 9For our example, we will be using the Bike Rental UCI dataset available in
Azure ML This data is also preloaded in the Azure ML Studio environment,
or you can download this data as a csv file from the UCI website The
reference for this data is Fanaee-T, Hadi, and Gama, Joao, “Event labeling
combining ensemble detectors and background knowledge,” Progress in Artificial Intelligence (2013): pp 1-15, Springer Berlin Heidelberg.
The R code for our example can be found at GitHub
Trang 10Working Between Azure ML and RStudio
Azure ML is a production environment It is ideally suited to publishing
machine learning models In contrast, Azure ML is not a particularly good
development environment.
In general, you will find it easier to perform preliminary editing, testing, and
debugging in RStudio In this way, you take advantage of the powerful
development resources and perform your final testing in Azure ML
Downloads for R and RStudio are available for Windows, Mac, and Linux.This report assumes the reader is familiar with the basics of R If you are notfamiliar with using R in Azure ML, check out the Quick Start Guide to R in AzureML
The R source code for the data science example in this report can be run ineither Azure ML or RStudio Read the comments in the source files to see thechanges required to work between these two environments
Trang 11Overview of Azure ML
This section provides a short overview of Azure Machine Learning You canfind more details and specifics, including tutorials, at the Microsoft Azureweb page Additional learning resources can be found on the Azure MachineLearning documentation site
Deeper and broader introductions can be found in the following video
As we work through our data science example throughout subsequent
sections, we include specific examples of the concepts presented here Weencourage you to go to this page and create your own free-tier account Weencourage you to try these example on your own using this account
Trang 12Azure ML Studio
Azure ML models are built and tested in the web-based Azure ML Studio
Figure 1-1 below shows an example of the Azure ML Studio
Figure 1-1 Azure ML Studio
A workflow of the model appears in the center of the studio window Adataset and an Execute R Script module are on the canvas On the left side ofthe Studio display, you see datasets, and a series of tabs containing varioustypes of modules Properties of whichever dataset or module that has beenclicked on can be seen in the right panel In this case, you can see the R codecontained in the Execute R Script module
Build your own experiment
Trang 13Building your own experiment in Azure ML is quite simple Click the +
symbol in the lower lefthand corner of the studio window You will see adisplay resembling the Figure 1-2 below Select either a blank experiment orone of the sample experiments
Trang 14Figure 1-2 Creating a New Azure ML Experiment
If you choose a blank experiment, start dragging and dropping modules anddata sets onto your canvas Connect the module outputs to inputs to build anexperiment
Trang 15Getting Data In and Out of Azure ML
Let’s discuss how we get data into and out of Azure ML
Azure ML supports several data I/O options, including:
Web services
HTTP connections
Azure SQL tables
Azure Blob storage
Azure Tables; noSQL key-value tables
storage components Figure 1-3 shows an example of configuring the Readermodule to read data from a hypothetical Azure SQL table Similar
capabilities are available in the Writer module for outputting data at volume
Trang 16Figure 1-3 Configuring the Reader Module for an Azure SQL Query
Trang 17Modules and Datasets
Mixing native modules and R in Azure ML
Azure ML provides a wide range of modules for data transformation,
machine learning, and model evaluation Most native Azure ML modules arecomputationally efficient and scalable As a general rule, these native
modules should be your first choice
The deep and powerful R language extends Azure ML to meet the
requirements of specific data science problems For example,
solution-specific data transformation and cleaning can be coded in R R languagescripts contained in Execute R Script modules can be run in-line with nativeAzure ML modules Additionally, the R language gives Azure ML powerfuldata visualization capabilities With the Create R Model module, you cantrain and score models from numerous R packages within an experiment withrelatively little work
As we work through the examples, you will see how to mix native Azure MLmodules and Execute R Script modules to create a complete solution
Execute R Script Module I/O
In the Azure ML Studio, input ports are located above module icons, and
output ports are located below module icons.
TIP
If you move your mouse over the ports of a module, you will see a “tool tip” showing the
type of data for that port.
The Execute R Script module has five ports:
The Dataset1 and Dataset2 ports are inputs for rectangular Azure datatables
The Script Bundle port accepts a zipped R script file (.R file) or R dataset
Trang 18The Result Dataset output port produces an Azure rectangular data tablefrom a data frame
The R Device port produces output of text or graphics from R
Within experiments, workflows are created by connecting the appropriate
ports between modules—output port to input port Connections are made by
dragging your mouse from the output port of one module to the input port ofanother module
Trang 19Azure ML Workflows
Model training workflow
Figure 1-4 shows a generalized workflow for training, scoring, and
evaluating a machine learning model in Azure ML This general workflow isthe same for most regression and classification algorithms The model
definition can be a native Azure ML module or R code in a Create R Modelmodule
Figure 1-4 A generalized model training workflow for Azure ML models.
Key points on the model training workflow:
Data input can come from a variety of interfaces, including web services,HTTP connections, Azure SQL, and Hive Query These data sources can
be within the Cortana suite or external to it In most cases, for training andtesting models, you use a saved dataset
Transformations of the data can be performed using a combination ofnative Azure ML modules and the R language
Trang 20A Model Definition module defines the model type and properties On theleft hand pane of the Studio you will see numerous choices for models.The parameters of the model are set in the properties pane R model
training and scoring scripts can be provided in a Create R Model module
The Training module trains the model Training of the model is scored inthe Score module and performance summary statistics are computed in theEvaluate module
The following sections include specific examples of each of the steps
illustrated in Figure 1-4
Publishing a model as a web service
Once you have developed and evaluated a satisfactory model, you can
publish it as a web service You will need to create streamlined workflow forpromotion to production A generalized example is shown in Figure 1-5
Trang 21Figure 1-5 Workflow for an Azure ML model published as a web service
Here are some key points of the workflow for publishing a web service:Typically, you will use transformations you created and saved when youwere training the model These include saved transformations from thevarious Azure ML data transformation modules and modified R
transformation code
The product of the training processes (discussed above) is the trained
Trang 22You can apply transformations to results produced by the model.Examples of transformations include deleting unneeded columns andconverting units of numerical results
Trang 23A Regression Example
Trang 24Problem and Data Overview
Demand and inventory forecasting are fundamental business processes
Forecasting is used for supply chain management, staff level management,
production management, and many other applications
In this example, we will construct and test models to forecast hourly demandfor a bicycle rental system The ability to forecast demand is important for theeffective operation of this system If insufficient bikes are available, regularusers will be inconvenienced The users become reluctant to use the system,lacking confidence that bikes will be available when needed If too manybikes are available, operating costs increase unnecessarily
In data science problems, it is always important to gain an understanding ofthe objectives of the end-users In this case, having a reasonable number ofextra bikes on-hand is far less of an issue than having an insufficient
inventory Keep this fact in mind as we are evaluating models
For this example, we’ll use a dataset containing a time series of demand
information for the bicycle rental system These data contain hourly demandfigures over a two-year period, for both registered and casual users There arenine features, also know as predictor, or independent, variables The data setcontains a total of 17,379 rows or cases
The first and possibly most important, task in creating effective predictive
analytics models is determining the feature set Feature selection is usually more important than the specific choice of machine learning model Feature
candidates include variables in the dataset, transformed or filtered values of
these variables, or new variables computed from the variables in the dataset
The process of creating the feature set is sometimes known as feature
selection or feature engineering.
In addition to feature engineering, data cleaning and editing are critical inmost situations Filters can be applied to both the predictor and response
variables
The data set is available in the Azure ML sample data sets You can also
download it as a csv file either from Azure ML, or from the University of
Trang 25California Machine Learning Repository.
A first set of transformations
For our first step, we’ll perform some transformations on the raw input datausing the code shown below in an Azure ML Execute R Script module:
## This file contains the code for the transformation
## of the raw bike rental data It is intended to run in an
## Azure ML Execute R Script module By changing
## the following variable to false the code will run
## in R or RStudio.
Azure <- FALSE
## If we are in Azure, source the utilities from the zip
## file The next lines of code read in the dataset, either
## in Azure ML or from a csv file for testing purposes.
## Select the columns we need
cols <- c("dteday", "mnth", "hr", "holiday",
"workingday", "weathersit", "temp",
## Normalize the numeric predictors
cols <- c("temp", "hum", "windspeed")
BikeShare[, cols] <- scale(BikeShare[, cols])
}
## Create a new variable to indicate workday
BikeShare$isWorking <- ifelse(BikeShare$workingday &
!BikeShare$holiday, 1, 0)
## Add a column of the count of months which could
## help model trend
BikeShare <- month.count(BikeShare)
Trang 26## Create an ordered factor for the day of the week
## starting with Monday Note this factor is then
## converted to an "ordered" numerical value to be
## compatible with Azure ML table data types.
function reads the data frame from the input port of the execute R Script
Module The argument 1 indicates the first input port R functions from a zip file are brought into the R environment by the source() function The R file
is read from a directory /src The date-time character string is converted to a
POSIXct time series object by the to.POSIXct function
If, on the other hand, Azure is set to FALSE, the other code path is executed.This code path allows us to test the code in RStudio The data are read from a
.csv file The argument stringsAsFactors = FALSE ensures that string
columns are retained as such, and as they will be in Azure ML Column
selection and normalization of certain numeric columns are executed Thesetransformations are accomplished with the Azure module in that
environment The date-time column is converted to a time series object withthe char.toPOSIXct function
This code creates five new columns, or features As we explore the data wewill determine if any of these features improve our models
Creating a column indicating whether it’s a workday or not
The month.count function adds a column indicating the number of
Trang 27months from the beginning of the time series.
A column indicating the day of the week as an ordered factor
The utilities.R file contains the functions used for the transformations The
listing of these functions is shown below:
Trang 28inFrame$monthCount <- 12 * yearCount + inFrame$mnth
inFrame
}
These functions are in a file called utilities.R This file is packaged into a zip file, and uploaded into Azure ML Studio The R code in the zip file is then
available in any Execute R Script module in the experiment
Exploring the data
Let’s have a first look at the data by walking through a series of exploratoryplots
An additional Execute R Script module with the visualization code is added
to the experiment At this point, our Azure ML experiment looks like
Figure 1-6 The first Execute R Script module, titled “Transform Data,”contains the code shown in the previous code listing
Figure 1-6 The Azure ML experiment in Studio
The Execute R Script module, shown at the bottom of this experiment, runs
Trang 29code for exploring the data, using output from the Execute R Script modulethat transforms the data.
Our first step is to read the transformed data and create a correlation matrixusing the following code:
## This code will create a series of data visualizations
## to explore the bike rental dataset This code is
## intended to run in an Azure ML Execute R
## Script module By changing the following variable
## you can run the code in R or RStudio for testing.
## Look at the correlation between the predictors and
## between predictors and quality Use a linear
## time series regression to detrend the demand.
Time <- BikeShare$dteday
BikeShare$count <- BikeShare$cnt - fitted(
lm(BikeShare$cnt ~ Time, data = BikeShare))
Trang 30In this code, we use lm() to compute a linear model used for detrending theresponse variable column in the data frame Detrending removes a source ofbias in the correlation estimates We are particularly interested in the
correlation of the predictor variables with this detrended response
NOTE
The levelplot() function from the lattice package is wrapped by a call to plot() This
is required since, in some cases, Azure ML suppresses automatic printing, and hence
plotting Suppressing printing is desirable in a production environment as automatically
produced output will not clutter the result As a result, you may need to wrap expressions you intend to produce as printed or plotted output, with the print() or plot() functions.
This code requires one function, which is defined in the utilities.R file.
Using the cor() function, we’ll compute the correlation matrix This
correlation matrix is displayed using the levelplot() function in the latticepackage
A plot of the correlation matrix showing the relationship between the
predictors, and the predictors and the response variable, can be seen in
Figure 1-7 If you run this code in an Azure ML Execute R Script, you cansee the plots at the R Device port
Trang 31Figure 1-7 Plot of correlation matrix
This plot is dominated by the strong correlation between dayWeek andisWorking—which is hardly surprising It’s clear that we don’t need toinclude both of these variables in any model, as they are proxies for eachother
To get a better look at the correlations between other variables, see thesecond plot, in Figure 1-8, with the dayWeek variable removed
Trang 32Figure 1-8 Plot of correlation matrix without dayWeek variable
In this plot we can see that a few of the features exhibit fairly strong
correlation with the response The hour (hr), temp, and month (mnth) arepositively correlated, whereas humidity (hum) and the overall weather
(weathersit) are negatively correlated The variable windspeed is nearlyuncorrelated For this plot, the correlation of a variable with itself has beenset to 0.0 Note that the scale is asymmetric
We can also see that several of the predictor variables are highly correlated—for example, hum and weathersit or hr and hum These correlated variablescould cause problems for some types of predictive models
WARNING
You should always keep in mind the pitfalls in the interpretation of correlation First, and
most importantly, correlation should never be confused with causation A highly
correlated variable may or may not imply causation Second, a highly correlated or nearly
Trang 33uncorrelated variable may, or may not, be a good predictor The variable may be nearly
collinear with some other predictor, or the relationship with the response may be
ylab("Log number of bikes") +
labs(title = paste("Bike demand at ",
as.character(times), ":00", spe ="")) +
theme(text = element_text(size=20))
}
)
This code uses the ggplot2 package to create the time series plots An
anonymous R function wrapped in lapply, generates the plots at the selectedhours
Two examples of the time series plots for two specific hours of the day areshown in Figures 9 and 10
Trang 34Figure 1-9 Time series plot of bike demand for the 0700 hour
Trang 35Figure 1-10 Time series plot of bike demand for the 1800 hour
Notice the differences in the shape of these curves at the two different hours.Also, note the outliers at the low side of demand
Next, we’ll create a number of box plots for some of the factor variables,using the following code:
## Convert dayWeek back to an ordered factor so the plot is in
## time order.
BikeShare$dayWeek <- fact.conv(BikeShare$dayWeek)
## This code gives a first look at the predictor values vs the demand for bikes labels <- list("Box plots of hourly bike demand",
"Box plots of monthly bike demand",
"Box plots of bike demand by weather factor",
"Box plots of bike demand by workday vs holiday",
"Box plots of bike demand by day of the week")
xAxis <- list("hr", "mnth", "weathersit",
"isWorking", "dayWeek")
Trang 36If you are not familiar with using Map() this code may look a bit
intimidating When faced with functional code like this, always read from the
inside out On the inside, you can see the ggplot2 package functions This
code is contained in an anonymous function with two arguments Map()
iterates over the two argument lists to produce the series of plots
The utility function that creates the day of week factor with meaningful
names is shown in the listing below:
fact.conv <- function(inVec){
## Function gives the day variable meaningful
## level names.
outVec <- as.factor(inVec)
levels(outVec) <- c("Monday", "Tuesday", "Wednesday",
"Thursday", "Friday", "Saturday",
"Sunday")
outVec
Three of the resulting box plots are shown in Figures 11, 12, and 13
Trang 37Figure 1-11 Box plots showing the relationship between bike demand and hour of the day
Trang 38Figure 1-12 Box plots showing the relationship between bike demand and weather situation
Trang 39Figure 1-13 Box plots showing the relationship between bike demand and day of the week
From these plots, you can see a significant difference in the likely predictivepower of these three variables Significant and complex variation in hourlybike demand can be seen in Figure 1-11 In contrast, it looks doubtful thatweathersit is going to be very helpful in predicting bike demand, despitethe relatively high (negative) correlation value observed The result shown in
Figure 1-13 is surprising—we expected bike demand to depend on the day ofthe week
Once again, the outliers at the low end of bike demand can be seen in the boxplots
TIP
In our example, we make heavy use of the ggplot2 package To learn more about ggplot2,
we recommend R Graphics Cookbook: Practical Recipes for Visualizing Data by Winston Chang (O’Reilly) This is also an excellent ggplot2 ‘cheat sheet’.
Trang 40Finally, we’ll create some plots to explore the continuous variables, using thefollowing code:
## Look at the relationship between predictors and bike demand
labels <- c("Bike demand vs temperature",
"Bike demand vs humidity",
"Bike demand vs windspeed",
This code is quite similar to the code used for the box plots We have
included a loess smoothed line on each of these plots Also, note that we have
added a color scale and increased the point transparency Therefore, we get afeel for the number of overlapping data points
TIP
When plotting a large number of points, overplotting is a significant problem.
Overplotting makes it difficult to tell the actual point density as points lie on top of each
other Methods like color scales, point transparency and hexbinning can all be applied to
situations with significant overplotting.
WARNING
The loess method in R is quite memory intensive Depending on how much memory you
have on your local machine, you may or may not be able to run this code Fortunately,
Azure ML runs on servers with 60 GB of RAM, which is more than up to the job.