data science in the cloud with microsoft azure machine learning and r 2015 update

Data Science in the Cloud with Microsoft Azure Machine Learning and R: 2015 Updateby Stephen F.. Data Science in the Cloud with Microsoft Azure Machine Learning and R: 2015 Update, the c

Trang 3

Data Science in the Cloud with

Microsoft Azure Machine

Learning and R: 2015 Update

Stephen F Elston

Trang 4

Data Science in the Cloud with Microsoft Azure Machine Learning and R: 2015 Update

by Stephen F Elston

Printed in the United States of America

Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North,

Sebastopol, CA 95472

O’Reilly books may be purchased for educational, business, or sales

promotional use Online editions are also available for most titles (

http://safaribooksonline.com ) For more information, contact our

corporate/institutional sales department: 800-998-9938 or

corporate@oreilly.com

Editor: Shannon Cutt

Production Editor: Nicholas Adams

Proofreader: Nicholas Adams

Interior Designer: David Futato

Cover Designer: Karen Montgomery

Illustrator: Rebecca Demarest

September 2015: First Edition

Trang 5

Revision History for the First Edition

2015-09-01: First Release

2015-11-21: Second Release

The O’Reilly logo is a registered trademark of O’Reilly Media, Inc Data

Science in the Cloud with Microsoft Azure Machine Learning and R: 2015 Update, the cover image, and related trade dress are trademarks of O’Reilly

Media, Inc

While the publisher and the author(s) have used good faith efforts to ensurethat the information and instructions contained in this work are accurate, thepublisher and the author(s) disclaim all responsibility for errors or omissions,including without limitation responsibility for damages resulting from the use

of or reliance on this work Use of the information and instructions contained

in this work is at your own risk If any code samples or other technology thiswork contains or describes is subject to open source licenses or the

intellectual property rights of others, it is your responsibility to ensure thatyour use thereof complies with such licenses and/or rights

978-1-491-93634-4

[LSI]

Trang 6

Chapter 1 Data Science in the Cloud with Microsoft Azure

Machine Learning and R: 2015 Update

Trang 7

This report covers the basics of manipulating data, constructing models, andevaluating models in the Microsoft Azure Machine Learning platform (AzureML) The Azure ML platform has greatly simplified the development anddeployment of machine learning models, with easy-to-use and powerful

cloud-based data transformation and machine learning tools

In this report, we’ll explore extending Azure ML with the R language (Acompanion report explores extending Azure ML using the Python language.)All of the concepts we will cover are illustrated with a data science example,using a bicycle rental demand dataset We’ll perform the required data

manipulation, or data munging Then, we will construct and evaluate

regression models for the dataset

You can follow along by downloading the code and data provided in the nextsection Later in the report, we’ll discuss publishing your trained models asweb services in the Azure cloud

Before we get started, let’s review a few of the benefits Azure ML providesfor machine learning solutions:

Solutions can be quickly and easily deployed as web services

Models run in a highly scalable and secure cloud environment

Azure ML is integrated with the powerful Microsoft Cortana AnalyticsSuite, which includes massive storage and processing capabilities It canread data from and write data to Cortana storage at significant volume.Azure ML can even be employed as the analytics engine for other

components of the Cortana Analytics Suite

Machine learning algorithms and data transformations are extendableusing the R language, for solution-specific functionality

Rapidly operationalized analytics are written in the R and Python

languages

Trang 8

Code and data are maintained in a secure cloud environment.

Trang 9

For our example, we will be using the Bike Rental UCI dataset available in

Azure ML This data is also preloaded in the Azure ML Studio environment,

or you can download this data as a csv file from the UCI website The

reference for this data is Fanaee-T, Hadi, and Gama, Joao, “Event labeling

combining ensemble detectors and background knowledge,” Progress in Artificial Intelligence (2013): pp 1-15, Springer Berlin Heidelberg.

The R code for our example can be found at GitHub

Trang 10

Working Between Azure ML and RStudio

Azure ML is a production environment It is ideally suited to publishing

machine learning models In contrast, Azure ML is not a particularly good

development environment.

In general, you will find it easier to perform preliminary editing, testing, and

debugging in RStudio In this way, you take advantage of the powerful

development resources and perform your final testing in Azure ML

Downloads for R and RStudio are available for Windows, Mac, and Linux.This report assumes the reader is familiar with the basics of R If you are notfamiliar with using R in Azure ML, check out the Quick Start Guide to R in AzureML

The R source code for the data science example in this report can be run ineither Azure ML or RStudio Read the comments in the source files to see thechanges required to work between these two environments

Trang 11

Overview of Azure ML

This section provides a short overview of Azure Machine Learning You canfind more details and specifics, including tutorials, at the Microsoft Azureweb page Additional learning resources can be found on the Azure MachineLearning documentation site

Deeper and broader introductions can be found in the following video

As we work through our data science example throughout subsequent

sections, we include specific examples of the concepts presented here Weencourage you to go to this page and create your own free-tier account Weencourage you to try these example on your own using this account

Trang 12

Azure ML Studio

Azure ML models are built and tested in the web-based Azure ML Studio

Figure 1-1 below shows an example of the Azure ML Studio

Figure 1-1 Azure ML Studio

A workflow of the model appears in the center of the studio window Adataset and an Execute R Script module are on the canvas On the left side ofthe Studio display, you see datasets, and a series of tabs containing varioustypes of modules Properties of whichever dataset or module that has beenclicked on can be seen in the right panel In this case, you can see the R codecontained in the Execute R Script module

Build your own experiment

Trang 13

Building your own experiment in Azure ML is quite simple Click the +

symbol in the lower lefthand corner of the studio window You will see adisplay resembling the Figure 1-2 below Select either a blank experiment orone of the sample experiments

Trang 14

Figure 1-2 Creating a New Azure ML Experiment

If you choose a blank experiment, start dragging and dropping modules anddata sets onto your canvas Connect the module outputs to inputs to build anexperiment

Trang 15

Getting Data In and Out of Azure ML

Let’s discuss how we get data into and out of Azure ML

Azure ML supports several data I/O options, including:

Web services

HTTP connections

Azure SQL tables

Azure Blob storage

Azure Tables; noSQL key-value tables

storage components Figure 1-3 shows an example of configuring the Readermodule to read data from a hypothetical Azure SQL table Similar

capabilities are available in the Writer module for outputting data at volume

Trang 16

Figure 1-3 Configuring the Reader Module for an Azure SQL Query

Trang 17

Modules and Datasets

Mixing native modules and R in Azure ML

Azure ML provides a wide range of modules for data transformation,

machine learning, and model evaluation Most native Azure ML modules arecomputationally efficient and scalable As a general rule, these native

modules should be your first choice

The deep and powerful R language extends Azure ML to meet the

requirements of specific data science problems For example,

solution-specific data transformation and cleaning can be coded in R R languagescripts contained in Execute R Script modules can be run in-line with nativeAzure ML modules Additionally, the R language gives Azure ML powerfuldata visualization capabilities With the Create R Model module, you cantrain and score models from numerous R packages within an experiment withrelatively little work

As we work through the examples, you will see how to mix native Azure MLmodules and Execute R Script modules to create a complete solution

Execute R Script Module I/O

In the Azure ML Studio, input ports are located above module icons, and

output ports are located below module icons.

TIP

If you move your mouse over the ports of a module, you will see a “tool tip” showing the

type of data for that port.

The Execute R Script module has five ports:

The Dataset1 and Dataset2 ports are inputs for rectangular Azure datatables

The Script Bundle port accepts a zipped R script file (.R file) or R dataset

Trang 18

The Result Dataset output port produces an Azure rectangular data tablefrom a data frame

The R Device port produces output of text or graphics from R

Within experiments, workflows are created by connecting the appropriate

ports between modules—output port to input port Connections are made by

dragging your mouse from the output port of one module to the input port ofanother module

Trang 19

Azure ML Workflows

Model training workflow

Figure 1-4 shows a generalized workflow for training, scoring, and

evaluating a machine learning model in Azure ML This general workflow isthe same for most regression and classification algorithms The model

definition can be a native Azure ML module or R code in a Create R Modelmodule

Figure 1-4 A generalized model training workflow for Azure ML models.

Key points on the model training workflow:

Data input can come from a variety of interfaces, including web services,HTTP connections, Azure SQL, and Hive Query These data sources can

be within the Cortana suite or external to it In most cases, for training andtesting models, you use a saved dataset

Transformations of the data can be performed using a combination ofnative Azure ML modules and the R language

Trang 20

A Model Definition module defines the model type and properties On theleft hand pane of the Studio you will see numerous choices for models.The parameters of the model are set in the properties pane R model

training and scoring scripts can be provided in a Create R Model module

The Training module trains the model Training of the model is scored inthe Score module and performance summary statistics are computed in theEvaluate module

The following sections include specific examples of each of the steps

illustrated in Figure 1-4

Publishing a model as a web service

Once you have developed and evaluated a satisfactory model, you can

publish it as a web service You will need to create streamlined workflow forpromotion to production A generalized example is shown in Figure 1-5

Trang 21

Figure 1-5 Workflow for an Azure ML model published as a web service

Here are some key points of the workflow for publishing a web service:Typically, you will use transformations you created and saved when youwere training the model These include saved transformations from thevarious Azure ML data transformation modules and modified R

transformation code

The product of the training processes (discussed above) is the trained

Trang 22

You can apply transformations to results produced by the model.Examples of transformations include deleting unneeded columns andconverting units of numerical results

Trang 23

A Regression Example

Trang 24

Problem and Data Overview

Demand and inventory forecasting are fundamental business processes

Forecasting is used for supply chain management, staff level management,

production management, and many other applications

In this example, we will construct and test models to forecast hourly demandfor a bicycle rental system The ability to forecast demand is important for theeffective operation of this system If insufficient bikes are available, regularusers will be inconvenienced The users become reluctant to use the system,lacking confidence that bikes will be available when needed If too manybikes are available, operating costs increase unnecessarily

In data science problems, it is always important to gain an understanding ofthe objectives of the end-users In this case, having a reasonable number ofextra bikes on-hand is far less of an issue than having an insufficient

inventory Keep this fact in mind as we are evaluating models

For this example, we’ll use a dataset containing a time series of demand

information for the bicycle rental system These data contain hourly demandfigures over a two-year period, for both registered and casual users There arenine features, also know as predictor, or independent, variables The data setcontains a total of 17,379 rows or cases

The first and possibly most important, task in creating effective predictive

analytics models is determining the feature set Feature selection is usually more important than the specific choice of machine learning model Feature

candidates include variables in the dataset, transformed or filtered values of

these variables, or new variables computed from the variables in the dataset

The process of creating the feature set is sometimes known as feature

selection or feature engineering.

In addition to feature engineering, data cleaning and editing are critical inmost situations Filters can be applied to both the predictor and response

variables

The data set is available in the Azure ML sample data sets You can also

download it as a csv file either from Azure ML, or from the University of

Trang 25

California Machine Learning Repository.

A first set of transformations

For our first step, we’ll perform some transformations on the raw input datausing the code shown below in an Azure ML Execute R Script module:

## This file contains the code for the transformation

## of the raw bike rental data It is intended to run in an

## Azure ML Execute R Script module By changing

## the following variable to false the code will run

## in R or RStudio.

Azure <- FALSE

## If we are in Azure, source the utilities from the zip

## file The next lines of code read in the dataset, either

## in Azure ML or from a csv file for testing purposes.

## Select the columns we need

cols <- c("dteday", "mnth", "hr", "holiday",

"workingday", "weathersit", "temp",

## Normalize the numeric predictors

cols <- c("temp", "hum", "windspeed")

BikeShare[, cols] <- scale(BikeShare[, cols])

}

## Create a new variable to indicate workday

BikeShare$isWorking <- ifelse(BikeShare$workingday &

!BikeShare$holiday, 1, 0)

## Add a column of the count of months which could

## help model trend

BikeShare <- month.count(BikeShare)

Trang 26

## Create an ordered factor for the day of the week

## starting with Monday Note this factor is then

## converted to an "ordered" numerical value to be

## compatible with Azure ML table data types.

function reads the data frame from the input port of the execute R Script

Module The argument 1 indicates the first input port R functions from a zip file are brought into the R environment by the source() function The R file

is read from a directory /src The date-time character string is converted to a

POSIXct time series object by the to.POSIXct function

If, on the other hand, Azure is set to FALSE, the other code path is executed.This code path allows us to test the code in RStudio The data are read from a

.csv file The argument stringsAsFactors = FALSE ensures that string

columns are retained as such, and as they will be in Azure ML Column

selection and normalization of certain numeric columns are executed Thesetransformations are accomplished with the Azure module in that

environment The date-time column is converted to a time series object withthe char.toPOSIXct function

This code creates five new columns, or features As we explore the data wewill determine if any of these features improve our models

Creating a column indicating whether it’s a workday or not

The month.count function adds a column indicating the number of

Trang 27

months from the beginning of the time series.

A column indicating the day of the week as an ordered factor

The utilities.R file contains the functions used for the transformations The

listing of these functions is shown below:

Trang 28

inFrame$monthCount <- 12 * yearCount + inFrame$mnth

inFrame

}

These functions are in a file called utilities.R This file is packaged into a zip file, and uploaded into Azure ML Studio The R code in the zip file is then

available in any Execute R Script module in the experiment

Exploring the data

Let’s have a first look at the data by walking through a series of exploratoryplots

An additional Execute R Script module with the visualization code is added

to the experiment At this point, our Azure ML experiment looks like

Figure 1-6 The first Execute R Script module, titled “Transform Data,”contains the code shown in the previous code listing

Figure 1-6 The Azure ML experiment in Studio

The Execute R Script module, shown at the bottom of this experiment, runs

Trang 29

code for exploring the data, using output from the Execute R Script modulethat transforms the data.

Our first step is to read the transformed data and create a correlation matrixusing the following code:

## This code will create a series of data visualizations

## to explore the bike rental dataset This code is

## intended to run in an Azure ML Execute R

## Script module By changing the following variable

## you can run the code in R or RStudio for testing.

## Look at the correlation between the predictors and

## between predictors and quality Use a linear

## time series regression to detrend the demand.

Time <- BikeShare$dteday

BikeShare$count <- BikeShare$cnt - fitted(

lm(BikeShare$cnt ~ Time, data = BikeShare))

Trang 30

In this code, we use lm() to compute a linear model used for detrending theresponse variable column in the data frame Detrending removes a source ofbias in the correlation estimates We are particularly interested in the

correlation of the predictor variables with this detrended response

NOTE

The levelplot() function from the lattice package is wrapped by a call to plot() This

is required since, in some cases, Azure ML suppresses automatic printing, and hence

plotting Suppressing printing is desirable in a production environment as automatically

produced output will not clutter the result As a result, you may need to wrap expressions you intend to produce as printed or plotted output, with the print() or plot() functions.

This code requires one function, which is defined in the utilities.R file.

Using the cor() function, we’ll compute the correlation matrix This

correlation matrix is displayed using the levelplot() function in the latticepackage

A plot of the correlation matrix showing the relationship between the

predictors, and the predictors and the response variable, can be seen in

Figure 1-7 If you run this code in an Azure ML Execute R Script, you cansee the plots at the R Device port

Trang 31

Figure 1-7 Plot of correlation matrix

This plot is dominated by the strong correlation between dayWeek andisWorking—which is hardly surprising It’s clear that we don’t need toinclude both of these variables in any model, as they are proxies for eachother

To get a better look at the correlations between other variables, see thesecond plot, in Figure 1-8, with the dayWeek variable removed

Trang 32

Figure 1-8 Plot of correlation matrix without dayWeek variable

In this plot we can see that a few of the features exhibit fairly strong

correlation with the response The hour (hr), temp, and month (mnth) arepositively correlated, whereas humidity (hum) and the overall weather

(weathersit) are negatively correlated The variable windspeed is nearlyuncorrelated For this plot, the correlation of a variable with itself has beenset to 0.0 Note that the scale is asymmetric

We can also see that several of the predictor variables are highly correlated—for example, hum and weathersit or hr and hum These correlated variablescould cause problems for some types of predictive models

WARNING

You should always keep in mind the pitfalls in the interpretation of correlation First, and

most importantly, correlation should never be confused with causation A highly

correlated variable may or may not imply causation Second, a highly correlated or nearly

Trang 33

uncorrelated variable may, or may not, be a good predictor The variable may be nearly

collinear with some other predictor, or the relationship with the response may be

ylab("Log number of bikes") +

labs(title = paste("Bike demand at ",

as.character(times), ":00", spe ="")) +

theme(text = element_text(size=20))

}

)

This code uses the ggplot2 package to create the time series plots An

anonymous R function wrapped in lapply, generates the plots at the selectedhours

Two examples of the time series plots for two specific hours of the day areshown in Figures 9 and 10

Trang 34

Figure 1-9 Time series plot of bike demand for the 0700 hour

Trang 35

Figure 1-10 Time series plot of bike demand for the 1800 hour

Notice the differences in the shape of these curves at the two different hours.Also, note the outliers at the low side of demand

Next, we’ll create a number of box plots for some of the factor variables,using the following code:

## Convert dayWeek back to an ordered factor so the plot is in

## time order.

BikeShare$dayWeek <- fact.conv(BikeShare$dayWeek)

## This code gives a first look at the predictor values vs the demand for bikes labels <- list("Box plots of hourly bike demand",

"Box plots of monthly bike demand",

"Box plots of bike demand by weather factor",

"Box plots of bike demand by workday vs holiday",

"Box plots of bike demand by day of the week")

xAxis <- list("hr", "mnth", "weathersit",

"isWorking", "dayWeek")

Trang 36

If you are not familiar with using Map() this code may look a bit

intimidating When faced with functional code like this, always read from the

inside out On the inside, you can see the ggplot2 package functions This

code is contained in an anonymous function with two arguments Map()

iterates over the two argument lists to produce the series of plots

The utility function that creates the day of week factor with meaningful

names is shown in the listing below:

fact.conv <- function(inVec){

## Function gives the day variable meaningful

## level names.

outVec <- as.factor(inVec)

levels(outVec) <- c("Monday", "Tuesday", "Wednesday",

"Thursday", "Friday", "Saturday",

"Sunday")

outVec

Three of the resulting box plots are shown in Figures 11, 12, and 13

Trang 37

Figure 1-11 Box plots showing the relationship between bike demand and hour of the day

Trang 38

Figure 1-12 Box plots showing the relationship between bike demand and weather situation

Trang 39

Figure 1-13 Box plots showing the relationship between bike demand and day of the week

From these plots, you can see a significant difference in the likely predictivepower of these three variables Significant and complex variation in hourlybike demand can be seen in Figure 1-11 In contrast, it looks doubtful thatweathersit is going to be very helpful in predicting bike demand, despitethe relatively high (negative) correlation value observed The result shown in

Figure 1-13 is surprising—we expected bike demand to depend on the day ofthe week

Once again, the outliers at the low end of bike demand can be seen in the boxplots

TIP

In our example, we make heavy use of the ggplot2 package To learn more about ggplot2,

we recommend R Graphics Cookbook: Practical Recipes for Visualizing Data by Winston Chang (O’Reilly) This is also an excellent ggplot2 ‘cheat sheet’.

Trang 40

Finally, we’ll create some plots to explore the continuous variables, using thefollowing code:

## Look at the relationship between predictors and bike demand

labels <- c("Bike demand vs temperature",

"Bike demand vs humidity",

"Bike demand vs windspeed",

This code is quite similar to the code used for the box plots We have

included a loess smoothed line on each of these plots Also, note that we have

added a color scale and increased the point transparency Therefore, we get afeel for the number of overlapping data points

TIP

When plotting a large number of points, overplotting is a significant problem.

Overplotting makes it difficult to tell the actual point density as points lie on top of each

other Methods like color scales, point transparency and hexbinning can all be applied to

situations with significant overplotting.

WARNING

The loess method in R is quite memory intensive Depending on how much memory you

have on your local machine, you may or may not be able to run this code Fortunately,

Azure ML runs on servers with 60 GB of RAM, which is more than up to the job.

Định dạng
Số trang	86
Dung lượng	5,07 MB