IT training data science microsoft ml r 2015 update khotailieu

ElstonData Science in the Cloud with Microsoft Azure Machine Learning and R: 2015 Update... [LSI] Data Science in the Cloud with Microsoft Azure Machine Learning and R: 2015 Update by St

Trang 1

with Microsoft Azure

Machine Learning and R

Stephen F Elston

Data Science

in the Cloud

Trang 3

Stephen F Elston

Data Science in the Cloud

with Microsoft Azure Machine Learning and R:

2015 Update

Trang 4

[LSI]

Data Science in the Cloud with Microsoft Azure Machine Learning and R: 2015 Update

by Stephen F Elston

Printed in the United States of America.

Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.

O’Reilly books may be purchased for educational, business, or sales promotional use Online editions are also available for most titles ( http://safaribooksonline.com ) For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com

Editor: Shannon Cutt

Production Editor: Nicholas Adams

Proofreader: Nicholas Adams

Interior Designer: David Futato

Cover Designer: Karen Montgomery

Illustrator: Rebecca Demarest September 2015: First Edition

Revision History for the First Edition

2015-09-01: First Release

2015-11-21: Second Release

The O’Reilly logo is a registered trademark of O’Reilly Media, Inc Data Science in the Cloud with Microsoft Azure Machine Learning and R: 2015 Update, the cover

image, and related trade dress are trademarks of O’Reilly Media, Inc.

While the publisher and the author(s) have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author(s) disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work Use of the information and instructions contained in this work is at your own risk If any code samples or other technology this work contains or describes is sub‐ ject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.

Trang 5

Table of Contents

1 Data Science in the Cloud with Microsoft Azure Machine Learning and

R: 2015 Update 1

Introduction 1

Overview of Azure ML 3

A Regression Example 9

Improving the Model and Transformations 36

Improving Model Parameter Selection in Azure ML 41

Using an R Model in Azure ML 44

Cross Validation 48

Some Possible Next Steps 51

Publishing a Model as a Web Service 52

Summary 54

Trang 7

CHAPTER 1 Data Science in the Cloud with Microsoft Azure Machine Learning

and R: 2015 Update

Introduction

This report covers the basics of manipulating data, constructingmodels, and evaluating models in the Microsoft Azure MachineLearning platform (Azure ML) The Azure ML platform has greatlysimplified the development and deployment of machine learningmodels, with easy-to-use and powerful cloud-based data transfor‐mation and machine learning tools

In this report, we’ll explore extending Azure ML with the R lan‐guage (A companion report explores extending Azure ML using thePython language.) All of the concepts we will cover are illustratedwith a data science example, using a bicycle rental demand dataset

We’ll perform the required data manipulation, or data munging.

Then, we will construct and evaluate regression models for the data‐set

You can follow along by downloading the code and data provided inthe next section Later in the report, we’ll discuss publishing yourtrained models as web services in the Azure cloud

Before we get started, let’s review a few of the benefits Azure MLprovides for machine learning solutions:

Trang 8

• Solutions can be quickly and easily deployed as web services.

• Models run in a highly scalable and secure cloud environment

• Azure ML is integrated with the powerful Microsoft CortanaAnalytics Suite, which includes massive storage and processingcapabilities It can read data from and write data to Cortanastorage at significant volume Azure ML can even be employed

as the analytics engine for other components of the CortanaAnalytics Suite

• Machine learning algorithms and data transformations areextendable using the R language, for solution-specific function‐ality

• Rapidly operationalized analytics are written in the R andPython languages

• Code and data are maintained in a secure cloud environment

Downloads

For our example, we will be using the Bike Rental UCI dataset avail‐

able in Azure ML This data is also preloaded in the Azure ML Stu‐

dio environment, or you can download this data as a csv file from

the UCI website The reference for this data is Fanaee-T, Hadi, and

Gama, Joao, “Event labeling combining ensemble detectors and back‐ ground knowledge,” Progress in Artificial Intelligence (2013): pp 1-15, Springer Berlin Heidelberg.

The R code for our example can be found at GitHub

Working Between Azure ML and RStudio

Azure ML is a production environment It is ideally suited to pub‐lishing machine learning models In contrast, Azure ML is not a

particularly good development environment.

In general, you will find it easier to perform preliminary editing,

testing, and debugging in RStudio In this way, you take advantage of

the powerful development resources and perform your final testing

in Azure ML Downloads for R and RStudio are available for Win‐dows, Mac, and Linux

This report assumes the reader is familiar with the basics of R If youare not familiar with using R in Azure ML, check out the Quick Start Guide to R in AzureML

Trang 9

The R source code for the data science example in this report can berun in either Azure ML or RStudio Read the comments in thesource files to see the changes required to work between these twoenvironments.

Overview of Azure ML

This section provides a short overview of Azure Machine Learning.You can find more details and specifics, including tutorials, at theMicrosoft Azure web page Additional learning resources can befound on the Azure Machine Learning documentation site

Deeper and broader introductions can be found in the followingvideo classes:

• Data Science with Microsoft Azure and R, Working with based Predictive Analytics and Modeling by Stephen Elstonfrom O’Reilly Media, provides an in-depth exploration of doingdata science with Azure ML and R

Cloud-• Data Science and Machine Learning Essentials, an edX course

by Stephen Elston and Cynthia Rudin, provides a broad intro‐duction to data science using Azure ML, R, and Python

As we work through our data science example throughout subse‐quent sections, we include specific examples of the concepts presen‐ted here We encourage you to go to this page and create your ownfree-tier account We encourage you to try these example on yourown using this account

Azure ML Studio

Azure ML models are built and tested in the web-based Azure MLStudio Figure 1-1 below shows an example of the Azure ML Studio

Trang 10

Figure 1-1 Azure ML Studio

A workflow of the model appears in the center of the studio window

A dataset and an Execute R Script module are on the canvas On theleft side of the Studio display, you see datasets, and a series of tabscontaining various types of modules Properties of whichever data‐set or module that has been clicked on can be seen in the rightpanel In this case, you can see the R code contained in the Execute

R Script module

Build your own experiment

Building your own experiment in Azure ML is quite simple Click

the + symbol in the lower lefthand corner of the studio window You

will see a display resembling the Figure 1-2 below Select either ablank experiment or one of the sample experiments

Trang 11

Figure 1-2 Creating a New Azure ML Experiment

If you choose a blank experiment, start dragging and droppingmodules and data sets onto your canvas Connect the module out‐puts to inputs to build an experiment

Getting Data In and Out of Azure ML

Let’s discuss how we get data into and out of Azure ML

Azure ML supports several data I/O options, including:

• Web services

• HTTP connections

• Azure SQL tables

• Azure Blob storage

• Azure Tables; noSQL key-value tables

Trang 12

Azure SQL table Similar capabilities are available in the Writermodule for outputting data at volume.

Figure 1-3 Configuring the Reader Module for an Azure SQL Query

Modules and Datasets

Mixing native modules and R in Azure ML

Azure ML provides a wide range of modules for data transforma‐tion, machine learning, and model evaluation Most native Azure

ML modules are computationally efficient and scalable As a generalrule, these native modules should be your first choice

The deep and powerful R language extends Azure ML to meet therequirements of specific data science problems For example,solution-specific data transformation and cleaning can be coded in

R R language scripts contained in Execute R Script modules can berun in-line with native Azure ML modules Additionally, the R lan‐guage gives Azure ML powerful data visualization capabilities Withthe Create R Model module, you can train and score models fromnumerous R packages within an experiment with relatively littlework

As we work through the examples, you will see how to mix nativeAzure ML modules and Execute R Script modules to create a com‐plete solution

Trang 13

Execute R Script Module I/O

In the Azure ML Studio, input ports are located above module icons, and output ports are located below module icons.

If you move your mouse over the ports of a module,

you will see a “tool tip” showing the type of data for

that port

The Execute R Script module has five ports:

• The Dataset1 and Dataset2 ports are inputs for rectangularAzure data tables

• The Script Bundle port accepts a zipped R script file (.R file) or

Connections are made by dragging your mouse from the output port

of one module to the input port of another module

Azure ML Workflows

Model training workflow

Figure 1-4 shows a generalized workflow for training, scoring, andevaluating a machine learning model in Azure ML This generalworkflow is the same for most regression and classification algo‐rithms The model definition can be a native Azure ML module or Rcode in a Create R Model module

Trang 14

Figure 1-4 A generalized model training workflow for Azure ML mod‐ els.

Key points on the model training workflow:

• Data input can come from a variety of interfaces, including webservices, HTTP connections, Azure SQL, and Hive Query.These data sources can be within the Cortana suite or external

to it In most cases, for training and testing models, you use asaved dataset

• Transformations of the data can be performed using a combina‐tion of native Azure ML modules and the R language

• A Model Definition module defines the model type and proper‐ties On the left hand pane of the Studio you will see numerouschoices for models The parameters of the model are set in theproperties pane R model training and scoring scripts can beprovided in a Create R Model module

• The Training module trains the model Training of the model isscored in the Score module and performance summary statisticsare computed in the Evaluate module

The following sections include specific examples of each of the stepsillustrated in Figure 1-4

Publishing a model as a web service

Once you have developed and evaluated a satisfactory model, youcan publish it as a web service You will need to create streamlined

Trang 15

workflow for promotion to production A generalized example isshown in Figure 1-5.

Figure 1-5 Workflow for an Azure ML model published as a web ser‐ vice

Here are some key points of the workflow for publishing a web ser‐vice:

• Typically, you will use transformations you created and savedwhen you were training the model These include saved trans‐formations from the various Azure ML data transformationmodules and modified R transformation code

• The product of the training processes (discussed above) is the

trained model.

• You can apply transformations to results produced by themodel Examples of transformations include deleting unneededcolumns and converting units of numerical results

A Regression Example

Problem and Data Overview

Demand and inventory forecasting are fundamental business pro‐

cesses Forecasting is used for supply chain management, staff level

management, production management, and many other applica‐tions

Trang 16

In this example, we will construct and test models to forecast hourlydemand for a bicycle rental system The ability to forecast demand isimportant for the effective operation of this system If insufficientbikes are available, regular users will be inconvenienced The usersbecome reluctant to use the system, lacking confidence that bikeswill be available when needed If too many bikes are available, oper‐ating costs increase unnecessarily.

In data science problems, it is always important to gain an under‐standing of the objectives of the end-users In this case, having a rea‐sonable number of extra bikes on-hand is far less of an issue thanhaving an insufficient inventory Keep this fact in mind as we areevaluating models

For this example, we’ll use a dataset containing a time series ofdemand information for the bicycle rental system These data con‐tain hourly demand figures over a two-year period, for both regis‐tered and casual users There are nine features, also know as predic‐tor, or independent, variables The data set contains a total of 17,379rows or cases

The first and possibly most important, task in creating effective pre‐

dictive analytics models is determining the feature set Feature selec‐

tion is usually more important than the specific choice of machine

learning model Feature candidates include variables in the dataset,

transformed or filtered values of these variables, or new variablescomputed from the variables in the dataset The process of creating

the feature set is sometimes known as feature selection or feature

engineering.

In addition to feature engineering, data cleaning and editing arecritical in most situations Filters can be applied to both the predic‐tor and response variables

The data set is available in the Azure ML sample data sets You can

also download it as a csv file either from Azure ML, or from the

University of California Machine Learning Repository

A first set of transformations

For our first step, we’ll perform some transformations on the rawinput data using the code shown below in an Azure ML Execute RScript module:

## This file contains the code for the transformation

## of the raw bike rental data It is intended to run in an

Trang 17

## Azure ML Execute R Script module By changing

## the following variable to false the code will run

## in R or RStudio.

Azure <- FALSE

## If we are in Azure, source the utilities from the zip

## file The next lines of code read in the dataset, either

## in Azure ML or from a csv file for testing purposes.

## Select the columns we need

cols <- c("dteday", "mnth", "hr", "holiday",

"workingday", "weathersit", "temp",

## Normalize the numeric predictors

cols <- c("temp", "hum", "windspeed")

BikeShare[, cols] <- scale(BikeShare[, cols])

}

## Create a new variable to indicate workday

BikeShare$isWorking <- ifelse(BikeShare$workingday &

!BikeShare$holiday, 1, 0)

## Add a column of the count of months which could

## help model trend

BikeShare <- month.count(BikeShare)

## Create an ordered factor for the day of the week

## starting with Monday Note this factor is then

## converted to an "ordered" numerical value to be

## compatible with Azure ML table data types.

BikeShare$dayWeek <- as.factor(weekdays(BikeShare$dteday)) BikeShare$dayWeek <- as.numeric(ordered(BikeShare$dayWeek, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")))

Trang 18

## Output the transformed data frame if in Azure ML.

if(Azure) maml.mapOutputPort('BikeShare')

Notice the conditional statement at the beginning of this code list‐ing When the logical variable Azure is set to TRUE, the maml.mapInputPort(1) function reads the data frame from the input port of theexecute R Script Module The argument 1 indicates the first input

port R functions from a zip file are brought into the R environment

by the source() function The R file is read from a directory /src.

The date-time character string is converted to a POSIXct time seriesobject by the to.POSIXct function

If, on the other hand, Azure is set to FALSE, the other code path isexecuted This code path allows us to test the code in RStudio The

data are read from a csv file The argument stringsAsFactors =FALSE ensures that string columns are retained as such, and as theywill be in Azure ML Column selection and normalization of certainnumeric columns are executed These transformations are accom‐plished with the Azure module in that environment The date-timecolumn is converted to a time series object with the char.toPOSIXctfunction

This code creates five new columns, or features As we explore thedata we will determine if any of these features improve our models

• Creating a column indicating whether it’s a workday or not

• The month.count function adds a column indicating the num‐ber of months from the beginning of the time series

• A column indicating the day of the week as an ordered factor

In most cases, Azure ML will treat R POSIXct format‐

ted character columns as having a date-time type R

may not interpret the Azure ML date-time type as

POSIXct To be consistent, a type conversion is

required If you encounter errors with date-time fields

when working with R in Azure ML, check that the type

conversions are working as expected

The utilities.R file contains the functions used for the transforma‐

tions The listing of these functions is shown below:

set.asPOSIXct <- function(inFrame) {

dteday <- as.POSIXct(

Trang 19

Exploring the data

Let’s have a first look at the data by walking through a series ofexploratory plots

An additional Execute R Script module with the visualization code

is added to the experiment At this point, our Azure ML experimentlooks like Figure 1-6 The first Execute R Script module, titled

“Transform Data,” contains the code shown in the previous codelisting

Trang 20

Figure 1-6 The Azure ML experiment in Studio

The Execute R Script module, shown at the bottom of this experi‐ment, runs code for exploring the data, using output from the Exe‐cute R Script module that transforms the data

Our first step is to read the transformed data and create a correla‐tion matrix using the following code:

## This code will create a series of data visualizations

## to explore the bike rental dataset This code is

## intended to run in an Azure ML Execute R

## Script module By changing the following variable

## you can run the code in R or RStudio for testing.

## Look at the correlation between the predictors and

## between predictors and quality Use a linear

## time series regression to detrend the demand.

Time <- BikeShare$dteday

BikeShare$count <- BikeShare$cnt - fitted(

lm(BikeShare$cnt ~ Time, data = BikeShare))

cor.BikeShare.all <- cor(BikeShare[, c("mnth",

"hr",

"weathersit",

Trang 21

detrending the response variable column in the data frame.Detrending removes a source of bias in the correlation estimates.

We are particularly interested in the correlation of the predictorvariables with this detrended response

The levelplot() function from the lattice package is

wrapped by a call to plot() This is required since, in

some cases, Azure ML suppresses automatic printing,

and hence plotting Suppressing printing is desirable in

a production environment as automatically produced

output will not clutter the result As a result, you may

need to wrap expressions you intend to produce as

printed or plotted output, with the print() or plot()

A plot of the correlation matrix showing the relationship betweenthe predictors, and the predictors and the response variable, can be

Trang 22

seen in Figure 1-7 If you run this code in an Azure ML Execute RScript, you can see the plots at the R Device port.

Figure 1-7 Plot of correlation matrix

This plot is dominated by the strong correlation between dayWeekand isWorking—which is hardly surprising It’s clear that we don’tneed to include both of these variables in any model, as they areproxies for each other

To get a better look at the correlations between other variables, seethe second plot, in Figure 1-8, with the dayWeek variable removed

Trang 23

Figure 1-8 Plot of correlation matrix without dayWeek variable

In this plot we can see that a few of the features exhibit fairly strong

(mnth) are positively correlated, whereas humidity (hum) and theoverall weather (weathersit) are negatively correlated The variablewindspeed is nearly uncorrelated For this plot, the correlation of avariable with itself has been set to 0.0 Note that the scale is asym‐metric

We can also see that several of the predictor variables are highly cor‐related—for example, hum and weathersit or hr and hum These cor‐related variables could cause problems for some types of predictivemodels

You should always keep in mind the pitfalls in the

interpretation of correlation First, and most impor‐

tantly, correlation should never be confused with cau‐

sation A highly correlated variable may or may not

imply causation Second, a highly correlated or nearly

uncorrelated variable may, or may not, be a good pre‐

dictor The variable may be nearly collinear with some

other predictor, or the relationship with the response

may be nonlinear

Trang 24

Next, time series plots for selected hours of the day are created,using the following code:

## Make time series plots for certain hours of the day

ylab("Log number of bikes") +

labs(title = paste("Bike demand at ",

as.character(times), ":00", spe ="")) + theme(text = element_text(size=20))

}

)

This code uses the ggplot2 package to create the time series plots

An anonymous R function wrapped in lapply, generates the plots atthe selected hours

Two examples of the time series plots for two specific hours of theday are shown in Figures 9 and 10

Figure 1-9 Time series plot of bike demand for the 0700 hour

Trang 25

Figure 1-10 Time series plot of bike demand for the 1800 hour

Notice the differences in the shape of these curves at the two differ‐ent hours Also, note the outliers at the low side of demand

Next, we’ll create a number of box plots for some of the factor vari‐ables, using the following code:

## Convert dayWeek back to an ordered factor so the plot is in

## time order.

BikeShare$dayWeek <- fact.conv(BikeShare$dayWeek)

## This code gives a first look at the predictor values vs the demand for bikes.

labels <- list("Box plots of hourly bike demand",

"Box plots of monthly bike demand",

"Box plots of bike demand by weather factor", "Box plots of bike demand by workday vs holiday", "Box plots of bike demand by day of the week") xAxis <- list("hr", "mnth", "weathersit",

Trang 26

If you are not familiar with using Map() this code may look a bit

intimidating When faced with functional code like this, always read

from the inside out On the inside, you can see the ggplot2 package

functions This code is contained in an anonymous function withtwo arguments Map() iterates over the two argument lists to pro‐duce the series of plots

The utility function that creates the day of week factor with mean‐ingful names is shown in the listing below:

outVec

Three of the resulting box plots are shown in Figures 11, 12, and 13

Figure 1-11 Box plots showing the relationship between bike demand and hour of the day

Trang 27

Figure 1-12 Box plots showing the relationship between bike demand and weather situation

Figure 1-13 Box plots showing the relationship between bike demand and day of the week

Trang 28

From these plots, you can see a significant difference in the likelypredictive power of these three variables Significant and complexvariation in hourly bike demand can be seen in Figure 1-11 In con‐trast, it looks doubtful that weathersit is going to be very helpful inpredicting bike demand, despite the relatively high (negative) corre‐lation value observed The result shown in Figure 1-13 is surprising

—we expected bike demand to depend on the day of the week.Once again, the outliers at the low end of bike demand can be seen

in the box plots

In our example, we make heavy use of the ggplot2

package To learn more about ggplot2, we recommend

R Graphics Cookbook: Practical Recipes for Visualizing

Data by Winston Chang (O’Reilly) This is also an

excellent ggplot2 ‘cheat sheet’

Finally, we’ll create some plots to explore the continuous variables,using the following code:

## Look at the relationship between predictors and bike demand labels <- c("Bike demand vs temperature",

"Bike demand vs humidity",

"Bike demand vs windspeed",

"Bike demand vs hr")

xAxis <- c("temp", "hum", "windspeed", "hr")

Map(function(X, label){

ggplot(BikeShare, aes_string(x = X, y = "cnt")) + geom_point(aes_string(colour = "cnt"), alpha = 0.1) + scale_colour_gradient(low = "green", high = "blue") + geom_smooth(method = "loess") +

ggtitle(label) +

theme(text = element_text(size=20)) },

xAxis, labels)

This code is quite similar to the code used for the box plots We have

included a loess smoothed line on each of these plots Also, note that

we have added a color scale and increased the point transparency.Therefore, we get a feel for the number of overlapping data points

Trang 29

When plotting a large number of points, overplotting is

a significant problem Overplotting makes it difficult

to tell the actual point density as points lie on top of

each other Methods like color scales, point transpar‐

ency and hexbinning can all be applied to situations

with significant overplotting

The loess method in R is quite memory intensive

Depending on how much memory you have on your

local machine, you may or may not be able to run this

code Fortunately, Azure ML runs on servers with 60

GB of RAM, which is more than up to the job

Examples of the resulting scatter plots are shown in Figures 14 and15

Figure 1-14 Scatter plot of bike demand versus humidity

Figure 1-14 shows a clear trend of generally decreasing bike demandwith increased humidity However, at the low end of humidity, thedata is sparse and the trend is less certain We will need to proceedwith care

Trang 30

Figure 1-15 Scatter plot of bike demand versus hour of the day

Figure 1-15 shows the scatter plot of bike demand vs hour of theday Note that the loess smoother does not fit parts of these datavery well This is a warning that we may have trouble modeling thiscomplex behavior

Once again, in both scatter plots we see the prevalence of outliers atthe low end of bike demand

Exploring a potential interaction

Perhaps there is an interaction between time of day and day of theweek A day of week effect is not apparent from Figure 1-13, but wemay need to look in more detail This idea is easy to explore Addingthe following code to the visualization Execute R Script module cre‐ates box plots for working and non-working days for peak demandhours:

## Explore the interaction between time of day

## and working or non-working days.

labels <- list("Box plots of bike demand at 0900 for \n ing and non-working days",

"Box plots of bike demand at 1800 for \n ing and non-working days")

work-Times <- list(8, 17)

Map(function(time, label){

ggplot(BikeShare[BikeShare$hr == time, ],

Trang 31

aes(x = isWorking, y = cnt, group = isWorking)) + geom_boxplot( ) + ggtitle(label) +

theme(text = element_text(size=18)) },

Times, labels)

The result of running this code can be seen in Figures 16 and 17

Figure 1-16 Box plots of bike demand at 0900 for working and working days

Định dạng
Số trang	62
Dung lượng	20,97 MB