data science microsoft ml r 2015 update

Data Science in the Cloud with Microsoft Azure Machine Learning and R: 2015Update Stephen F... Data Science in the Cloud with Microsoft Azure Machine Learning and R: 2015Update by Stephe

Trang 3

Data Science in the Cloud with Microsoft Azure Machine Learning and R: 2015

Update

Stephen F Elston

Trang 4

Data Science in the Cloud with Microsoft Azure Machine Learning and R: 2015

Update

by Stephen F Elston

Printed in the United States of America

Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472

O’Reilly books may be purchased for educational, business, or sales promotional use Online

editions are also available for most titles ( http://safaribooksonline.com ) For more information,contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com

Editor: Shannon Cutt

Production Editor: Nicholas Adams

Proofreader: Nicholas Adams

Interior Designer: David Futato

Cover Designer: Karen Montgomery

Illustrator: Rebecca Demarest

September 2015: First Edition

Revision History for the First Edition

2015-09-01: First Release

2015-11-21: Second Release

The O’Reilly logo is a registered trademark of O’Reilly Media, Inc Data Science in the Cloud with

Microsoft Azure Machine Learning and R: 2015 Update, the cover image, and related trade dress

are trademarks of O’Reilly Media, Inc

While the publisher and the author(s) have used good faith efforts to ensure that the information andinstructions contained in this work are accurate, the publisher and the author(s) disclaim all

responsibility for errors or omissions, including without limitation responsibility for damages

resulting from the use of or reliance on this work Use of the information and instructions contained inthis work is at your own risk If any code samples or other technology this work contains or describes

is subject to open source licenses or the intellectual property rights of others, it is your responsibility

to ensure that your use thereof complies with such licenses and/or rights

978-1-491-93634-4

[LSI]

Trang 5

Chapter 1 Data Science in the Cloud with Microsoft Azure Machine Learning and R:

2015 Update

Introduction

This report covers the basics of manipulating data, constructing models, and evaluating models in theMicrosoft Azure Machine Learning platform (Azure ML) The Azure ML platform has greatly

simplified the development and deployment of machine learning models, with easy-to-use and

powerful cloud-based data transformation and machine learning tools

In this report, we’ll explore extending Azure ML with the R language (A companion report exploresextending Azure ML using the Python language.) All of the concepts we will cover are illustratedwith a data science example, using a bicycle rental demand dataset We’ll perform the required data

manipulation, or data munging Then, we will construct and evaluate regression models for the

Solutions can be quickly and easily deployed as web services

Models run in a highly scalable and secure cloud environment

Azure ML is integrated with the powerful Microsoft Cortana Analytics Suite, which includesmassive storage and processing capabilities It can read data from and write data to Cortana

storage at significant volume Azure ML can even be employed as the analytics engine for othercomponents of the Cortana Analytics Suite

Machine learning algorithms and data transformations are extendable using the R language, forsolution-specific functionality

Rapidly operationalized analytics are written in the R and Python languages

Code and data are maintained in a secure cloud environment

Downloads

For our example, we will be using the Bike Rental UCI dataset available in Azure ML This data is

Trang 6

also preloaded in the Azure ML Studio environment, or you can download this data as a csv file from

the UCI website The reference for this data is Fanaee-T, Hadi, and Gama, Joao, “Event labeling

combining ensemble detectors and background knowledge,” Progress in Artificial Intelligence (2013): pp 1-15, Springer Berlin Heidelberg.

The R code for our example can be found at GitHub

Working Between Azure ML and RStudio

Azure ML is a production environment It is ideally suited to publishing machine learning models In

contrast, Azure ML is not a particularly good development environment.

In general, you will find it easier to perform preliminary editing, testing, and debugging in RStudio In

this way, you take advantage of the powerful development resources and perform your final testing inAzure ML Downloads for R and RStudio are available for Windows, Mac, and Linux

This report assumes the reader is familiar with the basics of R If you are not familiar with using R inAzure ML, check out the Quick Start Guide to R in AzureML

The R source code for the data science example in this report can be run in either Azure ML or

RStudio Read the comments in the source files to see the changes required to work between thesetwo environments

Overview of Azure ML

This section provides a short overview of Azure Machine Learning You can find more details andspecifics, including tutorials, at the Microsoft Azure web page Additional learning resources can befound on the Azure Machine Learning documentation site

Deeper and broader introductions can be found in the following video classes:

Data Science with Microsoft Azure and R, Working with Cloud-based Predictive Analytics andModeling by Stephen Elston from O’Reilly Media, provides an in-depth exploration of doing datascience with Azure ML and R

Data Science and Machine Learning Essentials, an edX course by Stephen Elston and CynthiaRudin, provides a broad introduction to data science using Azure ML, R, and Python

As we work through our data science example throughout subsequent sections, we include specificexamples of the concepts presented here We encourage you to go to this page and create your ownfree-tier account We encourage you to try these example on your own using this account

Azure ML Studio

Azure ML models are built and tested in the web-based Azure ML Studio Figure 1-1 below shows

an example of the Azure ML Studio

Trang 7

Figure 1-1 Azure ML Studio

A workflow of the model appears in the center of the studio window A dataset and an Execute RScript module are on the canvas On the left side of the Studio display, you see datasets, and a series

of tabs containing various types of modules Properties of whichever dataset or module that has beenclicked on can be seen in the right panel In this case, you can see the R code contained in the Execute

R Script module

Build your own experiment

Building your own experiment in Azure ML is quite simple Click the + symbol in the lower lefthand

corner of the studio window You will see a display resembling the Figure 1-2 below Select either ablank experiment or one of the sample experiments

Trang 8

Figure 1-2 Creating a New Azure ML Experiment

If you choose a blank experiment, start dragging and dropping modules and data sets onto your

canvas Connect the module outputs to inputs to build an experiment

Getting Data In and Out of Azure ML

Let’s discuss how we get data into and out of Azure ML

Azure ML supports several data I/O options, including:

Web services

HTTP connections

Azure SQL tables

Azure Blob storage

Azure Tables; noSQL key-value tables

Hive queries

These data I/O capabilities enable interaction with external applications and other components of the

Trang 9

Cortana Analytics Suite.

We will investigate web service publishing in another section of this report

Data I/O at scale is supported by the AzureML Reader and Writer modules The Reader and Writermodules provide an interface with Cortana data storage components Figure 1-3 shows an example ofconfiguring the Reader module to read data from a hypothetical Azure SQL table Similar capabilitiesare available in the Writer module for outputting data at volume

Figure 1-3 Configuring the Reader Module for an Azure SQL Query

Modules and Datasets

Mixing native modules and R in Azure ML

Azure ML provides a wide range of modules for data transformation, machine learning, and modelevaluation Most native Azure ML modules are computationally efficient and scalable As a generalrule, these native modules should be your first choice

The deep and powerful R language extends Azure ML to meet the requirements of specific data

science problems For example, solution-specific data transformation and cleaning can be coded in R

R language scripts contained in Execute R Script modules can be run in-line with native Azure MLmodules Additionally, the R language gives Azure ML powerful data visualization capabilities Withthe Create R Model module, you can train and score models from numerous R packages within an

Trang 10

experiment with relatively little work.

As we work through the examples, you will see how to mix native Azure ML modules and Execute RScript modules to create a complete solution

Execute R Script Module I/O

In the Azure ML Studio, input ports are located above module icons, and output ports are located

below module icons.

TIP

If you move your mouse over the ports of a module, you will see a “tool tip” showing the type of data for that port.

The Execute R Script module has five ports:

The Dataset1 and Dataset2 ports are inputs for rectangular Azure data tables

The Script Bundle port accepts a zipped R script file (.R file) or R dataset file.

The Result Dataset output port produces an Azure rectangular data table from a data frame

The R Device port produces output of text or graphics from R

Within experiments, workflows are created by connecting the appropriate ports between modules—

output port to input port Connections are made by dragging your mouse from the output port of one

module to the input port of another module

Azure ML Workflows

Model training workflow

Figure 1-4 shows a generalized workflow for training, scoring, and evaluating a machine learningmodel in Azure ML This general workflow is the same for most regression and classification

algorithms The model definition can be a native Azure ML module or R code in a Create R Modelmodule

Trang 11

Figure 1-4 A generalized model training workflow for Azure ML models.

Key points on the model training workflow:

Data input can come from a variety of interfaces, including web services, HTTP connections,Azure SQL, and Hive Query These data sources can be within the Cortana suite or external to it

In most cases, for training and testing models, you use a saved dataset

Transformations of the data can be performed using a combination of native Azure ML modulesand the R language

A Model Definition module defines the model type and properties On the left hand pane of theStudio you will see numerous choices for models The parameters of the model are set in the

properties pane R model training and scoring scripts can be provided in a Create R Model

module

The Training module trains the model Training of the model is scored in the Score module andperformance summary statistics are computed in the Evaluate module

The following sections include specific examples of each of the steps illustrated in Figure 1-4

Publishing a model as a web service

Once you have developed and evaluated a satisfactory model, you can publish it as a web service.You will need to create streamlined workflow for promotion to production A generalized example isshown in Figure 1-5

Trang 12

Figure 1-5 Workflow for an Azure ML model published as a web service

Here are some key points of the workflow for publishing a web service:

Typically, you will use transformations you created and saved when you were training the model.These include saved transformations from the various Azure ML data transformation modules andmodified R transformation code

The product of the training processes (discussed above) is the trained model.

You can apply transformations to results produced by the model Examples of transformationsinclude deleting unneeded columns and converting units of numerical results

A Regression Example

Trang 13

Problem and Data Overview

Demand and inventory forecasting are fundamental business processes Forecasting is used for

supply chain management, staff level management, production management, and many other

In data science problems, it is always important to gain an understanding of the objectives of the users In this case, having a reasonable number of extra bikes on-hand is far less of an issue thanhaving an insufficient inventory Keep this fact in mind as we are evaluating models

end-For this example, we’ll use a dataset containing a time series of demand information for the bicyclerental system These data contain hourly demand figures over a two-year period, for both registeredand casual users There are nine features, also know as predictor, or independent, variables The dataset contains a total of 17,379 rows or cases

The first and possibly most important, task in creating effective predictive analytics models is

determining the feature set Feature selection is usually more important than the specific choice of machine learning model Feature candidates include variables in the dataset, transformed or filtered

values of these variables, or new variables computed from the variables in the dataset The process

of creating the feature set is sometimes known as feature selection or feature engineering.

In addition to feature engineering, data cleaning and editing are critical in most situations Filters can

be applied to both the predictor and response variables

The data set is available in the Azure ML sample data sets You can also download it as a csv file

either from Azure ML, or from the University of California Machine Learning Repository

A first set of transformations

For our first step, we’ll perform some transformations on the raw input data using the code shownbelow in an Azure ML Execute R Script module:

## This file contains the code for the transformation

## of the raw bike rental data It is intended to run in an

## Azure ML Execute R Script module By changing

## the following variable to false the code will run

## in R or RStudio.

Azure <- FALSE

## If we are in Azure, source the utilities from the zip

## file The next lines of code read in the dataset, either

## in Azure ML or from a csv file for testing purposes.

if(Azure){

source("src/utilities.R")

BikeShare <- maml.mapInputPort(1)

Trang 14

## Select the columns we need

cols <- c("dteday", "mnth", "hr", "holiday",

"workingday", "weathersit", "temp",

## Normalize the numeric predictors

cols <- c("temp", "hum", "windspeed")

BikeShare[, cols] <- scale(BikeShare[, cols])

}

## Create a new variable to indicate workday

BikeShare$isWorking <- ifelse(BikeShare$workingday &

!BikeShare$holiday, 1, 0)

## Add a column of the count of months which could

## help model trend

BikeShare <- month.count(BikeShare)

## Create an ordered factor for the day of the week

## starting with Monday Note this factor is then

## converted to an "ordered" numerical value to be

## compatible with Azure ML table data types.

Notice the conditional statement at the beginning of this code listing When the logical variable Azure

is set to TRUE, the maml.mapInputPort(1) function reads the data frame from the input port of the

execute R Script Module The argument 1 indicates the first input port R functions from a zip file are brought into the R environment by the source() function The R file is read from a directory /src The

date-time character string is converted to a POSIXct time series object by the to.POSIXct function

If, on the other hand, Azure is set to FALSE, the other code path is executed This code path allows us

to test the code in RStudio The data are read from a csv file The argument stringsAsFactors =

FALSE ensures that string columns are retained as such, and as they will be in Azure ML Columnselection and normalization of certain numeric columns are executed These transformations are

Trang 15

accomplished with the Azure module in that environment The date-time column is converted to a timeseries object with the char.toPOSIXct function.

This code creates five new columns, or features As we explore the data we will determine if any ofthese features improve our models

Creating a column indicating whether it’s a workday or not

The month.count function adds a column indicating the number of months from the beginning of thetime series

A column indicating the day of the week as an ordered factor

TIP

In most cases, Azure ML will treat R POSIXct formatted character columns as having a date-time type R may not

interpret the Azure ML date-time type as POSIXct To be consistent, a type conversion is required If you encounter errors with date-time fields when working with R in Azure ML, check that the type conversions are working as expected.

The utilities.R file contains the functions used for the transformations The listing of these functions is

Trang 16

These functions are in a file called utilities.R This file is packaged into a zip file, and uploaded into Azure ML Studio The R code in the zip file is then available in any Execute R Script module in the

experiment

Exploring the data

Let’s have a first look at the data by walking through a series of exploratory plots

An additional Execute R Script module with the visualization code is added to the experiment At thispoint, our Azure ML experiment looks like Figure 1-6 The first Execute R Script module, titled

“Transform Data,” contains the code shown in the previous code listing

Figure 1-6 The Azure ML experiment in Studio

The Execute R Script module, shown at the bottom of this experiment, runs code for exploring thedata, using output from the Execute R Script module that transforms the data

Our first step is to read the transformed data and create a correlation matrix using the following code:

## This code will create a series of data visualizations

## to explore the bike rental dataset This code is

## intended to run in an Azure ML Execute R

## Script module By changing the following variable

## you can run the code in R or RStudio for testing.

Azure <- FALSE

if(Azure){

## Source the zipped utility file

Trang 17

## Look at the correlation between the predictors and

## between predictors and quality Use a linear

## time series regression to detrend the demand.

Time <- BikeShare$dteday

BikeShare$count <- BikeShare$cnt - fitted(

lm(BikeShare$cnt ~ Time, data = BikeShare))

In this code, we use lm() to compute a linear model used for detrending the response variable column

in the data frame Detrending removes a source of bias in the correlation estimates We are

particularly interested in the correlation of the predictor variables with this detrended response

NOTE

The levelplot() function from the lattice package is wrapped by a call to plot() This is required since, in some cases, Azure

ML suppresses automatic printing, and hence plotting Suppressing printing is desirable in a production environment as

automatically produced output will not clutter the result As a result, you may need to wrap expressions you intend to

produce as printed or plotted output, with the print() or plot() functions.

This code requires one function, which is defined in the utilities.R file.

Trang 18

A plot of the correlation matrix showing the relationship between the predictors, and the predictorsand the response variable, can be seen in Figure 1-7 If you run this code in an Azure ML Execute RScript, you can see the plots at the R Device port.

Figure 1-7 Plot of correlation matrix

This plot is dominated by the strong correlation between dayWeek and isWorking—which is hardlysurprising It’s clear that we don’t need to include both of these variables in any model, as they areproxies for each other

To get a better look at the correlations between other variables, see the second plot, in Figure 1-8,with the dayWeek variable removed

Trang 19

Figure 1-8 Plot of correlation matrix without dayWeek variable

In this plot we can see that a few of the features exhibit fairly strong correlation with the response.The hour (hr), temp, and month (mnth) are positively correlated, whereas humidity (hum) and theoverall weather (weathersit) are negatively correlated The variable windspeed is nearly

uncorrelated For this plot, the correlation of a variable with itself has been set to 0.0 Note that thescale is asymmetric

We can also see that several of the predictor variables are highly correlated—for example, hum andweathersit or hr and hum These correlated variables could cause problems for some types of

Next, time series plots for selected hours of the day are created, using the following code:

## Make time series plots for certain hours of the day

require(ggplot2)

times <- c(7, 9, 12, 15, 18, 20, 22)

# BikeShare$Time <- Time

lapply(times, function(times){

Trang 20

ggplot(BikeShare[BikeShare$hr == times, ],

aes(x = dteday, y = cnt)) +

geom_line() +

ylab("Log number of bikes") +

labs(title = paste("Bike demand at ",

as.character(times), ":00", spe ="")) +

theme(text = element_text(size=20))

}

)

This code uses the ggplot2 package to create the time series plots An anonymous R function wrapped

in lapply, generates the plots at the selected hours

Two examples of the time series plots for two specific hours of the day are shown in Figures 9 and

10

Figure 1-9 Time series plot of bike demand for the 0700 hour

Trang 21

Figure 1-10 Time series plot of bike demand for the 1800 hour

Notice the differences in the shape of these curves at the two different hours Also, note the outliers atthe low side of demand

Next, we’ll create a number of box plots for some of the factor variables, using the following code:

## Convert dayWeek back to an ordered factor so the plot is in

## time order.

BikeShare$dayWeek <- fact.conv(BikeShare$dayWeek)

## This code gives a first look at the predictor values vs the demand for bikes.

labels <- list("Box plots of hourly bike demand",

"Box plots of monthly bike demand",

"Box plots of bike demand by weather factor",

"Box plots of bike demand by workday vs holiday",

"Box plots of bike demand by day of the week")

xAxis <- list("hr", "mnth", "weathersit",

Trang 22

functional code like this, always read from the inside out On the inside, you can see the ggplot2

package functions This code is contained in an anonymous function with two arguments Map()iterates over the two argument lists to produce the series of plots

The utility function that creates the day of week factor with meaningful names is shown in the listingbelow:

fact.conv <- function(inVec){

## Function gives the day variable meaningful

## level names.

outVec <- as.factor(inVec)

levels(outVec) <- c("Monday", "Tuesday", "Wednesday",

"Thursday", "Friday", "Saturday",

"Sunday")

outVec

Three of the resulting box plots are shown in Figures 11, 12, and 13

Figure 1-11 Box plots showing the relationship between bike demand and hour of the day

Trang 23

Figure 1-12 Box plots showing the relationship between bike demand and weather situation

Figure 1-13 Box plots showing the relationship between bike demand and day of the week

From these plots, you can see a significant difference in the likely predictive power of these threevariables Significant and complex variation in hourly bike demand can be seen in Figure 1-11 In

Trang 24

contrast, it looks doubtful that weathersit is going to be very helpful in predicting bike demand,

despite the relatively high (negative) correlation value observed The result shown in Figure 1-13 issurprising—we expected bike demand to depend on the day of the week

Once again, the outliers at the low end of bike demand can be seen in the box plots

TIP

In our example, we make heavy use of the ggplot2 package To learn more about ggplot2, we recommend R Graphics

Cookbook: Practical Recipes for Visualizing Data by Winston Chang (O’Reilly) This is also an excellent ggplot2 ‘cheat

sheet’.

Finally, we’ll create some plots to explore the continuous variables, using the following code:

## Look at the relationship between predictors and bike demand

labels <- c("Bike demand vs temperature",

"Bike demand vs humidity",

"Bike demand vs windspeed",

This code is quite similar to the code used for the box plots We have included a loess smoothed line

on each of these plots Also, note that we have added a color scale and increased the point

transparency Therefore, we get a feel for the number of overlapping data points

TIP

When plotting a large number of points, overplotting is a significant problem Overplotting makes it difficult to tell the

actual point density as points lie on top of each other Methods like color scales, point transparency and hexbinning can all

be applied to situations with significant overplotting.

WARNING

The loess method in R is quite memory intensive Depending on how much memory you have on your local machine, you

may or may not be able to run this code Fortunately, Azure ML runs on servers with 60 GB of RAM, which is more than

up to the job.

Trang 25

Examples of the resulting scatter plots are shown in Figures 14 and 15.

Figure 1-14 Scatter plot of bike demand versus humidity

Figure 1-14 shows a clear trend of generally decreasing bike demand with increased humidity

However, at the low end of humidity, the data is sparse and the trend is less certain We will need toproceed with care

Trang 26

Figure 1-15 Scatter plot of bike demand versus hour of the day

Figure 1-15 shows the scatter plot of bike demand vs hour of the day Note that the loess smootherdoes not fit parts of these data very well This is a warning that we may have trouble modeling thiscomplex behavior

Once again, in both scatter plots we see the prevalence of outliers at the low end of bike demand

Exploring a potential interaction

Perhaps there is an interaction between time of day and day of the week A day of week effect is notapparent from Figure 1-13, but we may need to look in more detail This idea is easy to explore

Adding the following code to the visualization Execute R Script module creates box plots for workingand non-working days for peak demand hours:

## Explore the interaction between time of day

## and working or non-working days.

labels <- list("Box plots of bike demand at 0900 for \n working and non-working days",

"Box plots of bike demand at 1800 for \n working and non-working days")

Trang 27

Figure 1-16 Box plots of bike demand at 0900 for working and non-working days

Figure 1-17 Box plots of bike demand at 1800 for working and non-working days

Now we can clearly see what we were missing in the initial set of features There is different demandbetween working and non-working days at peak demand hours

Trang 28

Creating new features

We need a new feature that differentiates the time of the day by working and non-working days; to dothis, we will add the following code to the transform Execute R Script module:

## Add a variable with unique values for time of day for

## working and non-working days.

BikeShare$workTime <- ifelse(BikeShare$isWorking,

BikeShare$hr,

BikeShare$hr + 24)

NOTE

We have created the new variable using working versus non-working days This leads to 48 levels (2 × 24) in this variable.

We could have used the day of the week, but this approach would have created 168 levels (7 × 24) Reducing the number

of levels reduces complexity and the chance of overfitting—generally leading to a better model.

Transformed time: Other new features

As noted earlier, the complex hour-to-hour variation bike demand shown in Figures 10 and 14 may bedifficult for some models to deal with Perhaps, if we shift the time axis, we will create new featureswhere demand is closer to a simple hump shape The following code shifts the time axis by five hours

to create one new feature and shifts the workTime feature by five hours to create another new feature:

## Shift the order of the hour variable so that it is smoothly

## "humped over 24 hours.

BikeShare$xformHr <- ifelse(BikeShare$hr > 4,

BikeShare$hr - 5,

BikeShare$hr + 19)

## Add a variable with unique values for time of day for

## working and non-working days.

BikeShare$xformWorkHr <- ifelse(BikeShare$isWorking,

BikeShare$xformHr,

BikeShare$xformHr + 24)

We add two more plots to the series we created in the visualization model, with the following code:

## Look at the relationship between predictors and bike demand

labels <- c("Bike demand vs temperature",

"Bike demand vs humidity",

"Bike demand vs windspeed",

"Bike demand vs hr",

"Bike demand vs xformHr",

"Bike demand vs xformWorkHr")

xAxis <- c("temp", "hum", "windspeed", "hr",

"xformHr", "xformWorkHr")

Map(function(X, label){

ggplot(BikeShare, aes_string(x = X, y = "cnt")) +

Định dạng
Số trang	57
Dung lượng	5,04 MB