Machine learning is a scientific discipline that explores the construction and study of algorithms that learn from data. ■ ML is used to train computers to do things that are impossible to program in advance (e.g. handwriting recognition, fraud detection). ■ ML is an important part of Data Mining, KDD, Data Science ■ ML has a strong ties to statistics and mathematical optimization; statistics and optimization techniques are usually at the core of ML algorithms
Trang 3Machine Learning
■ Machine learning is a scientific discipline that explores the
construction and study of algorithms that learn from data
■ ML is used to train computers to do things that are impossible
to program in advance (e.g handwriting recognition, fraud
detection)
■ ML is an important part of Data Mining, KDD, Data Science
■ ML has a strong ties to statistics and mathematical
optimization; statistics and optimization techniques are usually
at the core of ML algorithms
3
Trang 4■ Predicting the stock prices based on the
current and historical data
■ Predict how much inventory to stock in the
case of hurricanes (Walmart)
■ How to group customers based on their
characteristics and buying behaviors?
■ Email classification (spam vs non-spam)
■ Predict who (customers) will quit using your
service (MegaTelCo)
4
Trang 5Machine Learning Tasks
■ Supervised learning
■ Unsupervised learning
■ Reinforcement learning
5
Trang 6Supervised Learning
training data), the goal of supervised learning is to learn/find a general rules that maps inputs to outputs
– Classification: target variable is discrete (e.g., spam email)
– Regression: target variable is real-valued (e.g., stock price)
Mapping function f
Learning Algorithm Training set
6
Trang 7Example:
USER’S PREFERENCES
Example Author Thread Length Where Read User Action
These are some training and test examples obtained from observing a user deciding whether to read articles posted to a threaded discussion board depending on whether the author is known or not (source:
http://artint.info/html/ArtInt_171.html )
7
Trang 8Example: Write-off
8
Figure 3-1 Data mining terminology for a supervised classification problem The prob‐ lem is supervised because it has a target attribute and some “training” data where we know the value for the target attribute It is a classification (rather than regression) problem because the target is a category (yes or no) rather than a number.
Black-Scholes model of option pricing, and so on Each of these abstracts away detailsthat are not relevant to their main purpose and keeps those that are
In data science, a predictive model is a formula for estimating the unknown value ofinterest: the target The formula could be mathematical, or it could be a logical statementsuch as a rule Often it is a hybrid of the two Given our division of supervised datamining into classification and regression, we will consider classification models (andclass-probability estimation models) and regression models
Terminology: Prediction
In common usage, prediction means to forecast a future event In data
science, prediction more generally means to estimate an unknown value This value could be something in the future (in common us‐
age, true prediction), but it could also be something in the present or
in the past Indeed, since data mining usually deals with historical data, models very often are built and tested using events from the past.
Predictive models for credit scoring estimate the likelihood that a potential customer will default (become a write-off) Predictive mod‐
els for spam filtering estimate whether a given piece of email is spam.
Predictive models for fraud detection judge whether an account has
Models, Induction, and Prediction | 45
(Provost and Fawcett, 2013)
Trang 9➡ Artificial Neural Network (ANN)
➡ Support Vector Machine (SVM)
Trang 10Example: Write-off
■ Solving write-off problem (binary classification) with decision
tree algorithm
10
Figure 3-3 Entropy of a two-class set as a function of p(+).
entropy(S) = - 0.7 × log2(0.7) + 0.3 × log2(0.3)
≈ - 0.7 × - 0.51 + 0.3 × - 1.74
≈ 0.88
Entropy is only part of the story We would like to measure how informative an attribute
is with respect to our target: how much gain in information it gives us about the value
of the target variable An attribute segments a set of instances into several subsets En‐ tropy only tells us how impure one individual subset is Fortunately, with entropy to
measure how disordered any set is, we can define information gain (IG) to measure how
much an attribute improves (decreases) entropy over the whole segmentation it creates.
Strictly speaking, information gain measures the change in entropy due to any amount
of new information being added; here, in the context of supervised segmentation, we consider the information gained by splitting the set on all values of a single attribute.
Let’s say the attribute we split on has k different values Let’s call the original set of examples the parent set, and the result of splitting on the attribute values the k chil‐
dren sets Thus, information gain is a function of both a parent set and of the children
52 | Chapter 3: Introduction to Predictive Modeling: From Correlation to Supervised Segmentation
entropy = -p 1* log(p 1 ) - p 2* log(p 2 ) - …
p i is the probability of i in the set
IG(parent, children) =
Trang 11Example: Write-off
11
entropy = 0.99 (high impurity)
entropy = 0.79 entropy = 0.39
Trang 13Example: Write-off
13
Figure 3-15 A classification tree and the partitions it imposes in instance space The black dots correspond to instances of the class Write-off, the plus signs correspond to instances of class non-Write-off The shading shows how the tree leaves correspond to segments of the population in instance space.
(Provost and Fawcett, 2013)
Trang 14Example: Write-off
14
Figure 3-15 A classification tree and the partitions it imposes in instance space The
black dots correspond to instances of the class Write-off, the plus signs correspond to
instances of class non-Write-off The shading shows how the tree leaves correspond to
segments of the population in instance space.
70 | Chapter 3: Introduction to Predictive Modeling: From Correlation to Supervised Segmentation
(Provost and Fawcett, 2013)
Trang 15Example: Write-off
■ Linear discrimination function: perceptron, logistics regression, support vector machine
15
2 And sometimes it can be surprisingly hard for them to admit it.
Figure 4-5 Many different possible linear boundaries can separate the two groups of points of Figure 4-4
Unfortunately, it’s not trivial to choose the “best” line to separate the classes Let’s con‐ sider a simple case, illustrated in Figure 4-4 Here the training data can indeed be sep‐ arated by class using a linear discriminant However, as shown in Figure 4-5 , there actually are many different linear discriminants that can separate the classes perfectly They have very different slopes and intercepts, and each represents a different model
of the data In fact, there are infinitely many lines (models) that classify this training set perfectly Which should we pick?
Optimizing an Objective Function
This brings us to one of the most important fundamental ideas in data mining—one that surprisingly is often overlooked even by data scientists themselves: we need to ask,
what should be our goal or objective in choosing the parameters? In our case, this would
allow us to answer the question: what weights should we choose? Our general procedure
will be to define an objective function that represents our goal, and can be calculated for
a particular set of weights and a particular set of data We will then find the optimal value for the weights by maximizing or minimizing the objective function What can easily be overlooked is that these weights are “best” only if we believe that the objective function truly represents what we want to achieve, or practically speaking, is the best proxy we can come up with We will return to this later in the book.
Unfortunately, creating an objective function that matches the true goal of the data mining is usually impossible, so data scientists often choose based on faith 2 and expe‐
f(x) = w 0 + w 1 x 1 + w 2 x 2 + …
Trang 16Unsupervised Learning
■ Unsupervised learning studies how systems can learn to
represent particular input patterns (unlabeled) in a way that reflects the statistical structure of the overall collection of
input patterns
- Clustering
- Principal components analysis (PCA)
- Self-organizing map (SOM)
- Evolutionary Computation
16
Trang 17Cluster analysis
■ Cluster analysis aims to search for patterns in a data set by
grouping the (multivariate) observations into clusters
■ The goal is to find an optimal grouping for which the
observations or objects within each cluster are similar, but the
clusters are dissimilar to each other
17
1 Randomly assign a number, from 1 to K, to each of the observations These
server as initial cluster assignments for the observations
2 Iterate until the cluster assignments stop changing :
- For each of the K clusters, compute the cluster centroid The nth cluster
centroid is the vector of the p feature means for the observations in the nth cluster
- Assign each observation to the cluster whose centroid is closest (based on
Euclidean distance)
K-mean clustering
Trang 18Example: K = 3
18
10.3 Clustering Methods 389
Data Step 1 Iteration 1, Step 2a
Iteration 1, Step 2b Iteration 2, Step 2a Final Results
FIGURE 10.6 The progress of the K-means algorithm on the example of ure 10.5 with K=3 Top left: the observations are shown Top center: in Step 1
Fig-of the algorithm, each observation is randomly assigned to a cluster Top right:
in Step 2(a), the cluster centroids are computed These are shown as large ored disks Initially the centroids are almost completely overlapping because the initial cluster assignments were chosen at random Bottom left: in Step 2(b), each observation is assigned to the nearest centroid Bottom center: Step 2(a) is once again performed, leading to new cluster centroids Bottom right: the results obtained after ten iterations.
col-initial configurations Then one selects the best solution, i.e that for which the objective (10.11) is smallest Figure 10.7 shows the local optima ob- tained by running K-means clustering six times using six different initial cluster assignments, using the toy data from Figure 10.5 In this case, the best clustering is the one with an objective value of 235.8.
As we have seen, to perform K-means clustering, we must decide how many clusters we expect in the data The problem of selecting K is far from simple This issue, along with other practical considerations that arise in performing K-means clustering, is addressed in Section 10.3.3.
Trang 19Example: group similar news
19
diagram-using-spark-and-mllib/
Trang 20http://chimpler.wordpress.com/2014/07/11/segmenting-audience-with-kmeans-and-voronoi-Example: Facebook friends
20
Trang 21Reinforcement Learning
■ Reinforcement learning is learning what to do how to map
■ Learning about, from, and while interacting with an external environment
■ The learner is not told which actions to take, as in most forms
of machine learning, but instead must discover which actions yield the most reward by trying them
21
Trang 22■ Feature scaling/Normalization: standardize the range of
independent variables or features of data
■ Feature manipulation: includes feature selection and feature construction
■ Interpretability: is about how easy we can explain the results/models obtained by ML algorithms
22
Trang 23A quick demonstration
■ Titanic
23
Trang 2424
and Karl Pearson
quantifying the relation- ship between offspring and parental
characteristics
linear discriminant function
analysis to solve
a taxonomic problem
1950s-2000s evolution of AI, machine learning and
Trang 25KDD/Data Mining
■ Knowledge Discovery in Databases (KDD) is the non-trivial
process of identifying valid, novel, potentially useful and
ultimately understandable patterns in data [Fayyad]
■ Data Mining is a problem solving methodology that finds a log- ical or mathematical description, of a complex nature, of pat- terns and regularities in a set of data [Decker and Focardi]
■ Data Mining is often related to learning/adaptive algorithms and methods
■ KDD/DM is not new techniques but rather a multi-disciplinary field of research: all make a contribution (later)
25
Trang 26Business Analytics
■ Business analytics (BA) refers to the skills, technologies,
practices for continuous iterative exploration and investigation
of past business performance to gain insight and drive
business planning [Bartlett,2013]
■ Using data to make better decisions; basically operations
research with emphasis on data
26
Trang 2727
Econometrics Operations Research
Business Analytics
Trang 2828
Econometrics Operations Research
Business Analytics Machine Learning
Trang 2929
Trang 30Common applications of BA
30
Competing on Analytics
that they recognize those methods’ tions—which factors are being weighed and which ones aren’t When the CEOs need help grasping quantitative techniques, they turn to experts who understand the business and how analytics can be applied to it We interviewed several leaders who had retained such advisers, and these executives stressed the need to find someone who can explain things in plain lan- guage and be trusted not to spin the numbers.
limita-A few CEOs we spoke with had surrounded themselves with very analytical people—pro- fessors, consultants, MIT graduates, and the like But that was a personal preference rather than a necessary practice
Of course, not all decisions should be grounded in analytics—at least not wholly so.
Personnel matters, in particular, are often well and appropriately informed by instinct and an- ecdote More organizations are subjecting re- cruiting and hiring decisions to statistical anal- ysis (see the sidebar “Going to Bat for Stats”).
But research shows that human beings can make quick, surprisingly accurate assessments
of personality and character based on simple observations For analytics-minded leaders, then, the challenge boils down to knowing
when to run with the numbers and when to run with their guts
Their Sources of Strength
Analytics competitors are more than simple number-crunching factories Certainly, they apply technology—with a mixture of brute force and finesse—to multiple business prob- lems But they also direct their energies to- ward finding the right focus, building the right culture, and hiring the right people to make optimal use of the data they constantly churn.
In the end, people and strategy, as much as formation technology, give such organizations strength
in-The right focus Although analytics itors encourage universal fact-based decisions, they must choose where to direct resource- intensive efforts Generally, they pick several functions or initiatives that together serve an overarching strategy Harrah’s, for example, has aimed much of its analytical activity at in- creasing customer loyalty, customer service, and related areas like pricing and promotions UPS has broadened its focus from logistics to customers, in the interest of providing supe- rior service While such multipronged strate-
Supply chain Simulate and optimize supply chain flows; reduce Dell, Wal-Mart, Amazon
inventory and stock-outs.
Customer selection, Identify customers with the greatest profit potential; Harrah’s, Capital One,
loyalty, and service increase likelihood that they will want the product or Barclays
service offering; retain their loyalty.
Pricing Identify the price that will maximize yield, or profit Progressive, Marriott
Human capital Select the best employees for particular tasks or jobs, New England Patriots,
at particular compensation levels Oakland A’s, Boston Red Sox
Product and service Detect quality problems early and minimize them Honda, Intel
quality
Financial Better understand the drivers of financial performance MCI, Verizon
performance and the effects of nonfinancial factors.
Research and Improve quality, efficacy, and, where applicable, safety Novartis, Amazon, Yahoo
development of products and services.
Analytics competitors make expert use of statistics and modeling to improve a wide variety of functions.
Here are some common applications:
THINGS YOU CAN COUNT ON
Thomas H Davenport (2005), Competing on Analytics, HRM