Neural Networks for Time Series In many business problems, the data naturally falls into a time series.. The network trains like a feed-forward, back propagation network trying to predi
Trang 1A
1.0
0.0
-1.0
Figure 7.11 Running a neural network on 10 examples from the validation set can help
determine how to interpret results
Neural Networks for Time Series
In many business problems, the data naturally falls into a time series Examples
of such series are the closing price of IBM stock, the daily value of the Swiss franc to U.S dollar exchange rate, or a forecast of the number of customers who will be active on any given date in the future For financial time series, someone who is able to predict the next value, or even whether the series is heading up
Trang 2Artificial Neural Networks 245
or down, has a tremendous advantage over other investors Although predominant in the financial industry, time series appear in other areas, such as forecasting and process control Financial time series, though, are the most studied since a small advantage in predictive power translates into big profits
Neural networks are easily adapted for time-series analysis, as shown in Figure 7.12 The network is trained on the time-series data, starting at the oldest point in the data The training then moves to the second oldest point, and the oldest point goes to the next set of units in the input layer, and so on The network trains like a feed-forward, back propagation network trying to predict the next value in the series at each step
value 2, time t
value 2, time t-1 value 2, time t-2
Historical units
Hidden lay
value 1, time t+1
Figure 7.12 A time-delay neural network remembers the previous few training examples
and uses them as input into the network The network then works like a feed-forward, back propagation network
Trang 3246 Chapter 7
Notice that the time-series network is not limited to data from just a single time series It can take multiple inputs For instance, to predict the value of the Swiss franc to U.S dollar exchange rate, other time-series information might be included, such as the volume of the previous day’s transactions, the U.S dollar
to Japanese yen exchange rate, the closing value of the stock exchange, and the day of the week In addition, non-time-series data, such as the reported inflation rate in the countries over the period of time under investigation, might also be candidate features
The number of historical units controls the length of the patterns that the network can recognize For instance, keeping 10 historical units on a network predicting the closing price of a favorite stock will allow the network to recognize patterns that occur within 2-week time periods (since exchange rates are set only on weekdays) Relying on such a network to predict the value 3 months in the future may not be a good idea and is not recommended
Actually, by modifying the input, a feed-forward network can be made to work like a time-delay neural network Consider the time series with 10 days
of history, shown in Table 7.5 The network will include two features: the day
of the week and the closing price
Create a time series with a time lag of three requires adding new features for the historical, lagged values (Day-of-the-week does not need to be copied, since it does not really change.) The result is Table 7.6 This data can now be input into a feed-forward, back propagation network without any special support for time series
Table 7.5 Time Series
Trang 4Artificial Neural Networks 247
Table 7.6 Time Series with Time Lag
If only we could ask it to tell us how it is making its decision in the form of rules Unfortunately, the same nonlinear characteristics of neural network nodes that make them so powerful also make them unable to produce simple rules Eventually, research into rule extraction from networks may bring unequivocally good results Until then, the trained network itself is the rule, and other methods are needed to peer inside to understand what is going on
A technique called sensitivity analysis can be used to get an idea of how
opaque models work Sensitivity analysis does not provide explicit rules, but
it does indicate the relative importance of the inputs to the result of the network Sensitivity analysis uses the test set to determine how sensitive the output of the network is to each input The following are the basic steps:
Trang 5248 Chapter 7
2 Measure the output of the network when all inputs are at their average value
3 Measure the output of the network when each input is modified, one at
a time, to be at its minimum and maximum values (usually –1 and 1, respectively)
For some inputs, the output of the network changes very little for the three
values (minimum, average, and maximum) The network is not sensitive to
these inputs (at least when all other inputs are at their average value) Other inputs have a large effect on the output of the network The network is
sensitive to these inputs The amount of change in the output measures the sen
sitivity of the network for each input Using these measures for all the inputs creates a relative measure of the importance of each feature Of course, this method is entirely empirical and is looking only at each variable independently Neural networks are interesting precisely because they can take interactions between variables into account
There are variations on this procedure It is possible to modify the values of two or three features at the same time to see if combinations of features have a particular importance Sometimes, it is useful to start from a location other than the center of the test set For instance, the analysis might be repeated for the minimum and maximum values of the features to see how sensitive the network is at the extremes If sensitivity analysis produces significantly different results for these three situations, then there are higher order effects in the network that are taking advantage of combinations of features
When using a feed-forward, back propagation network, sensitivity analysis can take advantage of the error measures calculated during the learning phase instead of having to test each feature independently The validation set is fed into the network to produce the output and the output is compared to the predicted output to calculate the error The network then propagates the error back through the units, not to adjust any weights but to keep track of the sensitivity of each input The error is a proxy for the sensitivity, determining how much each input affects the output in the network Accumulating these sensitivities over the entire test set determines which inputs have the larger effect
on the output In our experience, though, the values produced in this fashion are not particularly useful for understanding the network
Neural networks do not produce easily understood rules that explain how
T I P they arrive at a given result Even so, it is possible to understand the relative importance of inputs into the network by using sensitivity analysis Sensitivity can be a manual process where each feature is tested one at a time relative to the other features It can also be more automated by using the sensitivity information generated by back propagation In many situations, understanding the relative importance of inputs is almost as good as having explicit rules
Trang 6Artificial Neural Networks 249
Self-organizing maps (SOMs) are a variant of neural networks used for undirected data mining tasks such as cluster detection The Finnish researcher Dr Tuevo Kohonen invented self-organizing maps, which are also called Kohonen Networks Although used originally for images and sounds, these networks can also recognize clusters in data They are based on the same underlying units as feed-forward, back propagation networks, but SOMs are quite different in two respects They have a different topology and the back propagation method of learning is
no longer applicable They have an entirely different method for training
What Is a Self-Organizing Map?
The self-organizing map (SOM), an example of which is shown in Figure 7.13, is
a neural network that can recognize unknown patterns in the data Like the networks we’ve already looked at, the basic SOM has an input layer and an output layer Each unit in the input layer is connected to one source, just as in the networks for predictive modeling Also, like those networks, each unit in the SOM has an independent weight associated with each incoming connection (this is actually a property of all neural networks) However, the similarity between SOMs and feed-forward, back propagation networks ends here
The output layer consists of many units instead of just a handful Each of the units in the output layer is connected to all of the units in the input layer The output layer is arranged in a grid, as if the units were in the squares on a checkerboard Even though the units are not connected to each other in this layer, the grid-like structure plays an important role in the training of the SOM, as we will see shortly
How does an SOM recognize patterns? Imagine one of the booths at a carnival where you throw balls at a wall filled with holes If the ball lands in one of the holes, then you have your choice of prizes Training an SOM is like being
at the booth blindfolded and initially the wall has no holes, very similar to the situation when you start looking for patterns in large amounts of data and don’t know where to start Each time you throw the ball, it dents the wall a little bit Eventually, when enough balls land in the same vicinity, the indentation breaks through the wall, forming a hole Now, when another ball lands at that location, it goes through the hole You get a prize—at the carnival, this is a cheap stuffed animal, with an SOM, it is an identifiable cluster
Figure 7.14 shows how this works for a simple SOM When a member of the training set is presented to the network, the values flow forward through the network to the units in the output layer The units in the output layer compete with each other, and the one with the highest value “wins.” The reward is to adjust the weights leading up to the winning unit to strengthen in the response
to the input pattern This is like making a little dent in the network
Trang 7250 Chapter 7
The output units compete with each other for the output of the network
The output layer is laid out like a grid Each unit is connected to all the input units, but not to each other
The input layer is connected to the inputs
Figure 7.13 The self-organizing map is a special kind of neural network that can be used
to detect clusters
There is one more aspect to the training of the network Not only are the weights for the winning unit adjusted, but the weights for units in its immediate neighborhood are also adjusted to strengthen their response to the inputs
This adjustment is controlled by a neighborliness parameter that controls the
size of the neighborhood and the amount of adjustment Initially, the neighborhood is rather large, and the adjustments are large As the training continues, the neighborhoods and adjustments decrease in size Neighborliness actually has several practical effects One is that the output layer behaves more like a connected fabric, even though the units are not directly connected to each other Clusters similar to each other should be closer together than more dissimilar clusters More importantly, though, neighborliness allows for a group of units to represent a single cluster Without this neighborliness, the network would tend to find as many clusters in the data as there are units in the output layer—introducing bias into the cluster detection
Trang 8Artificial Neural Networks 251
0.1 0.2 0.7
0.2 0.6 0.6
0.1 0.9 0.4
0.2 0.1 0.8
The winning output unit and its path
Figure 7.14 An SOM finds the output unit that does the best job of recognizing a particular
input
Typically, a SOM identifies fewer clusters than it has output units This is inefficient when using the network to assign new records to the clusters, since the new inputs are fed through the network to unused units in the output layer To determine which units are actually used, we apply the SOM to the validation set The members of the validation set are fed through the network, keeping track of the winning unit in each case Units with no hits or with very few hits are discarded Eliminating these units increases the run-time performance of the network by reducing the number of calculations needed for new instances
Once the final network is in place—with the output layer restricted only to the units that identify specific clusters—it can be applied to new instances An
Trang 9Example: Finding Clusters
A large bank is interested in increasing the number of home equity loans that
it sells, which provides an illustration of the practical use of clustering The bank decides that it needs to understand customers that currently have home equity loans to determine the best strategy for increasing its market share To start this process, demographics are gathered on 5,000 customers who have home equity loans and 5,000 customers who do not have them Even though the proportion of customers with home equity loans is less than 50 percent, it
is a good idea to have equal weights in the training set
The data that is gathered has fields like the following:
■■ Appraised value of house
■■ Amount of credit available
■■ Amount of credit granted
■■ Marital status
■■ Number of children
■■ Household income This data forms a good training set for clustering The input values are mapped so they all lie between –1 and +1; these are used to train an SOM The network identifies five clusters in the data, but it does not give any information about the clusters What do these clusters mean?
A common technique to compare different clusters that works particularly
well with neural network techniques is the average member technique Find the
most average member of each of the clusters—the center of the cluster This is similar to the approach used for sensitivity analysis To do this, find the average value for each feature in each cluster Since all the features are numbers, this is not a problem for neural networks
For example, say that half the members of a cluster are male and half are female, and that male maps to –1.0 and female to +1.0 The average member for this cluster would have a value of 0.0 for this feature In another cluster,
Team-Fly®
Trang 10there may be nine females for every male For this cluster, the average memberwould have a value of 0.8 This averaging works very well with neural net-works since all inputs have to be mapped into a numeric range.
T I P Self-organizing maps, a type of neural network, can identify clusters but they do not identify what makes the members of a cluster similar to each other.
A powerful technique for comparing clusters is to determine the center or average member in each cluster Using the test set, calculate the average value for each feature in the data These average values can then be displayed in the same graph to determine the features that make a cluster unique.
These average values can then be plotted using parallel coordinates as inFigure 7.15, which shows the centers of the five clusters identified in the bank-ing example In this case, the bank noted that one of the clusters was particu-larly interesting, consisting of married customers in their forties with children
A bit more investigation revealed that these customers also had children intheir late teens Members of this cluster had more home equity lines thanmembers of other clusters
Figure 7.15 The centers of five clusters are compared on the same graph This simple
visualization technique (called parallel coordinates) helps identify interesting clusters.
Available Credit
Credit Balance
Age Marital
Status
Num Children
Income -1.0
-0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0
This cluster looks interesting High-income customers with children in the middle age group who are taking
out large loans.
Artificial Neural Networks 253
Trang 11254 Chapter 7
The story continues with the Marketing Department of the bank concluding that these people had taken out home equity loans to pay college tuition fees The department arranged a marketing program designed specifically for this market, selling home equity loans as a means to pay for college education The results from this campaign were disappointing The marketing program was not successful
Since the marketing program failed, it may seem as though the clusters did not live up to their promise In fact, the problem lay elsewhere The bank had initially only used general customer information It had not combined information from the many different systems servicing its customers The bank returned to the problem of identifying customers, but this time it included more information—from the deposits system, the credit card system, and
so on
The basic methods remained the same, so we will not go into detail about the analysis With the additional data, the bank discovered that the cluster of customers with college-age children did actually exist, but a fact had been overlooked When the additional data was included, the bank learned that the customers in this cluster also tended to have business accounts as well as personal accounts This led to a new line of thinking When the children leave home to go to college, the parents now have the opportunity to start a new business by taking advantage of the equity in their home
With this insight, the bank created a new marketing program targeted at the parents, about starting a new business in their empty nest This program succeeded, and the bank saw improved performance from its home equity loans group The lesson of this case study is that, although SOMs are powerful tools for finding clusters, neural networks really are only as good as the data that goes into them
Lessons Learned
Neural networks are a versatile data mining tool Across a large number of industries and a large number of applications, neural networks have proven themselves over and over again These results come in complicated domains, such as analyzing time series and detecting fraud, that are not easily amenable
to other techniques The largest neural network developed for production is probably the system that AT&T developed for reading numbers on checks This neural network has hundreds of thousands of units organized into seven layers Their foundation is based on biological models of how brains work Although predating digital computers, the basic ideas have proven useful In biology, neurons fire after their inputs reach a certain threshold This model
Trang 12Artificial Neural Networks 255
can be implemented on a computer as well The field has really taken off since the 1980s, when statisticians started to use them and understand them better
A neural network consists of artificial neurons connected together Each neuron mimics its biological counterpart, taking various inputs, combining them, and producing an output Since digital neurons process numbers, the activation function characterizes the neuron In most cases, this function takes the weighted sum of its inputs and applies an S-shaped function to it The result is a node that sometimes behaves in a linear fashion, and sometimes behaves in a nonlinear fashion—an improvement over standard statistical techniques
The most common network is the feed-forward network for predictive modeling Although originally a breakthrough, the back propagation training method has been replaced by other methods, notably conjugate gradient These networks can be used for both categorical and continuous inputs However, neural networks learn best when input fields have been mapped to the range between –1 and +1 This is a guideline to help train the network Neural networks still work when a small amount of data falls outside the range and for more limited ranges, such as 0 to 1
Neural networks do have several drawbacks First, they work best when there are only a few input variables, and the technique itself does not help choose which variables to use Variable selection is an issue Other techniques, such as decision trees can come to the rescue Also, when training a network, there is no guarantee that the resulting set of weights is optimal To increase confidence in the result, build several networks and take the best one
Perhaps the biggest problem, though, is that a neural network cannot explain what it is doing Decision trees are popular because they can provide a list of rules There is no way to get an accurate set of rules from a neural network A neural network is explained by its weights, and a very complicated mathematical formula Unfortunately, making sense of this is beyond our human powers of comprehension
Variations on neural networks, such as self-organizing maps, extend the technology to undirected clustering Overall neural networks are very powerful and can produce good models; they just can’t tell us how they do it
Trang 14Nearest Neighbor Approaches: Memory-Based Reasoning and
Collaborative Filtering
You hear someone speak and immediately guess that she is from Australia Why? Because her accent reminds you of other Australians you have met Or you try a new restaurant expecting to like it because a friend with good taste recommended it Both cases are examples of decisions based on experience When faced with new situations, human beings are guided by memories of similar situations they have experienced in the past That is the basis for the data mining techniques introduced in this chapter
Nearest neighbor techniques are based on the concept of similarity Memory-based reasoning (MBR) results are based on analogous situations in the past—much like deciding that a new friend is Australian based on past examples of Australian accents Collaborative filtering adds more information, using not just the similarities among neighbors, but also their preferences The restaurant recommendation is an example of collaborative filtering
Central to all these techniques is the idea of similarity What really makes
situations in the past similar to a new situation? Along with finding the simi
lar records from the past, there is the challenge of combining the informa
tion from the neighbors These are the two key concepts for nearest neighbor approaches
This chapter begins with an introduction to MBR and an explanation of how
it works Since measures of distance and similarity are important to nearest neighbor techniques, there is a section on distance metrics, including a discus
sion of the meaning of distance for data types, such as free text, that have no
257
Trang 15258 Chapter 8
obvious geometric interpretation The ideas of MBR are illustrated through a case study showing how MBR has been used to attach keywords to news stories The chapter then looks at collaborative filtering, a popular approach to making recommendations, especially on the Web Collaborative filtering is also based on nearest neighbors, but with a slight twist—instead of grouping restaurants or movies into neighborhoods, it groups the people recommending them
Memory Based Reasoning
The human ability to reason from experience depends on the ability to recognize appropriate examples from the past A doctor diagnosing diseases, a claims analyst flagging fraudulent insurance claims, and a mushroom hunter spotting Morels are all following a similar process Each first identifies similar cases from experience and then applies what their knowledge of those cases to the problem at hand This is the essence of memory-based reasoning A database of known records is searched to find preclassified records similar to a new
record These neighbors are used for classification and estimation
Applications of MBR span many areas:
Fraud detection
Customer response prediction
Medical treatments The most effective treatment for a given patient is probably the treatment that resulted in the best outcomes for similar patients MBR can find the treatment that produces the best outcome
Classifying responses Free-text responses, such as those on the U.S Census form for occupation and industry or complaints coming from customers, need to be classified into a fixed set of codes MBR can process the free-text and assign the codes
One of the strengths of MBR is its ability to use data “as is.” Unlike other data mining techniques, it does not care about the format of the records It only cares
about the existence of two operations: A distance function capable of calculating
a distance between any two records and a combination function capable of com
bining results from several neighbors to arrive at an answer These functions are readily defined for many kinds of records, including records with complex
or unusual data types such as geographic locations, images, and free text that
Trang 16Memory-Based Reasoning and Collaborative Filtering 259
are usually difficult to handle with other analysis techniques A case study later in the chapter shows MBR’s successful application to the classification of news stories—an example that takes advantage of the full text of the news story to assign subject codes
Another strength of MBR is its ability to adapt Merely incorporating new data into the historical database makes it possible for MBR to learn about new categories and new definitions of old ones MBR also produces good results without a long period devoted to training or to massaging incoming data into the right format
These advantages come at a cost MBR tends to be a resource hog since a large amount of historical data must be readily available for finding neighbors Classifying new records can require processing all the historical records to find the most similar neighbors—a more time-consuming process than applying an already-trained neural network or an already-built decision tree There is also the challenge of finding good distance and combination functions, which often requires a bit of trial and error and intuition
Example: Using MBR to Estimate Rents in Tuxedo, New York
The purpose of this example is to illustrate how MBR works by estimating the cost of renting an apartment in the target town by combining data on rents in
several similar towns—its nearest neighbors
MBR works by first identifying neighbors and then combining information from them Figure 8.1 illustrates the first of these steps The goal is to make predictions about the town of Tuxedo in Orange County, New York by looking
at its neighbors Not its geographic neighbors along the Hudson and Delaware
rivers, rather its neighbors based on descriptive variables—in this case, popu
lation and median home value The scatter plot shows New York towns arranged by these two variables Figure 8.1 shows that measured this way, Brooklyn and Queens are close neighbors, and both are far from Manhattan Although Manhattan is nearly as populous as Brooklyn and Queens, its home prices put it in a class by itself
T I P Neighborhoods can be found in many dimensions The choice of dimensions determines which records are close to one another For some purposes, geographic proximity might be important For other purposes home price or average lot size or population density might be more important The choice of dimensions and the choice of a distance metric are crucial to any nearest-neighbor approach
Trang 17260 Chapter 8
The first stage of MBR finds the closest neighbor on the scatter plot shown
in Figure 8.1 Then the next closest neighbor is found, and so on until the desired number are available In this case, the number of neighbors is two and the nearest ones turn out to be Shelter Island (which really is an island) way out by the tip of Long Island’s North Fork, and North Salem, a town in Northern Westchester near the Connecticut border These towns fall at about the middle of a list sorted by population and near the top of one sorted by home value Although they are many miles apart, along these two dimensions, Shelter Island and North Salem are very similar to Tuxedo
Once the neighbors have been located, the next step is to combine information from the neighbors to infer something about the target For this example, the goal is to estimate the cost of renting a house in Tuxedo There is more than one reasonable way to combine data from the neighbors The census provides information on rents in two forms Table 8.1 shows what the 2000 census reports about rents in the two towns selected as neighbors For each town, there is a count of the number of households paying rent in each of several price bands as well as the median rent for each town The challenge is to figure out how best to use this data to characterize rents in the neighbors and then how to combine information from the neighbors to come up with an estimate that characterizes rents in Tuxedo in the same way
Tuxedo’s nearest neighbors, the towns of North Salem and Shelter Island, have quite different distributions of rents even though the median rents are similar In Shelter Island, a plurality of homes, 34.6 percent, rent in the $500 to
$750 range In the town of North Salem, the largest number of homes, 30.9 percent, rent in the $1,000 to $1,500 range Furthermore, while only 3.1 percent of homes in Shelter Island rent for over $1,500, 24.2 percent of homes in North
Salem do On the other hand, at $804, the median rent in Shelter Island is above
the $750 ceiling of the most common range, while the median rent in North Salem, $1,150, is below the floor of the most common range for that town If
the average rent were available, it too would be a good candidate for character
izing the rents in the various towns
Table 8.1 The Neighbors