With this information available, good linear estimates of the missing values of any variable can be made using whatever variable instance values are actually present.. What they do is al
Trang 1TABLE 8.3 The effect of missing values (?.??) on the summary values of x and y.
missing x value destroys the ability to know the sums for x, x2, and xy! What to do?
Since getting the aggregated values correct is critical, the modeler requires some method
to determine the appropriate values, even with missing values This sounds a bit like pulling one’s self up by one’s bootstraps! Estimate the missing values to estimate the missing values! However, things are not quite so difficult
Trang 2In a representative sample, for any particular joint distribution, the ratios between the various values xx and xx2, and xy and xy2 remain constant So too do the ratios between
xx and xxy and xy and xxy When these ratios are found, they are the equivalent of setting the value of n to 1 One way to see why this is so is because in any representative sample
the ratios are constant, regardless of the number of instance values—and that includes
n = 1 More mathematically, the effect of the number of instances cancels out The end result is that when using ratios, n can be set to unity In the linear regression formulae, values are multiplied by n, and multiplying a value by 1 leaves the original value unchanged When multiplying by n = 1, the n can be left out of the expression In the
calculations that follow, that piece is dropped since it has no effect on the result
The key to building the regression equations lies in discovering the needed ratios for those values that are jointly present Given the present and missing values that are shown
in the lower part of Table 8.3, what are the ratios?
Table 8.4 shows the ratios determined from the three instance values where x and y are
both present Using the expressions for linear regression and these ratios, what is the
estimated value for the missing y value from Table 8.3?
TABLE 8.4 Ratios of the values that are present in the lower part of Table 8.3
taken for the values of each that are jointly present as shown in Table 8.5
TABLE 8.5 Mean values of x and y for estimating missing values.
Trang 3TABLE 8.6 Showing ratio-derived estimated values for xx2 and xxy.
0.43 0.43 x 0.45 = 0.1935 0.43 x 0.61 = 0.2623
Plugging these values into the expression to find b gives
Trang 4So b = –1 The negative sign indicates that values of y will decrease as values of x increase Given this value for b, a can be found:
The a value is 1.06 With suitable values discovered for a and b, and using the formula for
a straight line, an expression can be built that will provide an appropriate estimate for any
missing value of y, given a value of x That expression is
Trang 50.32 0.83 0.74 0.09
These estimates of y are quite close to the original values in this example The error, the
difference between the original value and the estimate, is small compared to the actual value
Multiple Linear Regression
The equations used for performing multiple regression are extensions of those already used for linear regression They are built from the same components as linear
regression—xx, xx2, xxy—for every pair of variables included in the multiple regression (Each variable becomes x in turn, and for that x, each of the other variables becomes y in
turn.) All of these values can be estimated by finding the ratio relationships for those variables’ values that are jointly present in the initial sample data set With this information available, good linear estimates of the missing values of any variable can be made using whatever variable instance values are actually present
With the ratio information known for all of the variables, a suitable multiple regression can
be constructed for any pattern of missing values, whether it was ever experienced before
or not Appropriate equations for the instance values that are present in any instance can
be easily constructed from the ratio information These equations are then used to predict the missing values
For a statistician trying to build predictions, or glean inferences from a data set, this technique presents certain problems However, the problems facing the modeler when replacing data are very different, for the modeler requires a computationally tractable method that introduces as little bias as is feasible when replacing missing values The missing-value replacements themselves should contribute no information to the model What they do is allow the information that is present (the nonempty instance values) to be used by the modeling tool, adding as little extraneous distortion to a data set as possible
It may seem strange that the replacement values should contribute no information to a data set However, any replacement value can only be generated from information that is already present in the form of other instance values The regression equations fit the replacement value in such a way that it least distorts the linear relationships already discovered Since the replacement value is derived exclusively from information that is already present in the data set, it can only reexpress the information that is already
Trang 6present New information, being new, changes what is already known to a greater or lesser degree, actually defining the relationship Replacement values should contribute as little as possible to changing the shape of the relationships that already exist The existing relationship is what the modeler needs to explore, not some pattern artificially constructed
by replacing missing values!
Alternative Methods of Missing-Value Replacement
Preserving joint variability between variables is far more effective at providing unbiased replacement values than methods that do not preserve variability In practice, many variables do have essentially linear between-variable relationships Even where the relationship is nonlinear, a linear estimate, for the purpose of finding a replacement for a missing value, is often perfectly adequate The minute amount of bias introduced is often below the noise level in the data set anyway and is effectively unnoticeable
Compared to finding nonlinear relationships, discovering linear relationships is both fast and easy This means that linear techniques can be implemented to run fast on modern computers, even when the dimensionality of a data set is high Considering the small amount of distortion usually associated with linear techniques, the trade-offs in terms of speed and flexibility are heavily weighted in favor of their use The replacement values can be generated dynamically (on the fly) at run time and substituted as needed
However, there are occasions when the relationship is clearly nonlinear, and when a linear estimate for a replacement value may introduce significant bias If the modeler knows that the relationship exists, some special replacement procedure for missing values can be used The real problem arises when a significantly nonlinear relationship exists that is unknown to the modeler and domain expert Mining will discover this relationship, but if there are missing values, linear estimates for replacements will produce bias and distortion Addressing these problems is outside the scope of the demonstration software, which is intended only to illustrate the principles involved in data preparation
There are several possible ways to address the problem Speed in finding replacement values is important for deployed production systems In a typical small direct marketing application, for instance, a solicitation mailing model may require replacing anything from
1 million to 20 million values As another example, large-scale, real-time fraud detection systems may need from tens to hundreds of millions of replacement values daily
Tests of Nonlinearity: Extending the Ratio Method of Estimation
There are tests to determine nonlinearity in a relationship One of the easiest is to simply try nonlinear regressions and see if the fit is improved as the nonlinearity of the
expression increases This is certainly not foolproof Highly nonlinear relationships may well not gradually improve their fit as the nonlinearity of the expression is increased
Trang 7An advantage of this method is that the ratio method already described can be extended
to capture nonlinear relationships The level of computational complexity increases considerably, but not as much as with some other methods The difficulty is that choosing the degree of nonlinearity to use is fairly arbitrary There are robust methods to determine the amount of nonlinearity that can be captured at any chosen degree of nonlinearity without requiring that the full nonlinear multiple regressions be built at every level This allows a form of optimization to be included in the nonlinearity estimation and capture However, there is still no guarantee that nonlinearities that are actually present will be captured The amount of data that has to be captured is quite considerable but relatively modest compared with other methods, and remains quite tractable
At run time, missing-value estimates can be produced very quickly using various optimization techniques The missing-value replacement rate is highly dependent on many factors, including the dimensionality of the data set and the speed of the computer,
to name only two However, in practical deployed production systems, replacement rates exceeding 1000 replacements per second, even in large or high-throughput data sets, can be easily achieved on modern PCs
Nonlinear Submodels
Another method of capturing the nonlinearities is to use a modeling tool that supports such a model Neural networks work well (described briefly in Chapter 10) In this case, for each variable in the data set, a subsample is created that has no missing values This
is required as unmodified neural networks do not handle missing values—they assume that all inputs have some value A predictive model for every variable is constructed from all of the other variables, and for the MVPs When a missing value is encountered, the appropriate model is used to predict its value from the available variable values
There are significant drawbacks to such a method The main flaw is that it is impossible to train a network for every possible pattern of missing values Training networks for all of the detected missing patterns in the sample may itself be an enormous task Even when done, there is no prediction possible when the population produces a previously
unencountered MVP, since there is no network trained for that configuration Similarly, the storage requirements for the number of networks may be unrealizable
A modification of this method builds fewer models by using subsets of variables as inputs
If the subset inputs are carefully selected, models can be constructed that among them have a very high probability that at least one of them will be applicable This approach requires constructing multiple, relatively small networks for each variable However, such
an approach can become intractable very quickly as dimensionality of the data set increases
An additional problem is that it is hard to determine the appropriate level of complexity Missing-value estimates are produced slowly at run time since, for every value, the
Trang 8appropriate network has to be looked up, loaded, run, and output produced
Autoassociative Neural Networks
Autoassociative neural networks are briefly described in Chapter 10 In this architecture, all of the inputs are also used as predicted outputs Using such an architecture, only a single neural network need be built When a missing value(s) is detected, the network can
be used in a back-propagation mode—but not a training mode, as no internal weights are adjusted Instead, the errors are propagated all the way back to the inputs At the input, an appropriate weight can be derived for the missing value(s) so that it least disturbs the internal structure of the network The value(s) so derived for any set of inputs reflects, and least disturbs, the nonlinear relationship captured by the autoassociative neural network
As with any neural network, its internal complexity determines the network’s ability to capture nonlinear relationships Determining that any particular network has, in fact, captured the extant nonlinear relationship is difficult The autoassociative neural network approach has been used with success in replacing missing values for data sets of modest dimensionality (tens and very low hundreds of inputs), but building such networks for moderate- to high-dimensionality data sets is problematic and slow The amount of data required to build a robust network becomes prohibitive, and for replacement value generation a robust network that actually reflects nonlinearities is needed
At run time, replacement values can be produced fairly quickly
Nearest-Neighbor Estimators
Nearest-neighbor methods rely on having the training set available at run time The method requires finding the point in state space best represented by the partially complete instance value, finding the neighbors nearest to that point, and using some metric to derive the missing values It depends on the assumption that representative near neighbors can be found despite the fact that one or more dimensional values is missing This can make it difficult to determine a point in state space that is
representative, given that its position in the dimensions whose value is missing is unknown Nonetheless, such methods can produce good estimates for missing values Such methods are inherently nonlinear so long as representative near neighbors can be found
The main drawbacks are that having the training data set available, even in some collapsed form, may require very significant storage Lookup times for neighbors can be very slow, so finding replacement values too is slow
Trang 9Chapter 9: Series Variables
Overview
Series variables have a number of characteristics that are sufficiently different from other
types of variables that they need examining in more detail Series variables are always at least two-dimensional, although one of the dimensions may be implicit The most common
type of series variable is a time series, in which a series of values of some feature or
event are recorded over a period of time The series may consist of only a list of measurements, giving the appearance of a single dimension, but the ordering is by time, which, for a time series, is the implicit variable
The series values are always measured on one of the scales already discussed, nominal through ratio, and are presented as an ordered list It is the ordering, the expression of the implied variable, that requires series data to be prepared for mining using techniques in addition to those discussed for nonseries data Without these additional techniques the miner will not be able to best expose the available information This is because series variables carry additional information within the ordering that is not exposed by the techniques discussed so far
Up to this point in the book we have developed precise descriptions of features of nonseries data and various methods for manipulating the identified features to expose information content This chapter does the same for series data and so has two main tasks:
1 Find unambiguous ways to describe the component features of a series data set so that it can be accurately and completely characterized
2 Find methods for manipulating the unique features of series data to expose the information content to mining tools
Series data has features that require more involvement by the miner in the preparation process than for nonseries data Where miner involvement is required, fully automated preparation tools cannot be used The miner just has to be involved in the preparation and exercise judgment and experience Much of the preparation requires visualizing the data set and manipulating the series features discussed There are a number of excellent commercial tools for series data visualization and manipulation, so the demonstration software does not include support for these functions Thus, instead of implementation notes concluding the chapter discussing how the features discussed in the chapter are put into practice, this chapter concludes with a suggested checklist of actions for preparing series data for the miner to use
Trang 109.1 Here There Be Dragons!
Mariners and explorers of old used fanciful and not always adequate maps In unexplored or unknown territory, the map warned of dragons—terrors of the unknown So it is when preparing data, for the miner knows at least some of the territory Many data explorers have passed this way A road exists Signposts point the way Maybe the dragons were chased away, but still be warned “Danger, quicksand!” Trouble lurks inside series data; the road of data preparation is rocky and uncertain, sometimes ending mired in difficulties It is all too easy to seriously damage data, render it useless, or worse, create wonderful-looking distortions that are but chimera that melt away when exposed to the bright light of reality Like all explorers faced with uncertainty, the miner needs to exercise care and experience here more than elsewhere The road is rough and not always well marked Unfortunately, the existing signposts, with the best of intentions, can still lead the miner seriously astray Tread this path with caution!
9.2 Types of Series
Nonseries multivariable measurements are taken without any particular note of their ordering Ordering is a critical feature of a series Unless ordered, it’s not a series One of the variables (called the displacement variable, and described in a moment) is always monotonic—either constantly increasing or constantly decreasing Whether there is one
or several other variables in the series, their measurements are taken at defined points on the range of the monotonic variable The key ordering feature is the change in the monotonic variable as its values change across part or all of its range
Time series are by far the most common type of series Measurements of one variable are taken at different times and ordered such that an earlier measurement always comes
before a later measurement For a time series, time is the displacement variable—the
measurements of the other variable (or variables) are made as time is “displaced,” or
changed The displacement variable is also called the index variable That is because the
points along the displacement variable at which the measurements are taken are called
the index points
Dimensions other than time can serve as the displacement dimension Distance, for instance, can be used For example, measuring the height of the American continent above sea level at different points on a line extending from the Atlantic to the Pacific produces a distance displacement series
Since time series are the most common series, where this chapter makes assumptions, a time series will be assumed The issues and techniques described about time series also apply to any other displacement series Series, however indexed, share many features in common, and techniques that apply to one type of series usually apply to other types of series Although the exact nature of the displacement variable may make little difference to the preparation and even, to some degree, the analysis of the series itself, it makes all the
Trang 11difference to the interpretation of the result!
9.3 Describing Series Data
Series data differs from the forms of data so far discussed mainly in the way in which the data enfolds the information The main difference is that the ordering of the data carries information This ordering, naturally, precludes random sampling since random sampling deliberately avoids, and actually destroys, any ordering Preserving the ordering is the main reason that series data has to be prepared differently from nonseries data
There is a large difference between preparing data for modeling and actually modeling the data This book focuses almost entirely on how to prepare the data for modeling, leaving aside almost all of the issues about the actual modeling, insofar as is practical The same approach will apply to series data Some of the tools needed to address the data
preparation problems may look similar, indeed are similar, to those used to model and glean information and insight into series data However, they are put to different purposes when preparing data That said, in order to understand some of the potential problems and how to address them, some precise method of describing a series is needed A key question is, What are the features of series data?
To answer this question, the chapter will first identify some consistent, recognizable, and useful features of series data The features described have to be consistent and
recognizable as well as useful The useful features are those that best help the miner in preparing series data for modeling The miner also needs these same features when modeling This is not surprising, as finding the best way to expose the features of interest for modeling is the main objective of data preparation
1 The feature or event is recorded as numerical information
2 The index point information is either recorded, or at least the displacements are defined
3 The index, if recorded, is recorded numerically
It is quite possible to record a time series using alpha labels for the nondisplacement dimension, but this is extremely rare Numerating such alpha values within the series is
Trang 12possible, although it requires extremely complex methods While it is very unusual indeed
to encounter series with one alpha dimension, it is practically unknown to find a series with an alpha-denominated displacement variable The displacement dimension has to be
at least an ordinal variable (ratio more likely), and these are invariably numerical
Because series with all dimensions numerical are so prevalent, we will focus entirely on those
It is also quite possible to record multivariable series sharing a common displacement variable, in other words, capturing several features or events at each index mark An example is collecting figures for sales, backlog, new orders, and inventory level every week “Time” is the displacement variable for all the measurements, and the index point is weekly The index point corresponds to the validating event referred to in Chapter 2 There is no reason at all why several features should not be captured at each index, the same as in any nonseries multidimensional data set However, just as each of the variables can be considered separately from each other during much of the nonseries data preparation process, so too can each series variable in a multidimensional series be considered separately during preparation
9.3.2 Features of a Series
By its nature a series has some implicit pattern within the ordering That pattern may repeat itself over a period Often, time series are thought of by default as repetitive, or cyclic, but there is no reason that any repeating pattern should in fact exist There is, for example, a continuing debate about whether the stock market exhibits a repetitive pattern
or is simply the result of a random walk (touched on later) Enormous effort has been put into detecting any cyclic pattern that may exist, and still the debate continues There is, nonetheless, a pattern in series data, albeit not necessarily a repeating one One of the objectives of analyzing series data is to describe that pattern, identify it as recognizable if possible, and find any parts that are repetitive Preparing series data for modeling, then, must preserve the nature of the pattern that exists Preparation also includes putting the data into a form in which the desired information is best exposed to a modeling tool Once again, a warning: this is not always easy!
Before looking at how series data may be prepared, and what problems may be detected and corrected, the focus now turns to finding some way to unambiguously describe the series
9.3.3 Describing a Series—Fourier
Jean Baptiste Joseph Fourier was not a professional mathematician Nonetheless, he exerted an influence on mathematicians and scientists of his day second only to that of Sir Isaac Newton Until Fourier revealed new tools for analyzing data, several scientists lamented that the power of mathematics seemed to be just about exhausted His insights reinvigorated the field Such is the power of Fourier’s insight that its impact continues to
Trang 13reverberate in the modern world today Indeed, Fourier provided the key insights and methods for developing the tools responsible for building the modern technology that we take for granted
To be fair, the techniques today brought under the umbrella description of Fourier analysis were not all entirely due to Fourier He drew on the work of others, and subsequent work enormously extends his original insight His name remains linked to these techniques, and deservedly so, because he had the key insights from which all else flowed
One of the tools he devised is a sort of mathematical prism Newton used a prism to discover that white light consists of component parts (colors) Fourier’s prism scatters the information in a series into component parts It is a truly amazing device that hinges on two insights: waves can be added together, and adding enough simple sine and cosine waves of different frequencies, phases, and amplitudes together is sufficient to create any
series shape Any series shape!
When adding waveforms together, several things can be varied The three key items are
• The frequency, or how many times a waveform repeats its pattern in a given time
• The phase, that is, where the peaks and troughs of a wave occur in relation to peaks and troughs of other waves
• The amplitude, or distance between the highest and lowest values of a wave
Figure 9.1 shows how two waveforms can be added together to produce a third The frequency simply measures how many waves, or cycles, occur in a given time The top two waveforms are symmetrical and uniform It is easy to see where they begin to repeat the previous pattern The two top waveforms also have different frequencies, which is shown by the identified wavelengths tracing out different lengths on the graph The lower, composite waveform cannot, just by looking at it, positively be determined to have completed a repeating pattern in the width of the graph
Trang 14Figure 9.1 The addition of two waveforms shows how to create a
new waveform The values of the consine and sine waveforms are added together and the result plotted The resulting wave-form may look nothing like the source that created it
Figure 9.2 shows two waveforms that are both complete cycles, and both are identical in length The waveforms illustrate the sine and the cosine functions:
y = sine(xo)
and
y = cosine(xo)
Figure 9.2 Values of the functions “sine” and “cosine” plotted for the number of
degrees shown on the central viniculum
When used in basic trigonometry, both of these functions return specific values for any
Trang 15number of degrees They are shown plotted over 360°, the range of a circle Because 0° represents the same circular position as 360° (both are due north on a compass, for example), this has to represent one complete cycle for sine and cosine waveforms—they begin an identical repetition after that point Looking at the two waveforms shows that the sine has identical values to the cosine, but occurring 90° later (further to the right on the graph) The sine is an identical waveform to the cosine, shifted 90° “Shifted” here literally means moved to the right by a distance corresponding to 90° This shift is called a phase
shift, and the two waveforms are said to be 90° out of phase with each other
The two upper images in Figure 9.3 show the effect of changing amplitude Six sine and cosine waveforms, three of each, are added together The frequencies of each
corresponding waveform in the two upper images are identical All that has changed is the amplitude of each of the waveforms This makes a very considerable difference to the resulting waveform shown at the bottom of each image The lower two images show the amplitudes held constant, but the frequency of each contributing waveform differs The resulting waveforms—the lower waveform of each frame—show very considerable differences
Figure 9.3 These four images show the result of summing six waveforms In the
top two images, the frequencies of the source waveforms are the same; only their amplitude differs In both of the two lower images, all waveforms have similar amplitude
It was Fourier’s insight that by combining enough of these two types of waveforms, varying their amplitude, phase, and frequency as needed, any desired resultant waveform can be built Fourier analysis is the “prism” that takes in a complex waveform and “splits” it into its component parts—just as a crystal prism takes in white light and splits it into the various colors And just as there is only one rainbow of colors, so too, for any specific input waveform, there is a single “rainbow” of outputs