Tài liệu Data Preparation for Data Mining- P13 pptx

The ideal, if the miner has access to software that can make the measurements such as data surveying software, requires use of a multivariable sample determined to be representative to a

Trang 1

system that can affect the outcome The more of them that there are, the more likely it is that, purely by happenstance, some particular, but actually meaningless, pattern will show

up The number of variables in a data set, or the number of weights in a neural network, all represent things that can change So, yet again, high-dimensionality problems turn up, this time expressed as degrees of freedom Fortunately for the purposes of data

preparation, a definition of degrees of freedom is not needed as, in any case, this is a problem previously encountered in many guises Much discussion, particularly in this chapter, has been about reducing the dimensionality/combinatorial explosion problem (which is degrees of freedom in disguise) by reducing dimensionality Nonetheless, a data set always has some dimensionality, for if it does not, there is no data set! And having some particular dimensionality, or number of degrees of freedom, implies some particular chance that spurious patterns will turn up It also has implications about how much data is needed to ensure that any spurious patterns are swamped by valid, real-world patterns The difficulty is that the calculations are not exact because several needed measures, such as the number of significant system states, while definable in theory, seem impossible to pin down in practice Also, each modeling tool introduces its own degrees of freedom (weights in a neural network, for example), which may be unknown to the minere mi.

The ideal, if the miner has access to software that can make the measurements (such as data surveying software), requires use of a multivariable sample determined to be representative to a suitable degree of confidence Failing that, as a rule of thumb for the

minimum amount of data to accept, for mining (as opposed to data preparation), use at least

twice the number of instances required for a data preparation representative sample The key is to have enough representative instances of data to swamp the spurious patterns Each significant system state needs sufficient representation, and having a truly representative sample of data is the best way to assure that

10.7 Beyond Joint Distribution

So far, so good Capturing the multidimensional distribution captures a representative sample of data What more is needed? On to modeling!

Unfortunately, things are not always quite so easy Having a representative sample in hand is a really good start, but it does not assure that the data set is modelable! Capturing

a representative sample is an essential minimum—that, and knowing what degree of confidence is justified in believing the sample to be representative However, the miner needs a modelable representative sample, and the sample simply being representative of the population may not be enough How so?

Actually, there are any number of reasons, all of them domain specific, why the minimum representative sample may not suffice—or indeed, why a nonrepresentative sample is needed (Heresy! All this trouble to ensure that a fully representative sample is collected, and now we are off after a nonrepresentative sample What goes on here?)

Trang 2

Suppose a marketing department needs to improve a direct-mail marketing campaign The normal response rate for the random mailings so far is 1.5% Mailing rolls out, results trickle in A (neophyte) data miner is asked to improve response “Aha!,” says the miner, “I have just the thing I’ll whip up a quick response model, infer who’s responding, and redirect the mail to similar likely responders All I need is a genuinely representative sample, and I’ll be all set!” With this terrific idea, the miner applies the modeling tools, and after furiously mining, the best prediction is that no one at all will respond! Panic sets in; staring failure in the face, the neophyte miner begins the balding process by tearing out hair in chunks while wondering what to do next.

Fleeing the direct marketers with a modicum of hair, the miner tries an industrial chemical manufacturer Some problem in the process occasionally curdles a production batch The exact nature of the process failure is not well understood, but the COO just read a business magazine article extolling the miraculous virtues of data mining Impressed by the freshly minted data miner (who has a beautiful certificate attesting to skill in mining), the COO decides that this is a solution to the problem Copious quantities of data are available, and plenty more if needed The process is well instrumented, and continuous chemical batches are being processed daily Oodles of data representative of the process are on hand Wielding mining tools furiously, the miner conducts an onslaught designed to wring every last confession of failure from the recalcitrant data Using every art and artifice, the miner furiously pursues the problem until, staring again at failure and with desperation setting in, the miner is forced to fly from the scene, yet more tufts of hair flying

Why has the now mainly hairless miner been so frustrated? The short answer is that while the data is representative of the population, it isn’t representative of the problem

Consider the direct marketing problem With a response rate of 1.5%, any predictive system has an accuracy of 98.5% if it uniformly predicts “No response here!” Same thing with the chemical batch processing—lots of data in general, little data about the failure conditions

Both of these examples are based on real applications, and in spite of the light manner of introducing the issue, the problem is difficult to solve The feature to be modeled is insufficiently represented for modeling in a data set that is representative of the population Yet, if the mining results are to be valid, the data set mined must be representative of the population or the results will be biased, and may well be useless in practice What to do?

10.7.1 Enhancing the Data Set

When the density of the feature to be modeled is very low, clearly the density of that feature needs to be increased—but in a way that does least violence to the distribution of the population as a whole Using the direct marketing response model as an example,

Trang 3

simply increasing the proportion of responders in the sample may not help It’s assumed that there are some other features in the sample that actually do vary as response varies It’s just that they’re swamped by spurious patterns, but only because of their low density

in the sample Enhancing the density of responders is intended to enhance the variability

of connected features The hope is that when enhanced, these other features become visible to the predictive mining tool and, thus, are useful in predicting likely responders

These assumptions are to some extent true Some performance improvement may be obtained this way, usually more by happenstance than design, however The problem is that low-density features have more than just low-level interactions with other, potentially predictive features The instances with the low-level feature represent some small proportion of the whole sample and form a subsample—the subsample containing only those instances that have the required feature Considered alone, because it is so small, the subsample almost certainly does not represent the sample as a whole—let alone the population There is, therefore, a very high probability that the subsample contains much noise and bias that are in fact totally unrelated to the feature itself, but are simply

concomitant to it in the sample taken for modeling

Simply increasing the desired feature density also increases the noise and bias patterns that the subsample carries with it—and those noise and bias patterns will then appear to

be predictive of the desired feature Worse, the enhanced noise and bias patterns may swamp any genuinely predictive feature that is present

This is a tough nut to crack It is very similar to any problem of extracting information from noise, and that is the province of information theory, discussed briefly in Chapter 11 in the context of the data survey One of the purposes of the data survey is to understand the informational structure of the data set, particularly in terms of any identified predictive variables However, a practical approach to solving the problem does not depend on the insights of the data survey, helpful though they might be The problem is to construct a sample data set that represents the population as much as possible while enhancing some particular feature

Feature Enhancement with Plentiful Data

If there is plenty of data to draw upon, instances of data with the desired feature may also

be plentiful This is the case in the first example above The mailing campaign produces many responses The problem is their low density as a proportion of the sample There may be thousands or tens of thousands of responses, even though the response rate is only 1.5%

In such a circumstance, the shortage of instances with the desired feature is not the problem, only their relative density in the mining sample With plenty of data available, the miner constructs two data sets, both fully internally representative of the

population—except for the desired feature To do this, divide the source data set into two

Trang 4

subsets such that one subset has only instances that contain the feature of interest and the other subset has no instances that contain the feature of interest Use the already described techniques (Chapter 5) to extract a representative sample from each subset, ignoring the effect of the key feature This results in two separate subsets, both similar to each other and representative of the population as a whole when ignoring the effect of the key feature They are effectively identical except that one has the key feature and the other does not.

Any difference in distribution between the two subsets is due either to noise, bias, or the effect of the key feature Whatever differences there are should be investigated and validated whatever else is done, but this procedure minimizes noise and bias since both data sets are representative of the population, save for the effect of the key feature Adding the two subsets together gives a composite data set that has an enhanced presence of the desired feature, yet is as free from other bias and noise as possible

Feature Enhancement with Limited Data

Feature enhancement is more difficult when there is only limited data available This is the case in the second example of the chemical processor The production staff bends every effort to prevent the production batch from curdling, which only happens very infrequently The reasons for the batch failure are not well understood anyway (that is what is to be investigated), so may not be reliably reproducible Whether possible or not, batch failure

is a highly expensive event, hitting directly at the bottom line, so deliberately introducing failure is simply not an option management will countenance The miner was constrained

to work with the small amount of failure data already collected

Where data is plentiful, small subsamples that have the feature of interest are very likely

to also carry much noise and bias Since more data with the key feature is unavailable, the miner is constrained to work with the data at hand There are several modeling techniques that are used to extract the maximum information from small subsamples, such as multiway cross-validation on the small feature sample itself, and intersampling and resampling techniques These techniques do not affect data preparation since they are only properly applied to already prepared data However, there is one data

preparation technique used when data instances with a key feature are particularly low in density: data multiplication

The problem with low feature-containing instance counts is that the mining tool might learn the specific pattern in each instance and take those specific patterns as predictive

In other words, low key feature counts prevent some mining tools from generalizing from the few instances available Instead of generalizing, the mining tool learns the particular instance configurations—which is particularizing rather than generalizing Data

multiplication is the process of creating additional data instances that appear to have the

feature of interest White (or colorless) noise is added to the key feature subset, producing

a second data subset (See Chapter 9 for a discussion of noise and colored noise.) The

Trang 5

interesting thing about the second subset is that its variables all have the same mean values, distributions and so on, as the original data set—yet no two instance values, except by some small chance, are identical Of course, the noise-added data set can be made as large as the miner needs If duplicates do exist, they should be removed.

When added to the original data set, these now appear as more instances with the feature, increasing the apparent count and increasing the feature density in the overall data set The added density means that mining tools will generalize their predictions from the multiplied data set A problem is that any noise or bias present will be multiplied too Can this be reduced? Maybe

A technique called color matching helps Adding white noise multiplies everything exactly

as it is, warts and all Instead of white noise, specially constructed colored noise can be added The multidimensional distribution of a data sample representative of the

population determines the precise color Color matching adds noise that matches the

multivariable distribution found in the representative sample (i.e., it is the same color, or has the same spectrum) Any noise or bias present in the original key feature subsample

is still present, but color matching attempts to avoid duplicating the effect of the original bias, even diluting it somewhat in the multiplication

As always, whenever adding bias to a data set, the miner should put up mental warning flags Data multiplication and color matching adds features to, or changes features of, the data set that simply are not present in the real world—or if present, not at the density found after modification Sometimes there is no choice but to modify the data set, and frequently the results are excellent, robust, and applicable Sometimes even good results are achieved where none at all were possible without making modifications Nonetheless, biasing data calls for extreme caution, with much validation and verification of the results before applying them

10.7.2 Data Sets in Perspective

Constructing a composite data set enhances the visibility of some pertinent feature in the data set that is of interest to the miner Such a data set is no longer an unbiased sample, even if the original source data allowed a truly unbiased sample to be taken in the first place Enhancing data makes it useful only from one particular point of view, or from a particular perspective While more useful in particular circumstances, it is nonetheless not

so useful in general It has been biased, but with a purposeful bias deliberately

introduced Such data has a perspective.

When mining perspectival data sets, it is very important to use nonperspectival test and evaluation sets With the best of intentions, the mining data has been distorted and, to at least that extent, no longer accurately represents the population The only place that the inferences or predictions can be examined to ensure that they do not carry an unacceptable distortion through into the real world is to test them against data that is as undistorted—that

Trang 6

is, as representative of the real world—as possible.

10.8 Implementation Notes

Of the four topics covered in this chapter, the demonstration code implements algorithms for the problems that can be automatically adjusted without high risk of unintended data set damage Some of the problems discussed are only very rarely encountered or could cause more damage than benefit to the data if applied without care Where no preparation code is available, this section includes pointers to procedures the miner can follow to perform the particular preparation activity

10.8.1 Collapsing Extremely Sparsely Populated Variables

The demonstration code has no explicit support for collapsing extremely sparsely populated variables It is usual to ignore such variables, and only in special circumstances

do they need to be collapsed Recall that these variables are usually populated at levels

of small fractions of 1%, so a much larger proportion than 99% of the values are missing (or empty)

While the full tool from which the demonstration code was drawn will fully collapse such variables if needed, it is easy to collapse them manually using the statistics file and the complete-content file produced by the demonstration code, along with a commercial data manipulation tool, say, an implementation of SQL Most commercial statistical packages also provide all of the necessary tools to discover the problem, manipulate the data, and create the derived variables

1. If using the demonstration code, start with the “stat” file

2. Identify the population density for each variable

3. Check the number of discrete values for each candidate sparse variable

4. Look in the complete-content file, which lists all of the values for all of the variables.

5. Extract the lists for the sparse variables

6 Access the sample data set with your tool of choice and search for, and list, those cases where the sparse variables simultaneously have values (This won’t happen often, even in sparse data sets.)

7. Create unique labels for each specific present-value pattern (PVP)

8. Numerate the PVPs

Trang 7

Now comes the only tricky part Recall that the PVPs were built from the representative

sample (It’s representative only to some selected degree of confidence.) The execution

data set may, and if large enough almost certainly will, contain a PVP that was not in the sample data set If important, and only the domain of the problem provides that answer,

create labels for all of the possible PVPs, and assign them appropriate values That is a

judgment call It may be that you can ignore any unrecognized PVPs, or more likely, flag them if they are found

10.8.2 Reducing Excessive Dimensionality

Neural networks comprise a vast topic on their own The brief introduction in this chapter only touched the surface In keeping with all of the other demonstration code segments, the neural network design is intended mainly for humans to read and understand

Obviously, it also has to be read (and executed) by computer systems, but the primary focus is that the internal working of the code be as clearly readable as possible Of all the demonstration code, this requirement for clarity most affects the network code The network is not optimized for speed, performance, or efficiency The sparsity mechanism is modified random assignment without any dynamic interconnection Compression factor (hidden-node count) is discovered by random search

The included code demonstrates the key principles involved and compresses information Code for a fully optimized autoassociative neural network, including dynamic connection search with modified cascade hidden-layer optimization, is an impenetrable beast! The full version, from which the demonstration is drawn, also includes many other obfuscating (as far as clarity of reading goes) “bells and whistles.” For instance, it includes

modifications to allow maximum compression of information into the hidden layer, rather than spreading it between hidden and output layers, as well as modifications to remove linear relationships and represent those separately While improving performance and compression, such features completely obscure the underlying principles

10.8.3 Measuring Variable Importance

Everything just said about neural networks for data compression applies when using the demonstration code to measure variable importance For explanatory ease, both data compression and variable importance estimation use the same code segment A network optimized for importance search can, once again, improve performance, but the principles are as well demonstrated by any SCANN-type BP-ANN

Trang 8

outside the scope of the present book It involves significant multivariable frequency modeling to reproduce a characteristic noise pattern emulating the sample multivariable distribution Many statistical analysis software packages provide the basic tools for the miner to characterize the distribution and develop the necessary noise generation function.

10.9 Where Next?

A pause at this point Data preparation, the focus of this book, is now complete By applying all of the insights and techniques so far covered, raw data in almost any form is turned into clean prepared data ready for modeling Many of the techniques are illustrated with computer code on the accompanying CD-ROM, and so far as data preparation for data mining is concerned, the journey ends here

However, the data is still unmined The ultimate purpose of preparing data is to gain understanding of what the data “means” or predicts The prepared data set still has to be used How is this data used? The last two chapters look not at preparing data, but at surveying and using prepared data

Trang 9

Chapter 11: The Data Survey

Overview

Suppose that three separate families are planning a vacation The Abbott family really enjoys lake sailing Their ideal vacation includes an idyllic mountain lake, surrounded by trees, with plenty of wildlife and perhaps a small town or two nearby in case supplies are needed They need only a place to park their car and boat trailer, a place to launch the boat, and they are happy

The Bennigans are amateur archeologists There is nothing they like better than to find an ancient encampment, or other site, and spend their time exploring for artifacts Their four-wheel-drive cruiser can manage most terrain and haul all they need to be entirely self-sufficient for a couple of weeks exploring—and the farther from civilization, the better they like it

The Calloways like to stay in touch with their business, even while on vacation Their ideal

is to find a luxury hotel in the sun, preferably near the beach but with nightlife Not just any nightlife; they really enjoy cabaret, and would like to find museums to explore and other places of interest to fill their days

These three families all have very different interests and desires for their perfect vacation Can they all be satisfied? Of course The locations that each family would like to find and enjoy exist in many places; their only problem is to find them and narrow down the possibilities to a final choice The obvious starting point is with a map Any map of the whole country indicates broad features—mountains, forests, deserts, lakes, cities, and probably roads The Abbotts will find, perhaps, the Finger Lakes in upstate New York a place to focus their attention The Bennigans may look at the deserts of the Southwest, while the Calloways look to Florida Given their different interests, each family starts by narrowing down the area of search for their ideal vacation to those general areas of the country that seem likely to meet their needs and interests

Once they have selected a general area, a more detailed map of the particular territory lets each family focus in more closely Eventually, each family will decide on the best choice they can find and leave for their various vacations Each family explores its own vacation site in detail While the explorations do not seem to produce maps, they reveal small details—the very details that the vacations are aimed at The Abbotts find particular lake coves, see particular trees, and watch specific birds and deer The Bennigans find individual artifacts in specific places The Calloways enjoy particular cabaret performers and see specific exhibits at particular museums It is these detailed explorations that each family feels to be the whole purpose for their vacations

Trang 10

Each family started with a general search to find places likely to be of interest Their initial search was easy The U.S Geological Survey has already done the hard work for them Other organizations, some private survey companies, have embellished maps in

particular ways and for particular purposes—road maps, archeological surveys, sailing maps (called “charts”), and so on Eventually, the level of detail that each family needed was more than a general map could provide Then the families constructed their own maps through detailed exploration

What does this have to do with data mining? The whole purpose of the data survey is to help the miner draw a high-level map of the territory With this map, a data miner discovers the general shape of the data, as well as areas of danger, of limitation, and of usefulness With a map, the Abbotts avoided having to explore Arizona to see if any lakes suitable for sailing were there With a data survey, a miner can avoid trying to predict the stock market from meteorological data “Everybody knows” that there are no lakes in Arizona “Everybody knows” that the weather doesn’t predict the stock market But these “everybodies” only know that through experience—mainly the experience of others who have been there first Every territory needed exploring by pioneers—people who entered the territory first to find out what there was in general—blazing the trail for the detailed explorations to follow The data survey provides a miner with a map of the territory that guides further exploration and locates the areas of particular interest, the areas suitable for mining On the other hand, just as with looking for lakes in Arizona, if there is no value to be found, that is well to know as early as possible

11.1 Introduction to the Data Survey

This chapter deals entirely with the data survey, a topic at least as large as data preparation The introduction to the use, purposes, and methods of data surveying in this chapter discusses how prepared data is used during the survey Most, if not all, of the surveying techniques can be automated Indeed, the full suite of programs from which the data preparation demonstration code is drawn is a full data preparation and survey tool set This chapter touches only on the main topics of data surveying It is an introduction to the territory itself The introduction starts with understanding the concept of “information.”

This book mentions “information” in several places “Information is embedded in a data set.” “The purpose of data preparation is to best expose information to a mining tool.”

“Information is contained in variability.” Information, information, information Clearly,

“information” is a key feature of data preparation In fact, information—its discovery, exposure, and understanding—is what the whole preparation-survey-mining endeavor is about A data set may represent information in a form that is not easily, or even at all, understandable by humans When the data set is large, understanding significant and salient points becomes even more difficult Data mining is devised as a tool to transform the impenetrable information embedded in a data set into understandable relationships or predictions

Trang 11

However, it is important to keep in mind that mining is not designed to extract information Data, or the data set, enfolds information This information describes many and various relationships that exist enfolded in the data When mining, the information is being mined for what it contains—an explanation or prediction based on the embedded relationships It

is almost always an explanation or prediction of specific details that solves a problem, or answers a question, within the domain of inquiry—very often a business problem What is

required as the end result is human understanding (enabling, if necessary, some action)

Examining the nature of, and the relationships in, the information content of a data set is a part of the task of the data survey It prepares the path for the mining that follows

Some information is always present in the data—understandable or not Mining finds relationships or predictions embedded in the information inherent in a data set With luck, they are not just the obvious relationships With more luck, they are also useful In discovering and clarifying some novel and useful relationship embedded in data, data mining has its greatest success Nonetheless, the information exists prior to mining The data set enfolds it It has a shape, a substance, a structure In some places it is not well defined; in others it is bright and clear It addresses some topics well; others poorly In some places, the relationships are to be relied on; in others not Finding the places, defining the limits, and understanding the structures is the purpose of data surveying

The fundamental question posed by the data survey is, “Just what information is in here anyway?”

11.2 Information and Communication

Everything begins with information The data set embeds it The data survey surveys it

Data mining translates it But what exactly is information? The Oxford English Dictionary

begins its definition with “The act of informing, ” and continues in the same definition a little later, “Communication of instructive knowledge.” The act referred to is clearly one where this thing, “information,” is passed from one person to another The latter part of the definition explicates this by saying it is “communication.” It is in this sense of

communicating intelligence—transferring insight and understanding—that the term

“information” is used in data mining Data possesses information only in its latent form Mining provides the mechanism by which any insight potentially present is explicated Since information is so important to this discussion, it is necessary to try to clarify, and if possible quantify, the concept

Because information enables the transferring of insight and understanding, there is a sense in which quantity of information relates to the amount of insight and understanding generated; that is, more information produces greater insight But what is it that creates greater insight?

A good mystery novel—say, a detective story—sets up a situation The situation described includes all of the necessary pieces to solve the mystery, but in a nonobvious

Trang 12

way Insight comes when, at the end of the story, some key information throws all of the established structure into a suddenly revealed, surprising new relationship The larger and more complex the situation that the author can create, the greater the insight when the true situation is revealed But in addition to the complexity of the situation, it seems to

be true that the more surprising or unexpected the solution, the greater the insight

The detective story illustrates the two key ingredients for insight The first is what for a detective story is described as “the situation.” The situation comprises a number of individual components and the relationship between the components For a detective story, these components are typically the characters, the attributes of characters, their relationship to one another, and the revealed actions taken by each during the course of the narrative These various components, together with their relationships, form a knowledge structure The second ingredient is the communication of a key insight that readjusts the knowledge structure, changing the relationship between the components The amount of insight seems intuitively related to how much readjustment of the knowledge structure is needed to include the new insight, and the degree to which the new information is unexpected

As an example, would you be surprised if you learned that to the best of modern scientific knowledge, the moon really is made of green cheese? Why? For a start, it is completely unexpected Can you honestly say that you have ever given the remotest credence to the possibility that the moon might really be made of green cheese? If true, such a simple communication carries an enormous amount of information It would probably require you

to reconfigure a great deal of your knowledge of the world After all, what sort of possible rational explanation could be constructed to explain the existence of such a

phenomenon? In fact, it is so unlikely that it would almost certainly take much repetition of the information in different contexts (more evidence) before you would accept this as valid (Speaking personally, it would take an enormous readjustment of my world view to accept any rational explanation that includes several trillion tons of curdled milk products hanging in the sky a quarter of a million miles distant!)

These two very fundamental points about information—how surprising the communication

is, and how much existing knowledge requires revision—both indicate something about how much information is communicated But these seem very subjective measures, and indeed they are, which is partly why defining information is so difficult to come to grips with

Claude E Shannon did come to grips with the problem in 1948 In what has turned out to

be one of the seminal scientific papers of the twentieth century, “A Mathematical Theory

of Communication,” he grappled directly with the problem This was published the next year as a book and established a whole field of endeavor, now called “information theory.” Shannon himself referred to it as “communication theory,” but its effects and applicability have reached out into a vast number of areas, far beyond communications In at least one sense it is only about communications, because unless information is communicated, it

Trang 13

informs nothing and no one Nonetheless, information theory has come to describe information as if it were an object rather than a process A more detailed look at information will assume where needed, at least for the sake of explanation, that it exists

as a thing in itself

11.2.1 Measuring Information: Signals and Dictionaries

Information comes in two pieces: 1) an informing communication and 2) a framework in which to interpret the information For instance, in order to understand that the moon is made of green cheese, you have to know what “green cheese” is, what “the moon” is, what “is made of” means, and so on So the first piece of information is a signal of some sort that indicates the informing communication, and the second is a dictionary that defines the interpretation of the signaled communication It is the dictionary that allows the signaled information to be placed into context within a framework of existing knowledge

Paul Revere, in his famous ride, exemplified all of the basic principles with the “One if by land, two if by sea” dictionary Implicit in this is “None if not coming.” This number of lamps shown—0, 1, or 2 in the Old North Church tower in Boston, indicating the direction of British advance—formed the dictionary for the communication system The actual signal consisted of 0 or 1 or 2 lamps showing in the tower window

11.2.2 Measuring Information: Signals

A signal is a system state that indicates a defined communication A system can have any

number of signals The English language has many thousands—each word carrying, or signaling, a unique meaning Paul Revere came close to using the least possible signal The least possible signal is a system state that is either present or not present Any light in the Old North Church signaled that the British were coming—no light, no British coming This minimal amount of signaled information can be indicated by any two-state

arrangement: on and off, 1 and 0, up and down, present and absent It is from this

two-state system of signal information that we get the now ubiquitous binary digit, or bit of

information Modern computer systems are all built from many millions of two-state switches, each of which can represent this minimal signal

Back to the Old North Church tower How many bits did Paul Revere’s signal need? Well, there are three defined system states: 0 lamps = no sign of the British, 1 lamp = British coming by land, 2 lamps = British coming by sea One bit can carry only two system states State 0 = (say) no British coming, state 1 = (say) land advance (Note that there is

no necessary connection between the number of lamps showing and the number of bits.) There is no more room in one bit to define more than two system states So in addition to one bit signaling two states—no advance or land advance—at least one more bit is needed to indicate a sea advance With two bits, up to four system states can be encoded, as shown in Table 11.1

Trang 14

TABLE 11.1 Only three system states are needed to carry Paul Revere’s message (using two bits leaves one state undefined).

Bit 1 state Bit 2 state Tower lights Meaning

When Paul Revere constructed his signaling system, he directly faced the problem that, in practice, two bits are needed When the signals were devised, Paul used one lighted lantern to indicate the state of one bit He needed only 1 1/2 lights, but what does it mean

to show 1/2 a light? His solution introduced a redundant system state, as shown in Table 11.2

TABLE 11.2 Paul Revere’s signaling system used redundancy in having two states carry the same message.

Bit 1 state Bit 2 state Tower lights Meaning

Trang 15

1 0 1 Land advance

With this signaling system, land advance is indicated by two separate system states Each state could have been used to carry a separate message, but instead of having an undefined system state, an identical meaning was assigned to multiple system states Since the entire information content of the communication system could be carried by about 1/2 bits, there is roughly 1/2 a bit of redundancy in this system

Redundancy measures duplicate information in system states Most information-carrying

systems have redundancy—the English language is estimated to be approximately 50% redundant Tht is why yu cn undrstnd ths sntnce, evn thgh mst f th vwls are mssng! It is also what allows data set compression—squeezing out some of the redundancy

There are many measures that can be used to measure information content, but the use

of bits has gained wide currency and is one of the most common It is also convenient because one bit carries the least possible amount of information So the information content of a data set is conveniently measured as the number of bits of information it carries But given a data set, how can we discover how many bits of information it does carry?

11.2.3 Measuring Information: Bits of Information

When starting out to measure the information content of a data set, what can be easily discovered within a data set is its number of system states—not (at least directly) the number of bits needed to carry the information As an understandable example, however, imagine two data sets The first, set A, is a two-bit data set It comprises two variables each of which can take values of 0 or 1 The second data set, set B, comprises one one-bit variable, which can take on values of 0 or 1 If these two data sets are merged to form a joint data set, the resulting data set must carry three bits of information

To see that this is so, consider that set A has four possible system states, as shown in Table 11.3 Set B, on the other hand, has two possible system states, as shown in Table 11.4

TABLE 11.3 Data set A, using two bits, has four discrete states.

Set A variable 1 Set A variable 2 System state

Tiêu đề	Data Preparation for Data Mining
Trường học	University of Information Technology and Communications
Chuyên ngành	Data Mining and Data Preparation
Thể loại	PPT Presentation
Năm xuất bản	2023
Thành phố	Hanoi

Định dạng
Số trang	30
Dung lượng	248,18 KB