1. Trang chủ
  2. » Công Nghệ Thông Tin

Data Mining and Knowledge Discovery Handbook, 2 Edition part 23 doc

10 194 0
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 10
Dung lượng 383,98 KB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

An example of probabilistic reasoning using the Bayesian network induced from the 13 variables extracted from the 1996 General Household Survey.. On the other hand, the wealth rating of

Trang 1

200 Paola Sebastiani, Maria M Abad, and Marco F Ramoni

Variable Description Type State description

Region Hoh birth region Nominal England, Scotland and Wales

Ad fems No of adult females Ordinal 0, 1,≥ 2

Ad males No of adult males Ordinal 0, 1,≥ 2

Children No of children Ordinal 0, 1, 2, 3,≥ 4

Hoh age Age of Hoh Numeric 17-36; 36-50; 50-66; 66-98

Hoh gend Gender of Hoh Nominal M, F

Accomod Accommodation Nominal Room, Flat, House, Other

Bedrms No of bedrooms Ordinal 1, 2, 3,≥ 4

Ncars No of cars Ordinal 1, 2, 3,≥ 4

Tenure House status Nominal Rent, Owned, Soc-Sector

Hoh reslen Length of residence Numeric 0-3; 3-9; 9-19;≥ 19 (months)

Hoh origin Hoh ethnicity Nominal Caucas., Black, Chin., Indian, Other Hoh status Status of Hoh Nominal Active, Inactive, Retired

Table 10.2 Description of the variables used in the analysis Hoh denotes the Head of the Household Numbers of adult males, females and children refer to the household

of the household increases The dependency of the gender of the household head on the ethnic group shows that Blacks have the smallest probability of having a male head of the household (64%) while Indians have the largest probability (89%) Other interesting discoveries are that the age of the head of the household depends directly

on the number of adult males and females and shows that households with no fe-males and two or more fe-males are more likely to be headed by a young male, while

on the other hand, households with no males and two or more females are headed

by a mid age female There appear to be more single households headed by an elder female than an elder male Also the composition of the household changes in the ethnic groups and Indians have the smallest probability of living in a household with

no adult males (10%), while Blacks have the largest probability (32%)

By propagating the network, one may investigate other undirected associations and discover that, for example, the typical Caucasian mid family with two children has 77% chance of being headed by a male who, with probability 57, is aged be-tween 36 and 50 years The probability that the head of the household is active is 84, and the probability that the household is in an owned house is 66 Results of these queries are displayed in Figure 10.11 These figures are slightly different if the head of the household is, for example, Black and the probability that the head of the household is male (given that there are two children in the household) is only 62 and the probability that he is active is 79 If the head of the household is Indian, then the probability that he is male is 90, and the probability that he is active is 88 On average, the ethnic group changes slightly the probability of the household being in

an accommodation provided by the social service (26% for Blacks, 23% for Chinese, 20% Indians and 24% Caucasians) Similarly, Black household heads are more likely

to be inactive than household heads from different ethnic groups (16% Blacks, 10% Indians, 14% Caucasians and Chinese) and to be living in a less wealthy household,

as shown by the larger probability of living in accommodations with a smaller

Trang 2

num-Fig 10.11 An example of probabilistic reasoning using the Bayesian network induced from the 13 variables extracted from the 1996 General Household Survey

ber of bedrooms and of having a smaller number of cars The overall picture is that of households headed by a Black to be less wealthy than others, and this would be the conclusions one reaches if the gender of the head of the household is not taken into account However, the dependency structure discovered shows that the gender of the head of the household and the number of adult females make all the other variables independent of the ethnic group Thus, the extracted model supports the hypothesis that differences in the household wealth are more likely explained by the different household composition, and in particular by the gender of the head of the household, rather than racial factors

10.6.2 Customer Profiling

A typical problem of direct mail fund raising campaigns is the low response rate Recent studies have shown that adding incentives or gifts in the mailing can increase the response rate This is the strategy implemented by an American Charity in the June ’97 renewal campaign The mailing included a gift of personalized name and address labels plus an assortment of 10 note cards and envelopes Each mail cost the charity 0.68 dollars and resulted in a response rate of about 5% in the group of so called lapsed donors, that is, individuals who made their last donation more than a year before the ’97 renewal mail Since the donations received by the respondents ranged between 2 and 200 dollars, and the median donation was 13 dollars, the fund raiser needed to decide when it was worth sending the renewal mail to a donor, on

Trang 3

202 Paola Sebastiani, Maria M Abad, and Marco F Ramoni

the basis of the information available about him from the in-house database Fur-thermore, the charity was interested in strategies to recapture Lapsed Donors and, therefore, in making a profile from which to understand motivations behind their lack of response

We addressed these issues in (Sebastiani et al., 2000) by building two causal

models The first model captured the dependency of the probability of response to the mailing campaign on the independent variables in the database The second one modeled the dependency of the dollar amount of the gift and it was built by us-ing only the 5% respondents to the ’97 mailus-ing campaign We focused here on the first model, depicted in Figure 10.12, which shows that the probability of a donation (variable Target-B in the top-left corner) is directly affected by the wealth rating (variable Wealth1) and the donor’s neighborhood (variable Domain1) The net-work shows that, marginally, only 5% of those who received the renewal mail are likely to respond Persons living in suburbs, cities or towns have about 5% probabil-ity of responding, while donors living in rural or urban neighborhoods respond with probability 5% The wealth rating of the donor neighborhood has a positive effect

on the response rate of donors living in urban, suburban or city areas with donors living in wealthier neighborhoods being more likely to respond than donors living

in poorer neighborhoods The probability of responding raises up to about 6% for donors living in wealth city neighborhoods The variable Domain1 is closely related

to the variable Domain2 that represents an indicator of the socio-economic status of the donor neighborhood and it shows that donors living in suburbs or city are more likely to live in neighborhoods having a highly rated socio-economic status There-fore, they may be more sensitive to political and social issues The model also shows that donors living in neighborhoods with a high presence of males active in the Mil-itary (Malemili) are more likely to respond Again, since the charity collects funds for military veterans, this fact supports the hypothesis that sensitivity to the problem for which funds are collected has a large effect on the probability of response On the other hand, the wealth rating of donors living in rural neighborhood has the op-posite effect: the higher the wealth rating, the smaller the probability that the donor responds, and the least likely to respond (3.8%) are donors living in wealth rural areas A curiosity is that persons living in rural and poor neighborhood are more likely to respond positively to mail including a gift than donors living in wealthy city neighborhood

By querying the network, we can profile respondents who are more likely to live in a wealth neighborhood, which is located in a suburb and they are less likely

to have made a donation in the last 6 months than those who do not respond One feature that discriminates respondents from nonrespondent is the household income, and respondents are 1.20 times more likely to be living in wealthy neighborhoods, and to be on higher income than nonrespondents

Trang 4

Fig 10.12 The Bayesian network induced from the data warehouse to profile likely respon-dents to mail solicitations

10.7 Conclusions and Future Research Directions

Bayesian networks are a representation formalism born at the intersection of statistics and Artificial Intelligence Thanks to their solid statistical foundations, they have been successfully turned into a powerful Data Mining and knowledge discovery tool able to uncover complex models of interactions from large databases Their high symbolic nature makes them easily understandable to human operators Contrary

to standard classification methods, Bayesian networks do not require the preliminary identification of an outcome variable of interest but they are able to draw probabilistic inferences on any variable in the database

Notwithstanding these attractive properties, there are still several theoretical is-sues that limit the range of applicability of Bayesian networks to the practice of science and engineering This chapter has described methods to learn Bayesian net-works from databases with either discrete or continuous variables How to induce Bayesian networks from databases containing both types of variables is still very much an open research issues Imposing the assumption that discrete variables can only be parent nodes in the network, but cannot be children of any continuous Gaus-sian node leads to a closed form solution for the computation of the marginal likeli-hood (Lauritzen, 1992) This property has been applied, for example, to model-based

clustering by (Ramoni et al., 2002), and it is commonly used in classification

prob-lems (Cheeseman and Stutz, 1996) However, this restriction can quickly become unrealistic and greatly limit the set of models to explore As a consequence, common

Trang 5

204 Paola Sebastiani, Maria M Abad, and Marco F Ramoni

practice is still to discretize continuous variables with possible loss of information, particularly when the continuous variables are highly skewed

Another challenging research issue is how to learn Bayesian networks from in-complete data The received view of the effect of missing data on statistical inference

is based on the approach described by Rubin in (Rubin, 1987) This approach clas-sifies the missing data mechanism as ignorable or not, according to whether the data are missing completely at random (MCAR), missing at random (MAR), or informa-tively missing (IM) According to this approach, data are MCAR if the probability that an entry is missing is independent of both observed and unobserved values They are MAR if this probability is at most a function of the observed values in the database and, in all other cases, data are IM The received view is that, when data are either MCAR or MAR, the missing data mechanism is ignorable for parame-ter estimation, but it is not when data are IM An important but overlooked issue

is whether the missing data mechanism generating data that are MAR is ignorable for model selection (Rubin, 1996, Sebastiani and Ramoni, 2001A) We have shown that this is not the case for regression type graphical models exemplified and in-troduced two approaches to model selection with partially ignorable missing data

mechanisms: ignorable imputation and model folding Contrary to standard impu-tation schemes (Geiger et al., 1995, Little and Rubin, 1987, Schafer, 1997, Tanner,

1996, Thibaudeau and Winler, 2002), ignorable imputation accounts for the missing-data mechanism and produces, asymptotically, a proper imputation model as defined

by Rubin (Rubin, 1987, Rubin et al., 1995) However, the computation effort can

be very demanding and model folding is a deterministic method to approximate the exact marginal likelihood that reaches high accuracy at a low computational cost, because the complexity of the model search is not affected by the presence of incom-plete cases Both ignorable imputation and model folding reconstruct a completion

of the incomplete data by taking into account the variables responsible for the miss-ing data This property is in agreement with the suggestion put forward in (Heitjan and Rubin, 1991, Little and Rubin, 1987, Rubin, 1976) that the variables responsi-ble for the missing data should be kept in the model However, our approach allows

us to also evaluate the likelihoods of models that do not depend explicitly on these variables

Although this work provides the analytical foundations for a proper treatment

of missing data when the inference task is model selection, it is limited to the very special situation in which only one variable is partially observed, data are supposed

to be only MCAR or MAR, and the set of Bayesian networks is limited to those

in which the partially observed variable is a child of the other variables Research

is needed to extend these results to the more general graphical structures, in which several variables can be partially observed and data can be MCAR, MAR or IM These two issues — learning mixed variables networks and handling incomplete databases — are still unsolved and they offer challenging research opportunities

Trang 6

This work was supported in part by the National Science Foundation (ECS-0120309), the Spanish State Office of Education and Universities, the European Social Fund and the Fulbright Program of the US State Department

References

S G Bottcher and C Dethlefsen Deal: A package for learning Bayesian networks Available from http://www.jstatsoft.org/v08/i20/deal.pdf, 2003

U M Braga-Neto and E R Dougerthy Is cross-validation valid for small-sample microarray

classification Bioinformatics, 20:374–380, 2004.

E Castillo, J M Gutierrez, and A S Hadi Expert Systems and Probabilistic Network Models Springer, New York, NY, 1997.

E Charniak Belief networks without tears AI Magazine, pages 50–62, 1991.

P Cheeseman and J Stutz Bayesian classification (AutoClass): Theory and results In

U M Fayyad, G Piatetsky-Shapiro, P Smyth, and R Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining, pages 153–180 MIT Press, Cambridge, MA,

1996

J Cheng and M Druzdzel AIS-BN: An adaptive importance sampling algorithm for

evi-dential reasoning in large Bayesian networks J Artif Intell Res, 13:155–188, 2000.

D M Chickering Learning equivalence classes of Bayesian-network structures J Mach Learn Res, 2:445–498, February 2002.

G F Cooper The computational complexity of probabilistic inference using Bayesian belief

networks aij, 42:297–346, 1990.

G F Cooper and E Herskovitz A Bayesian method for the induction of probabilistic

net-works from data Mach Learn, 9:309–347, 1992.

R G Cowell, A P Dawid, S L Lauritzen, and D J Spiegelhalter Probabilistic Networks and Expert Systems Springer, New York, NY, 1999.

A P Dawid and S L Lauritzen Hyper Markov laws in the statistical analysis of

decompos-able graphical models Ann Stat, 21:1272–1317, 1993 Correction ibidem, (1995), 23,

1864

R O Duda and P E Hart Pattern Classification and Scene Analysis Wiley, New York, NY,

1973

N Friedman Inferring cellular networks using probabilistic graphical models Science,

303:799–805, 2004

N Friedman, D Geiger, and M Goldszmidt Bayesian network classifiers Mach Learn,

29:131–163, 1997

N Friedman and D Koller Being Bayesian about network structure: A

Bayesian approach to structure discovery in bayesian networks Machine Learning,

50:95–125, 2003

N Friedman, K Murphy, and S Russell Learning the structure of dynamic probabilistic

networks In Proceedings of the 14th Annual Conference on Uncertainty in Artificial In-telligence (UAI-98), pages 139–147, San Francisco, CA, 1998 Morgan Kaufmann

Pub-lishers

D Geiger and D Heckerman Learning gaussian networks In Proceedings of the Tenth Annual Conference on Uncertainty in Artificial Intelligence (UAI-94), San Francisco,

1994 Morgan Kaufmann

Trang 7

206 Paola Sebastiani, Maria M Abad, and Marco F Ramoni

D Geiger and D Heckerman A characterization of Dirichlet distributions through local and

global independence Ann Stat, 25:1344–1368, 1997.

A Gelman, J B Carlin, H S Stern, and D B Rubin Bayesian Data Analysis Chapman

and Hall, London, UK, 1995

S Geman and D Geman Stochastic relaxation, Gibbs distributions and the Bayesian

restora-tion of images IEEE T Pattern Anal, 6:721–741, 1984.

W R Gilks and G O Roberts Strategies for improving MCMC In W R Gilks, S

Richard-son, and D J Spiegelhalter, editors, Markov Chain Monte Carlo in Practice, pages 89–

114 Chapman and Hall, London, UK, 1996

C Glymour, R Scheines, P Spirtes, and K Kelly Discovering Causal Structure: Artifi-cial Intelligence, Philosophy of Science, and Statistical Modeling Academic Press, San

Diego, CA, 1987

I J Good Rational decisions J Roy Stat Soc B, 14:107–114, 1952.

I J Good The Estimation of Probability: An Essay on Modern Bayesian Methods MIT

Press, Cambridge, MA, 1968

D J Hand Construction and Assessment of Classification Rules Wiley, New York, NY,

1997

D J Hand, N M Adams, and R J Bolton Pattern Detection and Discovery Springer, New

York, 2002

D J Hand, H Mannila, and P Smyth Principles of Data Mining MIT Press, Cambridge,

2001

T Hastie, R Tibshirani, and J Friedman The Elements of Statistical Learning

Springer-Verlag, New York, 2001

D Heckerman Bayesian networks for Data Mining Data Min Knowl Disc, 1:79–119, 1997.

D Heckerman, D Geiger, and D M Chickering Learning Bayesian networks: The

combi-nations of knowledge and statistical data Mach Learn, 20:197–243, 1995.

D F Heitjan and D B Rubin Ignorability and coarse data Ann Stat, 19:2244–2253, 1991.

R E Kass and A Raftery Bayes factors J Am Stat Assoc, 90:773–795, 1995.

P Langley, W Iba, and K Thompson An analysis of Bayesian classifiers In Proceedings

of the Tenth National Conference on Artificial Intelligence, pages 223–228, Menlo Park,

CA, 1992 AAAI Press

P Larranaga, C Kuijpers, R Murga, and Y Yurramendi Learning Bayesian network

struc-tures by searching for the best ordering with genetic algorithms IEEE T Pattern Anal,

26:487–493, 1996

S L Lauritzen Propagation of probabilities, means and variances in mixed graphical

asso-ciation models J Am Stat Assoc, 87(420):1098–108, 1992.

S L Lauritzen Graphical Models Oxford University Press, Oxford, UK, 1996.

S L Lauritzen and D J Spiegelhalter Local computations with probabilities on graphical

structures and their application to expert systems (with discussion) J Roy Stat Soc B,

50:157–224, 1988

R J A Little and D B Rubin Statistical Analysis with Missing Data Wiley, New York,

NY, 1987

D Madigan and A E Raftery Model selection and accounting for model uncertainty in

graphical models using Occam’s window J Am Stat Assoc, 89:1535–1546, 1994.

D Madigan and G Ridgeway Bayesian data analysis for Data Mining In Handbook of Data Mining, pages 103–132 MIT Press, 2003.

D Madigan and J York Bayesian graphical models for discrete data Int Stat Rev, pages

215–232, 1995

Trang 8

P McCullagh and J A Nelder Generalized Linear Models Chapman and Hall, London,

2nd edition, 1989

A O’Hagan Bayesian Inference Kendall’s Advanced Theory of Statistics Arnold, London,

UK, 1994

J Pearl Probabilistic Reasoning in Intelligent Systems: Networks of plausible inference.

Morgan Kaufmann, San Francisco, CA, 1988

M Ramoni, A Riva, M Stefanelli, and V Patel An ignorant belief network to forecast

glucose concentration from clinical databases Artif Intell Med, 7:541–559, 1995.

M Ramoni and P Sebastiani Bayesian methods In Intelligent Data Analysis An Introduc-tion, pages 131–168 Springer, New York, NY, 2nd ediIntroduc-tion, 2003.

M Ramoni, P Sebastiani, and I.S Kohane Cluster analysis of gene expression dynamics

Proc Natl Acad Sci USA, 99(14):9121–6, 2002.

L Rokach, M Averbuch, and O Maimon, Information retrieval system for medical narra-tive reports Lecture notes in artificial intelligence, 3055 pp 217-228, Springer-Verlag (2004)

D B Rubin Inference and missing data Biometrika, 63:581–592, 1976.

D B Rubin Multiple Imputation for Nonresponse in Survey Wiley, New York, NY, 1987.

D B Rubin Multiple imputation after 18 years J Am Stat Assoc, 91:473–489, 1996.

D B Rubin, H S Stern, and V Vehovar Handling “don’t know” survey responses: the case

of the Slovenian plebiscite J Am Stat Assoc, 90:822–828, 1995.

M Sahami Learning limited dependence Bayesian classifiers In Proceeding of the 2 Int Conf On Knowledge Discovery & Data Mining, 1996.

J L Schafer Analysis of Incomplete Multivariate Data Chapman and Hall, London, UK,

1997

P Sebastiani, M Abad, and M F Ramoni Bayesian networks for genomic analysis In E R

Dougherty, I Shmulevich, J Chen, and Z J Wang, editors, Genomic Signal Processing and Statistics, Series on Signal Processing and Communications EURASIP, 2004.

P Sebastiani and M Ramoni Analysis of survey data with Bayesian networks Technical Report, Knowledge Media Institute, The Open University, Walton Hall, Milton Keynes MK7 6AA, 2000 Available from authors

P Sebastiani and M Ramoni Bayesian selection of decomposable models with incomplete

data J Am Stat Assoc, 96(456):1375–1386, 2001A.

P Sebastiani and M Ramoni Common trends in european school populations Res Offic Statist., 4(1):169–183, 2001B.

P Sebastiani and M F Ramoni On the use of Bayesian networks to analyze survey data

Res Offic Statist., 4:54–64, 2001C.

P Sebastiani and M Ramoni Generalized gamma networks Technical report, University of Massachusetts, Department of Mathematics and Statistics, 2003

P Sebastiani, M Ramoni, and A Crea Profiling customers from in-house data ACM SIGKDD Explorations, 1:91–96, 2000.

P Sebastiani, M Ramoni, and I Kohane BADGE: Technical notes Technical report, De-partment of Mathematics and Statistics, University of Massachusetts at Amherst, 2003

P Sebastiani, M F Ramoni, V Nolan, C Baldwin, and M H Steinberg Discovery of com-plex traits associated with overt stroke in patients with sickle cell anemia by Bayesian

network modeling In 27th Annual Meeting of the National Sickle Cell Disease Program,

2004 To appear

Trang 9

208 Paola Sebastiani, Maria M Abad, and Marco F Ramoni

P Sebastiani, Y H Yu, and M F Ramoni Bayesian machine learning and its potential

applications to the genomic study of oral oncology Adv Dent Res, 17:104–108, 2003.

R D Shachter Evaluating influence diagrams Operation Research, 34:871–882, 1986.

M Singh and M Valtorta Construction of Bayesian network structures from data: A brief

survey and an efficient algorithm Int J Approx Reason, 12:111–131, 1995.

D J Spiegelhalter and S L Lauritzen Sequential updating of conditional probabilities on

directed graphical structures Networks, 20:157–224, 1990.

P Spirtes, C Glymour, and R Scheines Causation, prediction and search Springer, New

York, 1993

M A Tanner Tools for Statistical Inference Springer, New York, NY, third edition, 1996.

Y Thibaudeau and W E Winler Bayesian networks representations, generalized imputation, and synthetic microdata satisfying analytic restraints Technical report, Statistical Re-search Division report RR 2002/09, 2002 http://www.census.gov/srd/www/byyear.html

A Thomas, D J Spiegelhalter, and W R Gilks Bugs: A program to perform Bayesian inference using Gibbs Sampling In J Bernardo, J Berger, A P Dawid, and A F M

Smith, editors, Bayesian Statistics 4, pages 837–42 Oxford University Press, Oxford,

UK, 1992

J Whittaker Graphical Models in Applied Multivariate Statistics Wiley, New York, NY,

1990

S Wright The theory of path coefficients: a reply to niles’ criticism Genetics, 8:239–255,

1923

S Wright The method of path coefficients Annals of Mathematical Statistics, 5:161–215,

1934

J Yu, V Smith, P Wang, A Hartemink, and E Jarvis Using Bayesian network inference

al-gorithms to recover molecular genetic regulatory networks In International Conference

on Systems Biology 2002 (ICSB02), 2002.

H Zhou and S Sakane Sensor planning for mobile robot localization using Bayesian

net-work inference J of Advanced Robotics, 16, 2002 To appear.

Trang 10

Data Mining within a Regression Framework

Richard A Berk

Department of Statistics

UCLA

berk@stat.ucla.edu

Summary Regression analysis can imply a far wider range of statistical procedures than often appreciated In this chapter, a number of common Data Mining procedures are discussed within a regression framework These include non-parametric smoothers, classification and regression trees, bagging, and random forests In each case, the goal is to characterize one or more of the distributional features of a response conditional on a set of predictors

Key words: regression, smoothers, splines, CART, bagging, random forests

11.1 Introduction

Regression analysis can imply a broader range of techniques than ordinarily appre-ciated Statisticians commonly define regression so that the goal is to understand

“as far as possible with the available data how the the conditional distribution of

some response y varies across subpopulations determined by the possible values of

the predictor or predictors” (Cook and Weisberg, 1999) For example, if there is a single categorical predictor such as male or female, a legitimate regression analysis has been undertaken if one compares two income histograms, one for men and one for women Or, one might compare summary statistics from the two income distribu-tions: the mean incomes, the median incomes, the two standard deviations of income, and so on One might also compare the shapes of the two distributions with a Q-Q plot

There is no requirement in regression analysis for there to be a “model” by which the data were supposed to be generated There is no need to address cause and ef-fect And there is no need to undertake statistical tests or construct confidence inter-vals The definition of a regression analysis can be met by pure description alone Construction of a “model,” often coupled with causal and statistical inference, are supplements to a regression analysis, not a necessary component (Berk, 2003) Given such a definition of regression analysis, a wide variety of techniques and approaches can be applied In this chapter, I will consider a range of procedures

O Maimon, L Rokach (eds.), Data Mining and Knowledge Discovery Handbook, 2nd ed.,

DOI 10.1007/978-0-387-09823-4_11, © Springer Science+Business Media, LLC 2010

Ngày đăng: 04/07/2014, 05:21

TỪ KHÓA LIÊN QUAN