Data Preprocessing Depending on the particular data mining application, you may find that the needed dataelements are not present in your data warehouse.. That would give us an idea of th
Trang 1OLAP Versus Data Mining From our earlier discussion on OLAP, you have a clearidea about the features of OLAP With OLAP queries and analysis, users are able toobtain results and derive interesting patterns from the data.
Data mining also enables users to uncover interesting patterns, but there is an essentialdifference in the way the results are obtained Figure 9-31 points out the essential differ-ence between the two approaches
Although both OLAP and data mining are complex information delivery systems, thebasic difference lies in the interaction of the users with the systems OLAP is a user-drivenmethodology; data mining is a data-driven approach Data mining is a fairly automaticknowledge discovery process
Data Mining: Knowledge Discovery The knowledge discovery in data mining nology may be broken down into the following basic steps:
tech-. Define business objectives
. Prepare data
. Launch data mining tools
. Evaluate results
. Present knowledge discoveries
. Incorporate usage of discoveries
Figure 9-32 amplifies the knowledge discovery process and shows the relevant datarepositories
FIGURE 9-31 OLAP and data mining.
Trang 2Data Mining/Data Warehousing How and where does data mining fit in a data housing environment? The data warehouse is a valuable and easily available data sourcefor data mining operations Data in the data warehouse is already cleansed and consoli-dated Data for data mining may be extracted from the data warehouse.
ware-Figure 9-33 illustrates data mining in the data warehouse environment Observe themovement of data for data mining operations
FIGURE 9-32 Knowledge discovery process.
FIGURE 9-33 Data mining in the data warehouse environment.
Trang 3Data Mining Techniques
Although a discussion of major data mining techniques might be somewhat useful, ourprimary concentration is on the data and how to model the data for data mining appli-cations Detailed study of data mining techniques and algorithms is, therefore, outsidethe scope of our study These techniques and algorithms are complex and highly technical.However, we will just touch on major functions, application areas, and techniques.Functions and Techniques Refer to Figure 9-34 showing data mining functions andtechniques
Look at the four columns in the figure and try to understand the connections Reviewthe following statements
. Data mining algorithms are part of data mining techniques
. Data mining techniques are used to carry out data mining functions While ing specific data mining functions, you are applying data mining processes
perform-. A certain data mining function is generally suitable to a given application area
. Each application area is a major area in business where data mining is actively used.Applications In order to appreciate the tremendous usefulness of data mining, let us list
a few major applications of data mining in the business area
Customer Segmentation This is one of the most widespread applications Businessesuse data mining to understand their customers Cluster detection algorithms discover clus-ters of customers sharing same buying characteristics
FIGURE 9-34 Data mining functions and techniques.
Trang 4Market Basket Analysis This is a very useful application for retail Link analysis thms uncover affinities between products that are bought together Other businesses such
algori-as upscale auction houses use these algorithms to find customers to whom they can sellhigher-value items
Risk Management Insurance companies and mortgage businesses use data mining todiscover risks associated with potential customers
Delinquency Tracking Loan companies use the technology to track customers who arelikely to be delinquent on loan repayments
Demand Prediction Retail and other distribution businesses use data mining to matchdemand and supply trends to forecast demand for specific products
Data Preparation and Modeling
Data for your mining operations depend on the business objectives—what you expect toget out of the data mining technique being adopted to work on the data You will beable to come up with a set of data elements whose values are required as input into thedata mining tool For getting the values, you need to determine the data sources
Go back and revisit Figure 9-33, which shows data preparation from the enterprisedata warehouse You know the sources that feed the data warehouse It is assumed thatthe data warehouse contains data that has been integrated and combined from severalsources The data warehouse is also expected to have clean data with all the impuritiesremoved in the data staging area If data mining algorithms are allowed to work on incon-sistent data, the results could be totally useless
In this subsection, we will concentrate on a box that indicates data selected, extracted,transformed, and prepared for data mining We will discuss how the data selection andpreparation are done Once the data is prepared, you need to store the prepared data suita-bly for feeding the data mining applications What are good methods for this data storage?How do you prepare the data model for this data repository?
Data Preprocessing
Depending on the particular data mining application, you may find that the needed dataelements are not present in your data warehouse Be prepared to look to other outsideand internal sources for additional data Further, incomplete, noisy, and inconsistentdata is not infrequent in corporate databases and large data warehouses Do not simplyassume the correctness of available data and just extract data from these sources andfeed the data mining application
Data preprocessing generally consists of the following processes:
. Selection of data
. Preparation of data
. Transformation of data
Let us discuss these briefly That would give us an idea of the data content that should
be reflected in the data model for the preprocessed source data for data mining
Trang 5Data Selection Of course, what data is needed depends on the business objectives andthe nature of the data mining application Remember, data mining algorithms work on data
at the lowest grain or level of detail Based on a list of data elements, you need to identifythe sources Maybe most of the data can be extracted from the data warehouse Otherwise,determine the secondary sources
Data mining algorithms work on data variables Values of selected active variables arefed into the data mining system to perform the required operations Active variables would
be data attributes that may be found within the fact and dimension tables of the data house repository
ware-Suppose your data mining application wants to perform market basket analysis, that is,
to determine what a typical customer is likely to put in a market basket and go to the out counter of a supermarket The active variables in this case would possibly be number
check-of visits and variables to describe each basket such as household identification (fromsupermarket card), date of purchase, items purchased, basket value, quantities purchased,and promotion code
Active variables generally fall into following categories:
Nominal Variable This has a limited number of values with no significance attached tothe values in terms of ranking Example: gender (male or female)
Ordinal Variable This has a limited number of values with values signifying ranking.Example: customer education (high school or college or graduate school)
Continuous Measure Variable Difference in values of the variable measurable tinuous variations Examples: purchase price, number of items Values for this variableare real numbers
Con-Discrete Measure Variable Difference in values of the variable measurable Con-Discretevariations Example: number of market basket items Values for this variable are integers.Data Preparation This step basically entails cleansing the selected data First, this stepbegins with a general review of the structure of the data in question and choosing a method
to measure quality Usually, measuring data quality gets done by a combination of cal methods and data visualization techniques
statisti-Most common data problems appear to be the following:
Missing Values No recorded values for many instances Need to fill in the missingvalues before using the variable Several techniques are available to estimate and fill inthe missing values
Noisy Data A few instances have values completely out of line Example: daily wagesexceeding a million dollars Several smoothing techniques are available to deal withnoisy data
Inconsistent Data Synonyms and homonyms in various source systems may produceincorrect and inconsistent data Sources must be reviewed and inconsistencies removed.Removal of data problems signals the end of the data preparation step Once theselected data is cleansed, it is ready for transformation
Trang 6Data Transformation The prepared data is getting ready to be used as input to thedata mining algorithm The data transformation step converts the prepared data into aformat suitable for data mining You may say that data transformation changes the pre-pared data into a type of analytical model The analytical model is simply an infor-mation structure representing integrated and time-dependent formatting of prepareddata.
For example, if a supermarket wants to analyze customer purchases, it must first bedecided if the analysis will be done at the store level or at the level of individual purchases.The analytical model includes the variables and the levels of detail
Following the identification of the analytical model, detailed data transformation takesplace The objective is to transform the data to fit the exact requirements stipulated by thedata mining algorithm Data transformation may include a number of substeps such as:
. Data recoding
. Data format conversion
. Householding (linking of data of customers in the same household)
. Data reduction (by combining correlated variables)
. Scaling of parameter values to a range acceptable by data mining algorithms
. Discretization (conversion of quantitive variables into categorical variable groups)
. Conversion of categoric variable into a numeric representation
Data Modeling
Data modeling for data mining applications involve representations of the pertinent datarepositories In our discussions of data mining so far, we have been referring to the datarequirements for data mining applications We pointed out certain advantages of dataextraction from the data warehouse However, data warehouse is not a required source;you may directly extract data from operational systems and other sources
Figure 9-35 shows the data movements, the data preprocessing phase, and the datarepositories
Study the figure carefully and note the following data repositories for which we need tocreate data models as suggested below:
DM Source Repository This is a general data store for all possible data mining cations Periodically, data is extracted from the data warehouse and stored in thisrepository
appli-Data Model Normalized relational data model to represent low-level data content forall possible active variables available from the data warehouse
Application Analytical Repository This is a data store for a specific data miningapplication Data is extracted from the above DM Source Repository and other sourcesand stored in this repository Only the required active variables are selected
Data Model Normalized relational data model is recommended to represent datacontent at the desired data level for only those active variables relevant to thespecific data mining application
Trang 7Data Mining Input Extract This data store is meant to be used as input to the datamining algorithm Data in the Application Analytical Repository is transformed andmoved into this data store This data store contains transformed values for only therequired active variables.
Data Model Flat file or normalized relational data model with two or three tables torepresent data content to be used as direct input to the data mining algorithm
. Data warehousing is the most common of the decision-support systems widely usedtoday It is a blend of several technologies Major components of a data warehouse aresource data, data staging, data storage, and information delivery
. Decision makers view business in terms of business dimensions for analysis fore, data modeling for a data warehouse must take into account business dimensionsand the business metrics Dimensional modeling technique is used
There-. A dimensional data model, known as a STAR schema, consists of several dimensionentity types and a fact entity type in the middle Each of the dimension entity types
FIGURE 9-35 Data mining: data movements and repositories.
Trang 8is in a one-to-many relationship with the common fact entity type The STARschema is not normalized A snowflake schema, sometimes useful, is a normalizedversion The data model for a given data warehouse usually consists of families ofSTARS.
. The conceptual data model in the form of a STAR schema is transformed into alogical model If the data warehouse is implemented using a relational DBMS, thelogical model in this case is a relational model
. OLAP systems provide complex dimensional analysis Data modeling for MOLAP:representation of multidimensional arrays suitable for the particular MDDBMSselected Data modeling for ROLAP: E-R model of summarized data as required
. Data mining is a fairly automatic knowledge discovery system Data modelingfor data mining systems consists of modeling for the data repositories: DM sourcerepository, application analytical repository, and DM input extract
REVIEW QUESTIONS
1 Match the column entries:
1 Informational systems A Uses MDDBMS
2 Data staging area B Semiadditive
3 Dimension hierarchies C Normalized
4 Fact entity type D Knowledge discovery
5 Dimension table E Decision support
6 Profit margin percentage F For drill-down analysis
7 Snowflake schema G Data cleansed and transformed
10 Data mining J Represents multiple dimensions
2 A data warehouse is a decision-support environment, not a product Discuss
3 What data does an information package contain? Give a simple example
4 Explain why the E-R modeling technique is not completely suitable for the datawarehouse? How is dimensional modeling different?
5 Describe the composition of the primary keys for the dimension and fact tables.Give simple examples
6 Describe the nature of the columns in a dimension table transformed from thecorresponding conceptual STAR schema Give typical examples
7 What is your understanding of a value chain and a value circle in terms of families
of STARS? What are the implications for data modeling?
8 Describe the main features of a ROLAP system Explain how data modeling isdone for this
9 Distinguish between OLAP and data mining with regard to data modeling
10 Discuss data preprocessing for data mining What data repositories are involvedand how do you model these?
Trang 10PRACTICAL APPROACH TO
DATA MODELING
345
Trang 12ENSURING QUALITY IN
THE DATA MODEL
CHAPTER OBJECTIVES
Establish the significance of quality in a data model
Explore approaches to good data modeling
Study instituting quality in model definitions
Introduce quality dimensions in a data model
Examine dimensions of accuracy, completeness, and clarity
Highlight features and benefits of a high-quality model
Discuss quality assurance process and the results
Understand data model review and assessment
We are at a crucial point in our discussions of data modeling We have traveled quite farcovering much ground You have a strong grip on data modeling by now You are anexpert on the components of a data model You know how to translate the informationrequirements of an organization into a suitable data model using model components.You have studied a number of examples of data models In short, you now possess athorough knowledge of what data modeling is all about What more is left to be covered?
In this chapter, we are now ready to turn our attention to an important aspect of datamodeling—ensuring quality of the model Having gone through the multifarious facets
of data modeling, it is just fitting to bring all that to a logical conclusion by stressingdata model quality
In recent decades, organizational user groups and information technology professionalshave realized the overwhelming significance of data modeling A data modeling effort pre-cedes every database implementation However, what we see in practice is a number of
347 Data Modeling Fundamentals By Paulraj Ponniah
Copyright # 2007 John Wiley & Sons, Inc.
Trang 13bad or inadequate models out there Many models are totally incorrect representations ofthe information requirements It is not that a bad model lacks a few attributes here andthere or portrays a small number of relationships incorrectly Many bad models lead todisastrous database implementations Some bad data models are abandoned midstreamand shelved because of improper quality control The efforts of many weeks andmonths are down the drain.
This chapter addresses the critical issues of data model quality First, you will get toappreciate the significance of data model quality Next, we will move to a discussion ofquality in the definitions of various components We will then explore the dimensionsand characteristics of high-quality models and learn how to apply the fundamental prin-ciples of model quality Quality assurance is a distinct process in the modeling effort;you will cover quality assurance in sufficient detail
SIGNIFICANCE OF QUALITY
It is obvious that high quality in anything we create is essential That goes without having
to mention it specifically Then is not that maxim true for data modeling as well? Whyemphasize quality in data modeling separately? There are some special reasons
The concepts of data modeling are not that easy to comprehend Data modeling is aspecialized effort needing special skills A data modeler must be a business analyst, drafts-man, documentation expert, and a database specialist—all rolled into one It takes skill andexperience to gain a high degree of proficiency in data modeling It is easy to overlook theessentials and produce bad data models In a large organization, piecing together the variouscomponents into a good data model requires enormous discipline and skill It is not difficult
to slip on model quality We need to pay a high degree of special attention to quality.Why Emphasize Quality?
Recall the fundamental purposes of a data model Go back to the reasons for creating a datamodel in the first place What is the role a data model plays in the development process ofthe data system for an organization?
First, a data model is meant as a communication tool for confirming the informationrequirements with the user groups Next, a data model serves as a blueprint for the designand implementation of the data system for the organization We have covered these two themes
in elaborate detail Figure 10-1 summarizes these two essential purposes of a data model.Good Communication Tool Quality in a data model is essential because the modelhas to be a good and effective means of communication with the user groups As a datamodeler, you delve into the information requirements; you want the data content of theultimate database system to reflect the information requirements exactly How do youensure this?
You create a data model as a true replica of the information requirements Then you usethe data model as a tool for communication with the user groups You have to show them thatyou have captured all the information requirements properly You need to point out thevarious components of the data model to the user groups and get their confirmation Youcan do this correctly and effectively only if your data model is good and of high quality.Good Database Blueprint The database of an organization is built and implementedfrom the data model Every component of the data model gets transformed into one or
Trang 14more parts of the database If the entity types in a data model are erroneous or incomplete,the resulting database will be incorrect If the data model does not establish the relationshipscorrectly, the database will link the data structures incorrectly.
For building and implementing the database of an organization accurately and tely, the data model should be of good quality A good data model is a good blueprint; agood blueprint ensures a good database—the end product
comple-Good and Bad Models
You have again noted the two primary purposes of a data model Figure 10-1 presents thetwo purposes again Now, the question arises: If our goal is to produce a good data model,can we recognize good and bad models?
In order to examine and find out if a model is good or bad, let us get back to the twoprimary purposes of a data model We will examine a data model from the view ofthese purposes and note the characteristics of good and bad models The first purpose
is the use of a data model as a communication tool for working with the user groupsand stakeholders, that is, people outside the IT community The second purpose of adata model is its use as a blueprint for database design and implementation
Communication Tool A good data model has distinct characteristics including:
. Symbols used in the model have specific and unambiguous meanings
. Users can intuitively understand the data model diagram
. The model diagram conveys correct semantics
. The layout of the model diagram is clear, uncluttered, and appealing
FIGURE 10-1 Purposes of a data model.
Trang 15. Users are able to understand the representations noted in the data model.
. Users can easily relate model components to their information requirements
. The data model reflects the business rules correctly
. Users are able to notice problems of representations, if any, easily
. Users are able to note any missing representations without difficulty
. Users are able to suggest additions, deletions, and modifications easily
. The model is free from hidden or ambiguous meanings
. The data model is able to facilitate back-and-forth communication with user groupseffectively
. The data model diagram and accompanying documentation complement each otherand work well together as a joint communication tool
A bad data model does not possess the above characteristics Further, a data model may
be dismissed as bad if, in addition to the absence of the above characteristics, the modelhas specific negative features including the following:
. The data model diagram is confusing and convoluted
. The data model is incomplete and does not represent the complete informationrequirements
. There are several components in the data model that are vague and ambiguous
. The symbols lack clarity; meanings are not distinct and clear
. The layout of the data model diagram is horrific
. Users find numerous representation errors in the data model
Blueprint for Database A good data model from the point of view of its use as adetailed plan for the database has distinct characteristics including:
. Component-to-component mapping between the conceptual data model and thelogical data model is easy
. Each component in the conceptual data model has equivalent part or parts in thelogical model
. All symbols are clear for making the transition
. All meanings are easily transferable between the conceptual and logical data models
. The data model is a complete blueprint
. The data model can be broken down in cohesive parts for possible partialimplementations
. The connections or links between model components are easily defined
. The business rules may easily be transposed from the conceptual data model to thelogical data model
A bad data model does not possess the above characteristics Moreover, a data modelmay be considered bad if, in addition to the absence of the above characteristics, the modelhas specific negative features including the following:
. The data model contains insufficient information for all the transitions to the logicaldata model to be rendered possible