1. Trang chủ
  2. » Công Nghệ Thông Tin

DATA MODELING FUNDAMENTALS (P13) docx

30 201 0
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Data Modeling Fundamentals (P13)
Trường học Unknown University
Chuyên ngành Data Modeling
Thể loại document
Định dạng
Số trang 30
Dung lượng 645,67 KB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

Data Preprocessing Depending on the particular data mining application, you may find that the needed dataelements are not present in your data warehouse.. That would give us an idea of th

Trang 1

OLAP Versus Data Mining From our earlier discussion on OLAP, you have a clearidea about the features of OLAP With OLAP queries and analysis, users are able toobtain results and derive interesting patterns from the data.

Data mining also enables users to uncover interesting patterns, but there is an essentialdifference in the way the results are obtained Figure 9-31 points out the essential differ-ence between the two approaches

Although both OLAP and data mining are complex information delivery systems, thebasic difference lies in the interaction of the users with the systems OLAP is a user-drivenmethodology; data mining is a data-driven approach Data mining is a fairly automaticknowledge discovery process

Data Mining: Knowledge Discovery The knowledge discovery in data mining nology may be broken down into the following basic steps:

tech-. Define business objectives

. Prepare data

. Launch data mining tools

. Evaluate results

. Present knowledge discoveries

. Incorporate usage of discoveries

Figure 9-32 amplifies the knowledge discovery process and shows the relevant datarepositories

FIGURE 9-31 OLAP and data mining.

Trang 2

Data Mining/Data Warehousing How and where does data mining fit in a data housing environment? The data warehouse is a valuable and easily available data sourcefor data mining operations Data in the data warehouse is already cleansed and consoli-dated Data for data mining may be extracted from the data warehouse.

ware-Figure 9-33 illustrates data mining in the data warehouse environment Observe themovement of data for data mining operations

FIGURE 9-32 Knowledge discovery process.

FIGURE 9-33 Data mining in the data warehouse environment.

Trang 3

Data Mining Techniques

Although a discussion of major data mining techniques might be somewhat useful, ourprimary concentration is on the data and how to model the data for data mining appli-cations Detailed study of data mining techniques and algorithms is, therefore, outsidethe scope of our study These techniques and algorithms are complex and highly technical.However, we will just touch on major functions, application areas, and techniques.Functions and Techniques Refer to Figure 9-34 showing data mining functions andtechniques

Look at the four columns in the figure and try to understand the connections Reviewthe following statements

. Data mining algorithms are part of data mining techniques

. Data mining techniques are used to carry out data mining functions While ing specific data mining functions, you are applying data mining processes

perform-. A certain data mining function is generally suitable to a given application area

. Each application area is a major area in business where data mining is actively used.Applications In order to appreciate the tremendous usefulness of data mining, let us list

a few major applications of data mining in the business area

Customer Segmentation This is one of the most widespread applications Businessesuse data mining to understand their customers Cluster detection algorithms discover clus-ters of customers sharing same buying characteristics

FIGURE 9-34 Data mining functions and techniques.

Trang 4

Market Basket Analysis This is a very useful application for retail Link analysis thms uncover affinities between products that are bought together Other businesses such

algori-as upscale auction houses use these algorithms to find customers to whom they can sellhigher-value items

Risk Management Insurance companies and mortgage businesses use data mining todiscover risks associated with potential customers

Delinquency Tracking Loan companies use the technology to track customers who arelikely to be delinquent on loan repayments

Demand Prediction Retail and other distribution businesses use data mining to matchdemand and supply trends to forecast demand for specific products

Data Preparation and Modeling

Data for your mining operations depend on the business objectives—what you expect toget out of the data mining technique being adopted to work on the data You will beable to come up with a set of data elements whose values are required as input into thedata mining tool For getting the values, you need to determine the data sources

Go back and revisit Figure 9-33, which shows data preparation from the enterprisedata warehouse You know the sources that feed the data warehouse It is assumed thatthe data warehouse contains data that has been integrated and combined from severalsources The data warehouse is also expected to have clean data with all the impuritiesremoved in the data staging area If data mining algorithms are allowed to work on incon-sistent data, the results could be totally useless

In this subsection, we will concentrate on a box that indicates data selected, extracted,transformed, and prepared for data mining We will discuss how the data selection andpreparation are done Once the data is prepared, you need to store the prepared data suita-bly for feeding the data mining applications What are good methods for this data storage?How do you prepare the data model for this data repository?

Data Preprocessing

Depending on the particular data mining application, you may find that the needed dataelements are not present in your data warehouse Be prepared to look to other outsideand internal sources for additional data Further, incomplete, noisy, and inconsistentdata is not infrequent in corporate databases and large data warehouses Do not simplyassume the correctness of available data and just extract data from these sources andfeed the data mining application

Data preprocessing generally consists of the following processes:

. Selection of data

. Preparation of data

. Transformation of data

Let us discuss these briefly That would give us an idea of the data content that should

be reflected in the data model for the preprocessed source data for data mining

Trang 5

Data Selection Of course, what data is needed depends on the business objectives andthe nature of the data mining application Remember, data mining algorithms work on data

at the lowest grain or level of detail Based on a list of data elements, you need to identifythe sources Maybe most of the data can be extracted from the data warehouse Otherwise,determine the secondary sources

Data mining algorithms work on data variables Values of selected active variables arefed into the data mining system to perform the required operations Active variables would

be data attributes that may be found within the fact and dimension tables of the data house repository

ware-Suppose your data mining application wants to perform market basket analysis, that is,

to determine what a typical customer is likely to put in a market basket and go to the out counter of a supermarket The active variables in this case would possibly be number

check-of visits and variables to describe each basket such as household identification (fromsupermarket card), date of purchase, items purchased, basket value, quantities purchased,and promotion code

Active variables generally fall into following categories:

Nominal Variable This has a limited number of values with no significance attached tothe values in terms of ranking Example: gender (male or female)

Ordinal Variable This has a limited number of values with values signifying ranking.Example: customer education (high school or college or graduate school)

Continuous Measure Variable Difference in values of the variable measurable tinuous variations Examples: purchase price, number of items Values for this variableare real numbers

Con-Discrete Measure Variable Difference in values of the variable measurable Con-Discretevariations Example: number of market basket items Values for this variable are integers.Data Preparation This step basically entails cleansing the selected data First, this stepbegins with a general review of the structure of the data in question and choosing a method

to measure quality Usually, measuring data quality gets done by a combination of cal methods and data visualization techniques

statisti-Most common data problems appear to be the following:

Missing Values No recorded values for many instances Need to fill in the missingvalues before using the variable Several techniques are available to estimate and fill inthe missing values

Noisy Data A few instances have values completely out of line Example: daily wagesexceeding a million dollars Several smoothing techniques are available to deal withnoisy data

Inconsistent Data Synonyms and homonyms in various source systems may produceincorrect and inconsistent data Sources must be reviewed and inconsistencies removed.Removal of data problems signals the end of the data preparation step Once theselected data is cleansed, it is ready for transformation

Trang 6

Data Transformation The prepared data is getting ready to be used as input to thedata mining algorithm The data transformation step converts the prepared data into aformat suitable for data mining You may say that data transformation changes the pre-pared data into a type of analytical model The analytical model is simply an infor-mation structure representing integrated and time-dependent formatting of prepareddata.

For example, if a supermarket wants to analyze customer purchases, it must first bedecided if the analysis will be done at the store level or at the level of individual purchases.The analytical model includes the variables and the levels of detail

Following the identification of the analytical model, detailed data transformation takesplace The objective is to transform the data to fit the exact requirements stipulated by thedata mining algorithm Data transformation may include a number of substeps such as:

. Data recoding

. Data format conversion

. Householding (linking of data of customers in the same household)

. Data reduction (by combining correlated variables)

. Scaling of parameter values to a range acceptable by data mining algorithms

. Discretization (conversion of quantitive variables into categorical variable groups)

. Conversion of categoric variable into a numeric representation

Data Modeling

Data modeling for data mining applications involve representations of the pertinent datarepositories In our discussions of data mining so far, we have been referring to the datarequirements for data mining applications We pointed out certain advantages of dataextraction from the data warehouse However, data warehouse is not a required source;you may directly extract data from operational systems and other sources

Figure 9-35 shows the data movements, the data preprocessing phase, and the datarepositories

Study the figure carefully and note the following data repositories for which we need tocreate data models as suggested below:

DM Source Repository This is a general data store for all possible data mining cations Periodically, data is extracted from the data warehouse and stored in thisrepository

appli-Data Model Normalized relational data model to represent low-level data content forall possible active variables available from the data warehouse

Application Analytical Repository This is a data store for a specific data miningapplication Data is extracted from the above DM Source Repository and other sourcesand stored in this repository Only the required active variables are selected

Data Model Normalized relational data model is recommended to represent datacontent at the desired data level for only those active variables relevant to thespecific data mining application

Trang 7

Data Mining Input Extract This data store is meant to be used as input to the datamining algorithm Data in the Application Analytical Repository is transformed andmoved into this data store This data store contains transformed values for only therequired active variables.

Data Model Flat file or normalized relational data model with two or three tables torepresent data content to be used as direct input to the data mining algorithm

. Data warehousing is the most common of the decision-support systems widely usedtoday It is a blend of several technologies Major components of a data warehouse aresource data, data staging, data storage, and information delivery

. Decision makers view business in terms of business dimensions for analysis fore, data modeling for a data warehouse must take into account business dimensionsand the business metrics Dimensional modeling technique is used

There-. A dimensional data model, known as a STAR schema, consists of several dimensionentity types and a fact entity type in the middle Each of the dimension entity types

FIGURE 9-35 Data mining: data movements and repositories.

Trang 8

is in a one-to-many relationship with the common fact entity type The STARschema is not normalized A snowflake schema, sometimes useful, is a normalizedversion The data model for a given data warehouse usually consists of families ofSTARS.

. The conceptual data model in the form of a STAR schema is transformed into alogical model If the data warehouse is implemented using a relational DBMS, thelogical model in this case is a relational model

. OLAP systems provide complex dimensional analysis Data modeling for MOLAP:representation of multidimensional arrays suitable for the particular MDDBMSselected Data modeling for ROLAP: E-R model of summarized data as required

. Data mining is a fairly automatic knowledge discovery system Data modelingfor data mining systems consists of modeling for the data repositories: DM sourcerepository, application analytical repository, and DM input extract

REVIEW QUESTIONS

1 Match the column entries:

1 Informational systems A Uses MDDBMS

2 Data staging area B Semiadditive

3 Dimension hierarchies C Normalized

4 Fact entity type D Knowledge discovery

5 Dimension table E Decision support

6 Profit margin percentage F For drill-down analysis

7 Snowflake schema G Data cleansed and transformed

10 Data mining J Represents multiple dimensions

2 A data warehouse is a decision-support environment, not a product Discuss

3 What data does an information package contain? Give a simple example

4 Explain why the E-R modeling technique is not completely suitable for the datawarehouse? How is dimensional modeling different?

5 Describe the composition of the primary keys for the dimension and fact tables.Give simple examples

6 Describe the nature of the columns in a dimension table transformed from thecorresponding conceptual STAR schema Give typical examples

7 What is your understanding of a value chain and a value circle in terms of families

of STARS? What are the implications for data modeling?

8 Describe the main features of a ROLAP system Explain how data modeling isdone for this

9 Distinguish between OLAP and data mining with regard to data modeling

10 Discuss data preprocessing for data mining What data repositories are involvedand how do you model these?

Trang 10

PRACTICAL APPROACH TO

DATA MODELING

345

Trang 12

ENSURING QUALITY IN

THE DATA MODEL

CHAPTER OBJECTIVES

Establish the significance of quality in a data model

Explore approaches to good data modeling

Study instituting quality in model definitions

Introduce quality dimensions in a data model

Examine dimensions of accuracy, completeness, and clarity

Highlight features and benefits of a high-quality model

Discuss quality assurance process and the results

Understand data model review and assessment

We are at a crucial point in our discussions of data modeling We have traveled quite farcovering much ground You have a strong grip on data modeling by now You are anexpert on the components of a data model You know how to translate the informationrequirements of an organization into a suitable data model using model components.You have studied a number of examples of data models In short, you now possess athorough knowledge of what data modeling is all about What more is left to be covered?

In this chapter, we are now ready to turn our attention to an important aspect of datamodeling—ensuring quality of the model Having gone through the multifarious facets

of data modeling, it is just fitting to bring all that to a logical conclusion by stressingdata model quality

In recent decades, organizational user groups and information technology professionalshave realized the overwhelming significance of data modeling A data modeling effort pre-cedes every database implementation However, what we see in practice is a number of

347 Data Modeling Fundamentals By Paulraj Ponniah

Copyright # 2007 John Wiley & Sons, Inc.

Trang 13

bad or inadequate models out there Many models are totally incorrect representations ofthe information requirements It is not that a bad model lacks a few attributes here andthere or portrays a small number of relationships incorrectly Many bad models lead todisastrous database implementations Some bad data models are abandoned midstreamand shelved because of improper quality control The efforts of many weeks andmonths are down the drain.

This chapter addresses the critical issues of data model quality First, you will get toappreciate the significance of data model quality Next, we will move to a discussion ofquality in the definitions of various components We will then explore the dimensionsand characteristics of high-quality models and learn how to apply the fundamental prin-ciples of model quality Quality assurance is a distinct process in the modeling effort;you will cover quality assurance in sufficient detail

SIGNIFICANCE OF QUALITY

It is obvious that high quality in anything we create is essential That goes without having

to mention it specifically Then is not that maxim true for data modeling as well? Whyemphasize quality in data modeling separately? There are some special reasons

The concepts of data modeling are not that easy to comprehend Data modeling is aspecialized effort needing special skills A data modeler must be a business analyst, drafts-man, documentation expert, and a database specialist—all rolled into one It takes skill andexperience to gain a high degree of proficiency in data modeling It is easy to overlook theessentials and produce bad data models In a large organization, piecing together the variouscomponents into a good data model requires enormous discipline and skill It is not difficult

to slip on model quality We need to pay a high degree of special attention to quality.Why Emphasize Quality?

Recall the fundamental purposes of a data model Go back to the reasons for creating a datamodel in the first place What is the role a data model plays in the development process ofthe data system for an organization?

First, a data model is meant as a communication tool for confirming the informationrequirements with the user groups Next, a data model serves as a blueprint for the designand implementation of the data system for the organization We have covered these two themes

in elaborate detail Figure 10-1 summarizes these two essential purposes of a data model.Good Communication Tool Quality in a data model is essential because the modelhas to be a good and effective means of communication with the user groups As a datamodeler, you delve into the information requirements; you want the data content of theultimate database system to reflect the information requirements exactly How do youensure this?

You create a data model as a true replica of the information requirements Then you usethe data model as a tool for communication with the user groups You have to show them thatyou have captured all the information requirements properly You need to point out thevarious components of the data model to the user groups and get their confirmation Youcan do this correctly and effectively only if your data model is good and of high quality.Good Database Blueprint The database of an organization is built and implementedfrom the data model Every component of the data model gets transformed into one or

Trang 14

more parts of the database If the entity types in a data model are erroneous or incomplete,the resulting database will be incorrect If the data model does not establish the relationshipscorrectly, the database will link the data structures incorrectly.

For building and implementing the database of an organization accurately and tely, the data model should be of good quality A good data model is a good blueprint; agood blueprint ensures a good database—the end product

comple-Good and Bad Models

You have again noted the two primary purposes of a data model Figure 10-1 presents thetwo purposes again Now, the question arises: If our goal is to produce a good data model,can we recognize good and bad models?

In order to examine and find out if a model is good or bad, let us get back to the twoprimary purposes of a data model We will examine a data model from the view ofthese purposes and note the characteristics of good and bad models The first purpose

is the use of a data model as a communication tool for working with the user groupsand stakeholders, that is, people outside the IT community The second purpose of adata model is its use as a blueprint for database design and implementation

Communication Tool A good data model has distinct characteristics including:

. Symbols used in the model have specific and unambiguous meanings

. Users can intuitively understand the data model diagram

. The model diagram conveys correct semantics

. The layout of the model diagram is clear, uncluttered, and appealing

FIGURE 10-1 Purposes of a data model.

Trang 15

. Users are able to understand the representations noted in the data model.

. Users can easily relate model components to their information requirements

. The data model reflects the business rules correctly

. Users are able to notice problems of representations, if any, easily

. Users are able to note any missing representations without difficulty

. Users are able to suggest additions, deletions, and modifications easily

. The model is free from hidden or ambiguous meanings

. The data model is able to facilitate back-and-forth communication with user groupseffectively

. The data model diagram and accompanying documentation complement each otherand work well together as a joint communication tool

A bad data model does not possess the above characteristics Further, a data model may

be dismissed as bad if, in addition to the absence of the above characteristics, the modelhas specific negative features including the following:

. The data model diagram is confusing and convoluted

. The data model is incomplete and does not represent the complete informationrequirements

. There are several components in the data model that are vague and ambiguous

. The symbols lack clarity; meanings are not distinct and clear

. The layout of the data model diagram is horrific

. Users find numerous representation errors in the data model

Blueprint for Database A good data model from the point of view of its use as adetailed plan for the database has distinct characteristics including:

. Component-to-component mapping between the conceptual data model and thelogical data model is easy

. Each component in the conceptual data model has equivalent part or parts in thelogical model

. All symbols are clear for making the transition

. All meanings are easily transferable between the conceptual and logical data models

. The data model is a complete blueprint

. The data model can be broken down in cohesive parts for possible partialimplementations

. The connections or links between model components are easily defined

. The business rules may easily be transposed from the conceptual data model to thelogical data model

A bad data model does not possess the above characteristics Moreover, a data modelmay be considered bad if, in addition to the absence of the above characteristics, the modelhas specific negative features including the following:

. The data model contains insufficient information for all the transitions to the logicaldata model to be rendered possible

Ngày đăng: 07/07/2014, 09:20

TỪ KHÓA LIÊN QUAN