Tài liệu High-Performance Parallel Database Processing and Grid Databases- P11 doc

Equation 17.2 can be used, but the number of records is not the total number of training records but rather the number of records possessing the attribute value of the entropy of a parti

Trang 1

Rec# Weather Temperature Time Day Jog (Target Class)

14 Thunderstorm Mild Sunset Weekend No

Figure 17.11 Training data set

thunderstorm, whereas the possible values for temperature are hot, mild, and cool Continuous values are real numbers (e.g., heights of a person in centimetres) Figure 17.11 shows the training data set for the decision tree shown previously This training data set consists of only 15 records For simplicity, only categorical attributes are used in this example Examining the ﬁrst record and matching it with the decision tree in Figure 17.10, the target is a Yes for ﬁne weather and mild temperature, disregarding the other two attributes This is because all records in this training data set follow this rule (see records 1 and 10) Other records, such as records 9 and 13 use all the four attributes.

17.3.2 Decision Tree Classiﬁcation: Processes Decision Tree Algorithm

There are many different algorithms to construct a decision tree, such as ID3, C4.5, Sprint, etc Constructing a decision tree is generally a recursive process At the start, all training records are at the root node Then it partitions the training records recursively by choosing one attribute at a time The process is repeated for the partitioned data set The recursion stops when a stopping condition is reached, which is when all of the training records in the partition have the same target class label.

Figure 17.12 shows an algorithm for constructing a decision tree The sion tree construction algorithm uses a divide-and-conquer method It constructs the tree using a depth-ﬁrst fashion Branching can be binary (only 2 branches) or multiway (½2 branches).

Trang 2

deci-17.3 Parallel Classiﬁcation 481

Algorithm: Decision Tree Construction

Procedure DTConstruct(D):

1. TDØ

2 Determine best splitting attribute

label

5 For each arc do

to D

class

Figure 17.12 Decision tree algorithm

Note that in the algorithm shown in Figure 17.12, the key element is the splitting attribute selection (line 2) The splitting attribute is the attribute chosen to split the training data set into a number of partitions The splitting attribute step is also often known as feature selection, because the algorithm needs to select a feature (or an attribute) of the training data set to create a node As mentioned earlier, choosing

a different attribute as a splitting attribute will cause the result decision to be ferent The difference in the decision tree produced by an algorithm lies in how

dif-to position the features or input attributes Hence, choosing a splitting attribute, which will result in an optimum decision tree, is desirable The way by which a splitting node is determined will be described in greater detail in the following.

Splitting Attributes or Feature Selection

When constructing a decision tree, it is necessary to have a means of determining the importance of the attributes for the classiﬁcation Hence, calculation is needed

to ﬁnd the best splitting attribute at a node All possible splitting attributes are evaluated with a feature selection criterion to ﬁnd the best attribute Although the feature selection criterion still does not guarantee the best decision tree, neverthe- less, it also relies on the completeness of the training data set and whether or not the training data set provides enough information.

The main aim of feature selection or choosing the right splitting attribute at some point in a decision tree is to create a tree that is as simple as possible and gives the correct classiﬁcation Consequently, poor selection of an attribute can result in a poor decision tree.

Trang 3

At each node, available attributes are evaluated on the basis of separating the classes of the training records For example, looking at the training records in

Figure 17.11, we note that if Time D Dawn, then the answer is always No (see

records 4, 7, 11– 13) It means that if Time is chosen as the ﬁrst splitting attribute,

at the next stage, we do not need to process these 5 records (records 4, 7, 11– 13).

We need to process only those records with Time D Sunset or Midday (10 records

altogether), making the gain for choosing attribute Time as a splitting attribute quite high and hence, desirable.

Let us look at another possible attribute, namely, Weather Also notice that

when the Weather D Thunderstorm, the target class is always No (see records 4, 8,

14– 15) If attribute Weather is chosen as a splitting attribute in the beginning, in the next stage, these four records (records 4, 8, 14– 15) will not be processed—we need to process only the other 11 records So, the gain in choosing attribute Weather as a splitting attribute is not that bad, but not as good as the attribute Time, because a higher number of records are pruned out.

Therefore, the main goal for choosing the best splitting attribute is to choose the attribute that will prune out as many records as possible at the early stage, so that fewer records need to be processed in the subsequent stages We can also say that the best splitting attribute is the one that will result in the smallest tree.

There are various kinds of feature selection criteria for determining the best

splitting attributes The basic feature selection criterion is called gain criterion,

which was designed for the one of the original decision tree algorithm (i.e., ID3/C4.5) Heuristically, the best splitting attribute will produce the “purest” nodes A popular impurity criterion is information gain Information gain increases with the average purity of the subsets that an attribute produces Therefore, the strategy is to choose an attribute that results in greatest information gain.

The gain criterion basically consists of four important calculations.

Ž Given a probability distribution, the information required to predict an event

is the distribution’s entropy Entropy for the given probability of the target

Let us use the training data set in Figure 17.11 There are two target

classes: Yes and No With 15 records in the training data set, 5 records have target class Yes and the other 10 records have target class No The probability

of falling into a Yes is 5/15, whereas the No probability is 10/15 Entropy for

the given probability of the two target classes is then calculated as follows:

entropy(Yes, No) D 5 =15 ð log.15=5/ C 10=15 ð log.15=10/

Trang 4

17.3 Parallel Classiﬁcation 483

At the next iteration, when the training data set is partitioned to a smaller subset, we need to calculate the entropy based on the number of training records in the partition, not the total number of records in the original training data set.

Ž For each of the possible attributes to be chosen as a splitting attribute, we need

to calculate the entropy value for each of the possible values of that particular attribute Equation 17.2 can be used, but the number of records is not the total number of training records but rather the number of records possessing the attribute value of the entropy of a particular attribute:

For example, for Weather D Fine, there are 4 records with target class Yes and 3 records with No Hence the entropy for Weather D Fine is:

entropy Weather D Fine/ D 4=7 ð log.7=4/ C 3=7 ð log.7=3/

Note that the entropy calculation for both examples above uses a

differ-ent total number of records In Weather D Fine the number of records is 7, whereas in Weather D Shower the number of records is only 4 This number

of records is important, because it affects the probability of having a target

class For example, for target class Yes in Fine weather the probability is 4/7, whereas the same target class Yes in Shower weather the probability is

only 1/4.

For each of the attribute values, we need to calculate the entropy In other words, for attribute Weather, because there are three attribute values (e.g.,

Fine, Shower, and Thunderstorm), each of these three values must have an

entropy value For attribute Temperature, for instance, we need an entropy

calculated for values Hot, Mild, and Cool.

Ž The entropy values for each attribute must be summed with a weighted sum The aim is that each attribute must have one entropy value Because each attribute value has an individual entropy value (e.g., attribute Weight has three entropy values, one for each weather), and the entropy of each attribute value is based on a different probability distribution, when we combine all the entropy values from the same attributes, their individual weight must be considered.

To calculate the weighted sum, each entropy value must be multiplied with the probability of each value of the total number of training records in the

partition For example, the weighted entropy value for Fine weather is 7/15 ð

0 :2966.

Trang 5

There are 7 records out of 15 records with Fine weather, and the entropy for Fine weather is 0.2966 as calculated earlier (see equation 17.4).

Using the same method, the weighted sum for Shower weather is 4/15 ð

0 :2442, as there are only 4 records out of the 15 records in the training dataset

with Shower weather, and the original entropy for Shower as calculated in

equation 17.5 is 0.2442.

After each individual entropy value has been weighted, we can sum them for each individual attribute For example, the weighted sum for attribute

Weather is:

Weighted sum entropy .Weather/ D Weighted entropy Fine/

C Weighted entropy .Shower/

C Weighted entropy .T hunderstorm/

D 7 =15 ð 0:2966 C 4=15 ð 0:2442 C 4=15 ð 0

Ž Finally, the gain for an attribute can be calculated by subtracting the weighted sum of the attribute entropy from the overall entropy For example, the gain

for attribute Weather is:

gain(Weather) D entropy training datasetD/ entropy.attributeWeather/

A Walk-Through Example

Using the sample training data set in Figure 17.11, the following gives a complete walk-through of the process to create a decision tree.

Step 1: Calculate entropy for the training data set in Figure 17.11 The result is

previously calculated as 0.2764 (see equation 17.3).

Step 2: Process attribute Weather

Trang 6

weighted sum entropy(Weather) D 0 :2035 (equation 17.6)

Ž Calculate information gain for attribute Weather:

Step 3: Process attribute Temperature

Ž Calculate weighted sum entropy of attribute Temperature:

entropy(Hot) D 2 =5 ð log.5=2/ C 3=5 ð log.5=3/ D 0:2923

entropy(Mild) D entropy(Hot) entropy(Cool) D 1 =5 ð log.5=1/ C 4=5 ð log.5=4/ D 0:2173

weighted sum entropy(Temperature) D 5 =15 ð 0:2923 C 5=15

ð 0 :2173 D 0:2674

Ž Calculate information gain for attribute Temperature:

gain (Temperature) D 0 :2764 0:2674 D 0:009 (17.8)

Step 4: Process attribute Time

Ž Calculate weighted sum entropy of attribute Time:

entropy(Dawn) D 0 C 5 =5 ð log.5=5/ D 0

entropy(Midday) D 2 =6 ð log.6=2/ C 4=6 ð log.6=4/ D 0:2764

entropy(Sunset) D 3 =4 ð log.4=3/ C 1=4 ð log.4=1/

Step 5: Process attribute Day

Ž Calculate weighted sum entropy of attribute Day:

entropy(Weekday) D 4 =10 ð log.10=4/ C 6=10 ð log.10=6/

Trang 7

Sunset Dawn

as the root node

Comparing equations 17.7, 17.8, 17.9, and 17.10 ,and 17.10 for the gain of each other attributes (Weather, Temperature, Time, and Day), the biggest gain is

Time, with gain value D 0 :1007 (see equation 17.9), and as a result, attribute Time

is chosen as the ﬁrst splitting attribute A partial decision tree with the root node

Time is shown in Figure 17.13.

The next stage is to process partition D1 consisting of records with Time D

Midday Training dataset partition D1consists of 6 records with record numbers

3, 6, 8, 9, 10, and 15 The next task is to determine the splitting attribute for

par-tition D1, whether it is Weather, Temperature, or Day The process similar to the

above to calculate the entropy and information gain, is summarized as follows:

Step 1: Calculate entropy for the training dataset partition D1.

entr opy D1/ D 2=6 log.6=2/ C 4=6 log.6=4/ D 0:2764 (17.11)

Step 2: Process attribute Weather

Ž Calculate weighted sum entropy of attribute Weather entropy(Fine) D 2 =3 ð log.6=2/ C 1=3 ð log.3=1/ D 0:2764

entropy(Shower) D entropy(Thunderstorm) D 0 weighted sum entropy (Weather) D 3 =5 ð 0:2764 D 0:1382

Ž Calculate information gain for attribute Weather:

gai n Weather/ D 0:2764 0:1382 D 0:1382 (17.12)

Step 3: Process attribute Temperature

Ž Calculate weighted sum entropy of attribute Temperature entropy(Hot) D 0

entropy(Mild) D entropy(Cool) D 1 =2 ð log.2=1/ C 1=2

Trang 8

17.3 Parallel Classiﬁcation 487 Step 4: Process attribute Day

Ž Calculate weighted sum entropy of attribute Day:

entropy(Weekday) D 2 =6 ð log.6=2/ C 4=6 ð log.6=4/ D 0:2764

entropy(Weekend) D 0 weighted sum entropy (Day) D 0 :2764

Ž Calculate information gain for attribute Day:

gai n T emperature/ D 0:2764—0:2764 D 0 (17.14)

The best splitting node for partition D2is attribute Weather with information gain value of 0.1382 (see equation 17.12) Continuing from Figure 17.13, Figure 17.14 shows the temporary decision tree.

For partition D2, the splitting attribute is also Weather The entropy and

infor-mation gain calculations are summarized as follows:

entr opy D2/ D 0:2443

weighted sum entropy Weather/ D 0 gai n Weather/ D 0:2443 ) Highest in f ormation gain weighted sum entropy T emperature/ D 0:1505

gai n T emperature/ D 0:0938 weighted sum entropy Day/ D 0:1505 gai n Day/ D 0:0938

And for partition D11, the splitting attribute is Temperature The entropy and

information gain calculations are summarized as follows:

Trang 9

Thunderstorm Thunderstorm Fine

Figure 17.15 Final decision tree

gai n T emperature/ D 0:2546 ) Highest in f ormation gain weighted sum entropy Day/ D 0:2546

gai n Day/ D 0

Because each of the partitions has branches that reach the target class node, a complete decision tree is generated Figure 17.15 shows the ﬁnal decision tree Note that the decision tree in Figure 17.15 looks different from the decision tree in Figure 17.10, and yet both correctly represent all rules from the training data set in Figure 17.11 The decision tree in Figure 17.15 looks more compact and is better than the one previously shown in Figure 17.10 Also note that Figure 17.15 does

not use attribute Day as a splitting attribute at all (as the training data set is limited) and all rules can be generated without the need for attribute Day.

17.3.3 Decision Tree Classiﬁcation: Parallel Processing

Since the structure of a decision tree is similar to query tree optimization, parallelization of a decision tree would be quite similar to subqueries execution scheduling in parallel query optimization (refer to Chapter 9) In subqueries

execution scheduling for query tree optimization, there are serial subqueries execution scheduling and parallel subqueries execution scheduling, whereas for parallel data mining, this chapter introduces data parallelism and result parallelism A parallel decision tree combines both concepts, subqueries execution

Trang 10

scheduling and parallel data mining, because both deal with tree parallelism Data parallelism for a decision tree is basically similar to serial subqueries execution scheduling, whereas result parallelism is identical to parallel subqueries execution scheduling Both data parallelism and result parallelism for a decision tree are described below.

Data Parallelism for Decision Tree

There are many terms used to describe data parallelism for a decision tree, ing synchronous tree construction, feature/attribute partitioning, or intratree node parallelism All of these basically describe data parallelism from a different angle.

includ-As we discuss data parallelism for a decision tree, we will then note how other names would occur.

Data parallelism is created because of data partitioning Previously, particularly

in parallel association rules, parallel sequential patterns, and parallel clustering, data parallelism employed horizontal data partitioning, whereby different records from the data set are distributed to different processors Each processor will have

a disjoint partitioned data set, each of which consists of a number of records with the complete attributes.

Data parallelism for decision making employs another type of data ing, namely vertical data partitioning Note that basic data partitioning, covering horizontal and vertical data partitioning, was explained in Chapter 3 on parallel searching operation (or parallel selection operation) For a parallel decision tree using data parallelism, the training data set is vertically partitioned, so that each partition will have one or more feature attributes, the target class, and the record number In other words, the feature attributes are vertically partitioned, but the record number and target class are replicated to all partitions Figure 17.16 illus- trates the vertical data partitioning of a training data set.

partition-The target class needs to be replicated to all partitions because only by having the target class can the partitions be glued together The record numbers will be used in the subsequent iterations in building the tree, as the partition size will be shrunk because of further partitioning of each partition.

In data parallelism for a decision tree, like any other data parallelism, the plete temporary result, in this case the decision tree, will be maintained in each processor In other words, at the end of each stage of building the decision tree, the same temporary decision tree will exist in all processors This is the same as any other data parallelism, like data parallelism for association rules, where in count distribution, at the end of each iteration, the frequent itemset is the same for each

com-processor This is also the same in data parallelism for k-means clustering, where

each processor will have the same clusters at the end of each iteration.

Figure 17.17 shows an illustration of data parallelism for a decision tree At level 1, the root node is processed and determined At the end of level 1, each processor will have the same root node.

At level 2, if the root node has n branches, there will be n level 2s In the

example shown in Figure 17.17, there are 3 branches from the root node sequently, there will be levels 2a, 2b, and 2c Each sublevel of level 2 will be

Trang 11

Feature attributes

Target Class

Partition 1 Partition 2 Partition 3

Figure 17.16 Vertical data partitioning of training data set

processed one after another, but when processing a sublevel of level 2, parallel

processors are employed In this sense, it is similar to the serial subqueries cution scheduling Parallelism is within a node, and hence it is an intratree node parallelism.

exe-The sublevel processing is also applied to the subsequent levels For example, Figure 17.17 shows the processing of level 3a To highlight that a node is currently being processed within a sublevel, the node in the decision tree in Figure 17.17 is ﬁlled in black to indicate the node currently being processed All other nodes are not ﬁlled.

Using the training data set in Figure 17.11, assume that there are 2 processors

to be employed in the parallel decision tree construction As there are four feature attributes, these attributes are vertically partitioned into the two processors: proces-

sor 1 receives the ﬁrst two attributes, Weather and Temperature, whereas processor

2 receives the other two attributes, Time and Day Figure 17.18 shows the parallel

of record numbers is carried out All of these activities are information sharing

Trang 12

Level 3a

Level 2cLevel 2bLevel 2a

Level 1

Figure 17.17 Dataparallelism of parallel decisiontree construction

activities— similar to count distribution in parallel association rules In a parallel decision tree, these information sharing activities can be thought of as a mean to

“synchronize” the decision tree, and hence data parallelism for a parallel decision tree is also known as a synchronous tree construction approach.

Once the tree has been synchronized, each processor will have the same sion tree Then the next stage (i.e., level 2a) starts Note that each partition has

deci-a smdeci-aller number of records (i.e., only 6 records in edeci-ach pdeci-artition) Furthermore,

because attribute Time is already processed, this attribute is then eliminated from the partition (see the shaded Time attribute in Fig 17.18) In this case, processor 2

will have only one feature attribute (e.g., Day) to process, whereas processor 1 has the original two feature attributes (e.g., Weather and Temperature).

If all of the feature attributes from one partition (one processor) have been cessed in the previous stages, then there are two options Option one is to leave the processor idle, and option two is to request other processors to send or to share one of their feature attributes The latter is the subject of load balancing, which has been discussed in Chapter 9 on parallel query optimization So, although the- oretically data parallelism does not require any data movements, in some cases where load balancing needs to be performed, data movement among processors may happen.

pro-If, in the ﬁrst place, the number of processors is more than the available number

of feature attributes, then a few processors may share the same feature attribute.

Trang 13

Level 1 (Root Node):

1 2

… 15

1 2

… 15

Locally calculate the information gain

values for: Weather and Temperature

values for: Time and Day

Global information sharing stage:

a Share target class counts to calculate dataset entropy value

b Exchange dataset entropy value to determine splitting attribute (e.g Time attribute is decided to be the splitting attribute)

c Distribute selected records# to all processor for the next phase

(e.g records 3, 6, 8, 9, 10, 15 for Time Midday, and records 1, 2, 5, 14 for Time Sunset)

Decision tree for Level 1:

Figure 17.18 Data parallelism in decision tree

Once level 2a processing starts, each processor will work independently, and afterward information sharing or tree synchronization is carried out The process

is repeated for all nodes In this case, level 2b will commence once level 2a has completed its task.

Result Parallelism for Decision Tree

As opposed to data parallelism, where the parallelism is intratree node, the result parallelism for the decision tree is intertree node parallelism Hence, if there are

multiple nodes on a level, parallelism is achieved through processing nodes currently by several processors.

con-Analogous to subqueries execution scheduling in parallel query optimization,

if data parallelism is serial subqueries execution scheduling, result parallelism is

parallel subqueries execution scheduling So, there is some degree of similarity

between parallel decision tree construction and parallel query tree optimization.

Trang 14

3 6 8 9 10 15

values for:Weather and Temperature

Locally calculate the information

gain values for: Day

a Share target class counts of each partition to calculate dataset entropy value

b Exchange dataset entropy value to determine splitting attribute (e.g Weather attribute is decided to be the splitting attribute)

c Distribute selected records# to all processor for the next phase

Result decision tree for Level 2:

parallel decision tree is also known as “partitioned tree construction.” Figure 17.19

gives an illustration of how a decision tree is partitioned Logically, partitioning a

decision tree is similar to the partially replicated index (PRI) described in Chapter

7 on parallel indexing The main rule is that the processor that processes a child node in a tree will also process its parent nodes Consequently, the root node is processed by all processors.

Figure 17.19 shows that at the root node level the root node processing is shared by all the three processors On level 2, the three nodes below the root

Trang 15

intern-In summary, if the number of processors is less than the number of nodes, an intranode parallelism is applied If not, then an internode parallelism is employed The decision tree partitioning in Figure 17.19 can be redrawn to Figure 17.20, emphasizing the load of each processor The dark shaded nodes indicate the node being processed by the processor at a particular level.

Level 3Level 2

Level 1

Level 4

Figure 17.20 Resultparallelism of parallel decisiontree construction

Trang 16

1 gets the ﬁrst 8 records, and processor 2 the last 7 records.

Since entropy and information gain calculations need global information from the entire training data set, each processor needs to exchange counts with other processors, and this is global information exchange Once each processor receives the necessary information to calculate the entropy and information gain values, it decides the best splitting attribute.

Before level 2 processing starts, each processor needs to know which records are to be processed next In this case, processor 1 will process the node pointed

by the Midday time arc, whereas processor 2 will process the node pointed by the Sunset time arc Processor 1 needs to know which records to process, and so does

processor 2 In this example, processor 1 will obtain a data set partition containing records 3, 6, 8, 9, 10, and 15, whereas processor 2 will obtain records 1, 2, 5, and 14 At this stage, there will be record movement from one processor to the other, since each processor may require records from other processors to process the node allocated to it For example, processor 1 now needs record 15, which was initially located in partition 2 (processor 2) Once data movement is complete, level

2 processing can commence.

Note that the decision tree from level 1 is shown in each processor The dotted

line indicates that this path is processed by another processor Arc Sunset dotted

in processor 1 means that this arc is processed by processor 2, and on the other

hand, the arc Midday, which is dotted line in processor 2, refers to the path being

processed by processor 1.

During level 2 processing, global information sharing is also needed, as in level

1 processing The global information sharing is needed to calculate the entropy and information gain values in order to determine the next splitting attribute After the splitting attribute has been determined, the records need to be redistributed again.

In our example in Figure 17.21, level 3 processing requires only processor 1

to work This is because processor 2 has completed its part and all the necessary target class nodes have been generated Processor 1 on level 3 processing will obtain records 6, 9, and 10, which are a subset of the previous partition in level 2 Figure 17.21 shows the entire process of result parallelism of the parallel decision tree.

This chapter presents two more data mining techniques, namely clustering and

classiﬁcation For clustering, the k-means method is chosen, whereas for

classiﬁ-cation, the decision tree method is used.

Parallel k-means and the parallel decision tree adopt data parallelism and result

parallelism Data parallelism in clustering is based on data partitioning whereby

Trang 17

Horizontal Data Partitioning:

… 8

9 10

… 15

Level 1 (Root Node):

a Count target class on each partition

b Perform intra-nod eparallelism the same as for data parallelism to share target class

counts to calculate dataset entropy value, exchange dataset entropy value todetermine splitting attribute, and distribute selected records# to all other processors for the next phase)

Decision tree for Level 1:

1 2 5 14

a Count target class on each partition

b Perform intra-node parallelism the same as for data parallelism to share target

class counts to calculate dataset entropy value, exchange dataset entropy value to determin esplitting attribute,and distribute selected records# to allother processors for the next phase)

Class

Class Weather

Figure 17.21 Result parallelism in decision tree

Trang 18

17.4 Summary 497

Result decision tree for Level 2:

Processor 1

Dawn Sunset Midday

Time

No Weather

No

Shower Fine

Yes

Shower

No Yes

6 9 10

Global information sharing stage:… as like in Level 2 … Result decision tree for Level 3:

Dawn Sunset Midday

Figure 17.21 (Continued)

each processor builds local clusters based on its data partition, whereas result allelism in clustering is based on allocating different ﬁnal clusters into different processors to construct them.

par-Data parallelism in a decision tree is based on vertical data partitioning, as opposed to horizontal data partitioning commonly used by other data parallelism models (e.g., data parallelism of association rules, data parallelism of clustering, etc) Vertical data partitioning in a decision tree is necessary so that each processor may focus on different feature attributes of the training data set Result parallelism in a decision tree is based on tree partitioning This resembles parallel index partitioning explained in Chapter 7 Both data parallelism and result

Trang 19

parallelism for decision tree have a similar concept with subqueries execution scheduling explained in Chapter 9 on parallel query optimization.

All parallelism methods for various data mining techniques show some larities with those of query processing, indexing partitioning, and query optimization All of these parallelism methods are designed for data-intensive applications, including database query processing, data warehousing, and OLAP, as well as data mining.

Foti et al (2000) presented parallel clustering for multicomputers Recent work on parallel clustering includes that of Qiang et al (2005), who proposed

a window-based incremental parallel clustering method, and Fiolet and Toursel (2005), who also described progressive clustering, but for the Grid Kim et al.

(WAIM 2006) also focused on clustering algorithms for the Grid.

17.1 One of the main differences between clustering and classiﬁcation is that in

classi-fication each class or category is predefined, whereas in clustering the label of eachcluster is not predefined Elaborate this concept with an example

17.2 One of the main differences between clustering and decision trees is that in decision

trees a record that falls into a certain class or category is identiﬁable through its tures or attributes, whereas in clustering records are grouped within a cluster because

fea-they are “similar” to each other, without necessarily knowing what their common

properties are Elaborate this concept with an example

17.3 Clustering exercises:

a Given a data set D D f55; 30; 68; 39; 1; 4; 49; 90; 34; 76; 82; 56; 31; 25; 78; 56;

38; 32; 88; 9; 44; 98; 11; 70; 66; 89; 99; 22; 23; 26g, use the k-means serial

algo-rithm to cluster the data in three clusters

b Now choose a different set of centroid values, and perform the k-means clustering

again Analyze whether the clusters are different as a result of choosing differentcentroid values

c Use the k-means serial algorithm to cluster the data above in four clusters.

Observe the clusters’ composition and how they differ should there only be threeclusters

d Use the k-means data parallelism algorithm to cluster the data in three clusters

using three processors

Trang 20

17.6 Exercises 499

e Now use the k-means result parallelism algorithm to cluster the data in three

clus-ters using three processors

17.4 Classiﬁcation exercises:

Approved Rec# Employment Marital Gender Age (Target Class)

a Using the this data set, show a walk-through of how a decision tree is built with a

serial decision tree algorithm.

b Assuming that there are three available processors, demonstrate with a

walk-through how a decision tree is built with a data parallelism decision tree

algorithm

c Now use a result parallelism decision tree algorithm to build the decision tree.

Trang 22

CHAPTER 4: PARALLEL SORT AND GROUP-BY

Some parts of this chapter have appeared in our early publications:

[1] David Taniar, Wenny Rahayu: Parallel database sorting Inf Sci 146(1–4):

171–219, 2002 ( 2002 Elsevier) [2] David Taniar, Wenny Rahayu: Parallel group-by query processing in a clus-

ter architecture Comput Syst Sci Eng 17(1): 23–39, 2002 ( 2002 CRL Publishing)

[3] David Taniar, Wenny Rahayu: Sorting in parallel database systems, Asia (2) 2000: 830–835 ( 2000 IEEE)

HPC-Sections 4.2, 4.3, and 4.5 contain materials from [1] with kind permission from Elsevier Sections 4.4 and 4.6 contain materials from [2] with kind permission from CRL Publishing.

Figures 4.1–4.9 have been reproduced from [1] with kind permission from Elsevier Figures 4.3–4.4 and 4.6–4.9 have been reproduced from [3] with kind permission from IEEE Figures 4.12–4.13 have been reproduced from [3] with kind permission from CRL Publishing.

Table 4.1 has been reproduced from [1] with kind permission from Elsevier.

CHAPTER 6: PARALLEL GROUP-BY JOIN

[4] David Taniar, Wenny Rahayu, Hero Ekonomosa: Performance Evaluation

of Parallel GroupBy-Before-join Query Processing in High Performance

Database Systems HPCN Europe 2001: 241–250, Lecture Notes in

Com-puter Science 2110 ( 2001 Springer) [5] David Taniar, Wenny Rahayu: Parallel Processing of "GroupBy-Before-

Join" Queries in Cluster Architecture CCGrid 2001: 178–185 ( 2001 IEEE)

High-Performance Parallel Database Processing and Grid Databases,

by David Taniar, Clement Leung, Wenny Rahayu, and Sushant Goel Copyright  2008 John Wiley & Sons, Inc.

Trang 23

[6] David Taniar, Wenny Rahayu: Parallel "GroupBy-Before-Join" Query

Pro-cessing for High Performance Parallel/Distributed Database Systems AINA

(1) 2006: 693–700 ( 2006 IEEE) [7] David Taniar Rebecca Boon-Noi Tan: Parallel Processing of Multi-Join Expansion aggregate Data Cube Query in High Performance Database Sys-

tems ISPAN 2002 ( 2002 IEEE) [8] David Taniar, Yi Jiang, Kevin Liu, Clement H.C Leung: Aggregate-join

query processing in parallel database systems, HPC-Asia (2) 2000:

824–829 ( 2000 IEEE) [9] David Taniar, Rebecca Boon-Noi Tan, Clement H C Leung, Kevin H Liu: Performance analysis of "Groupby-After-Join" query processing in parallel

database systems Inf Sci 168(1–4): 25–50, 2004 ( 2004 Elsevier) [10] David Tania, Yi Jian, Kevin H Liu, Clement H C Leung: Parallel

Aggregate-Join Query Processing Informatica (Slovenia) 26(3), 2002

Section 6.1 contains materials from [9] with kind permission from Elsevier, from [5,8] with kind permission from IEEE Section 6.2 contains materials from [5] with kind permission from IEEE, and from [4] with kind permission from Springer Section 6.3 contains materials from [8, 9] with kind permissions from IEEE and Elsevier Section 6.5 contains materials from [6] with kind permission from IEEE Section 6.6 contains materials from [9] with kind permissions from Elsevier.

Figures 6.1–6.3 have been reproduced from [4,5,7] with kind permissions from Springer and IEEE Figures 6.4–6.5 have been reproduced from [8,9] with kind permissions from IEEE and Elsevier.

CHAPTER 7: PARALLEL INDEXING

[11] David Taniar, J Wenny Rahayu: Global parallel index from

multi-processors database systems Inf Sci 165 (1–2): 103–127, 2004 ( 2004 Elsevier)

[12] David Taniar, J Wenny Rahayu: A Taxonomy of Indexing Schemes for

Par-allel Database Systems Distributed and ParPar-allel Databases 12(1): 73–106,

2002 ( 2002 Kluwer Springer) [13] David Taniar, Wenny Rahayu: Global BC Tree Indexing in Parallel

Database Systems IDEAL 2003: 701–708, Lecture Notes in Computer

Science 2690 ( 2003 Springer) [14] David Taniar, Wenny Rahayu, Rebecca Boon-Noi Tan: Parallel algorithms for selection query processing involving index in parallel database systems.

Comput Syst Sci Eng 19(2): 95–114, 2004 ( 2004 CRL Publishing) [15] Wenny Rahayu, David Taniar: Parallel Selection Query Processing Involv-

ing Index in Parallel Database Systems ISPAN 2002: 309–314 ( 2002 IEEE)

Trang 24

PERMISSIONS 503

Sections 7.1–7.5 contain materials from [12] with kind permission from Springer Section 7.2 contains materials from [11, 12, 13, 14] with kind permissions from Elsevier, Springer, and CRL Publishing Sections 7.5–7.7 contain materials from [11, 13, 14, 15] with kind permissions from Elsevier, CRL Publishing and IEEE Figure 7.1 has been reproduced from [11, 12] with kind permissions from Else- vier and Springer Figure 7.2 has been reproduced from [12] with kind permission from Springer Figures 7.3–7.17 have been reproduced from [11, 12, 13, 14, 15] with kind permissions from Elsevier, Springer, CRL Publishing, and IEEE Figures 7.18–7.27 have been reproduced from [13, 14, 15] with kind permissions from Springer, CRL Publishing, and IEEE.

CHAPTER 8: PARALLEL UNIVERSAL QUANTIFICATION—COLLECTION JOIN QUERIES

[16] David Taniar, Wenny Rahayu: Parallel sort-merge object-oriented

collec-tion join algorithms Comput Syst Sci Eng 17(3): 145–158, 2002 ( 2002 CRL Publishing)

[17] David Taniar, Wenny Rahayu: Parallel sort-hash object-oriented collection

join algorithms for shared-memory machines Parallel Algorithms Appl.

17(2): 85–126, 2002 ( 2002 Taylor & Francis) [18] David Taniar, Wenny Rahayu: Parallel Collection Equi-Join Algorithms for

Object-Oriented Databases IDEAS 1998: 159–168 ( 1998 IEEE) [19] David Taniar, Wenny Rahayu: Parallel double sort-merge algorithm for

object-oriented collection join queries, HPC-Asia 1997: 122-127 ( 1997 IEEE)

[20] David Taniar, Wenny Rahayu: Divide and Partial Broadcast Method for

Par-allel Collection Join Queries HPCN Europe 1998: 937–939, Lecture Notes

in Computer Science 1401 ( 1998 Springer) [21] David Taniar: Toward an Ideal Data Placement Scheme for High Perfor-

mance Object-Oriented Database Systems HPCN Europe 1998: 508–517,

Lecture Notes in Computer Science 1401 ( 1998 Springer) [22] David Taniar, Wenny Rahayu: Collection-Intersect Join Algorithms for Par-

allel Object-Oriented Database Systems Euro-Par 1998: 505–512, Lecture

Notes in Computer Science 1470 ( 1998 Springer) [23] David Taniar, Wenny Rahayu: Parallel Sub-Collection Join Algorithm for

High Performance Object-Oriented Databases BNCOD 1998: 173–174,

Lecture Notes in Computer Science 1405 ( 1998 Springer) [24] David Taniar, Wenny Rahayu: Parallel Sub-collection Join Query Algo- rithms for a High Performance Object-Oriented Database Architecture.

ACPC 1999: 559–569, Lecture Notes in Computer Science 1557 ( 1999 Springer)

Trang 25

Sections 8.1, 8.2, 8.4–8.6 contain some materials form [16, 17, 18, 20–24] with kind permission from CRL Publishing, Taylor & Francis, IEEE, and Springer Figures 8.1, 8.3–8.6, 8.12, 8.20, 8.23 have been reproduced from [16] with kind permission from CRL Publishing Figure 8.1, 8.3–8.5, 8.7–8.8, 8.11–8.12, 8.20–8.25 have been reproduced from [17] with kind permission from Taylor & Francis Figures 8.1, 8.3, 8.6–8.8 have been reproduced from [18] with kind permission from IEEE.

CHAPTER 9: PARALLEL QUERY SCHEDULING AND OPTIMIZATION

[25] David Taniar, Yi Jiang: A High Performance Object-Oriented Distributed

Parallel Database Architecture HPCN Europe 1998: 498–507, Lecture

Notes in Computer Science 1401 ( 1998 Springer) [26] David Taniar, Clement H C Leung: Query execution scheduling in par-

allel object-oriented databases Information & Software Technology 41(3):

163–178, 1999 ( 1999 Elsevier) [27] Yi Jiang, David Taniar, Clement H C Leung: High performance distributed

parallel query processing Comput Syst Sci Eng 16(5): 277–289, 2001

( 2001 CRL Publishing) [28] David Taniar, Clement H C Leung: The impact of load balancing to object-

oriented query execution scheduling in parallel machine environment Inf Sci 157: 33–71, 2003 ( 2003 Elsevier)

Sections 9.2–9.3 contain materials from [26,28] with kind permission from vier Section 9.4 contains materials from [26] with kind permission from Elsevier Sections 9.5–9.7 contain materials from [27] with kind permission from CRL Pub- lishing.

Else-Figure 9.2 has been reproduced from [25] courtesy of Springer Else-Figures 9.3, 9.5 and 9.6 have been reproduced from [28] with kind permission from Elsevier Figures 9.4 and 9.7–9.9 have been reproduced from [26] with kind permission from Elsevier Figures 9.10–9.15 have been reproduced from [27] with kind permission from CRL Publishing.

CHAPTER 10: TRANSACTIONS IN DISTRIBUTED AND GRID DATABASES

[29] Sushant Goel, Hema Sharda, David Taniar: Multi-scheduler Concurrency

Control for Parallel Database Systems APPT 2003: 643–654, Lecture

Notes in Computer Science volume 2834 ( 2003 Springer) [30] Sushant Goel, Hema Sharda, David Taniar: Transaction Management

in Distributed Scheduling Environment for High Performance Database

Tiêu đề	Parallel Clustering and Classification
Thể loại	Tài liệu

Định dạng
Số trang	50
Dung lượng	343,1 KB