KNOWLEDGE ORIENTED APPLICATIONS IN DATA MINING ppsx

Hamidah Jantan, Abdul Razak Hamdan and Zulaiha Ali OthmanNew Implementations of Data Mining in a Plethora of Human Activities 15 Alberto Ochoa, Julio Ponce, Francisco Ornelas,Rubén Jara

Trang 1

KNOWLEDGEͳORIENTED

APPLICATIONS

IN DATA MININGEdited by Kimito Funatsu and Kiyoshi Hasegawa

Trang 2

Published by InTech

Janeza Trdine 9, 51000 Rijeka, Croatia

All chapters are Open Access articles distributed under the Creative Commons

Non Commercial Share Alike Attribution 3.0 license, which permits to copy,

distribute, transmit, and adapt the work in any medium, so long as the original

work is properly cited After this work has been published by InTech, authors

have the right to republish it, in whole or part, in any publication of which they

are the author, and to make other personal use of the work Any republication,

referencing or personal use of the work must explicitly identify the original source.Statements and opinions expressed in the chapters are these of the individual contributors and not necessarily those of the editors or publisher No responsibility is accepted for the accuracy of information contained in the published articles The publisher

assumes no responsibility for any damage or injury to persons or property arising out

of the use of any materials, instructions, methods or ideas contained in the book

Publishing Process Manager Ana Nikolic

Technical Editor Teodora Smiljanic

Cover Designer Martina Sirotic

Image Copyright agsandrew, 2010 Used under license from Shutterstock.com

First published January, 2011

Printed in India

A free online edition of this book is available at www.intechopen.com

Additional hard copies can be obtained from orders@intechweb.org

Knowledge-Oriented Applications in Data Mining, Edited by Kimito Funatsu

and Kiyoshi Hasegawa

p cm

ISBN 978-953-307-154-1

Trang 3

Books and Journals can be found at

www.intechopen.com

Trang 5

Hamidah Jantan, Abdul Razak Hamdan and Zulaiha Ali Othman

New Implementations of Data Mining

in a Plethora of Human Activities 15

Alberto Ochoa, Julio Ponce, Francisco Ornelas,Rubén Jaramillo, Ramón Zataraín, María Barrón,Claudia Gómez, José Martínez and Arturo Elias

Data Mining Techniques for Explaining Social Events 39

Krivec Jana and Gams Matjaž

Mining Enrolment Data Using Predictive and Descriptive Approaches 53

Fadzilah Siraj and Mansour Ali Abdoulha

Online Insurance Consumer Targeting and Lifetime Value Evaluation

- A Mathematics and Data Mining Approach 73

Yuanya Li, Gail Cook and Oliver Wreford

Data Mining Using RFM Analysis 91

Derya Birant

Seasonal Climate Prediction for the Australian Sugar Industry Using Data Mining Techniques 109

Lachlan McKinna and Yvette Everingham

Monthly River Flow Forecasting

by Data Mining Process 127

Özlem Terzi

Trang 6

Monitoring of Water Quality Using Remote Sensing Data Mining 135

Xing-Ping Wen and Xiao-Feng Yang

Applications of Data Mining to Diagnosis and Control of Manufacturing Processest 147

Marcin Perzyk, Robert Biernacki, Andrzej Kochanski, Jacek Kozlowski and Artur Soroczynski

Atom Coloring for Chemical Interpretation

and De Novo Design for Molecular Design 167

Kiyoshi Hasegawa, Keiya Migita and Kimito Funatsu

Hyperspectral Data Analysis and Visualisation 183

Maarten A Hogervorst and Piet B.W Schwering

Data Retrieval and Visualization for Setting Research Priorities in Biomedical Research 209

Hailin Chen and Vincent VanBuren

DNA Microarray Applied to Data Mining of Bradyrhizobium

elkanii Genome and Prospection of Active Genes 229

Jackson Marcondes and Eliana G M Lemos

Visual Gene Ontology Based Knowledge Discovery in Functional Genomics 245

Stefan Götz and Ana Conesa

Data Mining in Neurology 261

Antonio Candelieri, Giuliano Dolce, Francesco Riganello and Walter G Sannita

Glucose Prediction in Type 1 and Type 2 Diabetic Patients Using Data Driven Techniques 277

Eleni I Georga, Vasilios C Protopappas and Dimitrios I Fotiadis

Data Mining Based Establishment and Evaluation of Porcine Model for Syndrome i

n Traditional Chinese Medicine in the Context

of Unstable Angina (Myocardial Ischemia) 297

Huihui Zhao, Jianxin Chen, Qi Shi and Wei Wang

Results of Data Mining Technique Applied

to a Home Enteral Nutrition Database 311

Maria Eliana M Shieferdecker, Carlos Henrique Kuretzki, José Simão

de Paula Pinto, Antônio Carlos Ligoki Campos and Osvaldo Malafaia

Trang 7

Data Mining in Personalized Speech

Disorder Therapy Optimisation 321

Danubianu Mirela, Tobolcea Iolanda and Stefan Gheorghe Pentiuc

Data Mining Method for Energy System Aplications 339

Reşat Selbaş, Arzu Şencan and Ecir U Küçüksille

Regression 353

Mohsen Hajsalehi Sichani and Saeed khalafinejad

Data Mining: Machine Learning

and Statistical Techniques 373

Alfonso Palmer, Rafael Jiménez and Elena Gervilla

Dynamic Data Mining: Synergy

of Bio-Inspired Clustering Methods 397

Elena N Benderskaya and Sofya V Zhukova

Exploiting Inter-Sample Information and Exploring

Visualization in Data Mining: from Bioinformatics

to Anthropology and Aesthetics Disciplines 411

Kuan-ming Lin and Jung-Hua Liu

Data Mining Industrial Applications 431

Waldemar Wójcik and Konrad Gromaszek

Trang 9

Data mining, a branch of computer science and artifi cial intelligence, is the process of extracting patt erns from data Data mining is seen as an increasingly important tool to transform a huge amount of data into a knowledge form giving an informational ad-vantage Refl ecting this conceptualization, people consider data mining to be just one step in a larger process known as knowledge discovery in databases (KDD) Data min-ing is currently used in a wide range of practices from business to scientifi c discovery.The progress of data mining technology and large public popularity establish a need for a comprehensive text on the subject The series of books entitled by ‘Data Mining’ address the need by presenting in-depth description of novel mining algorithms and many useful applications.

The fi rst book (New Fundamental Technologies in Data Mining) is organized into two parts The fi rst part presents database management systems (DBMS) Before data min-ing algorithms can be used, a target data set must be assembled As data mining can only uncover patt erns already present in the data, the target dataset must be large enough to contain these patt erns For this purpose, some unique DBMS have been de-veloped over past decades They consist of soft ware that operates databases, providing storage, access, security, backup and other facilities DBMS can be categorized accord-ing to the database model that they support, such as relational or XML, the types of computer they support, such as a server cluster or a mobile phone, the query languages that access the database, such as SQL or XQuery, performance trade-oﬀ s, such as maxi-mum scale or maximum speed or others

The second part is based on explaining new data analysis techniques Data mining involves the use of sophisticated data analysis techniques to discover relationships

in large data sets In general, they commonly involve four classes of tasks: (1) ing is the task of discovering groups and structures in the data that are in some way

Cluster-or another “similar” without using known structures in the data Data visualization tools are followed aft er making clustering operations (2) Classifi cation is the task of generalizing known structure to apply to new data (3) Regression att empts to fi nd a function which models the data with the least error (4) Association rule searches for relationships between variables

Trang 10

The second book (Knowledge-Oriented Applications in Data Mining) is based on troducing several scientifi c applications using data mining Data mining is used for

in-a vin-ariety of purposes in both privin-ate in-and public sectors Industries such in-as bin-anking, insurance, medicine, and retailing use data mining to reduce costs, enhance research, and increase sales For example, pharmaceutical companies use data mining of chemi-cal compounds and genetic material to help guide research on new treatments for dis-eases In the public sector, data mining applications were initially used as a means to detect fraud and waste, but they have grown also to be used for purposes such as mea-suring and improving program performance It has been reported that data mining has helped the federal government recover millions of dollars in fraudulent Medicare payments

In data mining, there are implementation and oversight issues that can infl uence the success of an application One issue is data quality, which refers to the accuracy and completeness of the data The second issue is the interoperability of the data mining techniques and databases being used by diﬀ erent people The third issue is mission creep, or the use of data for purposes other than for which the data were originally collected The fourth issue is privacy Questions that may be considered include the degree to which government agencies should use and mix commercial data with gov-ernment data, whether data sources are being used for purposes other than those for which they were originally designed

In addition to understanding each part deeply, the two books present useful hints and strategies to solving problems in the following chapters The contributing authors have highlighted many future research directions that will foster multi-disciplinary collab-orations and hence will lead to signifi cant development in the fi eld of data mining.January, 2011

Trang 13

Data Mining Classification Techniques for

Human Talent Forecasting

1Faculty of Computer and Mathematical Sciences UiTM,

Terengganu, 23000 Dungun, Terengganu,

2Faculty of Information Science and Technology UKM,

in society as a whole This technique is an approach that is currently receiving great attention in data analysis and it has been recognized as a newly emerging analysis tool (Osei-Bryson, 2010; Park, 2001; Sinha, 2008; Tso & Yau, 2007; Wan, 2009; Zanakis, 2005; Zhuang et al., 2009) Additionally, among the major tasks in data mining are classification and prediction; concept description; rule association; cluster analysis; outlier analysis; trend and evaluation analysis; statistical analysis and others Classification and prediction tasks are among the popular tasks in data mining; and widely used in many areas especially for trend analysis and future planning In fact, classification technique is supervised learning, which is the class level or prediction target is already known As a result, the classification model which is represented through rules structures will be constructed in the classification process In this case, the constructed model will be representing the precious knowledge and it can be used for future planning

There are many areas which adapted this approach to solve their problems such as in finance, medical, marketing, stock, telecommunication, manufacturing, health care, customer relationship and etc However, the data mining application has not attracted much attention from people in Human Resource (HR) field (Chien & Chen, 2008; Ranjan, 2008) Besides that, in our previous study, most of the prediction applications are used to predict stock, demand, rate, risk, event and others; but there are quite limited studies on human prediction In addition prediction applications are mainly developed in business and industrious fields; and quite restricted studies involved human talent in an organization (Jantan et al., 2009) HR data can provide a rich resource for knowledge discovery and for decision support system development

Recently, an organization has to struggle effectively in term of cost, quality, service or innovation All these depend on having enough right people with the right skills, employed

Trang 14

in the appropriate locations at appropriate point of time In HR, among the challenges of

HR professionals are managing an organization talent known as talent management Talent management involves a lot of managerial decisions and these types of decisions are very uncertain and difficult Besides that, these decisions depend on various factors such as human experience, knowledge, preference and judgment The process to identify the existing talent in an organization is among the top talent management challenges and the important issue (A TP Track Research Report 2005) In addition, talent management is defined as an outcome to ensure the right person is in the right job (Cubbingham, 2007) Talent in an organization is evaluated based on the position that he/she holds, and the position is represented by the talent ability that he/she has Due to those reasons, this study attempts to use classification techniques in data mining to handle issue on talent forecasting

In this study, academic talent type of data in higher learning institution has been chosen as the datasets to represent human talent As a result, the purpose of this article is to suggest the potential classification techniques for human talent forecasting through some experiments using selected classification algorithms

This chapter is organized as follows The second section describes the related work on classification and prediction in data mining; researches on data mining in HR especially for talent management; and human talent forecasting using data mining technique The third section discusses on experiment setup in this study Next, the forth section shows experiment results and discussions Then, section five suggests some related future works Finally, the paper ends at Section 6 with the concluding remarks acknowledged

2 Related work

2.1 Classification and prediction in data mining

Data mining tasks are generally categorized as clustering, association, classification and prediction (Chien & Chen, 2008; Ranjan, 2008) Over the years, data mining has evolved various techniques to perform the tasks that include database oriented techniques, statistic, machine learning, pattern recognition, neural network, rough set and etc Database or data warehouse are rich with hidden information that can be used to provide intelligent decision making Intelligent decision refers to the ability to make automated decision that is quite similar to human decision Classification and prediction in machine learning are among the techniques that can produce intelligent decision At this time, many classification and prediction techniques have been proposed by researchers in machine learning, pattern recognition and statistics

Classification and prediction in data mining are two forms of data analysis that can be used

to extract models to describe important data classes or to predict future data trends (Han & Kamber, 2006) The classification process has two phases; the first phase is learning process, the training data will be analyzed by the classification algorithm The learned model or classifier shall be represented in the form of classification rules Next, the second phase is classification process where the test data are used to estimate the accuracy of the classification model or classifier If the accuracy is considered acceptable, the rules can be applied to the classification of new data (Fig 1)

Several techniques that are used for data classification are decision tree, Bayesian methods, Bayesian network, rule-based algorithms, neural network, support vector machine,

Trang 15

association rule mining, k-nearest-neighbor, case-based reasoning, genetic algorithms, rough sets, and fuzzy logic In this study, we attempt to use three main classification techniques i.e decision tree, neural network and k-nearest-neighbor However, decision tree and neural network are found useful in developing predictive models in many fields(Tso & Yau, 2007) The advantage of decision tree technique is that it does not require any domain knowledge or parameter setting, and is appropriate for exploratory knowledge discovery The second technique is neural-network which has high tolerance of noisy data

as well as the ability to classify pattern on which they have not been trained It can be used when we have little knowledge of the relationship between attributes and classes Next, the K-nearest-neighbor technique is an instance-based learning using distance metric to measure the similarity of instances All these three classification techniques have their own advantages and disadvantages, for that reasons, this study endeavor to explore these classification techniques for human talent data Besides that, data mining technique has been applied in many fields, but its application in HR is very rare (Chien & Chen, 2008)

Fig 1 Classification and Prediction in Data Mining

Recently, there are some researches that show great interest on solving HR problems using data mining approach (Ranjan, 2008) Table 1 lists some of the tasks in human resource that use data mining technique, and it shows there are quite limited studies on data mining in human resource domain In addition, until now there are quite limited discussions on talent management such as for talent forecasting, career planning and talent recruitment use data mining approach In HR, data mining technique used focuses on personnel selection especially to choose the right candidates for a job The classification and prediction in data

Trang 16

mining for HR problems are infrequent and there are some examples such as to predict the length of service, sales premium, to persistence indices of insurance agents and analyze miss-operation behaviors of operators (Chien & Chen, 2008) Due to these reasons, this study attempts to use data mining classification techniques to forecast potential employees

as substantial of talent management task using the past experience knowledge

Personnel selection

Decision tree (Chien & Chen, 2008), Fuzzy Logic and Data Mining (Tai & Hsu, 2005) Rough Set Theory(Chien & Chen, 2007)

Training Association rule mining (Chen et al., 2007)

Employee Development Fuzzy Data Mining and

Fuzzy Artificial Neural Network (Huang et al., 2006) Decision Tree (Tung et al., 2005)

Performance Evaluation Potential to use Decision Tree (Zhao, 2008)

Table 1 Data mining Techniques in HRM

2.2 Talent management and data mining

In any organization, talent management has become an increasingly crucial approach in HR functions Talent is considered as the capability of any individual to make a significant difference to the current and future performance of the organization (Lynne, 2005) In fact, managing talent involves human resource planning that emphasizes processes for managing people in organization Besides that, talent management can be defined as a process to ensure leadership continuity in key positions and encourage individual advancement; and decision to manage supply, demand and flow of talent through human capital engine (Cubbingham, 2007) Talent management is very crucial and needs some attention from HR professionals TP Track Research Report has found that among the top current and future talent management challenges are developing existing talent; forecasting talent needs; attracting and retaining the right leadership talent; engaging talent; identifying existing talent; attracting and retaining the right leadership and key contributor; deploying existing talent; lack of leadership capability at senior levels and ensuring a diverse talent pool (A TP Track Research Report 2005) The talent management process consists of recognizing the key talent areas in the organization, identifying the people in the organization who constitute key talent, and conducting development activities for the talent pool to retain and engage them and also have them ready to move into more significant roles (Cubbingham, 2007) (Fig 2) These processes involve HR activities that need to be integrated into an effective system (CHINA UPDATE, 2007) (Fig 2)

In this study, we focus on one of the talent management challenges i.e to identify the existing talent regarding the key talent in an organization by predicting their performance using previous employee performance records in databases In this case, we use the past related employee data regarding on their talent by using classification technique in data mining

Trang 17

Fig 2 Data mining and Talent Management

2.3 Human talent forecasting

Recently, with the new demand and increased visibility, HR seeks a more strategic role

by turning to data mining methods (Ranjan, 2008) This can be done by discovering generated patterns as useful knowledge from the existing data in HR databases Thus, this study concentrates on identifying the patterns that relate to the human talent The patterns can be generated by using some of the major data mining techniques such as clustering to list the employees with similar characteristics, to group the performances and etc From the association technique, patterns that are discovered can be used to associate the employee’s profile for the most appropriate program/job, associated with employee’s attitude toperformance and etc In prediction and classification task, the pattern discovered can be used to predict the percentage accuracy in employee’s performance, behavior, and attitudes, predict the performance progress throughout the performance period, and also identify the best profile for different employee and etc (Fig 3) The match of data mining problems and talent management needs are very crucial Therefore, it is very important to determine the suitable data mining techniques for talent management problems

Trang 18

Fig 3 Data mining Tasks for Talent Management

3 Experiment setup

This experiment attempts to propose the potential data mining classifier for human talent data The proposed classifier can be used to generate talent performance classification patterns from employee’s performance databases Subsequently, the generated classification patterns can be employed in decision support tool for human talent prediction The basic process for classification and prediction in data mining has been discussed in the related work (Fig 1) The experiment setup in this study has several tasks such as simulated data construction, outlier placing, attribute reduction and accuracy of model determination as shown in Fig 4 However, due to the difficulties to get real data from HR department, because of the confidentiality and security issues, for the exploratory purposes, this study simulates two human talent datasets

Fig 4 Experiment Setup

Trang 19

using dataset rule generator shown in Table 2 The first dataset contains one hundred data

(dataset1) and the second dataset has a thousand performance data (dataset2) based on human

talent performance factors In many cases, simulated or syntactic data is an ideal data and can produce a good data mining model For that reason, in this study uses outlier placing task for

dataset1 to handle that issue and that new dataset known as dataset3

In this experiment, the selected classification techniques used are based on the common techniques used for classification and prediction in data mining As mentioned earlier in related work, the classification techniques chosen are neural network which is quite popular

in data mining community and used as pattern classification technique (Witten & Frank, 2005) The decision tree known as ‘divide-and–conquer’ approach is from a set of independent instances for classification and the nearest neighbor is for classification that are based on the distance metric Table 3 summarizes the selected classification techniques in data mining, such as decision tree, neural network and nearest neighbor In this study, we attempt to use C4.5 and Random Forest for decision tree category; Multilayer Perceptron (MLP) and Radial Basic Function Network (RBFC) for neural network category; and K-Star for the nearest neighbor category

Background/ Demographic

(D1-D8) (a1-a8) Class level – D4/a4

D1 = RANDBETWEEN (1950-1983), D2 = RANDBETWEEN (1,2,3,4), D3 = RANDBETWEEN (0,1), D4 = RANDBETWEEN ((1-4), D5= RANDBETWEEN (1975-2008) and G2 =

IF (D5-D1<25 THEN D1+25 ELSE D5) I2 = G2+RANDBETWEEN(5,10) D6 = IF(I2>2008 THEN 0 ELSE I2) K2= G2+RANDBETWEEN(6,15) D7 = IF(K2>2008THEN 0 ELSE K2) M2= G2+RANDBETWEEN(10,30) D8 = IF(M2>2008 THEN 0 ELSE M2)

Previous performance evaluation

Trang 20

Data Mining

Decision Tree • C4.5 (Decision tree induction – the target is nominal and the

inputs may be nominal or interval Sometimes the size of the induced trees is significantly reduced when a different pruning strategy is adopted)

• Random forest (Choose a test based on a given number of

random features at each node, performing no pruning Random forest constructs random forest by bagging ensembles of random trees)

Neural Network • Multi Layer Perceptron (An accurate predictor for underlying

classification problem Given a fixed network structure, we must determine appropriate weights for the connections in the network)

• Radial Basic Function Network (Another popular type of feed

forward network, which has two layers, not counting the input layer, and differs from a multilayer perceptron in the way that the hidden units perform computations)

Nearest Neighbor • K*Star (An instance-based learning using distance metric to

measure the similarity of instances and generalized distance function based on transformation

Table 3 Selected Classification Algorithm

The human talent factor in this case study is for academic talent in higher learning institution The academic talent factors are extracted from the common practice for evaluation, performance evaluation documents and expertise experiences Besides the human performance factors, the talent background and management skill are also considered in the process to identify the potential talent In this experiment, the training dataset contains 53 related attributes from five performance factors demonstrated in Table 4

The target class for the dataset is the academic position (D4) which is representing as

professor, associate professor, senior lecturer and lecturer The classification technique used

is based on 10 fold cross validation training and test dataset In this experiment, the data mining tools used are WEKA and ROSETTA toolkit This experiment has two phases; the first phase is to identify the possible techniques using selected classifier algorithm for full attributes of data In this case, we use all the attributes which are defined before for the full dataset

Besides that, this experiment concentrates on the accuracy of selected classifiers in order to identify potential classifier algorithm for the datasets The accuracy of classifier is based on the percentage of test set samples that are correctly classified The second phase of experiment is to compare the accuracy of classifier for attribute reduction In this case, Boolean reasoning technique is used to select the most relevant or important attributes from the dataset The attribute reduction phase is divided into two stages The first stage is attribute reduction using the shortest length attribute, which is used by many researches in attribute reduction process The aim of this process is to determine the important attributes for the data set, which is known as attribute reduction dataset (AR) The second stage is for

Trang 21

Factor and Attributes Variable Name Meaning

Background

Age ,Race, Gender, Year of service, Year of Promotion 1, Year of Promotion 2, Year of Promotion 3

Performance evaluation marks for 15 years

Knowledge and skill

(20) PQA,PQC1,PQC2, PQC3,PQD1, PQD2,PQD3,PQE1,

PQE2,PQE, PQE4,PQE5,PQF1, PQF2,PQG1, PQG2,PQH1,PQH2, PQH3,PQH4

Professional qualification (Teaching, supervising, research, publication and conferences)

Management skill

Individual Quality

Table 4 Factors and Attributes for Academic Talent

the combination of important attribute which is known as importance attributes dataset (IA) In this case, we attempt to study the accuracy of the classifier using all importance attributes Finally, the experiment results for each phase is evaluated using the statistical significant test in order to determine the most significant classifier for each of datasets and it will be considered as the potential classifier for human talent data

4 Result and discussion

In this experiment, the accuracy of classification techniques is based on the selected classifier algorithm In the first phase, the accuracy for each of the classifier algorithm for full attributes for three datasets is shown in Table 5 The results for full attribute present the highest accuracy

of model is C4.5 (95.14%, 99.90% and 90.54%) which is the results could be considered as an indicator to the potential classification algorithm for human talent data (Fig 5.)

Classification Algorithm Dataset1 Dataset2 Dataset3

Table 5 Accuracy of Model for Full Attributes

Trang 22

0 20 40 60 80 100 120

Dataset1 Dataset2 Dataset3

Fig 5 Accuracy of Model for Full Attributes

The result for full attributes shows us the more data that we used (dataset2) in training

process the highest accuracy of model can be developed Besides that, the accuracy for dataset3 which contains outliers is slightly down for all classifiers, this result demonstrates the effect of outliers in dataset for accuracy of the model The second phase of the experiment is considered as a relevant analysis process in order to determine the accuracy of the selected classification technique using datasets with attribute reduction In this experiment, we focus on dataset1 and dataset2 The purpose of attribute reduction process

is to select the most relevant attribute in the dataset The reduction process is implemented using Boolean reasoning technique Through attribute reduction, we can decrease the preprocessing and processing time and space Table 6 shows the relevant analysis results for attribute reduction, five (5) attributes are selected, all the attributes are from the background factor By using these attributes reduction variables, the second phase of experiment is implemented The aim of this experiment is to find out the accuracy of the classification techniques with attribute reduction using the shortest length attributes and combination of the important attributes after reduction process

Year of service, Year of Promotion 1, Year of Promotion 2, Year of Promotion 3 Table 6 Important Attributes from Atribut Reduction

Table 7 shows the accuracy of the classification algorithm with attribute reduction for the shortest length methods (AR dataset) The C4.5 classifier has the highest percentage of accuracy in the first stage of second phase experiment (Table 7.) but the accuracy has declined at this stage

Trang 23

Classification Algorithm Dataset1 Dataset2

Table 7 Accuracy of Model for Attribute Reduction

In this experiment, the result indicates more attributes used in dataset that will affect the accuracy of the classifier Consequently, this result illustrates most of the attributes in dataset are important and should be considered However, with the combination of attributes from reduction process (IA dataset) in the second stage of experiment, the accuracy of classifier is higher compared to the shortest length attributes (AR dataset)

Table 8 shows the accuracy of classifier for importance attributes for dataset1 and dataset2

The C4.5 classifier has the highest accuracy for both datasets at this stage of experiment Fig

6 shows the accuracy of model for AR datasets and IA datasets in the second phase experiment

Fig 6 The Accuracy of Model for Attribute Reduction and Importance Attributes

Trang 24

Consecutively, to propose the potential classifier for human talent data, the statistical significant test is conducted using t-test evaluation By using the pair t-test as shown in Table 9, a positive mean difference in accuracy shows that the C4.5 has the highest value of positive mean which is significantly better than other classifiers For the accuracy criterion, C4.5 is significantly better than Random Forest and MLP, with a p-value < 0.05 In addition, decision tree can produce a model which may represent interpretable rules or logic statement and can be performed without complicated computations and the technique can

be used for both continuous and categorical variables This technique is more suitable for predicting categorical outcomes and less appropriate for application to time series data (Tso

& Yau, 2007) Besides that, the decision tree classifiers are a quite popular technique because the construction of tree does not require any domain knowledge or parameter setting, and therefore is appropriate for exploratory knowledge discovery

Table 9 Pair T-Test Result on Accuracy of Model for C4.5

In these experiments, we observe the great potential to use C4.5 classification algorithm in the next stage of data mining process i.e prediction using the constructed classification model Besides that, these results also show about the suitability of C4.5 classifier for the human talent datasets

5 Future works

In this study, due to the difficulties to obtain human talent data, we have to simulate the data for exploratory purposes and setup the classification experiment using the data In this case, knowledge discovered or constructed classification model by using the proposed classifier for the datasets cannot be used to represent the real problems In future works, the similar experiment setup can be applied to the real data in order to use classification model constructed by the proposed classifier Besides that, other Data mining techniques such as Support Vector Machine (SVM), Fuzzy logic and Artificial Immune System (AIS) should also be considered for future work on classification techniques using the same dataset

In some cases, the attribute relevancy has also become a factor on the accuracy of the classification algorithm In the next experiment, the attribute reduction process should be applied to other reduction techniques in order to confirm these findings whether the number of attributes will affect the accuracy of the classifier Besides that, the C4.5 classifier has the highest accuracy in the experiment; the accuracy for other decision tree classifier also needs to be experimented in order to validate these findings

6 Conclusion

This article has described the significance of the study using data mining for talent management especially for classification and prediction However, there should be more

Trang 25

data mining techniques applied to the different problem domains in HR field of research in order to broaden our horizon of academic and practice work on data mining in HR In addition, C4.5 classifier algorithm is the potential classifier in this experiment Thus, this technique can be used for real human talent data in the next prediction phase i.e classification rules construction These generated classification rules can be used to predict the potential talent for the specific task in an organization In HRM, there are several tasks that can be solved using this approach, for examples, selecting new employees, matching people to jobs, planning career paths, planning training needs for new and senior employee, predicting employee performance, predicting future employee and etc In conclusion, the ability to continuously change and obtain new understanding about classification and prediction in HR field has thus, become the major contribution to HR data mining

7 References

A TP Track Research Report (2005) Talent Management: A State of the Art: Tower Perrin HR

Services

Chen, K K., Chen, M Y., Wu, H J., & Lee, Y L (2007) Constructing a Web-based Employee

Training Expert System with Data Mining Approach Paper presented at the Paper in

The 9th IEEE International Conference on E-Commerce Technology and The 4th IEEE International Conference on Enterprise Computing, E-Commerce and E-Services (CEC-EEE 2007)

Chien, C F., & Chen, L F (2007) Using Rough Set Theory to Recruit and Retain

High-Potential Talents for Semiconductor Manufacturing IEEE Transactions on Semiconductor Manufacturing, 20(4), 528-541

Chien, C F., & Chen, L F (2008) Data mining to improve personnel selection and enhance

human capital: A case study in high-technology industry Expert Systems and Applications, 34(1), 380-290

CHINA UPDATE (2007) HR News for Your Organization : The Tower Perrin Asia Talent

Management Study Retrieved from www.towersperrin.com 7/1/2008

Cubbingham, I (2007) Talent Management : Making it real Development and Learning in

Organizations, 21(2), 4-6

Han, J., & Kamber, M (2006) Data Mining: Concepts and Techniques San Francisco: Morgan

Kaufmann Publisher

Huang, M J., Tsou, Y L., & Lee, S C (2006) Integrating fuzzy data mining and fuzzy

artificial neural networks for discovering implicit knowledge Knowledge-Based Systems, 19(6), 396-403

Jantan, H., Hamdan, A R., & Othman, Z A (2008) Data Mining Techniques for Performance

Prediction in Human Resource Application Paper presented at the 1st Seminar on

Data Mining and Optimization, Selangor

Jantan, H., Hamdan, A R., & Othman, Z A (2009, 25-27 February 2009) Knowledge Discovery

Techniques for Talent Forecasting in Human Resource Application Paper presented at

the World Academy of Science, Engineering and Technology, Penang, Malaysia

Lynne, M (2005) Talent Management Value Imperatives : Strategies for Execution: The

Conference Board

Osei-Bryson, K.-M (2010) Towards supporting expert evaluation of clustering results using

a data mining process model Information Sciences, 180(3), 414-431

Trang 26

Park, S C., Piramuthu, S., & Shaw, M.J (2001) Dynamic rule refinement in knoledge-based

data mining systems Decision Support System, 31(2), 205-222

Ranjan, J (2008) Data Mining Techniques for better decisions in Human Resource

Management Systems International Journal of Business Information Systems, 3(5),

464-481

Sinha, A P., & Zhao, H (2008) Incorporating domain knowledge into data mining

classifiers: An application in indirect lending Decision Support System, 46(1),

287-299

Tai, W S., & Hsu, C C (2005) A Realistic Personnel Selection Tool Based on Fuzzy Data

Mining Method Retrieved from press.com/php/download_papaer?id=46, 9/1/2008

www.atlantis-Tso, G K F., & Yau, K K W (2007) Predicting electricity energy consumption: A

comparison of regression analysis, decision tree and neural networks Energy, 32,

1761-1768

Tung, K Y., Huang, I C., Chen, S L., & Shih, C T (2005) Mining the Generation Xer's job

attitudes by artificial neural network and decision tree - empirical evidence in

Taiwan Expert Systems and Applications, 29(4), 783-794

Wan, S., & Lei, T.C (2009) A knowledge-based decision support system to analyze the

debris-flow problems at Chen-Yu-Lan River, Taiwan Knowledge-Based System, 22(8),

580-588

Witten, I H., & Frank, E (2005) Data Mining: Practical Machine Learning Tools and Techniques

San Francisco: Morgan Kaufmann Publisher

Zanakis, S H., Fernandez, I.B (2005) Competitiveness of nations: A knowledge discovery

examination European Journal of Operational Research, 166(1), 185-211

Zhao, X (2008) An Empirical Study of Data Mining in Performance Evaluation of HRM Paper

presented at the International Symposium on Intelligent Information Technology Application Workshops, Hangzhou, China

Zhuang, Z Y., Churilov, L., Burstein, F., & Sikaris, K (2009) Combining data mining and

case-based reasoning for intelligent decision support for pathology ordering by

general practitioners European Journal of Operational Research, 195, 662-675

Trang 27

New Implementations of Data Mining in a

Plethora of Human Activities

1Juarez City University

2UNICAMP Instituto de Computacão

2Brazil

1 Introduction

The fast growth of the societies along with the development and use of the technology, due

to this at the moment have much information which can be analyzed in the search of relevant informationto make predictions or decision making Knowledge Discovery and Data Mining are powerful data analysis tools The term Data mining is used to describe the non-trivial extraction of implicit, Data Mining is a discovery process in large and complex data set, refers to extracting knowledge from data bases Data mining is a multidisciplinary field with many techniques Whit this techniques you can create a mining model that describe the data that you will use (Ponce et al., 2009a)

Typical Data Mining techniques include clustering, association rule mining, classification, and regression

We show an overview of some algorithms that used the data mining to solve problems that arisen from the human activities like: Electrical Power Design, Trash Collectors Routes, Frauds in Saving Houses, Vehicle Routing Problem

One of the reasons why the Data Mining techniques are widely used is that there is a need to transform a large amount of data on information and knowledge useful

Having a large amount of data and not have tools that can process a phenomenon has been described as rich in data but poverty in information (Han & Kamber, 2006) This steady growth of data, which is stored in large databases, has exceeded the ability of human beings

to understand Moreover, various problems they might present a constant stream of data, which may be more difficult to analyze the power of information

Trang 28

1.1 Tree decisions to improve electrical power design

A decision tree (DT) is a directed acyclic graph, consisting of a node called root, which has

no input arcs, and a set of nodes that have an entrance arch Those nodes with output arcs are called internal nodes or nodes of evidence and those with no output arcs are known as leaf nodes or terminal nodes of decision (Rokach & Maimon, 2005)

The main objectives pursued by creating a DT (Safavian & Landgrebe, 1991) are:

• Correctly classify the largest number of objects in the training set (TS)

• Generalize, during construction of the tree, the TS to ensure that new objects are classified with the highest percentage of correct answers possible

• If the dataset is dynamic, the structure of DT should be upgraded easily

An algorithm for decision tree generation consists of two stages: the first is the induction stage of the tree and the second stage of classification In the first stage is constructed decision tree from training set, commonly each internal node of the tree is composed of an attribute of the portion of the test and training set present in the node is divided according

to the values that can take that attribute The construction of the tree starts generating its root node, choosing a test attribute and partitioning the training set into two or more subsets, for each partition generates a new node and so on When nodes are more objects of

a class generate an internal node, when it contains objects of a class, they form a sheet which

is assigned the class label In the second stage of the algorithm, each new object is classified

by the tree constructed, the tree is traversed from the root to a leaf node, from which membership is determined to some kind of object The way forward in the tree is determined by decisions made at each internal node, according to attribute this to the test Pattern Recognition one of the most studied problems is the supervised classification, where

it is known that a universe of objects is grouped into a given number of classes which have

of each, a sample of known objects belong to it and the problem is given a new order to establish their relationships with each of those classes (Ruiz et al., 1999)

Supervised classification algorithms are designed to determine the membership of an object (described by a set of attributes) to one or more classes, based on the information contained

in a previously classified set of objects (training set - TS)

Among the algorithms used for solving supervised classification are decision trees A decision tree is a structure that consists of nodes (internal and leaves) and arches Its internal nodes are characterized by one or more attributes of these nodes test and emerge one or more arcs These arcs have an associated attribute value test and these values determine which path to follow in the path of the tree

Leaf nodes contain information that determines the object belongs to a class The main characteristics of a decision tree are: simple construction, no need to predetermine parameters for their construction, can treat multi-class problems the same way he works with two-class problems, ability to be represented by a set of rules and the easy interpretation of its structure

1.1.1 Classifications of decision trees

There are various classifications of decision trees, for example according to the number of test attributes in their internal nodes there are two types of trees:

• Single-valued: only contain a test attribute on each node Examples of these algorithms include ID3 (Mitchell, 1997), C4.5 (Quinlan, 1993), CART (Breiman et al., 1984), FACT (Vanichsetakul & Loh, 1988), QUEST (Shis & Loh, 1988), Model Trees (Shou et al., 2005),

Trang 29

CTC (Perez et al., 2007), ID5R (Utgoff, 1989), ITI (Utgoff et al., 1997), UFFT (Gama & Medes, 2005), StreamTree (Jin & Agrawal, 2003), FDT (Janikowo, 1998), G-DT (Pedrycz, 2005) and Spider (Wang, et al., 2007)

• Multivalued: they have to a subset of attributes in each of its nodes For example, PT2 (Utgoff & Brodley,1990), LMDT (Utgoff & Brodley,1995), GALE (Llora & Wilson, 2004) and C-DT

According to the type of decision made by the tree, there are two types of trees:

• Fuzzy: give a degree of membership of each class of the data set, for example, C-DT, FDT, G-DT and Spider

• Drives: assign the object belongs to only one class, so the object is or does not belong to

a class, are examples of such algorithms: ID3, C4.5, CART, FACT, QUEST, Model Trees, CTC, LMDT, GALE, ID5R, ITI, UFFT, and StreamTree PT2

The algorithms for generation of decision trees can be classified according to their ability to process dynamic data sets, i.e sets in which lets you add new objects

According to this there are two types of algorithms for generation of decision trees:

• Incremental: can handle dynamic data sets which are getting a partial solution as they are looking at the objects Examples of such algorithms are: ID5R, ITI, UFFT, and StreamTree PT2

• No Incremental: can only work on static data sets as needed for the solution to the dataset in its entirety Examples include: ID3, C4.5, CART, FACT, QUEST, Model Trees, CTC, FDT, G-DT, Spider LMDT, GALE and C-DT

1.1.2 Decision tree application

To diagnose the electric power apparatus, the decision tree method can be a highly recommended classification tool because it provides the if-then-rule in visible, and thus we may have a possibility to connect the physical phenomena to the observed signals The most important point in constructing the diagnosing system is to make clear the relations between the faults and the corresponding signals Such a database system can be built up in the laboratory using a model electric power apparatus, and we have made it The next important thing is the feature extraction (Llora & Wilson, 2004)

2 Trash collectors routes organized by profiles

Waste It is something that we produce as part of everyday living, but we do not normally think too much about our waste Actually many cities generates a waste stream of great complexity, toxicity, and volume (see fig 1) It includes municipal solid waste, industrial solid waste, hazardous waste, and other specialty wastes, such as medical, nuclear, mining, agricultural waste, construction and demolition (C&D) waste, household waste, etc (OECD, 2008)

In the management of solid waste have the problem relates to the household waste is the individual decision-making over waste generation and disposal When the people decide how much to consume and what to consume, they do not take into account how much waste they produce

Therefore garbage collection is a very complex (even though in most cases do not perceive it) as not only identify routes used by vehicles for this purpose (which by itself is highly complex, to be taken into consideration many factors including the capability of vehicles, the

Trang 30

amount of waste that can each container, the type of waste, which is held in each container, the distance between containers, street address, etc.), but to determine what the best way to make such collection (Marquez, 2009)

Currently a major concern in the world is the way which must be stored, recycled or destroy the waste that we produce (as they have done studies that indicate that the daily waste production per person is about an extra kilogram to the produced in the manufacture of the products we use daily) which starts with the garbage collection process

There are many algorithms and techniques being used to improve the collection process, creating different routes on the basis of the different profiles from those who generate the garbage and of the type of waste, some of these algorithms and techniques are: Ant Colony Algorithms, Hybrid Genetic Algorithms, Data Mining, among others

Fig 1 Example of composition by weight of household garbage

3 Fraud analysis in saving houses

Fraud is an illegal activity, which has many variants and is almost as old as mankind Fraud tries to take advantage in some way, usually economic, by the fraudster with respect to the shame Specifically in the case of plastic card fraud there are several variants (Sánchez et al., 2009) The total cost of plastic card fraud is bigger respect to other forms of payment The first line of defence against fraud is based on preventive measures such as the Chip and PIN cards Next step is formed by methods employed to identify potential fraud trying to minimize potential losses These methods are called fraud detection systems (FDS), and a variety of ways are used to detect the most behavior potential fraudulent

3.1 Techniques for detection of frauds

There are two major frameworks to detect fraud through statistical methods If fraud is conducted in a known way, the pattern recognition techniques are typically used, especially supervised classification schemes (Whitrow et al., 2009) On the other hand if the way in which fraud is not know, for example, when there are new fraudulent behaviors, outlier analysis

Trang 31

methods are recommended (Kou et al., 2004) Previous research has established that the use of outlier analysis is one of the best techniques for the detection of fraud in general Some studies show simple techniques for anomaly detection analysis to discover plastic card fraud (Juszczak et al., 2008) However, to establish patterns to identify anomalies, these patterns are learned by the fraudsters and then they change the way to make de fraud Other problem with this approach is not always abnormal behaviors are fraudulent, so a successful system must locate the true positive events, that is, transactions that are detected as fraud, but they really are fraud and not only appear to be fraudulent Time is a factor against it, because to reduce losses, fraud detection should be done as quickly as possible In practical applications it is possible to use supervised and unsupervised methods together

3.1.1 Clustering

The clustering is primarily a technique of unsupervised approach, even if the supervised clustering has also been studied frequently (Basu et al., 2004) Although often clustering and anomaly detection appear to be fundamentally different from one another, have developed many techniques to detect anomalies based on clustering, which can be grouped into three categories which depend on three different assumptions regarding (Chandola et al., 2009):

semi-a Normal data instances belong to a pooled data set, while the anomalies do not belong to any group clustered

b Normal instances of data are close to the cluster centroids, while anomalies are further away from these centroids

c The normal data belongs to large, dense clusters, whereas the anomalies belong to small and sparse clusters

Each of the above assumptions has their own forms of detect outliers which have advantages and disadvantages between them

3.1.2 Hybrid systems

However, as in many aspects of artificial intelligence, the hybridization is a very current trend to detect abnormalities The reason is because many developed algorithms do not follow entirely the concepts of a simple classical metaheuristic (Lozano et al., 2010), to solve this problem is looking for the best from a combination of metaheuristics (and any other kind of optimization methods) that perform together to complement each other and produce

a profitable synergy, to which is called hybridization (Raidl, 2006)

Some possible reasons for the hybridization are (Grosan et al., 2007):

1 Improve the performance of evolutionary algorithms

2 Improve the quality of solutions obtained by evolutionary algorithms

3 Incorporate evolutionary algorithms as part of a larger system

In this way, Evolutionary Algorithms (EAs) have been the most frequently technique of hybridization used for clustering However previous research in this respect has been limited to the single objective case: criteria based on cluster compactness have been the objectives most commonly employed, as the measures provide smooth incremental guidance in all parts of search space

Since many years ago there has been a growing interest in developing and applying of EAs

in multi-objective optimization (Deb, 2001)

Trang 32

The recent studies on evolutionary algorithms have shown that the population-based algorithms are potential candidate to solve multi-objective optimization problems and can

be efficiently used to eliminate most of the difficulties of classical single objective methods such as the sensitivity to the shape of the Pareto-optimal front and the necessity of multiple runs to find multiple Pareto-optimal solutions

In general, the goal of a multi-objective optimization algorithm is not only to guide the search towards the Pareto-optimal front but also to maintain population diversity in the set of the Pareto optimal solutions In this way the following three main goals need to be achieved:

• Maximize the number of elements of the Pareto optimal set found

• Minimize the distance of the Pareto front produced by the algorithm with respect to the true (global) Pareto front (assuming we know its location)

• Maximize the spreads of solutions found, so that we can have a distribution of vectors

as smooth and uniform as possible (Dehuri et al., 2009)

So it looks like a good proposal to develop a FDS with a foundation of multi-objective clustering, which places the problem of detecting fraud in an appropriate context to reality

In the same way, the system is strengthened through hybridization using PSO for the creation of clusters, and then finds the anomalies using the clustering outlier concept The FDS is running on the plastic card issuing institution When a transaction arrived is sent

to the FDS to be verified, the FDS receives the card details and purchase value to verify if the transaction is genuine, by calculating the anomalies, based on the expenditure profile of each cardholder, purchasing and billing locations, time of purchase, etc When FDS confirms that the transaction is malicious, it activates an alarm and the financial institution decline the transaction The cardholder concerned is contacted and alerted about the possibility that your card is at risk

To find information dynamically observation for individual transactions of the cardholder, stored transactions are subject to a clustering algorithm In general, transactions are stored

in a database of the financial institution, which contain too many attributes Although there are several factors to consider, many proposals working only with the transaction amount, with the idea of reducing the dimensionality of the problem However, to improve the accuracy of the system is recommended to use other factors such as location and time of the transaction So, if the purchase amount exceeds a certain value, the time between the uses of the card is low or the locations where different transactions are distant are facts to consider activating the alarm Therefore, the alarm must be activated with a high level of accuracy Overall accuracy is simply the percentage of correct predictions of a classifier on a test set of

“ground truth” TP means the rate of predicting “true positives” (the ratio of correctly predicted frauds over all of the true frauds), FP means the rate of predicting “false positives” (the ratio of incorrectly predicted frauds over those test examples that were not frauds, otherwise known as the “false alarm rate”) (Stolfo et al., 1997)

Other two types of rates are considered for the results delivered by FDS, FN means the rate of predicting “false negatives” (the ratio of no predicted frauds over all the true frauds) and TN means the rate of predicting “true negatives” (the ratio of normal transactions detected) Table

I shows the classification rate of results obtained by the FDS after analyzing a transaction Once clusters are established, new transaction is entered and evaluated in the FDS, to see if

it belongs to a cluster set or is outside of it, seeing the transaction as an anomaly and becoming a candidate to be fraudulent All this required the calculation of anomalies through the clustering of transaction information through a multi-objective Pareto front with the support of Particle Swarm Optimization (PSO)

Trang 33

Outcome Classification

False Alarm False Positive (FP)

Hit True Positive (TP)

Normal True Negative (TN) Table 1 Classification rate of results

The accuracy of the FDS is represented as the fraction of total transactions (both genuine and

fraudulent) that are detected as correct, which can be expressed as follows (Stolfo et al.,

2000) The equation 1 shows the way to computing the precision

# of TN + # of TP Precision =

Fig 2 shows the idea of the full flow of the process proposed for the FDS As shown in the

figure, the FDS is divided into two parts, one that involves the creation of clusters and the

second in the detection of anomalies

Transactions outside of clusters are candidates to be considered fraudulent, however as

mentioned above the accuracy of the system is a factor to be considered, which is expected

to maximize in order to increase the functionality of the FDS

Fig 2 Research model

4 Data mining in vehicle routing problem

With the rapid development of the World-Wide Web (WWW), the increased popularity and

ease of use of its tools, the World-Wide Web is becoming the most important media for

collecting, sharing and distributing information Progress in digital data acquisition and

storage technology has resulted in the growth of huge distributed databases Due that

interest has grown in the possibility of tapping these data, of extracting from them

information that might be of value to the owner of the database

The discipline concerned with this task has become known as data mining, is the analysis of

observational data sets to find unsuspected relationships and to summarize the data in

novel ways that are both understandable and useful to the data owner The relationships

and summaries derived through a data mining exercise are often referred to as models or

Trang 34

patterns Examples include linear equations, rules, clusters, graphs, tree structures and

recurrent patterns in time series

These patterns provide knowledge on the application domain that is represented by the

document collection Such a pattern can also be seen as a query or implying a query that,

when addressed to the collection, retrieves a set of documents Thus the data mining tools

also identify interesting queries which can be used to browse the collection The system

searches for interesting concept sets and relations between concept sets, using explicit bias

for capturing interestingness A set of concepts (terms, phrases or keywords) directly

corresponds to a query that can be placed to the document collection for retrieving those

documents that contain all the concepts of the set

In this work, a new ant-colony algorithm, Adaptive Neighboring-Ant Search (AdaNAS), for

the semantic query routing problem (SQRP) in a P2P network is presented The proposed

algorithm incorporates an adaptive control parameter tuning technique for runtime

estimation of the time-to-live (TTL) of the ants AdaNAS uses three strategies that take

advantage of the local environment: learning, characterization, and exploration Two

classical learning rules are used to gain experience on past performance using three new

learning functions based on the distance travelled and the resources found by the ants

These strategies are aimed to produce a greater amount of results in a lesser amount of time

The time-to-live (TTL) parameter is tuned at runtime, though a deterministic rule based on

the information acquired by these three local strategies

4.1 Semantic Query Routing Problem (SQRP)

SQRP is the problem of locating information in a network based on a query formed by

keywords The goal in SQRP is to determine shorter routes from a node that issues a query

to those nodes of the network that can appropriately answer the query by providing the

requested information Each query traverses the network, moving from the initiating node

to a neighboring node and then to a neighbor of a neighbor and so forth, until it locates the

requested resource or gives up in its absence Due to the complexity of the problem (Amaral,

2004) (Lui et al., 2005) (Tempich et al., 2004), (Wu et al., 2006) solutions proposed to SQRP

typically limit to special cases

The general strategies of SQRP algorithms are the following Each node maintains a local

database of documents ri called the repository The search mechanism is based on nodes

sending messages to the neighboring nodes to query the contents of their repositories The

queries qi are messages that contain keywords that describe for possible matches If this

Trang 35

examination produces results to the query, the node responds by creating another message informing the node that launched the query of the resources available in the responding node If there are no results or there are too few results, the node that received the query forwards it to one or more of its neighbors This process is repeated until some predefined stopping criteria is reached An important observation is that in a P2P network the connection pattern varies among the net (heterogeneous topology), moreover the connections may change in time, and this may alter the routes available for messages to take As showed in the Figure 1 each node has associated a database of documents ri (repository) Those are available to all nodes connected in the network A node seeks information at the repository sending messages to its nodes neighbors

4.2 Neighboring-Ant Search (NAS)

NAS (Cruz et al., 2008) is also an ant-colony system, but incorporates a local structural measure to guide the ants towards nodes that have better connectivity The algorithm has three main phases: an evaluation phase that examines the local repository and incorporates the classical lookahead technique (Mihail etal., 2004), a transition phase in which the query propagates in the network until its TTL is reached, and a retrieval phase in which the pheromone tables are updated

Most relevant aspects of former works have been incorporated into the proposed NAS algorithm The framework of AntNet algorithm is modified to correspond to the problem conditions: in AntNet the final addresses are known, while NAS algorithm does not has a priori knowledge of where the resources are located On the other hand, differently to AntSearch, the SemAnt algorithm and NAS are focused on the same problem conditions, and both use algorithms based on AntNet algorithm However, the difference between the SemAnt and NAS is that SemAnt only learns from past experience, whereas NAS takes advantage of the local environment This means that the search in NAS takes place in terms

of the classic local exploration method of Lookahead (Mihail et al., 2004), the local structural metric DDC (Ortega, 2009) its measures the differences between the degree of a node and the degree of its neighbors, and three local functions of the past algorithm performance

4.3 Adaptative Neighboring-Ant Search (AdaNAS)

AdaNAS is a metaheuristic algorithm, where a set of independent agents called ants cooperate indirectly and sporadically to achieve a common goal

The algorithm has two objectives: it seeks to maximize the number of resources found by the ants and to minimize the number of steps taken by the ants AdaNAS guides the queries toward nodes that have better connectivity using the local structural metric degree; in addition, it uses the well known lookahead technique, which, by means of data structures, allows to know the repository of the neighboring nodes of a specific node

The algorithm performs in parallel all the queries using query ants Each node has only a query ant, which generates a Forward Ant for attending only one user query, assigning the searched keyword t to the Forward Ant Moreover the query ants realize periodically the local pheromone evaporation of the node where it is In the Algorithm is shown the process realized by the Forward Ant As can be observed all Forward Ants act in parallel In an initial phase (lines 4-8), the ant checks the local repository, and if it founds matching documents then creates a backward ant Afterwards, it realizes the search process (lines 9-25) while it has live and has not found R documents The search process has three sections: Evaluation of results, evaluation and application of the extension of TTL and selection of next node (lines 24-28)

Trang 36

The first section, the evaluation of results (lines 10-15) implements the classical Lookahead technique That is, the ant x located in a node r, checks the lookahead structure, that indicates how many matching documents are in each neighbor node of r This function needs three parameters: the current node (r), the keyword (t) and the set of known nodes (known) by the ant The set known indicates what nodes the lookahead function should ignore, because their matching documents have already taken into account If some resource

is found, the Forward Ant creates a backward ant and updates the quantity of found matching documents

Algorithm: Forward ant algorithm

1 in parallel for each Forward Ant x(r,t,R)

2 initialization: TTL = TTLmax, hops= 0

3 initialization: path=r, Λ=r, known=r

4 Results= get_ local_ documents(r)

5 if results > 0 then

6 create backward ant y(path, results, t)

7 activate y

8 End

9 while TTL < 0 and results < R do

10 La_ results= look ahead(r,t,known)

Fig 4 AdaNAS algorithm

The second section (lines 16-23) is evaluation and application of the extension of TTL In this section the ant verifies if TTL reaches zero, if it is true, the ant intends to extend its life, if it

Trang 37

can do it, it changes the normal transition rule modifying some parameters (line 21) in order

to create the modified transition rule The third section of the search process phase is the selection of the next node Here, the transition rule (normal or modified) is applied for selecting the next node and some structures are updated The final phase occurs when the search process finishes; then, the Forward Ant creates an update ant for doing the pheromone update

Figure 5 shows the results of the different experiments applied to NAS and AdaNAS on thirty runnings for each ninety different instances generated with the characteristics showed

in (Cruz et al., 2004) It can been seen from it that on all the instances the AdaNAS algorithm outperforms NAS On average, AdaNAS had an efficiency 81% better than NAS The topology and the repositories were created static, whereas the queries were launched randomly during the simulation Each simulation was run for 15,000 queries during 500 time units, each unit has 100ms The average performance was studied by computing three performance measures of each 100 queries Average efficiency, defined as the average of resources found per traversed edge (hits/hops)

Fig 5 Comparison between NAS and AdaNAS experimenting with 90 instances

5 Text mining in the media

Today it is common to use computational tools to retrieve information, in fact it is an everyday and in many cases necessary Information retrieval is performed on structured or unstructured data, IR systems commonly have recovered information from unstructured text (text without markup) while the database systems has been created to query relational data (sets of records that have values for predefined) , the principal differences between are

in terms of retrieval model, data structures and query language (Christopher et al., 2009) According to the literature reviewed, nowadays do not exist techniques for Natural Language Processing to achieve 100% accurate results, either with the statistical approach,

or the linguistic approach, in such a situation some researchers have blended both techniques (Chaudhuri et al , 2006) (Gonzalez et al., 2007) (Vallez & Pedraza, 2007) For example, in (Sayyadian, 2004) they propose several methods to exploit structured information in databases and present a query expansion mechanism based on information extraction from structured data The experimental results obtained show that using more structured information to expand the textual queries to improve performance in the recovery of entities in texts

Trang 38

It is common that the amount of data with which one interacts is considerably larger and cannot be worked and in some cases it would be very difficult to work with these manually,

in addition, these digital resources increase rapidly every day, reason by which the World Wide Web has become so popular, and is notorious as well as increased information systems Because of this, it is very important to retrieve information efficiently (Hristidis & Papakonstantinou, 2002)

The search motor of Google, is the clearest example of how a computational tool can facilitate a user the information retrieval, unfortunately does not allow elaborate searches successfully, since it is designed mainly to operate with key words on documentary data bases; email servers are other type of tools very useful and popular

Due to the diversity of existing digital media (heterogeneous data) has been investigated in diverse areas, as much in information retrieval as in natural language processing, whose final objective is to facilitate access to information and improve performance In (Vallez & Pedraza, 2007) classified research areas as follows:

• The information extraction is the removal of a text or a set of texts entities, events and relationships between existing elements

information of a text The techniques used vary according to compression rate, the purpose of summary, the genre of the text, the language (or languages) of the source texts, among other factors

• The quest for answers can give a concrete answer to the question raised by the user, is important that the information needs to be well defined: dates, places, people, etc

• The multilingual information retrieval consists of the possibility of recovering information although the question and/or the documents are in different languages, situation that reigns at the moment in the Web

Automatic classification techniques Search text automatically assign a set of documents to predefined classification categories, mainly by using statistical techniques, processing and parameterization

IR systems not only seek to identify only one object in a collection, but several items that can answer the query that satisfy user requirements, objects are usually text documents, but may

be of multimedia content such as image, video or audio For recovery to be efficient, the data are transformed into adequate representation, in addition, to answer satisfactorily the demands made by the user, the system can use various techniques and models, for example, the statistical processing that represents the classical model the information retrieval systems In (Noy, 2006) use data mining to test their analytical approach, whereas in (Oren, 2002) use the genetic programming paradigm with satisfactory results

In (Iskandar, 2007) “The retrieval strategy has been evaluated using Wikipedia, a social media collection that is an online encyclopedia Social media describes the online tools and platforms that people use to share opinions, insights, experiences, and perspectives with each other Social media can take many different forms, including text, images, audio, and video Popular social mediums include blogs, message boards, podcasts, wikis, and blogs”, see Figure 6

5.1 Experiments

We simulated by means of the developed tool -WREID- the expectations of successfully in a circuit of Wrestling and interests of obtain popularity based on their performance associated with specific features One of most interesting characteristics observed in the experimental

Trang 39

Fig 6 Social Media Retrieval using image features and structured text

analysis were the diversity of cultural patterns established by each society because the

selection of different attributes in a potential best wrestler: Agility, ability to fight, Emotional

Control, Force, Stamina, Speed, Intelligence The structured scenes associated the agents

cannot be reproduced in general, so that the time and space belong to a given moment in

them They represent a unique form, needs and innovator of adaptive behavior which solves

a followed computational problem of a complex change of relations Using Social Data

Mining implementing with agents was possible simulate the behavior of many followers in

the selection of a best wrestler and determinate whom people support this professional

career With respect at Node attributes, we summarize the measures required to describe

individual nodes of a graph They allow identifying elements by their topological

properties The degree -or connectivity- (ki) of a node vi is defined as the number of edges of

this node From the adjacency matrix, we easily obtain the degree of a given node as:

1

N

i j ij

See examples of k values in figure 7 For directed graphs, we distinguish between incoming

and outgoing links Thus, we specify the degree of a node in its indegree, ini k , and

outdegree, The clustering coefficient C i is a local measure quantifying the likelihood that

neighboring nodes of vi are connected with each other It is calculated by dividing the

number of neighbors of vi that are actually connected among them, n, with all possible

combinations excluding autoloops, i.e., ki(ki-1) Formally, we have:

k k

=

Trang 40

Wrestler: Scott Steel

6 Data mining with Ant Colony and Genetic Algorithm

6.1 Artificial Ant Colony

This section describes the principles of any Ant System (AS), a meta-heuristic algorithm based in the form in how the natural ants find food sources The description starts with the ant metaphor, which is a model of this behavior Then, it follows a discussion of how AS has evolved, and show as the ant algorithms can be applied to the Data Mining process The Ant System was inspired by collective behavior of certain real ants (forager ants) While they are traveling in search of food, they deposit a chemical substance called pheromone on the traversed path The communication through the pheromone is an effective way of coordinating the activities of these insects For this reason, pheromone rapidly influences the behavior of each ant: they will choose the paths where is the biggest pheromone concentration The behavior of real ants to search food is modeled as a probabilistic process When there are paths without any amount of pheromone, the ants explores the neighboring area in a totally random way In presence of an amount of pheromone, the ants follow a path with a probability based in the pheromone concentration The ants deposit additional pheromone concentrations during his travels Since the pheromone evaporates, the

Tiêu đề	Knowledge-oriented Applications In Data Mining
Tác giả	Kimito Funatsu, Kiyoshi Hasegawa
Trường học	InTech
Chuyên ngành	Data Mining
Thể loại	Edited Book
Năm xuất bản	2011
Thành phố	Rijeka

Định dạng
Số trang	454
Dung lượng	38,11 MB