Library of Congress Cataloging-in-Publication Data: Mitra, Sushmita Data mining : multimedia, soft computing, and bioinformatics / Sushmita Mitra and Tinku Acharya... In this age of mult
Trang 3Data Mining
Trang 4This page intentionally left blank
Trang 5Machine Intelligence Unit
Indian Statistical Institute
Kolkata, India
TINKUACHARYA
Senior Executive Vice President
Chief Science Officer
Avisere Inc
Tucson, Arizona
and
Adjunct Professor
Department of Electrical Engineering
Arizona State University
Trang 6Copyright © 2003 by John Wiley & Sons, Inc All rights reserved.
Published by John Wiley & Sons, Inc., Hoboken, New Jersey.
Published simultaneously in Canada.
No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or
by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4744, or on the web at www.copyright.com Requests to the Publisher for permission should
be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ
07030, (201) 748-6011, fax (201) 748-6008, e-mail: permreq@wiley.com.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representation or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of
merchantability or fitness for a particular purpose No warranty may be created or extended by sales representatives or written sales materials The advice and strategies contained herein may not be suitable for your situation You should consult with a professional where appropriate Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
For general information on our other products and services please contact our Customer Care
Department within the U.S at 877-762-2974, outside the U.S at 317-572-3993 or fax 317-572-4002 Wiley also publishes its books in a variety of electronic formats Some content that appears in print, however, may not be available in electronic format.
Library of Congress Cataloging-in-Publication Data:
Mitra, Sushmita
Data mining : multimedia, soft computing, and bioinformatics /
Sushmita Mitra and Tinku Acharya.
Trang 7To Ma, who made me what I am,
and to Somosmita, who let me feel like a supermom.
—Sushmita
To Lisa, Arita, andArani
—Tinku Acharya
Trang 8This page intentionally left blank
Trang 9Preface xv
1 Introduction to Data Mining 11.1 Introduction 11.2 Knowledge Discovery and Data Mining 51.3 Data Compression 101.4 Information Retrieval 121.5 Text Mining 141.6 Web Mining 151.7 Image Mining 161.8 Classification 181.9 Clustering 191.10 Rule Mining 201.11 String Matching 211.12 Bioinformatics 231.13 Data Warehousing 241.14 Applications and Challenges 251.15 Conclusions and Discussion 28References 30
Trang 10Soft Computing 352.1 Introduction 352.2 What is Soft Computing? 372.2.1 Relevance 372.2.2 Fuzzy sets 392.2.3 Neural networks 442.2.4 Neuro-fuzzy computing 532.2.5 Genetic algorithms 552.2.6 Rough sets 592.2.7 Wavelets 612.3 Role of Fuzzy Sets in Data Mining 622.3.1 Clustering 632.3.2 Granular computing 632.3.3 Association rules 642.3.4 Functional dependencies 652.3.5 Data summarization 652.3.6 Image mining 662.4 Role of Neural Networks in Data Mining 672.4.1 Rule extraction 672.4.2 Rule evaluation 672.4.3 Clustering and self-organization 692.4.4 Regression 692.4.5 Information retrieval 692.5 Role of Genetic Algorithms in Data Mining 702.5.1 Regression 712.5.2 Association rules 712.6 Role of Rough Sets in Data Mining 722.7 Role of Wavelets in Data Mining 732.8 Role of Hybridizations in Data Mining 742.9 Conclusions and Discussion 77References 78
Multimedia Data Compression 893.1 Introduction 893.2 Information Theory Concepts 913.2.1 Discrete memoryless model and entropy 913.2.2 Noiseless Source Coding Theorem 923.3 Classification of Compression Algorithms 94
Trang 113.4 A Data Compression Model 953.5 Measures of Compression Performance 963.5.1 Compression ratio and bits per sample 973.5.2 Quality metric 973.5.3 Coding complexity 993.6 Source Coding Algorithms 993.6.1 Run-length coding 993.6.2 Huffman coding 1003.7 Principal Component Analysis for Data Compression 1033.8 Principles of Still Image Compression 1053.8.1 Predictive coding 1053.8.2 Transform coding 1073.8.3 Wavelet coding 1093.9 Image Compression Standard: JPEG 1123.10 The JPEG Lossless Coding Algorithm 1133.11 Baseline JPEG Compression 1163.11.1 Color space conversion 1163.11.2 Source image data arrangement 1183.11.3 The baseline compression algorithm 1193.11.4 Decompression process in baseline JPEG 1263.11.5 JPEG2000: Next generation still picture coding
standard 1293.12 Text Compression 1313.12.1 The LZ77 algorithm 1323.12.2 The LZ78 algorithm 1333.12.3 The LZW algorithm 1363.12.4 Other applications of Lempel-Ziv coding 1393.13 Conclusions and Discussion 140References 140String Matching 1434.1 Introduction 1434.1.1 Some definitions and preliminaries 1444.1.2 String matching problem 1464.1.3 Brute force string matching 1484.2 Linear-Order String Matching Algorithms 1504.2.1 String matching with finite automata 1504.2.2 Knuth-Morris-Pratt algorithm 1524.2.3 Boyer-Moore algorithm 158
Trang 124.2.4 Boyer-Moore-Horspool algorithm 1614.2.5 Karp-Rabin algorithm 1654.3 String Matching in Bioinformatics 1694.4 Approximate String Matching 1714.4.1 Basic definitions 1724.4.2 Wagner-Fischer algorithm for computation of string
distance 1734.4.3 Text search with fc-differences 1764.5 Compressed Pattern Matching 1774.6 Conclusions and Discussion 179References 179
Classification in Data Mining 1815.1 Introduction 1815.2 Decision Tree Classifiers 1845.2.1 ID3 1875.2.2 IBM IntelligentMiner 1895.2.3 Serial PaRallelizable INduction of decision Trees
(SPRINT) 1895.2.4 RainForest 1925.2.5 Overfitting 1925.2.6 PrUning and BuiLding Integrated in Classification
(PUBLIC) 1945.2.7 Extracting classification rules from trees 1945.2.8 Fusion with neural networks 1955.3 Bayesian Classifiers 1965.3.1 Bayesian rule for minimum risk 1965.3.2 Naive Bayesian classifier 1965.3.3 Bayesian belief network 1985.4 Instance-Based Learners 1995.4.1 Minimum distance classifiers 1995.4.2 fc-nearest neighbor (fc-NN) classifier 2015.4.3 Locally weighted regression 2015.4.4 Radial basis functions (RBFs) 2025.4.5 Case-based reasoning (CBR) 2035.4.6 Granular computing and CBR 2035.5 Support Vector Machines 2045.6 Fuzzy Decision Trees 2055.6.1 Classification 207
Trang 13CONTENTS xi
5.6.2 Rule generation and evaluation 2125.6.3 Mapping of rules to fuzzy neural network 2145.6.4 Results 2165.7 Conclusions and Discussion 220References 221
Clustering in Data Mining 2276.1 Introduction 2276.2 Distance Measures and Symbolic Objects 2296.2.1 Numeric objects 2296.2.2 Binary objects 2296.2.3 Categorical objects 2316.2.4 Symbolic objects 2316.3 Clustering Categories 2326.3.1 Partitional clustering 2326.3.2 Hierarchical clustering 2356.3.3 Leader clustering 2376.4 Scalable Clustering Algorithms 2376.4.1 Clustering large applications 2386.4.2 Density-based clustering 2396.4.3 Hierarchical clustering 2416.4.4 Grid-based methods 2436.4.5 Other variants 2446.5 Soft Computing-Based Approaches 2446.5.1 Fuzzy sets 2446.5.2 Neural networks 2466.5.3 Wavelets 2486.5.4 Rough sets 2496.5.5 Evolutionary algorithms 2506.6 Clustering with Categorical Attributes 2516.6.1 Sieving Through Iterated Relational Reinforcements
(STIRR) 2526.6.2 Robust Hierarchical Clustering with Links (ROCK) 2526.6.3 c-modes algorithm 2536.7 Hierarchical Symbolic Clustering 2556.7.1 Conceptual clustering 2556.7.2 Agglomerative symbolic clustering 2566.7.3 Cluster validity indices 2576.7.4 Results 259
Trang 146.8 Conclusions and Discussion 261References 262
7 Association Rules 2677.1 Introduction 2677.2 Candidate Generation and Test Methods 2697.2.1 A priori algorithm 2697.2.2 Partition algorithm 2727.2.3 Some extensions 2727.3 Depth-First Search Methods 2737.4 Interesting Rules 2757.5 Multilevel Rules 2767.6 Online Generation of Rules 2777.7 Generalized Rules 2787.8 Scalable Mining of Rules 2807.9 Other Variants 2817.9.1 Quantitative association rules 2817.9.2 Temporal association rules 2817.9.3 Correlation rules 2827.9.4 Localized associations 2827.9.5 Optimized association rules 2837.10 Fuzzy Association Rules 2837.11 Conclusions and Discussion 288References 289
8 Rule Mining with Soft Computing 2938.1 Introduction 2938.2 Connectionist Rule Generation 2948.2.1 Neural models 2958.2.2 Neuro-fuzzy models 2968.2.3 Using knowledge-based networks 2978.3 Modular Hybridization 3028.3.1 Rough fuzzy MLP 3028.3.2 Modular knowledge-based network 3058.3.3 Evolutionary design 3088.3.4 Rule extraction 3108.3.5 Results 3118.4 Conclusions and Discussion 315
Trang 15CONTENTS xiii
References 315
9 Multimedia Data Mining 3199.1 Introduction 3199.2 Text Mining 3209.2.1 Keyword-based search and mining 3219.2.2 Text analysis and retrieval 3229.2.3 Mathematical modeling of documents 3239.2.4 Similarity-based matching for documents and queries 3259.2.5 Latent semantic analysis 3269.2.6 Soft computing approaches 3289.3 Image Mining 3299.3.1 Content-Based Image Retrieval 3309.3.2 Color features 3329.3.3 Texture features 3379.3.4 Shape features 3389.3.5 Topology 3409.3.6 Multidimensional indexing 3429.3.7 Results of a simple CBIR system 3439.4 Video Mining 3459.4.1 MPEG-7: Multimedia content description interface 3479.4.2 Content-based video retrieval system 3489.5 Web Mining 3509.5.1 Search engines 3519.5.2 Soft computing approaches 3539.6 Conclusions and Discussion 357References 357
10 Bioinformatics: An Application 36510.1 Introduction 36510.2 Preliminaries from Biology 36710.2.1 Deoxyribonucleic acid 36710.2.2 Amino acids 36810.2.3 Proteins 36910.2.4 Microarray and gene expression 37110.3 Information Science Aspects 37110.3.1 Protein folding 37210.3.2 Protein structure modeling 373
Trang 1610.3.3 Genomic sequence analysis 37410.3.4 Homology search 37410.4 Clustering of Microarray Data 37810.4.1 First-generation algorithms 37910.4.2 Second-generation algorithms 38010.5 Association Rules 38110.6 Role of Soft Computing 38110.6.1 Predicting protein secondary structure 38210.6.2 Predicting protein tertiary structure 38210.6.3 Determining binding sites 38510.6.4 Classifying gene expression data 38510.7 Conclusions and Discussion 386References 387Index 392About the Authors 399
Trang 17The success of the digital revolution and the growth of the Internet haveensured that huge volumes of high-dimensional multimedia data are availableall around us This information is often mixed, involving different datatypessuch as text, image, audio, speech, hypertext, graphics, and video componentsinterspersed with each other The World Wide Web has played an importantrole in making the data, even from geographically distant locations, easilyaccessible to users all over the world However, often most of this data are
not of much interest to most of the users The problem is to mine useful
information or patterns from the huge datasets Data mining refers to thisprocess of extracting knowledge that is of interest to the user
Data mining is an evolving and growing area of research and development,both in academia as well as in industry It involves interdisciplinary researchand development encompassing diverse domains In our view, this area isfar from being saturated, with newer techniques and directions being pro-posed in the literature everyday In this age of multimedia data exploration,data mining should no longer be restricted to the mining of knowledge fromlarge volumes of high-dimensional datasets in traditional databases only Re-searchers need to pay attention to the mining of different datatypes, includ-ing numeric and alphanumeric formats, text, images, video, voice, speech,graphics, and also their mixed representations Efficient management of suchhigh-dimensional very large databases also influence the performance of datamining systems Data Compression technologies can play a significant role
xv
Trang 18It is also important that special multimedia data compression techniques areexplored especially suitable for data mining applications.
With the completion of the Human Genome Project, we have access tolarge databases of biological information Proper analysis of such huge data,involving decoding of genes in the DNA and the three-dimensional proteinstructures, holds immense promise in Bioinformatics The applicability ofdata mining in this domain cannot be denied, given the lifesaving prospects
of effective drug design This is also of practical interest to the pharmaceuticalindustry
Different forms of ambiguity or uncertainty inherent in real-life data need
to be handled appropriately using soft computing The goal is to arrive at
a low-cost, reasonably good solution, instead of a high-cost, best solution.Fuzzy sets provide the uncertainty handling capability, inherent in humanreasoning, while artificial neural networks help incorporate learning to min-imize error Genetic algorithms introduce effective parallel searching in thehigh-dimensional problem space
Since all these aspects are not covered in that elaborate form in currentbooks available in the market, we wanted to emphasize them in this book.Along with the traditional concepts and functions of data mining, like clas-sification, clustering, and rule mining, we wish to highlight the current andburning issues related to mining in multimedia applications and Bioinformat-ics Storage of such huge datasets being more feasible in the compresseddomain, we also devote a reasonable portion of the text to data mining in thecompressed domain Topics like text mining, image mining, and Web miningare covered specifically
Current trends show that the advances in data mining need not be strained to stochastic, combinatorial, and/or classical so-called hard optimization-based techniques We dwell, in greater detail, on the state of the art of softcomputing approaches, advanced signal processing techniques such as WaveletTransformation, data compression principles for both lossless and lossy tech-niques, access of data using matching pursuits in both raw and compresseddata domains, fundamentals and principles of classical string matching algo-rithms, and how all these areas possibly influence data mining and its futuregrowth We cover aspects of advanced image compression, string matching,content based image retrieval, etc., which can influence future developments
con-in data mcon-incon-ing, particularly for multimedia data mcon-incon-ing
There are 10 chapters in the book The first chapter provides an tion to the basics of data mining and outlines its major functions and applica-tions This is followed in the second chapter by a discussion on soft computingand its different tools, including fuzzy sets, artificial neural networks, geneticalgorithms, wavelet transforms, rough sets, and their hybridizations, alongwith their roles in data mining
introduc-We then present some advanced topics and new aspects of data miningrelated to the processing and retrieval of multimedia data These have di-rect applications to information retrieval, Web mining, image mining, and