Data Mining and Knowledge Discovery Handbook, 2 Edition part 129 ppsx

Link Term Report using Text Analysis in Mega-puter PolyAnalyst 65.6 Web Mining Software Two selected software are reviewed and compared in terms of data preparation, data analysis, and r

Trang 1

Term of “statistical” in SAS Text Miner using SASPDF-SYNONYMS text ﬁle (Wood-ﬁeld, 2004)

SAS Text Miner uses the “drag-and-drop” principle by dragging the selected icon in the tool set to dropping it into the workspace The workspace of SAS Text Miner was constructed with a data icon of selected animal data that was provided by SAS in their Instructor’s Trainer Kit as shown in Figure 24 Figure 25 shows the results of using SAS Text Miner with indi-vidual plots for “role by frequency”, “number of documents by frequency”, “frequency by weight”, “attribute by frequency”, and “number of documents by frequency scatter plot.” Fig-ure 26 shows “Concept Linking FigFig-ure” as generated by SAS Text Miner using SASPDF-SYNONYMS text ﬁle

65.5.2 Megaputer PolyAnalyst

Previous work by the authors Segall and Zhang (2006) have utilized Megaputer PolyAna-lyst for data mining The new release of PolyAnaPolyAna-lyst version 6.0 includes text mining and speciﬁcally new features for text OLAP (on-line analytical processing) and taxonomy based categorization which is useful for when dealing with large collections of unstructured docu-ments as discussed in Megaputer Intelligence Inc (2007) The latter cites that taxonomy based classiﬁcations are useful when dealing with large collections of unstructured documents such

as tracking the number of known issues in product repair notes and customer support letters According to Megaputer Intelligence Inc (2007), PolyAnalyst “provides simple means for creating, importing, and managing taxonomies, and carries out automated categorization

of text records against existing taxonomies.” Megaputer Intelligence Inc (2007) provides ex-amples of applications to executives, customer support specialists, and analysts According

to Megaputer Intelligence Inc (2007), “executives are able to make better business decisions upon viewing a concise report on the distribution of tracked issues during the latest observation period”

This chapter provides several ﬁgures of actual screen shots of Megaputer PolyAnalyst version 6.0 for text mining These are Figure 27 for workspace of text mining of Megaputer PolyAnalyst, Figure 28 is “Sufﬁx Tree Clustering” Report for the text cluster of (desk; front), and Figure 29 is screen shot of “Link Term” Report of hotel customer survey text Megaputer PolyAnalyst can also provide screen shots with drill-down text analysis and histogram plot of text analysis

Trang 2

Fig 65.27 Workspace for Text Mining in Megaputer PolyAnalyst

Fig 65.28 Clustering Results

in Megaputer PolyAnalyst

Fig 65.29 Link Term Report using Text Analysis in Mega-puter PolyAnalyst

65.6 Web Mining Software

Two selected software are reviewed and compared in terms of data preparation, data analysis, and results reporting (see Table 4) As shown in the table below, Megaputer PolyAnalyst has unique feature of data and text mining tool integrated with web site data source input, while SPSS Clementine has linguistic approach rather than statistics based approach, Table 4 gives a visual interpretation of the differences and similarities among both selected software as shown below

Trang 3

Data Understand product and

content afﬁnities (link

analysis)

Analysis Predict user propensity to

convert, buy, or churn

x

Keyword and Search

En-gine

Reporting Support for multiple

lan-guages

tool integrated with web site data source input

Linguistic ap-proach rather than statistics based approach

65.6.1 Megaputer PolyAnalyst

Megaputer PolyAnalyst is an enterprise analytical system that integrates Web mining together with data and text mining because it does not have a separate module for Web mining Web pages or sites can be inputted directly to Megaputer PolyAnlayst as data source nodes Megaputer PolyAnlayst has the standard data and text mining functionalities such as Cat-egorization, Clustering, Prediction, Link Analysis, Keyword and entity extraction, Pattern dis-covery, and Anomaly detection These different functional nodes can be directly connected to the web data source node for performing web mining analysis Megaputer PolyAnalyst user interface allows the user to develop complex data analysis scenarios without loading data in the system, thus saving analyst’s time According to Megaputer (2007), whatever data sources are used, PolyAnalyst provides means for loading and integrating these data PolyAnalyst can load data from disparate data sources including all popular databases, statistical, and spread-sheet systems In addition, it can load collections of documents in html, doc, pdf and txt for-mats, as well as load data from an internet web source PolyAnalyst offers visual “on-the-ﬂy integration” and merging of data coming from disparate sources to create data marts for fur-ther analysis It supports incremental data appending and referencing data sets in previously created PolyAnalyst projects

Figures 30-32 are screen shots illustrating the applications of Megaputer PolyAnalyst for web mining to available data sets Figure 30 shows an expanded view of PolyAnalyst workspace Figure 31 shows screen shot of PolyAnalyst using website of Arkansas State

Trang 4

Uni-versity (ASU) as the web data source Figure 32 shows a keyword extraction report from a web page of undergraduate admission of website of Arkansas State University (ASU)

Fig 65.30 PolyAnalyst workspace with Internet data source

Fig 65.31 PolyAnalyst using www.astate.edu as web data source

Fig 65.32 Keyword extraction report

Trang 5

Fig 65.33 SPSS Clementine workspace

Fig 65.34 Decision rules for determining clusters of web data

SPSS (2007) claims four key data mining capabilities: segmentation, sequence detection, afﬁnity analysis, and propensity modeling Speciﬁcally, SPSS (2007) indicates six Web anal-ysis application modules within SPSS Clementine that are: search engine optimization, auto-mated user and visit segmentation, Web site activity and user behavior analysis, home page activity, activity sequence analysis, and propensity analysis

Unlike other platforms used for Web mining that provide only simple frequency counts (e.g., number of visits, ad hits, top pages, total purchase visits, and top click streams), SPSS (2007) Clementine provides more meaningful customer intelligence such as: likelihood to

Trang 6

Fig 65.35 Decision tree re-sults

convert by individual visitor, likelihood to respond by individual prospect, content clusters by customer value, missed crossed-sell opportunities, and event sequences by outcome Figures 33-35 are screen shots illustrating the applications of SPSS Clementine for web mining to available data sets Figure 33 shows the SPSS Clementine workspace Different user modes can be deﬁned including research mode, shopping mode, search mode, evaluation mode, and so on Decision rules for determining clusters of web data are demonstrated in Figure 34 Figure 35 exhibits decision tree results with classiﬁers using different model types (e.g., CHAID, logistic, neural)

65.7 Conclusion and Future Research

The conclusions of this research include the fact that each of the software selected for this research has its own unique characteristics and properties that can be displayed when applied

to the available data sets As indicated, each software has it own set of algorithm types to which it can be applied

Comparing ﬁve data mining software, Biodiscovery GeneSight focuses on cluster analysis and is able to provide a variety of data mining visualization charts and colors BioDiscovery GeneSight have less data mining functions than the other four do SAS Enterprise Miner, Megaputer PolyAnalyst, PASW, and IBM Intelligent Miner employ each of the same algo-rithms as illustrated in Table 1 except that SAS has a separate software SAS Text Miner for text analysis The regression results are comparable for those obtained using these software The cluster analysis results for SAS Enterprise Miner, Biodiscovery GeneSight, and Mega-puter PolyAnalyst each are unique to each software as to how they represent their results

In conclusion, SAS Enterprise Miner, Megaputer PolyAnalyst, PASW, and IBM Intelligent Miner offer the greatest diversiﬁcation of data mining algorithms

This chapter has discussed commercial data mining software that is applicable to super-computing for 3-D visualization and very large microarray databases Speciﬁcally it illustrated the applications of supercomputing for data visualization using two selected software of Avizo and JMP Genomics Avizo is a general supercomputing software and JMP Genomics is a spe-cial software for genetic data Supercomputing data mining for 3-D visualization with Avizo

is applied to diverse applications such as the human skull for medical research, and the atomic structure that can be used for multipurpose applications such as chemical or nuclear We have also presented, using JMP Genomics, the data distributions of condition, patient, frequencies,

Trang 7

has standalone Text Analyst software for text mining.

Regarding web mining software, PolyAnalyst can mine web data integrated within a data mining enterprise analytical system and provide visual tools such as link analysis of the critical terms of the text SPSS Clementine can be used for graphical illustrations of customer web activities as well as also for link analysis of different data categories such as campaign, age, gender, and income The selection of appropriate web mining software should be based on both its available web mining technologies and also the type of data to be encountered The future direction of the research is to investigate other data, text, web, and supercom-puting mining software for analyzing various types of data and making comparisons of the capabilities of these software between and among each other This future research would also include the acquisition of other data sets to perform these new analyses and comparisons

Acknowledgement The authors would like to acknowledge the support provided by a 2009

Summer Faculty Research Grant as awarded to them by the College of Business of Arkansas State University without whose program and support this work cannot be done The authors also want to acknowledge each of the software manufactures for their support of this research

References

AAAI (2002), American Association for Artiﬁcial Intelligence (AAAI) Spring Sympo-sium on Information Reﬁnement and Revision for Decision Making: Modeling for Diagnostics, Prognostics, and Prediction, Software and Data, retrieved from http: //www.cs.rpi.edu/˜goebel/ss02/software-and-data.html Ceccato, M., M Marin, K Mens, L Moonen, et al., (2006), Applying and combining three different aspect Mining Techniques, Software Quality Journal 14(3), 209-214 Chang, J and Lee, W (2006), Finding frequent itemsets over online data streams, Informa-tion and Software Technology 48(7), 606-619

Chou, C., Sinha, A and Zhao, H (2008), A text mining approach to Internet abuse detection, Information Systems and eBusiness Management 6(4), 419-440

Curry, C., Grossman, R., Locke, D., Vejcik, S., and Bugajski, J (2007), Detecting changes

in large data sets of payment card data: A case study, KDD’07, August 12-15, San Jose, CA

Trang 8

Data Intelligence Group (1995), An overview of data mining at Dun & Bradstreet, DIG White Paper 95/01, retrieved from http://www.thearling.com.text/wp9501/wp9501.htm Davi, A, Dominique Haughton, Nada Nasr, Gaurav Shah, et al (2005), A Review of Two Text-Mining Packages: SAS TextMining and WordStat The American Statistician 59(1), 89-104

Davies, A (2007), Identiﬁcation of spurious results generated via data mining using an Inter-net distributed supercomputer grant, Duquesne University Donahue School of Business, http://www.business.duq.edu/Research/details.asp?id=83

Deshmukah, A V (1997), Software review: ModelQuest Expert 1.0, ORMS Today, December 1997, retrieved from http://www.lionhrtpub.com/orms/orms-12-97/software-review.html

Ducatelle, F., (2006), Software for the data mining course, School of In-formatics, The University of Edinburgh, Scotland, UK, retrieved from http://www.inf.ed.ac.uk/teaching/courses/dme/html/software2.html

Ganapathy, S., Ranganathan, C and Sankaranarayanan, B (2004), Visualization strategies and tools for enhancing customer relationship management, Communications of the ACM 47(11), 92-98

Grossman, R (2007), Data grids, data clouds and data webs: a survey of high perfor-mance and distributed data mining, HPC Workshop: Hardware and software for large-scale biological computing in the next decade, December 11-14, Okinawa, Japan, http://www.irp.oist.jp/hpc-workshop/slides.html

Hearst, M A.(2003), What is Data Mining?, http://www.ischool.berkeley.edu/˜hearstr/ text mining.html

IBM DB2 Intelligent Miner Visualization: Using the Intelligent Miner Visualizers Version 8.2 SH12, Second Edition, August 2004

Kim, S., E James Whitehead Jr and Yi Zhang, (2008), Classifying Software Changes: Clean

or Buggy? IEEE Transactions on Software Engineering 34(2), 181-197

Lau, K., Lee, K and Ho, Y (2005), Text Mining for the Hotel Industry, Cornell Hotel and Restaurant Administration Quarterly 46(3), 344-363

Lazarevic A., Fiea T., & Obradovic, Z., (2006), A software system for spatial data analysis and modeling, retrieved from http://www.ist.temple.edu?˜zoran/papers/lazarevic00.pdf Leung, Y F (2004), My microarray software comparison - Data mining soft-ware, September 2004, Chinese University of Hong Kong, retrieved from http://www.ihome.cuhk.edu.hk/˜b400559/arraysoft mining speciﬁc.html

Megaputer Intelligence Inc.(2007), Data Mining, Text Mining, and Web Mining Software, http:///www.megaputer.com

Mesrobian, E , Muntz, R., Shek,E., Mechoso,, C R., Farrara, J.D., Spahr, J.A., Stolorz, P.(1995), Real time data mining, management, and visualization of GCM output, IEEE Computer Society, v.81, http://dml.cs.ucla.edu/˜shek/publications/sc 94.ps.gz

Metz C.(2003), Software: Text Mining, PC Magazine, July 1, http://www.pcmag.com/print article2/0,1217.a=43573,00.asp

National Center for Biotechnology Information (2006), National Library of Medicine, National Institutes of Health, NCBI tools for data mining, retrieved from http://www.ncbi.nlm,nih.gov/Tools/

Nayak, R (2008), Data Mining in Web Services Discovery and Monitoring, International Journal of Web Services Research 5(1), 63-82

Nisbet, R A.(2006), Data mining tools: Which one is best for CRM? Part 3, DM Re-view, March 21, 2006, retrieved from http://www.dmreview.com/editorial/dmreview/ print action.cfm?articleId=1049954

Trang 9

Narrative Reports, Lecture Notes in Artiﬁcial intelligence 3055, page 217-228 Springer-Verlag, 2004

Sanchez, E (1996), Speedier: Penn researchers to link supercomputers to community prob-lems, The Compass, v 43, n 4, p 14, September 17, http://www.upenn.edu/pennnews/ features/1996/091796/research

Sanchez, M., Moreno, M., Segrera,S and Lopez, V (2008), Framework for the develop-ment of a personalised recommender system with integrated web-mining functionali-ties,International Journal of Computer Applications in Technology, 33(4), 312-327 SAS (2009), JMP Genomics 4.0 Product Brief, http://www.jmp.com/software/genomics /pdf/103112 jmpg4 prodbrief.pdf

Segall, R and Zhang, Q (2006), Data visualization and data mining of continuous numer-ical and discrete nominal-valued microarray databases for biotechnology, Kybernetes: International Journal of Systems and Cybernetics, 35(9/10),1538-1566

Seigle, G (2002), CIA, FBI developing intelligence supercomputer, Global Security Sekijima, M (2007), Application of HPC to the analysis of disease related protein and the design of novel proteins, HPC Workshop: “Hardware and software for large-scale biological computing in the next decade”, December 11-14, Okinawa, Japan, http://www.irp.oist.jp/hpc-workshop/slides.html

SPPS (2009a): PASW Modeler 13: Overview Demo, http://www.spss.com/media/demos/ modeler/ demo-modeler-overview/index.htm

SPPS (2009b): PAWS Modeler Auto Cluster and Cluster Viewer, http://www.spss.com/media/demos/modeler/demo-modeler-autocluster/index.htm SPSS (2007), Web Mining for Clementine, http://www.spss.com/web mining for clementine, viewed 16 May 2007

StatSoft, Inc (2006), Electronic textbook, retrieved from http://www.statsoft.com/textbook/glosa.html

VSG Visualization Sciences Group (2009), Avizo The 3D visualization software for scien-tiﬁc and industrial data, http://www.vsg3d.com/vsg prod avizo overview.php

Wikipedia (2006), Supercomputers, Retrieved May 19, 2009 from BookRags.com: http://www.bookrags.com/wiki/Supercomputer

Wikipedia (2007), Web mining, http://en.wikipedia.org/wiki/Web mining

Woodﬁeld, Terry (2004), Mining Textual Data Using SAS Text Miner for SAS9 Course Notes, SAS Institute, Inc., Cary, NC

Zhang, Q and Segall, R (2008), Web mining: a survey of current research, techniques, and software, International Journal of Information Technology & Decision Making, 7(4), 683-720

Trang 10

Weka-A Machine Learning Workbench for Data

Mining

Eibe Frank1, Mark Hall1, Geoffrey Holmes1, Richard Kirkby1, Bernhard Pfahringer1, Ian H Witten1, and Len Trigg2

1 Department of Computer Science, University of Waikato, Hamilton, New Zealand

{eibe, mhall, geoff, rkirkby, bernhard,

ihw}@cs.waikato.ac.nz

2 Reel Two, P O Box 1538, Hamilton, New Zealand

len@reeltwo.com

Summary The Weka workbench is an organized collection of state-of-the-art machine lear-ning algorithms and data preprocessing tools The basic way of interacting with these methods

is by invoking them from the command line However, convenient interactive graphical user interfaces are provided for data exploration, for setting up large-scale experiments on dis-tributed computing platforms, and for designing conﬁgurations for streamed data processing These interfaces constitute an advanced environment for experimental data mining The sys-tem is written in Java and distributed under the terms of the GNU General Public License

Key words: machine learning software, Data Mining, data preprocessing, data visualization, extensible workbench

66.1 Introduction

Experience shows that no single machine learning method is appropriate for all possible learn-ing problems The universal learner is an idealistic fantasy Real datasets vary, and to obtain accurate models the bias of the learning algorithm must match the structure of the domain The Weka workbench is a collection of state-of-the-art machine learning algorithms and data preprocessing tools It is designed so that users can quickly try out existing machine learning methods on new datasets in very ﬂexible ways It provides extensive support for the whole process of experimental Data Mining, including preparing the input data, evaluating learning schemes statistically, and visualizing both the input data and the result of learning This has been accomplished by including a wide variety of algorithms for learning different types of concepts, as well as a wide range of preprocessing methods This diverse and compre-hensive set of tools can be invoked through a common interface, making it possible for users

O Maimon, L Rokach (eds.), Data Mining and Knowledge Discovery Handbook, 2nd ed.,

Định dạng
Số trang	10
Dung lượng	1,23 MB