Link Term Report using Text Analysis in Mega-puter PolyAnalyst 65.6 Web Mining Software Two selected software are reviewed and compared in terms of data preparation, data analysis, and r
Trang 1Term of “statistical” in SAS Text Miner using SASPDF-SYNONYMS text file (Wood-field, 2004)
SAS Text Miner uses the “drag-and-drop” principle by dragging the selected icon in the tool set to dropping it into the workspace The workspace of SAS Text Miner was constructed with a data icon of selected animal data that was provided by SAS in their Instructor’s Trainer Kit as shown in Figure 24 Figure 25 shows the results of using SAS Text Miner with indi-vidual plots for “role by frequency”, “number of documents by frequency”, “frequency by weight”, “attribute by frequency”, and “number of documents by frequency scatter plot.” Fig-ure 26 shows “Concept Linking FigFig-ure” as generated by SAS Text Miner using SASPDF-SYNONYMS text file
65.5.2 Megaputer PolyAnalyst
Previous work by the authors Segall and Zhang (2006) have utilized Megaputer PolyAna-lyst for data mining The new release of PolyAnaPolyAna-lyst version 6.0 includes text mining and specifically new features for text OLAP (on-line analytical processing) and taxonomy based categorization which is useful for when dealing with large collections of unstructured docu-ments as discussed in Megaputer Intelligence Inc (2007) The latter cites that taxonomy based classifications are useful when dealing with large collections of unstructured documents such
as tracking the number of known issues in product repair notes and customer support letters According to Megaputer Intelligence Inc (2007), PolyAnalyst “provides simple means for creating, importing, and managing taxonomies, and carries out automated categorization
of text records against existing taxonomies.” Megaputer Intelligence Inc (2007) provides ex-amples of applications to executives, customer support specialists, and analysts According
to Megaputer Intelligence Inc (2007), “executives are able to make better business decisions upon viewing a concise report on the distribution of tracked issues during the latest observation period”
This chapter provides several figures of actual screen shots of Megaputer PolyAnalyst version 6.0 for text mining These are Figure 27 for workspace of text mining of Megaputer PolyAnalyst, Figure 28 is “Suffix Tree Clustering” Report for the text cluster of (desk; front), and Figure 29 is screen shot of “Link Term” Report of hotel customer survey text Megaputer PolyAnalyst can also provide screen shots with drill-down text analysis and histogram plot of text analysis
Trang 2Fig 65.27 Workspace for Text Mining in Megaputer PolyAnalyst
Fig 65.28 Clustering Results
in Megaputer PolyAnalyst
Fig 65.29 Link Term Report using Text Analysis in Mega-puter PolyAnalyst
65.6 Web Mining Software
Two selected software are reviewed and compared in terms of data preparation, data analysis, and results reporting (see Table 4) As shown in the table below, Megaputer PolyAnalyst has unique feature of data and text mining tool integrated with web site data source input, while SPSS Clementine has linguistic approach rather than statistics based approach, Table 4 gives a visual interpretation of the differences and similarities among both selected software as shown below
Trang 3Data Understand product and
content affinities (link
analysis)
Analysis Predict user propensity to
convert, buy, or churn
x
Keyword and Search
En-gine
Reporting Support for multiple
lan-guages
tool integrated with web site data source input
Linguistic ap-proach rather than statistics based approach
65.6.1 Megaputer PolyAnalyst
Megaputer PolyAnalyst is an enterprise analytical system that integrates Web mining together with data and text mining because it does not have a separate module for Web mining Web pages or sites can be inputted directly to Megaputer PolyAnlayst as data source nodes Megaputer PolyAnlayst has the standard data and text mining functionalities such as Cat-egorization, Clustering, Prediction, Link Analysis, Keyword and entity extraction, Pattern dis-covery, and Anomaly detection These different functional nodes can be directly connected to the web data source node for performing web mining analysis Megaputer PolyAnalyst user interface allows the user to develop complex data analysis scenarios without loading data in the system, thus saving analyst’s time According to Megaputer (2007), whatever data sources are used, PolyAnalyst provides means for loading and integrating these data PolyAnalyst can load data from disparate data sources including all popular databases, statistical, and spread-sheet systems In addition, it can load collections of documents in html, doc, pdf and txt for-mats, as well as load data from an internet web source PolyAnalyst offers visual “on-the-fly integration” and merging of data coming from disparate sources to create data marts for fur-ther analysis It supports incremental data appending and referencing data sets in previously created PolyAnalyst projects
Figures 30-32 are screen shots illustrating the applications of Megaputer PolyAnalyst for web mining to available data sets Figure 30 shows an expanded view of PolyAnalyst workspace Figure 31 shows screen shot of PolyAnalyst using website of Arkansas State
Trang 4Uni-versity (ASU) as the web data source Figure 32 shows a keyword extraction report from a web page of undergraduate admission of website of Arkansas State University (ASU)
Fig 65.30 PolyAnalyst workspace with Internet data source
Fig 65.31 PolyAnalyst using www.astate.edu as web data source
Fig 65.32 Keyword extraction report
Trang 5Fig 65.33 SPSS Clementine workspace
Fig 65.34 Decision rules for determining clusters of web data
SPSS (2007) claims four key data mining capabilities: segmentation, sequence detection, affinity analysis, and propensity modeling Specifically, SPSS (2007) indicates six Web anal-ysis application modules within SPSS Clementine that are: search engine optimization, auto-mated user and visit segmentation, Web site activity and user behavior analysis, home page activity, activity sequence analysis, and propensity analysis
Unlike other platforms used for Web mining that provide only simple frequency counts (e.g., number of visits, ad hits, top pages, total purchase visits, and top click streams), SPSS (2007) Clementine provides more meaningful customer intelligence such as: likelihood to
Trang 6Fig 65.35 Decision tree re-sults
convert by individual visitor, likelihood to respond by individual prospect, content clusters by customer value, missed crossed-sell opportunities, and event sequences by outcome Figures 33-35 are screen shots illustrating the applications of SPSS Clementine for web mining to available data sets Figure 33 shows the SPSS Clementine workspace Different user modes can be defined including research mode, shopping mode, search mode, evaluation mode, and so on Decision rules for determining clusters of web data are demonstrated in Figure 34 Figure 35 exhibits decision tree results with classifiers using different model types (e.g., CHAID, logistic, neural)
65.7 Conclusion and Future Research
The conclusions of this research include the fact that each of the software selected for this research has its own unique characteristics and properties that can be displayed when applied
to the available data sets As indicated, each software has it own set of algorithm types to which it can be applied
Comparing five data mining software, Biodiscovery GeneSight focuses on cluster analysis and is able to provide a variety of data mining visualization charts and colors BioDiscovery GeneSight have less data mining functions than the other four do SAS Enterprise Miner, Megaputer PolyAnalyst, PASW, and IBM Intelligent Miner employ each of the same algo-rithms as illustrated in Table 1 except that SAS has a separate software SAS Text Miner for text analysis The regression results are comparable for those obtained using these software The cluster analysis results for SAS Enterprise Miner, Biodiscovery GeneSight, and Mega-puter PolyAnalyst each are unique to each software as to how they represent their results
In conclusion, SAS Enterprise Miner, Megaputer PolyAnalyst, PASW, and IBM Intelligent Miner offer the greatest diversification of data mining algorithms
This chapter has discussed commercial data mining software that is applicable to super-computing for 3-D visualization and very large microarray databases Specifically it illustrated the applications of supercomputing for data visualization using two selected software of Avizo and JMP Genomics Avizo is a general supercomputing software and JMP Genomics is a spe-cial software for genetic data Supercomputing data mining for 3-D visualization with Avizo
is applied to diverse applications such as the human skull for medical research, and the atomic structure that can be used for multipurpose applications such as chemical or nuclear We have also presented, using JMP Genomics, the data distributions of condition, patient, frequencies,
Trang 7has standalone Text Analyst software for text mining.
Regarding web mining software, PolyAnalyst can mine web data integrated within a data mining enterprise analytical system and provide visual tools such as link analysis of the critical terms of the text SPSS Clementine can be used for graphical illustrations of customer web activities as well as also for link analysis of different data categories such as campaign, age, gender, and income The selection of appropriate web mining software should be based on both its available web mining technologies and also the type of data to be encountered The future direction of the research is to investigate other data, text, web, and supercom-puting mining software for analyzing various types of data and making comparisons of the capabilities of these software between and among each other This future research would also include the acquisition of other data sets to perform these new analyses and comparisons
Acknowledgement The authors would like to acknowledge the support provided by a 2009
Summer Faculty Research Grant as awarded to them by the College of Business of Arkansas State University without whose program and support this work cannot be done The authors also want to acknowledge each of the software manufactures for their support of this research
References
AAAI (2002), American Association for Artificial Intelligence (AAAI) Spring Sympo-sium on Information Refinement and Revision for Decision Making: Modeling for Diagnostics, Prognostics, and Prediction, Software and Data, retrieved from http: //www.cs.rpi.edu/˜goebel/ss02/software-and-data.html Ceccato, M., M Marin, K Mens, L Moonen, et al., (2006), Applying and combining three different aspect Mining Techniques, Software Quality Journal 14(3), 209-214 Chang, J and Lee, W (2006), Finding frequent itemsets over online data streams, Informa-tion and Software Technology 48(7), 606-619
Chou, C., Sinha, A and Zhao, H (2008), A text mining approach to Internet abuse detection, Information Systems and eBusiness Management 6(4), 419-440
Curry, C., Grossman, R., Locke, D., Vejcik, S., and Bugajski, J (2007), Detecting changes
in large data sets of payment card data: A case study, KDD’07, August 12-15, San Jose, CA
Trang 8Data Intelligence Group (1995), An overview of data mining at Dun & Bradstreet, DIG White Paper 95/01, retrieved from http://www.thearling.com.text/wp9501/wp9501.htm Davi, A, Dominique Haughton, Nada Nasr, Gaurav Shah, et al (2005), A Review of Two Text-Mining Packages: SAS TextMining and WordStat The American Statistician 59(1), 89-104
Davies, A (2007), Identification of spurious results generated via data mining using an Inter-net distributed supercomputer grant, Duquesne University Donahue School of Business, http://www.business.duq.edu/Research/details.asp?id=83
Deshmukah, A V (1997), Software review: ModelQuest Expert 1.0, ORMS Today, December 1997, retrieved from http://www.lionhrtpub.com/orms/orms-12-97/software-review.html
Ducatelle, F., (2006), Software for the data mining course, School of In-formatics, The University of Edinburgh, Scotland, UK, retrieved from http://www.inf.ed.ac.uk/teaching/courses/dme/html/software2.html
Ganapathy, S., Ranganathan, C and Sankaranarayanan, B (2004), Visualization strategies and tools for enhancing customer relationship management, Communications of the ACM 47(11), 92-98
Grossman, R (2007), Data grids, data clouds and data webs: a survey of high perfor-mance and distributed data mining, HPC Workshop: Hardware and software for large-scale biological computing in the next decade, December 11-14, Okinawa, Japan, http://www.irp.oist.jp/hpc-workshop/slides.html
Hearst, M A.(2003), What is Data Mining?, http://www.ischool.berkeley.edu/˜hearstr/ text mining.html
IBM DB2 Intelligent Miner Visualization: Using the Intelligent Miner Visualizers Version 8.2 SH12, Second Edition, August 2004
Kim, S., E James Whitehead Jr and Yi Zhang, (2008), Classifying Software Changes: Clean
or Buggy? IEEE Transactions on Software Engineering 34(2), 181-197
Lau, K., Lee, K and Ho, Y (2005), Text Mining for the Hotel Industry, Cornell Hotel and Restaurant Administration Quarterly 46(3), 344-363
Lazarevic A., Fiea T., & Obradovic, Z., (2006), A software system for spatial data analysis and modeling, retrieved from http://www.ist.temple.edu?˜zoran/papers/lazarevic00.pdf Leung, Y F (2004), My microarray software comparison - Data mining soft-ware, September 2004, Chinese University of Hong Kong, retrieved from http://www.ihome.cuhk.edu.hk/˜b400559/arraysoft mining specific.html
Megaputer Intelligence Inc.(2007), Data Mining, Text Mining, and Web Mining Software, http:///www.megaputer.com
Mesrobian, E , Muntz, R., Shek,E., Mechoso,, C R., Farrara, J.D., Spahr, J.A., Stolorz, P.(1995), Real time data mining, management, and visualization of GCM output, IEEE Computer Society, v.81, http://dml.cs.ucla.edu/˜shek/publications/sc 94.ps.gz
Metz C.(2003), Software: Text Mining, PC Magazine, July 1, http://www.pcmag.com/print article2/0,1217.a=43573,00.asp
National Center for Biotechnology Information (2006), National Library of Medicine, National Institutes of Health, NCBI tools for data mining, retrieved from http://www.ncbi.nlm,nih.gov/Tools/
Nayak, R (2008), Data Mining in Web Services Discovery and Monitoring, International Journal of Web Services Research 5(1), 63-82
Nisbet, R A.(2006), Data mining tools: Which one is best for CRM? Part 3, DM Re-view, March 21, 2006, retrieved from http://www.dmreview.com/editorial/dmreview/ print action.cfm?articleId=1049954
Trang 9Narrative Reports, Lecture Notes in Artificial intelligence 3055, page 217-228 Springer-Verlag, 2004
Sanchez, E (1996), Speedier: Penn researchers to link supercomputers to community prob-lems, The Compass, v 43, n 4, p 14, September 17, http://www.upenn.edu/pennnews/ features/1996/091796/research
Sanchez, M., Moreno, M., Segrera,S and Lopez, V (2008), Framework for the develop-ment of a personalised recommender system with integrated web-mining functionali-ties,International Journal of Computer Applications in Technology, 33(4), 312-327 SAS (2009), JMP Genomics 4.0 Product Brief, http://www.jmp.com/software/genomics /pdf/103112 jmpg4 prodbrief.pdf
Segall, R and Zhang, Q (2006), Data visualization and data mining of continuous numer-ical and discrete nominal-valued microarray databases for biotechnology, Kybernetes: International Journal of Systems and Cybernetics, 35(9/10),1538-1566
Seigle, G (2002), CIA, FBI developing intelligence supercomputer, Global Security Sekijima, M (2007), Application of HPC to the analysis of disease related protein and the design of novel proteins, HPC Workshop: “Hardware and software for large-scale biological computing in the next decade”, December 11-14, Okinawa, Japan, http://www.irp.oist.jp/hpc-workshop/slides.html
SPPS (2009a): PASW Modeler 13: Overview Demo, http://www.spss.com/media/demos/ modeler/ demo-modeler-overview/index.htm
SPPS (2009b): PAWS Modeler Auto Cluster and Cluster Viewer, http://www.spss.com/media/demos/modeler/demo-modeler-autocluster/index.htm SPSS (2007), Web Mining for Clementine, http://www.spss.com/web mining for clementine, viewed 16 May 2007
StatSoft, Inc (2006), Electronic textbook, retrieved from http://www.statsoft.com/textbook/glosa.html
VSG Visualization Sciences Group (2009), Avizo The 3D visualization software for scien-tific and industrial data, http://www.vsg3d.com/vsg prod avizo overview.php
Wikipedia (2006), Supercomputers, Retrieved May 19, 2009 from BookRags.com: http://www.bookrags.com/wiki/Supercomputer
Wikipedia (2007), Web mining, http://en.wikipedia.org/wiki/Web mining
Woodfield, Terry (2004), Mining Textual Data Using SAS Text Miner for SAS9 Course Notes, SAS Institute, Inc., Cary, NC
Zhang, Q and Segall, R (2008), Web mining: a survey of current research, techniques, and software, International Journal of Information Technology & Decision Making, 7(4), 683-720
Trang 10Weka-A Machine Learning Workbench for Data
Mining
Eibe Frank1, Mark Hall1, Geoffrey Holmes1, Richard Kirkby1, Bernhard Pfahringer1, Ian H Witten1, and Len Trigg2
1 Department of Computer Science, University of Waikato, Hamilton, New Zealand
{eibe, mhall, geoff, rkirkby, bernhard,
ihw}@cs.waikato.ac.nz
2 Reel Two, P O Box 1538, Hamilton, New Zealand
len@reeltwo.com
Summary The Weka workbench is an organized collection of state-of-the-art machine lear-ning algorithms and data preprocessing tools The basic way of interacting with these methods
is by invoking them from the command line However, convenient interactive graphical user interfaces are provided for data exploration, for setting up large-scale experiments on dis-tributed computing platforms, and for designing configurations for streamed data processing These interfaces constitute an advanced environment for experimental data mining The sys-tem is written in Java and distributed under the terms of the GNU General Public License
Key words: machine learning software, Data Mining, data preprocessing, data visualization, extensible workbench
66.1 Introduction
Experience shows that no single machine learning method is appropriate for all possible learn-ing problems The universal learner is an idealistic fantasy Real datasets vary, and to obtain accurate models the bias of the learning algorithm must match the structure of the domain The Weka workbench is a collection of state-of-the-art machine learning algorithms and data preprocessing tools It is designed so that users can quickly try out existing machine learning methods on new datasets in very flexible ways It provides extensive support for the whole process of experimental Data Mining, including preparing the input data, evaluating learning schemes statistically, and visualizing both the input data and the result of learning This has been accomplished by including a wide variety of algorithms for learning different types of concepts, as well as a wide range of preprocessing methods This diverse and compre-hensive set of tools can be invoked through a common interface, making it possible for users
O Maimon, L Rokach (eds.), Data Mining and Knowledge Discovery Handbook, 2nd ed.,
DOI 10.1007/978-0-387-09823-4_66, © Springer Science+Business Media, LLC 2010