More recently, it has been recognised that theshift from traditional to online services – and so the growing numbers of online cus-tomers and the increasing traffic generated by them – bri
Trang 1I-Hsien Ting and Hui-Ju Wu (Eds.)
Web Mining Applications in E-Commerce and E-Services
Trang 2Prof Janusz Kacprzyk
Systems Research Institute
Polish Academy of Sciences
Vol 150 Roger Lee (Ed.)
Software Engineering Research, Management and
Applications, 2008
ISBN 978-3-540-70774-5
Vol 151 Tomasz G Smolinski, Mariofanna G Milanova
and Aboul-Ella Hassanien (Eds.)
Computational Intelligence in Biomedicine and Bioinformatics,
2008
ISBN 978-3-540-70776-9
Rough – Granular Computing in Knowledge Discovery and Data
Mining, 2008
ISBN 978-3-540-70800-1
Vol 153 Carlos Cotta and Jano van Hemert (Eds.)
Recent Advances in Evolutionary Computation for
Combinatorial Optimization, 2008
ISBN 978-3-540-70806-3
Vol 154 Oscar Castillo, Patricia Melin, Janusz Kacprzyk and
Witold Pedrycz (Eds.)
Soft Computing for Hybrid Intelligent Systems, 2008
ISBN 978-3-540-70811-7
Vol 155 Hamid R Tizhoosh and M Ventresca (Eds.)
Oppositional Concepts in Computational Intelligence, 2008
ISBN 978-3-540-70826-1
Vol 156 Dawn E Holmes and Lakhmi C Jain (Eds.)
Innovations in Bayesian Networks, 2008
ISBN 978-3-540-85065-6
Vol 157 Ying-ping Chen and Meng-Hiot Lim (Eds.)
Linkage in Evolutionary Computation, 2008
ISBN 978-3-540-85067-0
Vol 158 Marina Gavrilova (Ed.)
Generalized Voronoi Diagram: A Geometry-Based Approach to
Computational Intelligence, 2009
ISBN 978-3-540-85125-7
Vol 159 Dimitri Plemenos and Georgios Miaoulis (Eds.)
Artificial Intelligence Techniques for Computer Graphics, 2009
ISBN 978-3-540-85127-1
Vol 160 P Rajasekaran and Vasantha Kalyani David
Pattern Recognition using Neural and Functional Networks,
2009
ISBN 978-3-540-85129-5
Vol 161 Francisco Baptista Pereira and Jorge Tavares (Eds.)
Bio-inspired Algorithms for the Vehicle Routing Problem, 2009
Inhibitory Rules in Data Analysis, 2009
ISBN 978-3-540-85637-5 Vol 164 Nadia Nedjah, Luiza de Macedo Mourelle, Janusz Kacprzyk, Felipe M.G Fran¸ca
and Alberto Ferreira de Souza (Eds.)
Intelligent Text Categorization and Clustering, 2009
ISBN 978-3-540-85643-6 Vol 165 Djamel A Zighed, Shusaku Tsumoto, Zbigniew W Ras and Hakim Hacid (Eds.)
Mining Complex Data, 2009
ISBN 978-3-540-88066-0 Vol 166 Constantinos Koutsojannis and Spiros Sirmakessis (Eds.)
Tools and Applications with Artificial Intelligence, 2009
ISBN 978-3-540-88068-4 Vol 167 Ngoc Thanh Nguyen and Lakhmi C Jain (Eds.)
Intelligent Agents in the Evolution of Web and Applications, 2009
ISBN 978-3-540-88070-7 Vol 168 Andreas Tolk and Lakhmi C Jain (Eds.)
Complex Systems in Knowledge-based Environments: Theory, Models and Applications, 2009
ISBN 978-3-540-88074-5 Vol 169 Nadia Nedjah, Luiza de Macedo Mourelle and Janusz Kacprzyk (Eds.)
Innovative Applications in Data Mining, 2009
ISBN 978-3-540-88044-8 Vol 170 Lakhmi C Jain and Ngoc Thanh Nguyen (Eds.)
Knowledge Processing and Decision Making in Agent-Based Systems, 2009
ISBN 978-3-540-88048-6 Vol 171 Chi-Keong Goh, Yew-Soon Ong and Kay Chen Tan (Eds.)
Multi-Objective Memetic Algorithms, 2009
ISBN 978-3-540-88050-9 Vol 172 I-Hsien Ting and Hui-Ju Wu (Eds.)
Web Mining Applications in E-Commerce and E-Services, 2009
ISBN 978-3-540-88080-6
Trang 4National University of Kaohsiung
No 700, Kaohsiung University Road
Kaohsiung City, 811
Taiwan
Email: iting@nuk.edu.tw
Dr Hui-Ju Wu
Institute of Human Resource Management
National Changhua University of Education
No.2, Shi-Da Road
Studies in Computational Intelligence ISSN 1860949X
Library of Congress Control Number: 2008935505
c
2009 Springer-Verlag Berlin Heidelberg
This work is subject to copyright All rights are reserved, whether the whole or part of thematerial is concerned, specifically the rights of translation, reprinting, reuse of illustrations,recitation, broadcasting, reproduction on microfilm or in any other way, and storage in databanks Duplication of this publication or parts thereof is permitted only under the provisions ofthe German Copyright Law of September 9, 1965, in its current version, and permission for usemust always be obtained from Springer Violations are liable to prosecution under the GermanCopyright Law
The use of general descriptive names, registered names, trademarks, etc in this publicationdoes not imply, even in the absence of a specific statement, that such names are exempt fromthe relevant protective laws and regulations and therefore free for general use
Typeset & Cover Design: Scientific Publishing Services Pvt Ltd., Chennai, India.
Printed in acid-free paper
9 8 7 6 5 4 3 2 1
springer.com
Trang 5Preface
Web mining has become a popular area of research, integrating the different research areas of data mining and the World Wide Web According to the taxonomy of Web mining, there are three sub-fields of Web-mining research: Web usage mining, Web content mining and Web structure mining These three research fields cover most content and activities on the Web With the rapid growth of the World Wide Web, Web mining has become a hot topic and is now part of the mainstream of Web re-search, such as Web information systems and Web intelligence Among all of the possible applications in Web research, e-commerce and e-services have been identi-fied as important domains for Web-mining techniques Web-mining techniques also play an important role in e-commerce and e-services, proving to be useful tools for understanding how e-commerce and e-service Web sites and services are used, ena-bling the provision of better services for customers and users Thus, this book will focus upon Web-mining applications in e-commerce and e-services
Some chapters in this book are extended from the papers that presented in WMEE
2008 (the 2nd International Workshop for E-commerce and E-services) In addition,
we also sent invitations to researchers that are famous in this research area to ute for this book The chapters of this book are introduced as follows:
contrib-In chapter 1, Peter I Hofgesang presents an introduction to online web usage ing and provides background information followed by a comprehensive overview of the related work In addition, it outlines the major, and yet mostly unsolved, chal-lenges in the field
min-In chapter 2, Gulden Uchyigit presented an overview of some of the techniques, algorithms, methodologies along with challenges of using semantic information in representation of domain knowledge, user needs and the recommendation algorithms
In chapter 3, Bettina Berendt and Daniel Trümper describe a novel method for analyzing large corpora has been developed Using an ontology created with methods
of global analysis, a corpus is divided into groups of documents sharing similar ics The introduced local analysis allows the user to examine the relationships of documents in a more detailed way
top-In chapter 4, Jean-Pierre Norguet et al propose a method based on output page mining and presents a solution to answer the need for summarized and conceptual audience metrics in Web analytics The authors describes several methods for collect-ing the Web pages output by Web servers and aggregate the occurrences of taxonomy terms in these pages can provide audience metrics for the Web site topics
Trang 6In chapter 5, Leszek Borzemski presents empirical experience learnt from Web performance mining research, in particular, in the development of predictive model describing Web performance behavior from the perspective of end-users The author evaluates Web performance from the perspective of Web clients therefore the Web performance is considered in the sense of the Web server-to-browser throughput or Web resource download speed rate
In chapter 6, Ali Mroue and Jean Caussanel describe an approach for automatically finding the prototypic browsing behavior of web users User access logs are examined
in order to extract the most significant user navigation access pattern Such approach gives us an efficient way to better understand the way users are acting, and leads us to improve the structure of websites for improving navigation
In chapter 7, Istvan K Nagy and Csaba Gaspar-Papanek investigate the time spent on web pages as a disregarded indicator of quality of online contents The authors present influential factors on TSP measure and gave a TSP data preprocessing methodology whereby we were able to eliminate the effects of this factors In addition, The authors introduce the concept of the sequential browsing and revisitation to more exactly restore users' navigation pattern based on TSP and the restored stack of browser
In chapter 8, Yingzi Jin et al describe an attempt to learn ranking of companies from
a social network that has been mined from the web The authors conduct an experiment using the social network among 312 Japanese companies related to the electrical prod-ucts industry to learn and predict the ranking of companies according to their market capitalization This study specifically examines a new approach to using web informa-tion for advanced analysis by integrating multiple relations among named entities
In chapter 9, Jun Shen, and Shuai Yuan propose a modelling based approach to sign and develop a P2P based service coordination system and their components The peer profiles are described with the WSMO (Web Service Modelling Ontology) stan-dard, mainly for quality of service and geographic features of the e-services, which would be invoked by various peers To fully explore the usability of service categoriza-tion and mining, the authors implement an ontology driven unified algorithm to select the most appropriate peers The UOW-SWS prototype also shows that the enhanced peer coordination is more adaptive and effective in dynamic business processes
de-In chapter 10, I-Hsien Ting and Hui-Ju Wu provide a study about the issues of ing web mining techniques for on-line social networks analysis Techniques and con-cepts of web mining and social networks analysis will be introduced and reviewed in this chapter as well as a discussion about how to use web mining techniques for on-line social networks analysis Moreover, in this chapter, a process to use web mining for on-line social networks analysis is proposed, which can be treated as a general process in this research area Discussions of the challenges and future research are also included in this chapter
us-In summary, this book’s content sets out to highlight the trends in theory and tice which are likely to influence e-commerce and e-services practices in the web mining research Through applying Web-mining techniques to e-commerce and e-services, value is enhanced and the research fields of Web mining, e-commerce and e-services can be expanded
prac-I-Hsien Ting Hui-Ju Wu
Trang 7Semantics-Based Analysis and Navigation of Heterogeneous
Bettina Berendt, Daniel Tr¨ umper 45
Semantic Analysis of Web Site Audience by Integrating Web
Usage Mining and Web Content Mining
Jean-Pierre Norguet, Esteban Zim´ anyi, Ralf Steinberger 65
Towards Web Performance Mining
Leszek Borzemski 81
Anticipate Site Browsing to Anticipate the Need
Ali Mroue, Jean Caussanel 103
User Behaviour Analysis Based on Time Spent on Web Pages
Istvan K Nagy, Csaba Gaspar-Papanek 117
Ranking Companies on the Web Using Social Network Mining
Yingzi Jin, Yutaka Matsuo, Mitsuru Ishizuka 137
Adaptive E-Services Selection in P2P-Based Workflow with
Multiple Property Specifications
Jun Shen, Shuai Yuan 153
Web Mining Techniques for On-Line Social Networks Analysis:
An Overview
I-Hsien Ting, Hui-Ju Wu 169
Author Index 181
Trang 8Peter I Hofgesang
VU University Amsterdam, Department of Computer Science
De Boelelaan 1081A, 1081 HV Amsterdam, The Netherlands
hpi@few.vu.nl
Abstract In recent years, web usage mining techniques have helped online service
providers to enhance their services, and restructure and redesign their websites in linewith the insights gained The application of these techniques is essential in buildingintelligent, personalised online services More recently, it has been recognised that theshift from traditional to online services – and so the growing numbers of online cus-tomers and the increasing traffic generated by them – brings new challenges to the field.Highly demanding real-world E-commerce and E-services applications, where the rapid,and possibly changing, large volume data streams do not allow offline processing, mo-tivate the development of new, highly efficient real-time web usage mining techniques.This chapter provides an introduction to online web usage mining and presents anoverview of the latest developments In addition, it outlines the major, and yet mostlyunsolved, challenges in the field
Keywords: Online web usage mining, survey, incremental algorithms, data stream
mining
1 Introduction
In the case of traditional, “offline” web usage mining (WUM), usage and otheruser-related data are analysed and modelled offline The mining process is nottime-limited, the entire process typically takes days or weeks, and the entire dataset is available upfront, prior to the analysis Algorithms may perform several iter-ations on the entire data set and thus data instances can be read more than once.However, as the number of online users – and the traffic generated by them –greatly increases, these techniques become inapplicable Services with more than
a critical amount of user access traffic need to apply highly efficient, real-timeprocessing techniques that are constrained both computationally and in terms
of memory requirements Real-time, or online, WUM techniques (as we refer tothem throughout this chapter) that provide solutions to these problems havereceived great attention recently, both from academics and the industry.Figure 1 provides a schematic overview of the online WUM process Userinteractions with the web server are presented as a continuous flow of usage data;the data are pre-processed – including being filtered and sessionised – on-the-fly;models are incrementally updated when new data instances arrive and refreshed
I.-H Ting, H.-J Wu (Eds.): Web Mining Appl in E-Commerce & E-Services, SCI 172, pp 1–23.
Trang 92 P.I Hofgesang
Fig 1 An overview of online WUM User interactions with a web server are
pre-processed continuously and fed into online WUM systems that process the data andupdate the models in real-time The outputs of these models are used to, e.g mon-itor user behaviour in real-time, to support online decision making, and to updatepersonalised services on-the-fly
models are applied, e.g to update (personalised) websites, to instantly alert ondetected changes in user behaviour, and to report on performance analysis or onresults of monitoring user behaviour to support online decision making
This book chapter is intended to be an introduction to online WUM and itaims to provide an overview of the latest developments in the field and so, inthis respect, it is – to the best of our knowledge – the first survey on the topic.The remainder of this chapter is organised as follows In the 2 section, weprovide a brief general introduction to WUM, and the new online challenges
We survey the literature related to online WUM divided in three sections(Sections 3, 4, and 5) 3 overviews the efficient and compact structures used in(or even developed for) online WUM 4 overviews online algorithms for WUM,while 5 presents the work related to real-time monitoring systems The mostimportant (open) challenges are described in 6 Finally, the last section provides
a discussion
2 Background
This section provides a background to traditional WUM; describes incrementallearning to efficiently update WUM models in a single pass over the clickstream;and, finally, it motivates the need for highly efficient real-time, change-awarealgorithms for high volume, streaming web usage data through the description
of web dynamics, characterising changing websites and usage data
Web or application servers log all relevant information available on user–serverinteraction These log data, also known as web user access or clickstream data,
Trang 10can be used to explore, model, and predict user behaviour WUM is the cation of data mining techniques to perform these steps, to discover and analysepatterns automatically in (enriched) clickstream data Its applications includecustomer profiling, personalisation of online services, product and content recom-mendations, and various other applications in E-commerce and web marketing.There are three major stages in the WUM process (see Figure 2): (I) data collec-tion and pre-processing, (II) pattern discovery, and (III) pattern analysis (see,for example, [18, 51, 67]).
appli-Web Usage Data Sources The clickstream data contain information on each user
click, such as the date and time of the clicks, the URI of visited web sources, andsome sort of user identifier (IP, browser type and, in the case of authentication-required sites, login names) An example of (artificially designed) user access logdata can be seen in Table 1
In addition to server-side log data, some applications allow the installation ofspecial software on the client side (see, for example, [3]) to collect various otherinformation (e.g scrolling activity, active window), and, in some cases, morereliable information (e.g actual page view time) Web access information can befurther enriched by, for example, user registration information, search queries,and geographic and demographic information
Pre-processing Raw log data need to be pre-processed; first, by filtering all
irrele-vant data and possible noise, then by identifying unique visitors, and by recovering
Fig 2 An overview of the web usage mining process
Trang 114 P.I Hofgesang
Table 1 An example of user access log data entries
IP address Time stamp Request (URI) Status Size User agent1.2.3.4 2008-04-28 22:24:14 GET index.html 200 5054 MSIE+6.01.3.4.5 2008-04-28 22:24:51 GET index.html 200 5054 Mozilla/5.01.2.3.4 2008-04-28 22:25:04 GET content1.html 200 880 MSIE+6.01.2.3.4 2008-04-28 22:27:46 GET content2.html 200 23745 MSIE+6.01.3.4.5 2008-04-28 22:28:02 GET content5.html 200 6589 Mozilla/5.01.2.3.4 2008-04-29 08:18:43 GET index.html 200 5054 MSIE+6.01.2.3.4 2008-04-29 08:22:17 GET content2.html 200 23745 MSIE+6.0
user sessions1 Due to browser and proxy server caching some references are ing from the log entries; here we can use information about the site structure alongwith certain heuristics to recover original sessions (e.g [19]) Different resources(typically, distinct web pages) on a website also need to mapped to distinct in-dices Page mapping itself is a challenging task In the case of advanced websites,with dynamically generated pages – as in the case of most E-commerce sites —URIs contain many application-specific parameters, and their mapping requires1) a complete overview of the page generation logic, and 2) application-orienteddecisions for determining page granularities Pages can be mapped to predefinedcategories by content based classification as well (e.g [6])
miss-User identification based on the combination of the IP address and the miss-User agent fields identifies two distinct users (“1.2.3.4, MSIE+6.0” and “1.3.4.5, Mozilla/5.0”) on the sample entries (Table 1) If we take all visited URIs (Re- quest field) for both users, ordered ascendingly by the Time stamp field, and
then form user sessions by the time frame identification method (see [19]), usinge.g a 30 minute timeout, the individual entries would broke into two separatesessions in case of the first user and into a single session for the second Havingthe visited pages mapped to distinct indices – e.g by assigning integer num-bers increasingly, starting from 1, to each unique pages by their appearance, i.e.index.html→1, content1.html→2, content2.html→3, content5.html→4 – we can denote the two sessions of the first user as user1
1: 1,2,3 and user1
2: 1,3, and for
the other user as user2s1: 1,4 Data in this format, i.e ordered sequences of pageIDs, can directly be used in numerous WUM methods and can easily be trans-formed into e.g histogram or binary vector representation for the application ofothers For complete and detailed overviews on pre-processing web usage data,
we refer the reader to [19, 51]
WUM Techniques There is a vast amount of related work on traditional WUM.
The most popular research areas include frequent itemsets and association rulesmining, sequential pattern mining, classification, clustering, and personalisa-tion For an overview on these techniques and on related work, see for example,
1A session is a timely ordered sequence of pages visited by a user during one visit
Trang 12Mobasher et al [54], Eirinaki and Vazirgiannis [22], Anand and Mobasher [2],Pierrakos et al [63], and Liu [51].
Modelling Individual User Behaviour Most related work processes user sessions
without a distinction of individual origin, i.e which session belongs to which user,either due to lack of reliable user identification or because the application doesnot require it For some applications, however, it is beneficial to process sessionswith their individual origin preserved Model maintenance for each individualposes necessary constraints on model sizes; real-world applications with (tensof) thousands of individuals require compact models
In the case of “traditional” WUM, the complete training set is available prior
to the analysis In many real-world scenarios, however, it is more appropriate toassume that data instances arrive in a continuous fashion over time, and we need
to process information as it flows in and to update our models accordingly We
can identify a task as an incremental learning task when the application does
not allow for waiting and gathering all data instances or when the data flow ispotentially infinite – and we gain information by processing more data points
We say that a learning algorithm is incremental (see, for example, [30]), if at
each stage our current model is dependent only on the current data instance and
the previous model More formally, given the first i training instances (x1, , xi)
in a data flow, the incremental algorithm builds M0, M1, , Mimodels such that
each M j is dependent only on M j −1 and x j , where M0 is an initial model and
1≤ j ≤ i We can generally allow a batch of the last n instances to be processed, where n is relatively small compared with the size of the stream, instead of only
the last instance
Alternatively, even if we have an incremental learning task in hand, we maychoose to execute the entire model building process again on the complete dataset (that is, complete at that given time), but in many cases our algorithm is lim-ited by calculation time and/or computational resources Incremental learningtasks include many real-world problems and they are best solved by incrementalalgorithms Continuous streams of mass or individual usage data from websitesrequire continuous processing, and so incremental learning algorithms are re-quired for online responses
Note, however, that in the description above we assume that the underlyingdata generation process is constant over time and that we gain information
by continuously updating our models with new data instances However, thisassumption is not realistic in most real-world data streams The underlying datadistribution is likely to change over time due to hidden or explicit influentialfactors In the following sections, first we outline the major influential factors inonline WUM and then we describe data stream mining, an emergent new field
of data mining that deals with massive dynamic data flows
Trang 136 P.I Hofgesang
The web and, therefore, web content, structure and usage data are dynamic
in nature – as are most real-world data sets Websites are changed every day:pages are removed, new pages are added, new links are created or removed,and the contents of pages are also updated At the same time, the behaviour ofusers may change as well Related work on web dynamics is motivated mostly
by search engine design and maintenance, and investigates the evolution of theweb, tracking structural and content evolution of websites over time [25, 60] Inthe following, we outline the most influential factors for the dynamic web
Website Structural Changes Most popular websites change rapidly However, the
type of probable changes may differ over domains In case of news portals andonline retail shops, for instance, the general news and product categories changeinfrequently; however, if we would like to capture more detailed information,and we identify page granularity on the very article or product level, we facedaily emergence of new articles and product pages and the likely disappearance
of many others at the same time As a side-effect, new links are created and oldones are removed Links may be maintained independently as well; for example,
in case of a website that uses automated cross-product recommendations, orprovides personalised pages Evolution of web graphs is the subject of the article
by Desikan and Srivastava [21]
Changes in Page Content In addition to structural changes of a website, which
are due to maintenance, content is also likely to change over time For some webpages this means minor updates, but for others it means radical alteration Forinstance, Wikipedia articles evolve over time as more and more content is addedand the information is smoothed; however, their general subjects mostly remain.Other pages may undergo drastic content modifications or be merged with otherpages, which both lead, most probably, to changes in page categories Evolvingweb content has been the subject of numerous research, see, for example, [33,
42, 52, 71]
The evolution of website structure and content over time raises many practicalproblems How do we maintain the semantic links between old pages and theirsuccessors? We should still be able to identify a new home page with differentcontent, and most likely a different URI, as the same home page with the sameindices mapped to it, perhaps flagged as changed and the change quantified, ifnecessary How do we synchronise an evolving website with user access data?Usage data always refer to the actual website and so an earlier snapshot of thestructure and its page mapping may be obsolete Automatic maintenance ofsemantics and synchronising web structure, content, and access data is an arealargely unexplored
Changing User Behaviour Changes in a dynamic website are, of course, reflected
in user access patterns However, the interests and behaviours of individuals arelikely to change independently over time as well An individual planning to buy a
TV may browse through the available selection of an online retail shop for several
Trang 14days or weeks and abandon the topic completely for years after the purchase Inthe case of an investment bank, we can imagine a client who fills up his portfolioover several days (“investing behaviour”) and then only seldomly checks their ac-count (“account checking behaviour”), to get an overview, for an extended period
of time before getting involved in transactions again Additionally, behaviouralpatterns of users may re-occur over time (e.g alternating account checking andinvesting behaviour), and seasonal effects (e.g Christmas or birthday shopping)are also likely to influence user behaviour Detecting behavioural changes is es-sential in triggering model updates; and identifying re-occurrence and seasonalpatterns helps to apply knowledge gained in the past
Incremental mining tasks require single-pass algorithms (Section 2.2) Datastream mining (DSM) [1, 4, 26, 39, 68] tasks induce further harsh constraints
on the methods that solve them The high volume flow of data streams allowsonly a single-time process of data instances, or a maximum of a few times using
a relatively small buffer of recent data, and the mining process is limited bothcomputationally and in terms of memory requirements In many real-world ap-plications, the underlying data distribution or the structure of the data streamchanges over time, as described for WUM in the previous section Such appli-cations require DSM algorithms to account for changes and provide solutions
to handle them Temporal aspects of data mining are surveyed in Roddick andSpiliopoulou [64], without the focus on efficiency and DSM constraints The workemphasises the necessity for many applications to incorporate temporal knowl-edge into the data mining process, so that temporal correlations in patterns can
be explored Such a temporal pattern may be, for example, that certain productsare more likely to be purchased together during winter
Concept Drift In online supervised learning, concept drift [27, 72] refers to
changes in the context underlying the target, concept variable More generally,
we refer to “concept” drift in unsupervised learning as well A drifting conceptdeteriorates the model and, to recover it, we need to get rid of outdated infor-mation and base our model only on the most recent data instances that belong
to the new concept Applying a fixed-size moving window on the data streamand considering only the latest instances is a simple and widely used solution tothis problem However, in practice, we cannot assume that any fixed value, how-ever carefully selected, of window size is able to capture a sufficient amount of
“good” instances A dynamic window size, adjusted to the changing context, isdesirable but it requires sophisticated algorithms to detect the points of change
An alternative to the sliding window approach is to exponentially discount olddata instances and update models accordingly
Incremental and stream mining algorithms need an online feed of pre-processeddata Although we did not find related work on real-time pre-processing of
Trang 158 P.I Hofgesang
clickstreams and related data sources, we assume traditional pre-processingmethods to have straightforward extensions to perform all necessary steps, in-cluding filtering, user identification, and sessionisation, described in Section 2.1
To support online pre-processing, we further need to maintain look-up tablesincluding a user table with user identification and related user data, a pagemapping table, and a table with filtering criteria, which holds, for example, anup-to-date list of robot patterns or pages to remove The automatic mainte-nance of page mapping consistent with both the website and the access data is
a non-trivial task, as mentioned in Section 2.3
3 Compact and Efficient Incremental Structures to
Maintain Usage Data
To support real-time clickstream mining, we need to employ flexible and tive structures to maintain web usage data These structures need to be memoryefficient and compact, and they need to support efficient self-maintenance, i.e.insertion and deletion operators and updates in some applications We stress effi-ciency requirements especially for applications where individual representation isneeded How should web usage data be represented to meet these requirements?Much of the related work applies tree-like structures to maintain sequential
adap-or “market basket”2(MB) type data User sessions tend to have a low branchingproperty on the first few pages, i.e the variation of pages in session prefixes ismuch lower than in the suffixes This property reflects the hierarchy of websitesand that most users in general visit only a small set of popular pages In practice,this property assures compactness in prefix-tree-like representations where thesame prefixes of sessions share the same branches in the tree In addition, treesare inherently easy to maintain incrementally A simplest, generic prefix-treeconsists of a root node, which may contain some general information aboutthe tree (e.g sum frequency) and references to its children nodes Every othernode in the tree contains fields with local information about the node (e.g pageidentification or node label, a frequency counter) and reference to its parent andchildren nodes An insertion operator, designed to assure that the same prefixesshare the same branches in the tree, turns this simple structure into a prefix-tree.This structure can be used to store sequences and, by applying some canonical(e.g lexicographic or frequency-based) order on items prior to insertion, MBtype data as well
Related work mostly extends this structure to suit specific applications Themajority of the following structures were originally proposed to maintain (fre-quent) itemsets but they can be used to store sessions, preserving the orderinginformation Whenever the application allows (e.g user profiling) or requires(e.g frequent itemset mining) the transformation of sessions to MB type, the2
“Market basket” type data is common in E-commerce applications With the ing information disregarded, sessions turn into sets of items or “market basket” typedata sets; cardinality of pages within single sessions or sets is often disregarded aswell
Trang 16order-loss of information results in highly compact trees, with the size reduced by largemargins This section focuses only on the structures – their original application
is mostly ignored
FP-Tree[34] was designed to facilitate frequent pattern mining The ture includes a header table to easily access similar items in the tree; nodes areordered by their frequency It was designed for offline mining, using two scansover the whole data set, and therefore no operators for online maintenance aredefined in the paper Cheung and Zaiane [16] introduced an FP-Tree variant,CATS Tree, for incremental mining In CATS Tree, sub-trees are optimisedlocally to improve compression, and nodes are sorted in descending order ac-cording to local frequencies AFPIM [43] extends FP-Tree by enabling onlinemining and providing the necessary maintenance operations on the tree How-ever, if we apply a minimum threshold to the frequency, the algorithm wouldstill need a complete scan of the data set in case of the emergence of “prefre-quent” items not yet represented in the tree FP-stream [29], another extension
struc-of FP-Tree, stores frequent patterns over tilted-time windows in an FP-Treestructure with tree nodes extended to embed information about the window.CanTree [47] is a simple tree structure to store MB type data all ordered,
by the same criteria, prior to insertion In this way, the order of insertion ofsequences will not have any effect on the final structure It is designed to supportsingle-pass algorithms; however, it does not apply a minimum threshold either,which would require multiple scans The authors extended this work in [46] andproposed DSTree to support frequent itemsets mining in data streams.Xie et al [41] proposed FIET (frequent itemset enumeration tree), a structurefor frequent itemset mining Nodes represent frequent itemsets and have an active
or inactive status to deal with potentially frequent itemsets Rojas and Nasraoui[65] presented a prefix tree with efficient single pass maintenance to summarizeevolving data streams of transactional data Along with the tree structure, analgorithm, to construct and maintain prefix trees with dynamic ranking, i.e withordering criterion that changes with time, was provided
The structures mentioned so far were designed to store MB type data andthus, if applied with the original intention, they spoil the sequential information
of sessions The following structures were inherently designed to store sequences.CST[32] is a simple generic prefix-tree for compact session representation Chen
et al [14] used a simple prefix-tree for incremental sequential patterns mining.El-Sayed et al [23] proposed FS-Tree, a frequent sequences tree structure,
to store potentially frequent sequences A simple tree structure is extended by
a header table that stores information about frequent and potentially frequentsequences in the data with a chain of pointers to sequences in the tree A non-frequent links table stores information about non-frequent links to support in-cremental mining In Li et al [49], TKP-forest, a top-k path forest, is used tomaintain essential information about the top-k path traversal patterns A TKP-forest consists of a set of traversal pattern trees, where a tree is assigned toeach character in the alphabet and contains sequences with their first elementequal to this character All possible suffixes of each incoming session are added
Trang 1710 P.I Hofgesang
Fig 3 Simple prefix-tree representation of the original sessions with a list of references
pointing to the last pages of sessions
Fig 4 Simple prefix-tree representation of sessions transformed into ascendingly
or-dered MB-type data with a list of references pointing to the last pages of sessions
to the appropriate tree Each tree maintains general statistics over its sequencesand the same items are linked together within trees to support efficient mining.Although this is mostly not covered in the literature we can assume that main-tenance of data over a variable or fixed-size sliding window can be implemented
easily by, for instance, maintaining a list of references for the last n sessions
pointing to the last pages of the sessions in the tree Sessions can easily be inated by following these pointers Figure 3 and 4 present an example of treeevolution, based on the simple generic prefix-tree we described above, both forordered session data (Figure 3) and for its MB type data representation (Figure4) using the data in Table 2 The simple tree structure is extended by a list ofreferences pointing to the last pages of sessions in the tree
elim-Table 2 Sample session data both in original and MB-type format
ID Original Session MB-type ID Original Session MB-type
s1 1 1 2 5 5 5 1 2 5 s4 1 2 1 2
s2 1 2 2 9 1 2 9 s5 1 2 3 3 1 2 3
Trang 184 Online WUM Algorithms
This section provides an overview of online WUM algorithms grouped into fourcategories: frequent (top-k) items, itemsets and association rules mining; discov-ery of sequential patterns and sequential prediction; clustering; and web userprofiling and personalisation We have attempted to compile a comprehensivelist of relevant papers; however, the list may not be complete Most of the workrelates to online WUM, but we also included general DSM methods where theapplication allows the processing of web usage data
Extending traditional frequent itemsets and association rules mining methods
to DSM environments has been widely studied recently, and it is one of themost popular fundamental research areas in DSM Just as in traditional item-sets mining, the exponential number of candidate items, and even the result set
is typically huge – and so in DSM we need to apply a minimal support old to rule out infrequent candidates The greatest challenge in finding frequentpatterns, and therefore frequent itemsets, in streaming data in an incrementalfashion is that previously infrequent patterns may become frequent after newinstances flow in and, similarly, previously frequent patterns may become in-frequent There is a vast amount of research proposed to solve this problem.Here we introduce only a few of the pioneer works and several more recent ones.Note, that most of these techniques can be applied directly to any kind of MBtype data sets, so we do not need to differentiate between WUM and generalDSM techniques Most of the following algorithms use some type of compact andefficient structure to maintain frequent and pre-frequent patterns over time
thresh-Two-pass and “semi-” incremental algorithms The candidate
generation-and-test method of Apriori-like algorithms is efficient for searching among theotherwise exponential amount of candidates, but it is not suitable for solvingincremental or stream mining tasks A number of algorithms apply a more ef-ficient, although still not online, approach that scans the database once to findcandidates, and to identify the actual frequent sets, with respect to a specifiedminimum support threshold, in a second scan FP-Growth, proposed by Han et
al ([35]), requires two scans over the entire database to find frequent itemsets Itsefficient tree structure, FP-Tree, uses header links to connect the same items inthe tree [16, 43] extended this work Cheung and Zaiane [16] introduced CATSTree, an extension of FP-Tree with higher compression, with the FELINEalgorithm FELINE allows adjustment to minimal support, to aid interactivemining (“built once, mine many”) AFPIM, proposed by Koh and Shieh [43],stores both frequent and pre-frequent items in an extended FP-Tree The tree
is adjusted according to the inserted and deleted transactions; however, it needs
a complete rescan over the database in case a newly emerged frequent item isnot yet in the tree
Trang 1912 P.I Hofgesang
One-pass algorithms There are two methods, in related work, to limit the
fre-quent pattern structure size: some algorithms use double thresholds (e.g [45]),and some apply pruning (e.g [17]) Lee and Lee [45] applied double thresholdsand an additional monitoring prefix-tree to maintain candidates They evaluatedtheir method both on real and synthetic web log data Chi et al [17] presentedMoment, an algorithm to maintain all closed frequent itemsets in a sliding win-dow A closed enumeration tree, CET, is used to record all actual and potentialclosed frequent itemsets estWin, by Chang and Lee [11], maintains a slidingwindow over the itemsets and stores all the currently significant ones in a moni-toring tree This tree is pruned over time to limit its size Frequent itemsets aremined, upon user request, from the monitoring tree Calders et al [10] pointedout that the goodness of online mining methods for frequent itemsets dependshighly on the correct parameter settings, i.e on the size of the sliding window or
on the decay factor if applied They proposed a max frequency measure of anitemset, that refers to the maximal frequency of an itemset over all possible win-dows on the stream They show that, in practice, it is sufficient to calculate max.frequencies over some specific points, called borders, and to maintain summarystatistics over only these points in order to determine frequent itemsets.The above papers focus on frequent itemsets mining and do not presentmethodology to maintain association rules Although rules can be calculatedbased on the frequent itemsets, it is not straightforward to maintain them overtime given the evolving itemsets and a user-defined confidence threshold.Yet another, slightly similar problem, is to find supported (top-k) items over adata stream (e.g the top 10 most-visited web pages) Cormode and Muthukrish-nan [20] presented methods to maintain top-k items, and their approximate fre-quency, based on statistics over random samples, referred to as “group testing”.Jin et al [40] proposed two hash-based approaches, hCount and hCount*, tofind a list of most frequent items over a data stream Charikar et al [12] pre-sented a one-pass algorithm applied on a novel data structure (count sketch)
to estimate the most frequent items using very limited storage space
In the previous section, on frequent itemsets mining, we ignored the order formation of user sessions This section, however, presents methods to discoverfrequent sequences and sequential relationships Essentially, the main problem
in-in frequent sequential pattern min-inin-ing is the same as described in-in the previoussection: how to deal with patterns that become frequent or infrequent over time.Finding frequent sequences online may help to adapt websites in real-time based
on the analysis of popular page traversals; and sequential page prediction els may form the basis of online page recommendation systems or page cachingmechanisms
mod-Wang [70] used a dynamic suffix tree structure for incremental pattern ing Parthasarathy et al [61] presented ISM, an incremental sequence miningthat maintains the frequent and potentially frequent sequences in a sequencelattice Massaeglia et al [53] proposed IseWum to maintain sequential web
Trang 20updat-usage patterns incrementally However, no guidelines for efficient tion are provided, the algorithm, as described, needs multiple iterative scansover the entire database The necessary number of iterations is a multiple of thelength of the longest sequence.
implementa-Cheng et al [15] proposed IncSpan to maintain sequential patterns in ically changing databases, solving the problem of inserting and appending records
dynam-to a database – deletion of records is not discussed The algorithm maintains abuffer of semi-frequent patterns as candidates and stores frequent ones in a se-quential pattern tree The efficiency of the algorithm is optimised through reversepattern matching and shared projection Chen et al [14] argued that IncSpanand its improved variant IncSpan+ [59] fail to detect some potentially frequentsequences and thus, eventually, the method is prone to miss a portion of all fre-quent sequences They proposed PBIncSpan to overcome the problem
El-Sayed et al [23] presented a tree structure (FS-tree) for frequent quences The tree is maintained incrementally, sequences are inserted or deletedbased on changes in the database In Li et al [48] StreamPath was presented
se-to mine the set of all frequent traversal patterns over a web-click stream byone scan The authors extended this work in [49] to find the top-k traversalsubsequence patterns
Yen et al [75] presented IncWTP to mine web traversal patterns incrementallyusing an extended lattice structure The size of the structure is limited by the web-site link structure: only connected pages are considered to be valid traversals.G¨und¨uz- ¨Og¨ud¨uc¨u and Tamer ¨Ozsu [32] presented an incremental web pagerecommendation model based on a compact tree structure (CST) and similar-ity based clustering of user sessions Li et al [50] presented DSM-PLW, aprojection-based, single-pass algorithm for online incremental mining of pathtraversal patterns over a continuous stream of maximal forward references us-ing a Landmark Window Laxman et al [44] presented space- and time-efficientalgorithms for frequency counting under the non-overlapped occurrences-basedfrequency for episodes
Markov models are highly popular in offline sequential prediction tasks though we found no prior work, we can assume it is straightforward to extendtraditional Markov-model-based techniques to online versions The state transi-tion probability matrix can be updated incrementally and, to keep it compact,state transitions can be represented using efficient tree or hash structures
Clustering partitions data instances into similarity groups, called clusters, suchthat members of the same cluster are similar, and members of different clustersare dissimilar To determine the degree of similarity, clustering applies a simi-larity or distance measure on pairs of instances Applications of web usage dataclustering in E-commerce environments include market segmentation and webpersonalisation In a stream mining environment, in addition to the constraintsdescribed in Section 2.4, the major challenge in clustering is to handle evolvingclusters New clusters may arise, old ones may disappear or merge, and instances,
Trang 2114 P.I Hofgesang
for example, in case clustered instances are individual users, may change ter membership over time Barbar [8] presents requirements for clustering datastreams and overviews some of the latest algorithms in the literature
clus-Ester et al [24] present an incremental density-based clustering, tal DBSCAN, one of the earliest incremental clustering methods The relationbetween objects is defined by assumptions about object density in a given neigh-bourhood of the object Effects of incremental updates, insertion and deletion
Incremen-of objects, are considered through their effect in changing these relations uation includes experiments on web access log data of a computer science department site
Eval-Nasraoui et al [58] presented TECNO-STREAMS, an immune system spired single pass method to cluster noisy data streams The system continuouslylearns and adapt to new incoming patterns In [56] the authors extended thiswork to track and validate evolving clusters and present a case study on thetask of mining real evolving web clickstream data and on tracking evolving topictrends in textual stream data
in-In Hofgesang [37] user profiles are maintained for each individual tally by means of a prefix-tree structure Clustering of profiles is offline, the workassumes that clusters need to be updated only periodically on demand Wu et al.[74] propose a clustering model, to generate and maintain clusters mined fromevolving clickstreams, based on dense regions discovery However, the authors
incremen-do not enclose details about cluster maintenance issues and the evaluation, onreal-world web usage data, do not cover the evolving aspects either
In Suryavanshi et al [69] the authors extend their previous work, RelationalFuzzy Subtractive Clustering, and propose its incremental version, Incremen-tal RFSC, for adaptive web usage profiling They define a measure, impactfactor, which quantifies the necessity of reclustering Their method thus up-dates clusters incrementally until the model deteriorates and needs a completere-clustering of the data from scratch
The following works, despite they present offline methods, capture the ing environment via incorporating temporal aspects In Nasraoui et al [57] theauthors present a framework based on a robust evolutionary clustering approach,for mining, tracking, and validating evolving user profiles on dynamic websites.The session similarity measure for clustering is extended to incorporate websiteontology, weighting pages based on their distance in the site hierarchy MONIC,proposed by Spiliopoulou et al [66], is a framework for monitoring cluster tran-sitions In the framework an offline clustering is applied periodically on an accu-mulating data set Cluster transitions, such as the emergence and disappearance
chang-of clusters and migration chang-of members from one cluster to the other, are trackedbetween two consecutive cluster sets
The aim of web personalisation is to help users cope with the informationload and to automatically filter relevant, new information An adaptive, person-alised website automatically filters new content according to user preference and
Trang 22adjusts its structure and presentation to improve usability Personalisation isbased on individual user or aggregate group profiles that capture individual orcommon interest and preference For an overview of offline personalisation, see[2, 22].
Most current personalisation systems consist of an offline part, to discover userprofiles, and an online part, to apply the profiles in real-time This approach isnot suitable in real-time dynamic environments with changing user preferences
In this scenario, user profiles also need to be updated online User profiles can bebased virtually on any of the online techniques presented in the previous sections
to extract user-specific patterns, e.g to maintain a list of most popular pages
or page sets of an individual over time In the case of group personalisation orcollaborative filtering, we may use online clustering to identify aggregate groupsand to calculate a centroid or base profile for each of these groups
Chen [13] presented a self-organising HCMAC neural network that can mentally update user profiles based on explicit feedback on page relevance given
incre-by users browsing a website A network needs an initial training on an initialdata set to build a starting model that is updated incrementally later on.Godoy and Amandi [31] proposed a user profiling technique, based on the webdocument conceptual clustering algorithm, that supports incremental learningand profile adaptation The personal agent, PersonalSearcher, adapts itsbehaviour to interesting changes to assist users on the web Furthermore, profilescan be presented in a readable description so that users can explore their profilesand verify their correctness
Based on the user profiles, we can build personalised services to provide tomised pages and adaptive websites The notion of an adaptive website wasproposed by Perkowitz and Etzioni [62] for websites that automatically improvetheir organisation and presentation based on user access patterns Baraglia andSilvestri [7] introduced SUGGEST, which performs online user profiling, modelupdating, and recommendation building
cus-In an article by Nasraoui et al [55], the authors presented two strategies,based on K-Nearest-Neighbors and TECNO-STREAMS (see Section 4.3), forcollaborative filtering-based recommendations applied on dynamic, streamingweb usage data They described a methodology to test the adaptability of rec-ommender systems in streaming environments
5 Online Web Usage Mining Systems
While related work in the previous sections focus mostly on single algorithms,here we present works that describe complete frameworks for online change de-tection and monitoring systems
Baron and Spiliopoulou [9] presented PAM, a framework to monitor changes
of a rule base over time Despite the offline methods, i.e pattern sets are fied in batches of the data between two consecutive time slices, tracking changes
identi-of usage patterns makes this work interesting to our survey Patterns – tion rules – are represented by a generic rule model that captures both statistics
Trang 23associa-16 P.I Hofgesang
and temporal information of rules Thus each rule is stored together with itstimestamp and statistics, such as support, confidence and certainty factor Ateach time slice patterns are compared to the ones discovered in the previousbatch: the same rules in the two sets are checked for significant changes in theirstatistics using a two-sided binomial test In case a change is detected based onthe current and the previous batches it is labelled either as short or long-termchange depending on the results of change detection in the following step, i.e.whether the changed value returns to its previous state in the next test or itremains the same for at least one more period Change detection in this form
is local, it checks rules that coexist in consecutive patterns sets To track rulesthroughout the whole period several heuristics were given that analyse changes
in the time series – formed of consecutive measurements for each rule on all dataslices, – e.g to check pattern stability over time and label them as permanent,frequent, or temporary changes The set of rules with changed statistics may belarge and to reduce its size the notion of atomic change were introduced A rulewith an atomic change contains no changed subpart itself At each step only theset of rules with atomic changes is presented to the user Experimental evalu-ation of PAM included analysis of 8 months of server-side access log data of anon-commercial website The total set was sliced into monthly periods, whichseems to be a reasonable setup, although no evaluation was presented how doesthe selection of window size affect the framework Furthermore, the authors gave
no guidelines to field experts on which heuristics to apply on a particular dataset and how to interpret the results of the heuristics
In their work Ganti et al [28] assumed the data to be kept in a large datawarehouse and to be maintained by systematic block evolution, i.e addition anddeletion of blocks of data They presented DEMON, a framework for miningand monitoring blocks of data in such dynamic environments A new dimension,
called the data span dimension, was introduced on the database which allows to select a window of the w most recent data blocks for analysis They also speci- fied a selection constraint, the block selection predicate, which allows to limit the
analysis to data blocks that satisfy certain criteria, e.g to select blocks of dataadded on each Monday They described three incremental algorithms, includingtwo variants of frequent itemset mining algorithms and a clustering algorithm,
on varying selections of data blocks In addition, they proposed a generic rithm that can be instantiated by additional incremental algorithms to facilitatetheir framework Furthermore, to capture possible cyclic and seasonal effects, asimilarity measure between blocks of data was defined
algo-The topology of a website represents the view of its designer’s algo-The actual siteusage, that reflects how visitors actually use the site, can confirm the correctness
of the site topology or can indicate paths of improvements It is in the best est of the site maintainer to match the topology and usage to facilitate efficientnavigation on the site Wu et al [73] proposed a system to online monitor andimprove website connectivity based on the site topology and usage data Theydefined two measures to quantify access efficiency on a website They assumedthat each user session consists of a set of target pages that the particular user
Trang 24inter-wants to visit The measures define efficiency based on the extra clicks a user has
to perform to reach his target pages within a given web graph These measuresare monitored constantly over the incoming sessions and in case their valuesdrop below a certain threshold redesign of the website topology is initiated The
redesign phase is facilitated by the access interest measure, which is designed
to indicate whether an access pattern is popular but not efficient Although,the concept of target page sets is the basis of their methods the authors simplyassume that these targets can be identified using page view times and the web-site topology Unfortunately, since it is a non-trivial task – and these pages canonly be approximated to a certain extent (e.g [36]) but can never be completelyidentified, – no guidelines are provided on how to identify these pages withinuser sessions
Hofgesang and Patist [38] provided a framework for online change detection inindividual web user behaviour They defined three goals – detecting changes inindividual browsing behaviour, reporting on user actions that may need specialcare, and detecting changes in visitation frequency – and proposed their spaceand computationally efficient, real-time solutions The first problem deals withdetecting changes in navigational patterns of users, i.e the sets of visited pages
of individuals Solution to the second goal is an integral part of the solution
to the first problem It considers outlier patterns of the first goal and checkswhether these patterns are “interesting” based on their “uniqueness” compared
to patterns of other individual profiles The third goal is to detect increased
or decreased activities in real-time on individual activity data, i.e the seriesformed by the number of sessions for individuals in a given time period (e.g.day) Changes detected in navigation patterns can be used, e.g to update per-sonalised websites, while the solution to the second problem provides hints that
an individual may need online assistance Detecting changes in user activity cilitates customer retention, e.g decreasing user activity may forecast a defectingcustomer If detected in time, a change can be used to take certain marketingactions to possibly retain the customer
fa-6 Challenges in Online WUM
This section summarises major challenges in online web usage mining to vate research in the field There is only a handful of works devoted completely
moti-to online web usage mining and thus more research – moti-to adapt and improve ditional web usage mining tools to meet the severe constraints of stream miningapplications and to develop novel online web usage mining algorithms – is muchneeded In particular, the most challenging and largely unexplored aspects are:
tra-• Change detection Characteristics of real-world data, collected over an
ex-tended period of time, are likely to change (see Section 2.3 and 2.4) Toaccount these changes and to trigger proper actions (e.g to update models,
or send alerts) algorithms for change detection need to be developed
• Compact models Many applications require a single model or a single profile
maintained for each individual (e.g individual personalisation and direct
Trang 2518 P.I Hofgesang
marketing) In case of most E-commerce applications this would lead to themaintenance of (tens or hundreds of) thousands of individual models (seeSection 2.1) and therefore efficient, compact representations are required
• Maintenance of page mapping [5] shows that in commercial websites over
40% of the content changes each day How to maintain consistency betweenpage mapping and usage data (see Section 2.3)? How to interpret previouslydiscovered patterns that refer to outdated web content, in a changed environ-ment? Automated solutions to maintain consistent mappings are required
• New types of websites Numerous practical problems arise with the growing
number of AJAX and flash based applications In case of flash, the content
is downloaded at once and user interaction is limited to the client side andthus not tracked by the server AJAX based web applications refresh onlyparts of the content How to collect the complete usage data in these environ-ments and how to identify web pages? Ad hoc solutions exist to tackle theseproblems but automated solutions, that capture the intentions of websitedesigners, would be highly desirable
• Public data sets The lack of publicly available web usage data sets sets
back research on online web usage mining Data sets collected over an sive amount of time, possibly reflecting web dynamics and user behaviouralchanges, carefully processed and well documented, with clear individual iden-tification would highly facilitate research
exten-7 Discussion
This work presented an introduction to online web usage mining It describedthe problem and provided background information followed by a comprehen-sive overview of the related work As in traditional web usage mining, the mostpopular research areas in online web usage mining are frequent pattern mining(frequent itemsets and frequent sequential patterns), clustering, and user pro-filing and personalisation We motivated research in online web usage miningthrough identification of major, and yet mostly unsolved, challenges in the field.Applications of online WUM techniques include many real-world E-commercescenarios including real-time user behaviour monitoring, support of on-the-flydecision making, and real-time personalisation that support adaptive websites
Trang 263 Atterer, R., Wnuk, M., Schmidt, A.: Knowing the user’s every move: user activitytracking for website usability evaluation and implicit interaction In: WWW 2006:Proceedings of the 15th international conference on World Wide Web, pp 203–212.ACM, New York (2006)
4 Babcock, B., Babu, S., Datar, M., Motwani, R., Widom, J.: Models and issues
in data stream systems In: PODS 2002: Proceedings of the twenty-first ACMSIGMOD-SIGACT-SIGART symposium on Principles of database systems, pp.1–16 ACM, New York (2002)
5 Baldi, P., Frasconi, P., Smyth, P.: Modeling the Internet and the Web: ProbabilisticMethods and Algorithms John Wiley & Sons, Chichester (2003)
6 Balog, K., Hofgesang, P.I., Kowalczyk, W.: Modeling navigation terns of visitors of unstructured websites In: AI-2005: Proceedings ofthe 25th SGAI International Conference on Innovative Techniques andApplications of Artificial Intelligence, pp 116–129 Springer SBM, Heidelberg(2005)
pat-7 Baraglia, R., Silvestri, F.: Dynamic personalization of web sites without user tervention Commun ACM 50(2), 63–67 (2007)
in-8 Barbar´a, D.: Requirements for clustering data streams SIGKDD Explor.Newsl 3(2), 23–27 (2002)
9 Baron, S., Spiliopoulou, M.: Monitoring the evolution of web usage patterns In:Berendt, B., Hotho, A., Mladeniˇc, D., van Someren, M., Spiliopoulou, M., Stumme,
G (eds.) EWMF 2003 LNCS (LNAI), vol 3209, pp 181–200 Springer, Heidelberg(2004)
10 Calders, T., Dexters, N., Goethals, B.: Mining frequent itemsets in a stream In:Perner, P (ed.) ICDM 2007, pp 83–92 IEEE Computer Society, Los Alamitos(2007)
11 Chang, J.H., Lee, W.S.: EstWin: Online data stream mining of recent frequentitemsets by sliding window method J Inf Sci 31(2), 76–90 (2005)
12 Charikar, M., Chen, K., Farach-Colton, M.: Finding frequent items in data streams.In: Widmayer, P., Triguero, F., Morales, R., Hennessy, M., Eidenbenz, S., Conejo,
R (eds.) ICALP 2002 LNCS, vol 2380, pp 693–703 Springer, Heidelberg (2002)
13 Chen, C.-M.: Incremental personalized web page mining utilizing self-organizingHCMAC neural network Web Intelli and Agent Sys 2(1), 21–38 (2004)
14 Chen, Y., Guo, J., Wang, Y., Xiong, Y., Zhu, Y.: Incremental mining of sequentialpatterns using prefix tree In: Zhou, Z.-H., Li, H., Yang, Q (eds.) PAKDD 2007.LNCS (LNAI), vol 4426, pp 433–440 Springer, Heidelberg (2007)
15 Cheng, H., Yan, X., Han, J.: IncSpan: incremental mining of sequential patterns
in large database In: KDD 2004: Proceedings of the 2004 ACM SIGKDD national conference on Knowledge discovery and data mining, pp 527–532 ACMPress, New York (2004)
inter-16 Cheung, W., Za¨ıane, O.R.: Incremental mining of frequent patterns without date generation or support constraint In: IDEAS 2003: 7th International DatabaseEngineering and Applications Symposium, pp 111–116 IEEE Computer Society,Los Alamitos (2003)
candi-17 Chi, Y., Wang, H., Yu, P.S., Muntz, R.R.: Moment: Maintaining closed frequentitemsets over a stream sliding window In: ICDM 2004, pp 59–66 IEEE ComputerSociety, Los Alamitos (2004)
Trang 2720 P.I Hofgesang
18 Cooley, R., Mobasher, B., Srivastava, J.: Web mining: Information and pattern covery on the world wide web In: ICTAI 1997: Proceedings of the 9th InternationalConference on Tools with Artificial Intelligence, pp 558–567 IEEE Computer So-ciety, Los Alamitos (1997)
dis-19 Cooley, R., Mobasher, B., Srivastava, J.: Data preparation for mining world wideweb browsing patterns Knowledge and Information Systems 1(1), 5–32 (1999)
20 Cormode, G., Muthukrishnan, S.: What’s hot and what’s not: tracking most quent items dynamically ACM Trans Database Syst 30(1), 249–278 (2005)
fre-21 Desikan, P., Srivastava, J.: Mining temporally evolving graphs In: Mobasher, B.,Liu, B., Masand, B., Nasraoui, O (eds.) WebKDD 2004: Webmining and WebUsage Analysis (2004)
22 Eirinaki, M., Vazirgiannis, M.: Web mining for web personalization ACM Trans.Inter Tech 3(1), 1–27 (2003)
23 El-Sayed, M., Ruiz, C., Rundensteiner, E.A.: FS-Miner: efficient and incrementalmining of frequent sequence patterns in web logs In: WIDM 2004: Proceedings
of the 6th annual ACM international workshop on Web information and datamanagement, pp 128–135 ACM Press, New York (2004)
24 Ester, M., Kriegel, H.-P., Sander, J., Wimmer, M., Xu, X.: Incremental clusteringfor mining in a data warehousing environment In: Gupta, A., Shmueli, O., Widom,
J (eds.) VLDB 1998: Proceedings of 24rd International Conference on Very LargeData Bases, pp 323–333 Morgan Kaufmann, San Francisco (1998)
25 Fetterly, D., Manasse, M., Najork, M., Wiener, J.L.: A large-scale study of theevolution of web pages Softw Pract Exper 34(2), 213–237 (2004)
26 Gaber, M.M., Zaslavsky, A., Krishnaswamy, S.: Mining data streams: a review.SIGMOD Rec 34(2), 18–26 (2005)
27 Gama, J., Castillo, G.: Learning with local drift detection In: Li, X., Za¨ıane, O.R.,
Li, Z (eds.) ADMA 2006 LNCS (LNAI), vol 4093, pp 42–55 Springer, Heidelberg(2006)
28 Ganti, V., Gehrke, J., Ramakrishnan, R.: DEMON: Mining and monitoring ing data Knowledge and Data Engineering 13(1), 50–63 (2001)
evolv-29 Giannella, C., Han, J., Pei, J., Yan, X., Yu, P.: Mining Frequent Patterns in DataStreams at Multiple Time Granularities In: Kargupta, H., Joshi, A., Sivakumar,K., Yesha, Y (eds.) Next Generation Data Mining AAAI/MIT (2003)
30 Giraud-Carrier, C.: A note on the utility of incremental learning AI tions 13(4), 215–223 (2000)
Communica-31 Godoy, D., Amandi, A.: User profiling for web page filtering IEEE Internet puting 9(04), 56–64 (2005)
Com-32 G¨und¨uz- ¨Og¨ud¨uc¨u, S., ¨Ozsu, M.T.: Incremental click-stream tree model: Learningfrom new users for web page prediction Distributed and Parallel Databases 19(1),5–27 (2006)
33 Han, J., Han, D., Lin, C., Zeng, H.-J., Chen, Z., Yu, Y.: Homepage live: automaticblock tracing for web personalization In: WWW 2007: Proceedings of the 16thInternational Conference on World Wide Web, pp 1–10 ACM, New York (2007)
34 Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation.In: Chen, W., Naughton, J.F., Bernstein, P.A (eds.) Proceedings of the 2000 ACMSIGMOD International Conference on Management of Data, Dallas, Texas, USA,May 16-18, pp 1–12 ACM, New York (2000)
35 Han, J., Pei, J., Yin, Y., Mao, R.: Mining frequent patterns without candidategeneration: A frequent-pattern tree approach Data Min Knowl Discov 8(1),53–87 (2004)
Trang 2836 Hofgesang, P.I.: Methodology for preprocessing and evaluating the time spent onweb pages In: WI 2006: Proceedings of the 2006 IEEE/WIC/ACM InternationalConference on Web Intelligence, pp 218–225 IEEE Computer Society, Los Alami-tos (2006)
37 Hofgesang, P.I.: Web personalisation through incremental individual profilingand support-based user segmentation In: WI 2007: Proceedings of the 2007IEEE/WIC/ACM International Conference on Web Intelligence, pp 213–220.IEEE Computer Society, Washington (2007)
38 Hofgesang, P.I., Patist, J.P.: Online change detection in individual web user haviour In: WWW 2008: Proceedings of the 17th International Conference onWorld Wide Web, pp 1157–1158 ACM, New York (2008)
be-39 Hulten, G., Spencer, L., Domingos, P.: Mining time-changing data streams In:Proceedings of the Seventh ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining, pp 97–106 ACM Press, New York (2001)
40 Jin, C., Qian, W., Sha, C., Yu, J.X., Zhou, A.: Dynamically maintaining frequentitems over a data stream In: CIKM 2003: Proceedings of the twelfth internationalconference on Information and knowledge management, pp 287–294 ACM, NewYork (2003)
41 Xie, Z.-j., Chen, H., Li, C.: MFIS—mining frequent itemsets on data streams In:
Li, X., Za¨ıane, O.R., Li, Z (eds.) ADMA 2006 LNCS, vol 4093, pp 1085–1093.Springer, Heidelberg (2006)
42 Khoury, I., El-Mawas, R.M., El-Rawas, O., Mounayar, E.F., Artail, H.: An efficientweb page change detection system based on an optimized Hungarian algorithm.IEEE Trans Knowl Data Eng 19(5), 599–613 (2007)
43 Koh, J.-L., Shieh, S.-F.: An efficient approach for maintaining association rulesbased on adjusting FP-tree structures1 In: Lee, Y., Li, J., Whang, K.-Y., Lee, D.(eds.) DASFAA 2004 LNCS, vol 2973, pp 417–424 Springer, Heidelberg (2004)
44 Laxman, S., Sastry, P.S., Unnikrishnan, K.P.: A fast algorithm for finding frequentepisodes in event streams In: KDD 2007: Proceedings of the 13th ACM SIGKDDinternational conference on Knowledge discovery and data mining, pp 410–419.ACM, New York (2007)
45 Lee, D., Lee, W.: Finding maximal frequent itemsets over online data streamsadaptively In: ICDM 2005: Proceedings of the 5th IEEE International Conference
on Data Mining, pp 266–273 IEEE Computer Society, Los Alamitos (2005)
46 Leung, C.K.-S., Khan, Q.I.: DSTree: A tree structure for the mining of frequentsets from data streams In: Perner, P (ed.) ICDM 2006: Proceedings of the SixthInternational Conference on Data Mining, pp 928–932 IEEE Computer Society,Los Alamitos (2006)
47 Leung, C.K.-S., Khan, Q.I., Hoque, T.: CanTree: A tree structure for efficientincremental mining of frequent patterns In: ICDM 2005: Proceedings of the 5thIEEE International Conference on Data Mining, pp 274–281 IEEE ComputerSociety, Los Alamitos (2005)
48 Li, H.-F., Lee, S.-Y., Shan, M.-K.: On mining webclick streams for path traversalpatterns In: WWW Alt 2004: Proceedings of the 13th international World WideWeb conference on Alternate track papers & posters, pp 404–405 ACM, New York(2004)
49 Li, H.-F., Lee, S.-Y., Shan, M.-K.: DSM-TKP: Mining top-k path traversal patternsover web click-streams In: WI 2005: Proceedings of the 2005 IEEE/WIC/ACM In-ternational Conference on Web Intelligence, pp 326–329 IEEE Computer Society,Los Alamitos (2005)
Trang 2922 P.I Hofgesang
50 Li, H.-F., Lee, S.-Y., Shan, M.-K.: DSM-PLW: single-pass mining of path traversalpatterns over streaming web click-sequences Comput Netw 50(10), 1474–1487(2006)
51 Liu, B.: Web Data Mining Springer, Heidelberg (2007)
52 Liu, L., Pu, C., Tang, W.: WebCQ-detecting and delivering information changes
on the web In: CIKM 2000: Proceedings of the ninth international conference
on Information and knowledge management, pp 512–519 ACM Press, New York(2000)
53 Masseglia, F., Poncelet, P., Teisseire, M.: Web usage mining: How to efficientlymanage new transactions and new clients In: Zighed, D.A., Komorowski, J.,
˙Zytkow, J.M (eds.) PKDD 2000 LNCS, vol 1910, pp 530–535 Springer, delberg (2000)
Hei-54 Mobasher, B., Dai, H., Luo, T., Sun, Y., Zhu, J.: Integrating web usage and contentmining for more effective personalization In: Bauknecht, K., Madria, S.K., Pernul,
G (eds.) EC-Web 2000 LNCS, vol 1875, pp 165–176 Springer, Heidelberg (2000)
55 Nasraoui, O., Cerwinske, J., Rojas, C., Gonz´alez, F.A.: Performance of mendation systems in dynamic streaming environments In: SDM 2007 SIAM,Philadelphia (2007)
recom-56 Nasraoui, O., Rojas, C., Cardona, C.: A framework for mining evolving trends inweb data streams using dynamic learning and retrospective validation ComputerNetworks 50(10), 1488–1512 (2006)
57 Nasraoui, O., Soliman, M., Saka, E., Badia, A., Germain, R.: A web usage miningframework for mining evolving user profiles in dynamic web sites IEEE Trans.Knowl Data Eng 20(2), 202–215 (2008)
58 Nasraoui, O., Uribe, C.C., Coronel, C.R., Gonz´alez, F.A.: TECNO-STREAMS:Tracking evolving clusters in noisy data streams with a scalable immune systemlearning model In: ICDM 2003: Proceedings of the 3rd IEEE International Confer-ence on Data Mining, pp 235–242 IEEE Computer Society, Los Alamitos (2003)
59 Nguyen, S.N., Sun, X., Orlowska, M.E.: Improvements of incSpan: Incrementalmining of sequential patterns in large database In: Ho, T.-B., Cheung, D., Liu, H.(eds.) PAKDD 2005 LNCS, vol 3518, pp 442–451 Springer, Heidelberg (2005)
60 Ntoulas, A., Cho, J., Olston, C.: What’s new on the web?: the evolution of theweb from a search engine perspective In: WWW 2004: Proceedings of the 13thinternational conference on World Wide Web, pp 1–12 ACM, New York (2004)
61 Parthasarathy, S., Zaki, M.J., Ogihara, M., Dwarkadas, S.: Incremental and active sequence mining In: CIKM 1999: Proceedings of the eighth internationalconference on Information and knowledge management, pp 251–258 ACM Press,New York (1999)
inter-62 Perkowitz, M., Etzioni, O.: Adaptive web sites: automatically synthesizing webpages In: AAAI 1998/IAAI 1998: Proceedings of the fifteenth national/tenth con-ference on Artificial intelligence/Innovative applications of artificial intelligence,
pp 727–732 American Association for Artificial Intelligence, Menlo Park (1998)
63 Pierrakos, D., Paliouras, G., Papatheodorou, C., Spyropoulos, C.D.: Web usagemining as a tool for personalization: A survey User Modeling and User-AdaptedInteraction 13(4), 311–372 (2003)
64 Roddick, J.F., Spiliopoulou, M.: A survey of temporal knowledge discoveryparadigms and methods IEEE Transactions on Knowledge and Data Engineer-ing 14(4), 750–767 (2002)
Trang 3065 Rojas, C., Nasraoui, O.: Summarizing evolving data streams using dynamic prefixtrees In: WI 2007: Proceedings of the IEEE/WIC/ACM International Conference
on Web Intelligence, pp 221–227 IEEE Computer Society, Washington (2007)
66 Spiliopoulou, M., Ntoutsi, I., Theodoridis, Y., Schult, R.: MONIC: modeling andmonitoring cluster transitions In: Proceedings of the Twelfth ACM SIGKDD In-ternational Conference on Knowledge Discovery and Data Mining, pp 706–711.ACM, New York (2006)
67 Srivastava, J., Cooley, R., Deshpande, M., Tan, P.-N.: Web usage mining: Discoveryand applications of usage patterns from web data SIGKDD Explorations 1(2), 12–
Nas-70 Wang, K.: Discovering patterns from large and dynamic sequential data J Intell.Inf Syst 9(1), 33–56 (1997)
71 Weinreich, H., Obendorf, H., Herder, E., Mayer, M.: Not quite the average: Anempirical study of web use ACM Trans Web 2(1), 1–31 (2008)
72 Widmer, G., Kubat, M.: Learning in the presence of concept drift and hiddencontexts Machine Learning 23(1), 69–101 (1996)
73 Wu, E.H., Ng, M.K., Huang, J.Z.: On improving website connectivity by usingweb-log data streams In: Lee, Y., Li, J., Whang, K.-Y., Lee, D (eds.) DASFAA
2004 LNCS, vol 2973, pp 352–364 Springer, Heidelberg (2004)
74 Wu, E.H., Ng, M.K., Yip, A.M., Chan, T.F.: A clustering model for mining ing web user patterns in data stream environment In: Yang, Z.R., Yin, H., Ever-son, R.M (eds.) IDEAL 2004 LNCS, vol 3177, pp 565–571 Springer, Heidelberg(2004)
evolv-75 Yen, S.-J., Lee, Y.-S., Hsieh, M.-C.: An efficient incremental algorithm for miningweb traversal patterns In: ICEBE 2005: Proceedings of the IEEE InternationalConference on e-Business Engineering, pp 274–281 IEEE Computer Society, LosAlamitos (2005)
Trang 32I.-H Ting, H.-J Wu (Eds.): Web Mining Appl in E-Commerce & E-Services, SCI 172, pp 25–43 springerlink.com © Springer-Verlag Berlin Heidelberg 2009
Gulden Uchyigit
Department of Computer Science Mathematics, University of Brighton
unprecedented rate, making it very difficult for users to find interesting information This situation is likely worsen in the future unless the end user has the available tools to assist them Web personalization is a research area which has received great attention in recent years Web personalization aims to assist the users with information overload problem One area of web personalization is the so called recommender systems Recommender systems make recommendations based on the user’s individual profiles Traditionally, the user profiles are keyword-based, they work on the premise that, those items which match certain keywords found in the user’s profile will be of interest and of relevance to the user, so those items are recommended to the user
One of the problems with the keyword-based profile representation methods is that a lot of useful information is lost during the pre-processing phase To overcome this problem eliciting and utilization of semantic-based information from the domain, rather than the individual keywords, within all stages of the personalization process including can enhance the personalization process
This chapter presents a state-of-the-art survey of the techniques which can be used to semantically enhance the data processing, user modeling and the recommendation stages of the personalization process
1 Introduction
Personalization technologies have been a popular tool for assisting users with the information overload problem As the number of services and the volume of content continues to grow personalization technologies are more than ever in demand Over the years they have been deployed in several different domains including the entertainment domain and e-commerce
In recent years developments into extending the Web with semantic knowledge in an attempt to gain a deeper insight into the meaning of the data being created, stored and exchanged has taken the Web to a different level This has lead to developments of semantically rich descriptions to achieve improvements in the area of personalization technologies (Pretschner and Gauch, 2004)
Traditional approaches to personalization include the content-based method
(Armstrong et al., 1995), (Balabanovic and Shoham, 1997), (Liberman, 1995), (Mladenic, 1996), (Pazzani and Billsus, 1997),(Lang, 1995) These systems generally infer a user's profile from the contents of the items the user previously seen and rated Incoming information is then compared with the user's profile and those items which are similar to the user's profile are assumed to be of interest to the user and are recommended
Trang 3326 G Uchyigit
A traditional method for determining whether information matches a user's
interests is through keyword matching If a user's interests are described by certain
keywords then the assumption is made that information containing those keywords should be of relevant and interest to the user Such methods may match lots of irrelevant information as well as relevant information, mainly because any item which matches the selected keywords will be assumed interesting regardless of its existing
context For instance, if the word learning exists in a paper about student learning (from the educational literature) then a paper on machine learning (from artificial
intelligence literature) will also be recommended In order to overcome such problems, it is important to model the semantic meaning of the data in the domain In recent years ontologies have been very popular in achieving this
Ontologies are formal explicit descriptions of concepts and their relationships within a domain Ontology-based representations are richer, more precise and less ambiguous than ordinary keyword based or item based approaches (Middleton et al.,
2002) For instance they can overcome the problem of similar concepts by helping the
system understand the relationship between the different concepts within the domain
For example to find a job as a doctor an ontology may suggest relevant related terms such as clinician and medicine Utilizing such semantic information provides a more
precise understanding of the application domain, and provides a better means to define the user's needs, preferences and activities with regard to the system, hence improving the personalization process
2 Background
Web personalization is a popular technique for assisting with the complex process of information discovery on the World Wide Web Web personalization is of importance both to the service provider and to the end-user interacting with the web site For the service provider it is used to develop a better understanding of the needs of their customers, so as to improve the design their web sites For the end-users web personalization is important because they are given customized assistance whilst they are interacting with a web site
More recently web usage mining has been used as the underlying approach to web personalization (Mobasher et al., 2004) The goal of web usage mining is to capture and model the user’s behavioral patterns as they are interacting with the web site and use this data during the personalization process Web usage patterns display the frequently accessed web pages by users of the web site who are in search of a particular piece of information Using such information the service providers can better understand which information their users are searching for and how they can assist the user during their search process by improving the organizations and structure the web site
Mobasher (Mobasher et al., 2004) classifies web personalization into 3 groups Manual decision rule systems, content-based recommender systems and collaborative-based recommender systems Manual decision rule systems allow the web site administrator to specify rules based on user demographics or static profiles (collected through a registration process) Content-based recommender systems make use of user profiles and make recommendations based on these profiles Collaborative-based
Trang 34recommender systems make use of user ratings and give recommendations based on how other users in the group have rated similar items
2.1 Recommender Systems
Over the past decade recommender systems have become very successful in assisting with the information overload problem They have been very popular in applications including e-commerce, entertainment and the news domains Recommender systems fall into three main categories collaborative, content and hybrid Their distinction is reliant on the nature in which the recommendations are made These distinctions are formalized by the methods in which: the items are perceived by a community of users; how the content of each item compares with the user's individual profile; a combination of both methods Collaborative based systems take in user ratings and make recommendations based on how other users in the group have rated similar items, content-based filtering systems make recommendations based on user’s profiles and hybrid systems combine both the content and collaborative based techniques
Content based systems automatically infer the user’s profile from the contents of the items the user has previously seen and rated These profiles are then used as inputs
to a classification algorithm along with the new unseen items from the domain Those items which are similar in content to the user’s profile are assumed to be interesting and are recommended to the user
A popular and extensively used document and profile representation method employed by many information filtering methods, including the content based method
is the so called vector space representation (Chen and Sycara, 1998), (Mladenic,
1996), (Lang, 1995), (Moukas, 1996), (Liberman, 1995), (T Kamba and Koseki, 1997), (Armstrong et al., 1995) Content based systems has their roots in text filtering, many of the techniques The content-based recommendation method was developed based on the text filtering model described by (Oard 1997) In (Oard, 1997), a generic information filtering model is described as having four components:
a method for representing the documents within the domain; a method for representing the user's information need; a method for making the comparison; and a method for utilizing the results of the comparison process The vector space method (Beaza-Yates and Ribeiro-Neto, 1999])consider that each document (profile) is
described as a set of keywords The text document is viewed as a vector in n dimensional space, n being the number of different words in the document set Such a representation is often referred to as bag-of-words, because of the loss of word
ordering and text structure (see Figure 2) The tuple of weights associated with each word, reflecting the significance of that word for a given document, give the document's position in the vector space The weights are related to the number of occurrences of each word within the document The word weights in the vector space
method are ultimately used to compute the degree of similarity between two feature
vectors This method can be used to decide whether a document represented as a weighted feature vector, and a profile are similar If they are similar then an assumption is made that the document is relevant to the user The vector space model
evaluates the similarity of the document d j with regard to a profile p as the correlation between the vectors d j and p This correlation can be quantified by the cosine of the
angle between these two vectors That is,
Trang 35in the past
Collaborative-based systems (Terveen et al., 1997), (Breese et al., 1998), (Knostan
et al., 1997), (Balabanovic and Shoham, 1997) were proposed as an alternative to the content-based methods The basic idea is to move beyond the experience of an individual user profile and instead draw on the experiences of a population or community of users Collaborative-based systems (Herlocker et al., 1999), (Knostan
et al., 1997), (Terveen et al., 1997), (Kautz et al., 1997), (Resnick and Varian, 1997) are built on the assumption that a good way to find interesting content is to find other people who have similar tastes, and recommend the items that those users like Typically, each target user is associated with a set of nearest neighbor users by comparing the profile information provided by the target user to the profiles of other users These users then act as recommendation partners for the target user, and items that occur in their profiles can be recommended to the target user In this way, items are recommended on the basis of user similarity rather than item similarity Collaborative recommender systems have several shortcomings one of which is that the users will only be recommended new items only if their ratings agree with other people within the community Also, if a new item has not been rated by anyone in the community if will not get recommended
To overcome, the problems posed by pure content and collaborative based recommender systems, hybrid recommender systems have been proposed Hybrid systems combine two or more recommendation techniques to overcome the shortcomings of each individual technique (Balabanovic, 1998), (Balabanovic and Shoham, 1997), (Burke, 2002), (Claypool et al., 1999) These systems generally, use the content-based component to overcome the new item start up problem, if a new item is present then it can still be recommended regardless if it was seen and rated The collaboration component overcomes the problem of over specialization as is the case with pure content based systems
2.2 Content-Based Recommender Systems
Content-based recommender systems have been very popular over the past decade They have mainly been employed in textual domains They have their roots in information retrieval and text mining Oard (Oard, 1997), presents a generic information filtering model that is described as having four components: a method for representing the documents within the domain; a method for representing the user's information need; a method for making the comparison; and a method for utilizing the results of the comparison process Oard's model described the text filtering model as the process of automating the user's judgments of new textual documents, where the same representation methods are used both for the user profile and the documents within the domain The goal of the text filtering model is to
Trang 36automate the text filtering model, so that the results of the automated comparison process are equal to the user’s judgment of the documents
Content based systems automatically infer the user’s profile from the contents of the document the user has previously seen and rated These profiles are then used as input to a classification algorithm along with the new unseen documents from the domain Those documents which are similar in content to the user’s profile are assumed to be interesting and recommended to the user
A popular and extensively used document and profile representation method employed by many information filtering methods, including the content based method
is the so called vector space representation (Chen and Sycara, 1998), (Mladenic,
1996), (Lang, 1995), (Moukas, 1996), (Liberman, 1995), (Kamba and Koseki, 1997), (Armstrong et al., 1995) The vector space method (Baeza-Yates and Ribeiro-Neto, 1999) consider that each document (profile) is described as a set of keywords The
text document is viewed as a vector in n dimensional space, n being the number of different words in the document set Such a representation is often referred to as bag- of-words, because of the loss of word ordering and text structure The tuple of weights
associated with each word, reflecting the significance of that word for a given document, give the document's position in the vector space The weights are related to the number of occurrences of each word within the document The word weights in
the vector space method are ultimately used to compute the degree of similarity
between two feature vectors This method can be used to decide whether a document represented as a weighted feature vector, and a profile are similar If they are similar then an assumption is made that the document is relevant to the user The vector space
model evaluates the similarity of the document d j with regard to a profile p as the correlation between the vectors d j and p
2.3 Collaborative-Based Recommender Systems
Collaborative-based systems (Terveen et al., 1997), (Breese et al., 1998), (Knostan et al., 1997), (Balabanovic and Shoham, 1997) are an alternative to the content-based methods The basic idea is to move beyond the experience of an individual user profile and instead draw on the experiences of a population or community of users Collaborative-based systems (Herlocker et al., 1999), (Konstan et al., 1997), (Terveen
et al., 1997), (Kautz et al., 1997), (Resnick and Varian, 1997) are built on the assumption that a good way to find interesting content is to find other people who have similar tastes, and recommend the items that those users like Typically, each target user is associated with a set of nearest neighbor users by comparing the profile information provided by the target user to the profiles of other users These users then act as recommendation partners for the target user, and items that occur in their profiles can be recommended to the target user In this way, items are recommended
on the basis of user similarity rather than item similarity Content-based systems suffer from shortcomings in the way they select items for recommendations Items are recommended if the user has seen and liked similar items in the past
A user profile effectively delimits a region of the item space from which future recommendations will be drawn Therefore, future recommendations will display limited diversity This is particularly problematic for new users since their recommendations will be based on a very limited set of items represented in their
Trang 3730 G Uchyigit
immature profiles Items relevant to a user, but bearing little resemblance to the snapshot of items the user has looked at in the past, will never be recommended in the future Collaborative filtering techniques try to overcome these shortcomings presented by content-based systems However, collaborative filtering alone can prove
ineffective for several reasons (Claypool et al., 1999) For instance, the early rater problem, arises when a prediction can not be provided for a given item because it’s new and therefore it has not been rated and it can not be recommended, the sparsity problem which arises due to sparse nature of the ratings within the information matrices making the recommendations inaccurate, the grey sheep problem which arises when there are individuals who do not benefit from the collaborative
recommendations because their opinions do not consistently agree or disagree with
other people in the community
To overcome, the problems posed by pure content and collaborative based recommender systems, hybrid recommender systems have been proposed Hybrid systems combine two or more recommendation techniques to overcome the shortcomings of each individual technique (Balabanovic, 1998), (Balabanovic and Shoham, 1997), (Burke, 2002), (Claypool et al., 1999) These systems generally, use the content-based component to overcome the new item start up problem, if a new item is present then it can still be recommended regardless if it was seen and rated The collaboration component overcomes the problem of over specialization as is the case with pure content based systems
2.4 The Semantic Web
The semantic web is an extension of the current Web which aims to provide an easier way to find, share, reuse and combine information It extends Web documents by adding new data and metadata to the existing Web documents This extension of Web documents is what enables the them to be processed automatically accessible by
machines To do this RDF (Resource Description Framework) is used to turn basic
Web data into structured data RDF works on Web pages and also inside applications and it's based on machine-readable information which builds on XML technology's capability to define customized tagging schemes and RDF's flexible approach to representing data RDF is a general framework for describing a Web site's metadata,
or the information about the information on the site It provides interoperability between applications that exchange machine-understandable information on the Web
RDF Schema (RDFS)
RDFS is used to create vocabularies that describe groups of related RDF resources and the relationships between those resources An RDFS vocabulary defines the allowable properties that can be assigned to RDF resources within a given domain RDFS also allows for the creation of classes of resources that share common properties In an RDFS vocabulary, resources are defined as instances of classes A class is a resource too, and any class can be a subclass of another This hierarchical semantic information is what allows machines to determine the meanings of resources based on their properties and classes
Trang 38
Web Ontology Language (OWL)
OWL is a W3C specification for creating Semantic Web applications Building upon RDF and RDFS, OWL defines the types of relationships that can be expressed in RDF using an XML vocabulary to indicate the hierarchies and relationships between different resources In fact, this is the very definition of “ontology” in the context of the Semantic Web: a schema that formally defines the hierarchies and relationships between different resources Semantic Web ontologies consist of a taxonomy and a set of inference rules from which machines can make logical conclusions
A taxonomy in this context is system of classification, such as the scientific kingdom/phylum/class/order/etc system for classifying plants and animals that groups resources into classes and sub-classes based on their relationships and shared properties
Since taxonomies (systems of classification) express the hierarchical relationships that exist between resources, we can use OWL to assign properties to classes of resources and allow their subclasses to inherit the same properties OWL also utilizes the XML Schema data types and supports class axioms such as subClassOf, disjointWith, etc., and class descriptions such as unionOf, intersectionOf, etc Many other advanced concepts are included in OWL, making it the richest standard ontology description language available today
3 Data Preperation: Ontology Learning, Extraction and
Pre-processing
As previously described personalization techniques such as the content-based method extensively employ the vector space representation This data representation technique is popular because of it’s simplicity and efficiency However, it has the disadvantage that a lot of useful information is lost during the representation phase since the sentence structure is broken down to the individual words In an attempt to minimize the loss of information during the representation phase it is important to retain the relationships between the words One popular technique in doing this is
to use conceptual hierarchies In this section we present an overview of the existing techniques, algorithms and methodologies which have been employed for ontology learning
The main component of ontology learning is the construction of the concept hierarchy Concept hierarchies are useful because they are an intuitive way to describe information (Lawrie and Croft, 2000) Generally hierarchies are manually created by domain experts This is a very cumbersome process and requires specialized knowledge from domain experts This therefore necessitates tools for their automatic generation Research into automatically constructing a hierarchy of concepts directly from data is extensive and includes work from a number of research groups including, machine learning, natural language processing and statistical analysis One approach
is to attempt to induce word categories directly from a corpus based on statistical occurrence (Evans et al., 1991), (Finch and Chater, 1994), (McMahon and Smith, 1996), (Nanas et al., 2003a) Another approach is to merge existing linguistic resources such as dictionaries and thesauri (Klavans et al., 1992), (Knight and Luk,
Trang 39co-32 G Uchyigit
1994) or tuning a thesaurus (e.g WordNet) using a corpus (Miller et al., 1990a) Other methods include using natural language processing (NLP) methods to extract phrases and keywords from text (Sanderson and Croft, 1999), or to use an already constructed hierarchy such as yahoo and map the concepts onto this hierarchy
Subsequent parts of this section include machine learning approaches and natural language processing approaches used for ontology learning
3.1 Machine Learning Approaches
Learning ontologies from unstructured text is not an easy task The system needs to automatically extract the concepts within the domain as well as extracting the relationships between the discovered concepts Machine learning approaches in particular clustering techniques, rule based techniques, fuzzy logic and formal concept analysis techniques have been very popular for this purpose This section presents an overview of the machine learning approaches which have been popular in discovering ontologies from unstructured text
3.1.1 Clustering Algorithms
Clustering algorithms are very popular in ontology learning They function by clustering the instances together based on their similarity The clustering algorithms can be divided
into hierarchical and non hierarchical methods Hierarchical methods construct a tree
where each node represents a subset of the input items (documents), where the root of the tree represents all the items in the item set Hierarchical methods can be divided
into the divisive and agglomerative methods Divisive methods begin with the entire set
of items and partition the set until only an individual item remains Agglomerative methods work in the opposite way, beginning with individual items, each item is represented as a cluster and merging these clusters until a single cluster remains At the
first step of hierarchical agglomerative clustering (HAC) algorithm, when each
instance represents its own cluster, the similarities between each cluster are simply defined by the chosen similarity method rule to determine the similarity of these new clusters to each other There are various rules which can be applied depending on the data, some of the measures are described below:
Single-Link: In this method the similarity of two clusters is determined by the
similarity of the two closest (most similar) instances in the different clusters So for
each pair of clusters S i and S j,
sim(Si,Sj) = max{cos(di,dj) di∈ Si,dj ∈ Sj} (2)
Complete-Link: In this method the similarity of two clusters is determined by the
similarity of the two least similar instances of both clusters This approach can be performed well in cases where the data forms the natural distinct categories, since it tends to produce tight (cohesive) spherical clusters This is calculated as:
Average-Link or Group Average: In this method, the similarity between two clusters is
calculated as the average distance between all pairs of objects in both clusters, i.e it's an
Trang 40intermediate solution between complete link and single-link This is unweighted, or weighted by the size of the clusters The weighted form is calculated as:
sim(Si,Sj) = 1
ninj ∑ cos(di,dj) (4)
where n i and n j refer to the size of S i and S j respectively
Hierarchical clustering methods are popular for ontology learning because they are able to naturally discover the concept hierarchy during the clustering process Scatter/Gather (Lin and Pantel, 2001) is one of the earlier methods in which clustering
is used to create document hierarchies Recently new types of hierarchies have been introduced which rely on the terms used by a set of documents to expose some structure of the document collection One such technique is lexical modification and another is subsumption
3.1.2 Rule Learning Algorithms
These are algorithms that learn association rules or other attribute based rules The algorithms are generally based on a greedy search of the attribute-value tests that can
be added to the rule preserving its consistency with the training instances Apriori algorithm is a simple algorithm which learns association rules between objects Apriori is designed to operate on databases containing transactions (for example, the collections of items bought by customers) As is common in association rule mining, given a set of item sets (for instance, sets of retail transactions each listing individual item’s purchased), the algorithm attempts to find subsets which are common to at
least a minimum number S c (the cutoff, or confidence threshold) of the item sets
Apriori uses a bottom up approach, where frequent subsets are extended one item at a time (a step known as candidate generation, and groups of candidates are tested against the data The algorithm terminates when no further successful extensions are found One example of an ontology learning tool is OntoEdit (Maedche and Staab, 2001), which is used to assist the ontology engineer during the ontology creation process The algorithm semi automatically learns to construct an ontology from
unstructured text The algorithm uses a method for discovering generalized
association rules The input data for the learner is a set of transactions, each of which consists of set of items that appear together in the transaction The algorithm extracts association rules represented by sets of items that occur together sufficiently often and presents the rules to the knowledge engineer For example a shopping transaction may include the items purchased together The generalized association rule may say that snacks are purchased together with drinks rather than crisps are purchased with beer
3.1.3 Fuzzy Logic
Fuzzy logic provide the opportunity to model systems that are inherently imprecisely defined Fuzzy logic is popular in modeling of textual data because of the uncertainty which is present in textual data Fuzzy logic is built on theories of fuzzy sets Fuzzy set theory deals with representation of classes whose boundaries are not well defined The key idea is to associate a membership function with the elements of a class The