Web mining applications in e commerce and e services

More recently, it has been recognised that theshift from traditional to online services – and so the growing numbers of online cus-tomers and the increasing traﬃc generated by them – bri

Trang 1

I-Hsien Ting and Hui-Ju Wu (Eds.)

Web Mining Applications in E-Commerce and E-Services

Trang 2

Prof Janusz Kacprzyk

Systems Research Institute

Polish Academy of Sciences

Vol 150 Roger Lee (Ed.)

Software Engineering Research, Management and

Applications, 2008

ISBN 978-3-540-70774-5

Vol 151 Tomasz G Smolinski, Mariofanna G Milanova

and Aboul-Ella Hassanien (Eds.)

Computational Intelligence in Biomedicine and Bioinformatics,

2008

ISBN 978-3-540-70776-9

Rough – Granular Computing in Knowledge Discovery and Data

Mining, 2008

ISBN 978-3-540-70800-1

Vol 153 Carlos Cotta and Jano van Hemert (Eds.)

Recent Advances in Evolutionary Computation for

Combinatorial Optimization, 2008

ISBN 978-3-540-70806-3

Vol 154 Oscar Castillo, Patricia Melin, Janusz Kacprzyk and

Witold Pedrycz (Eds.)

Soft Computing for Hybrid Intelligent Systems, 2008

ISBN 978-3-540-70811-7

Vol 155 Hamid R Tizhoosh and M Ventresca (Eds.)

Oppositional Concepts in Computational Intelligence, 2008

ISBN 978-3-540-70826-1

Vol 156 Dawn E Holmes and Lakhmi C Jain (Eds.)

Innovations in Bayesian Networks, 2008

ISBN 978-3-540-85065-6

Vol 157 Ying-ping Chen and Meng-Hiot Lim (Eds.)

Linkage in Evolutionary Computation, 2008

ISBN 978-3-540-85067-0

Vol 158 Marina Gavrilova (Ed.)

Generalized Voronoi Diagram: A Geometry-Based Approach to

Computational Intelligence, 2009

ISBN 978-3-540-85125-7

Vol 159 Dimitri Plemenos and Georgios Miaoulis (Eds.)

Artificial Intelligence Techniques for Computer Graphics, 2009

ISBN 978-3-540-85127-1

Vol 160 P Rajasekaran and Vasantha Kalyani David

Pattern Recognition using Neural and Functional Networks,

2009

ISBN 978-3-540-85129-5

Vol 161 Francisco Baptista Pereira and Jorge Tavares (Eds.)

Bio-inspired Algorithms for the Vehicle Routing Problem, 2009

Inhibitory Rules in Data Analysis, 2009

ISBN 978-3-540-85637-5 Vol 164 Nadia Nedjah, Luiza de Macedo Mourelle, Janusz Kacprzyk, Felipe M.G Fran¸ca

and Alberto Ferreira de Souza (Eds.)

Intelligent Text Categorization and Clustering, 2009

ISBN 978-3-540-85643-6 Vol 165 Djamel A Zighed, Shusaku Tsumoto, Zbigniew W Ras and Hakim Hacid (Eds.)

Mining Complex Data, 2009

ISBN 978-3-540-88066-0 Vol 166 Constantinos Koutsojannis and Spiros Sirmakessis (Eds.)

Tools and Applications with Artificial Intelligence, 2009

ISBN 978-3-540-88068-4 Vol 167 Ngoc Thanh Nguyen and Lakhmi C Jain (Eds.)

Intelligent Agents in the Evolution of Web and Applications, 2009

ISBN 978-3-540-88070-7 Vol 168 Andreas Tolk and Lakhmi C Jain (Eds.)

Complex Systems in Knowledge-based Environments: Theory, Models and Applications, 2009

ISBN 978-3-540-88074-5 Vol 169 Nadia Nedjah, Luiza de Macedo Mourelle and Janusz Kacprzyk (Eds.)

Innovative Applications in Data Mining, 2009

ISBN 978-3-540-88044-8 Vol 170 Lakhmi C Jain and Ngoc Thanh Nguyen (Eds.)

Knowledge Processing and Decision Making in Agent-Based Systems, 2009

ISBN 978-3-540-88048-6 Vol 171 Chi-Keong Goh, Yew-Soon Ong and Kay Chen Tan (Eds.)

Multi-Objective Memetic Algorithms, 2009

ISBN 978-3-540-88050-9 Vol 172 I-Hsien Ting and Hui-Ju Wu (Eds.)

Web Mining Applications in E-Commerce and E-Services, 2009

ISBN 978-3-540-88080-6

Trang 4

National University of Kaohsiung

No 700, Kaohsiung University Road

Kaohsiung City, 811

Taiwan

Email: iting@nuk.edu.tw

Dr Hui-Ju Wu

Institute of Human Resource Management

National Changhua University of Education

No.2, Shi-Da Road

Studies in Computational Intelligence ISSN 1860949X

Library of Congress Control Number: 2008935505

c

2009 Springer-Verlag Berlin Heidelberg

This work is subject to copyright All rights are reserved, whether the whole or part of thematerial is concerned, specifically the rights of translation, reprinting, reuse of illustrations,recitation, broadcasting, reproduction on microfilm or in any other way, and storage in databanks Duplication of this publication or parts thereof is permitted only under the provisions ofthe German Copyright Law of September 9, 1965, in its current version, and permission for usemust always be obtained from Springer Violations are liable to prosecution under the GermanCopyright Law

The use of general descriptive names, registered names, trademarks, etc in this publicationdoes not imply, even in the absence of a specific statement, that such names are exempt fromthe relevant protective laws and regulations and therefore free for general use

Typeset & Cover Design: Scientific Publishing Services Pvt Ltd., Chennai, India.

Printed in acid-free paper

9 8 7 6 5 4 3 2 1

springer.com

Trang 5

Preface

Web mining has become a popular area of research, integrating the different research areas of data mining and the World Wide Web According to the taxonomy of Web mining, there are three sub-fields of Web-mining research: Web usage mining, Web content mining and Web structure mining These three research fields cover most content and activities on the Web With the rapid growth of the World Wide Web, Web mining has become a hot topic and is now part of the mainstream of Web re-search, such as Web information systems and Web intelligence Among all of the possible applications in Web research, e-commerce and e-services have been identi-fied as important domains for Web-mining techniques Web-mining techniques also play an important role in e-commerce and e-services, proving to be useful tools for understanding how e-commerce and e-service Web sites and services are used, ena-bling the provision of better services for customers and users Thus, this book will focus upon Web-mining applications in e-commerce and e-services

Some chapters in this book are extended from the papers that presented in WMEE

2008 (the 2nd International Workshop for E-commerce and E-services) In addition,

we also sent invitations to researchers that are famous in this research area to ute for this book The chapters of this book are introduced as follows:

contrib-In chapter 1, Peter I Hofgesang presents an introduction to online web usage ing and provides background information followed by a comprehensive overview of the related work In addition, it outlines the major, and yet mostly unsolved, chal-lenges in the field

min-In chapter 2, Gulden Uchyigit presented an overview of some of the techniques, algorithms, methodologies along with challenges of using semantic information in representation of domain knowledge, user needs and the recommendation algorithms

In chapter 3, Bettina Berendt and Daniel Trümper describe a novel method for analyzing large corpora has been developed Using an ontology created with methods

of global analysis, a corpus is divided into groups of documents sharing similar ics The introduced local analysis allows the user to examine the relationships of documents in a more detailed way

top-In chapter 4, Jean-Pierre Norguet et al propose a method based on output page mining and presents a solution to answer the need for summarized and conceptual audience metrics in Web analytics The authors describes several methods for collect-ing the Web pages output by Web servers and aggregate the occurrences of taxonomy terms in these pages can provide audience metrics for the Web site topics

Trang 6

In chapter 5, Leszek Borzemski presents empirical experience learnt from Web performance mining research, in particular, in the development of predictive model describing Web performance behavior from the perspective of end-users The author evaluates Web performance from the perspective of Web clients therefore the Web performance is considered in the sense of the Web server-to-browser throughput or Web resource download speed rate

In chapter 6, Ali Mroue and Jean Caussanel describe an approach for automatically finding the prototypic browsing behavior of web users User access logs are examined

in order to extract the most significant user navigation access pattern Such approach gives us an efficient way to better understand the way users are acting, and leads us to improve the structure of websites for improving navigation

In chapter 7, Istvan K Nagy and Csaba Gaspar-Papanek investigate the time spent on web pages as a disregarded indicator of quality of online contents The authors present influential factors on TSP measure and gave a TSP data preprocessing methodology whereby we were able to eliminate the effects of this factors In addition, The authors introduce the concept of the sequential browsing and revisitation to more exactly restore users' navigation pattern based on TSP and the restored stack of browser

In chapter 8, Yingzi Jin et al describe an attempt to learn ranking of companies from

a social network that has been mined from the web The authors conduct an experiment using the social network among 312 Japanese companies related to the electrical prod-ucts industry to learn and predict the ranking of companies according to their market capitalization This study specifically examines a new approach to using web informa-tion for advanced analysis by integrating multiple relations among named entities

In chapter 9, Jun Shen, and Shuai Yuan propose a modelling based approach to sign and develop a P2P based service coordination system and their components The peer profiles are described with the WSMO (Web Service Modelling Ontology) stan-dard, mainly for quality of service and geographic features of the e-services, which would be invoked by various peers To fully explore the usability of service categoriza-tion and mining, the authors implement an ontology driven unified algorithm to select the most appropriate peers The UOW-SWS prototype also shows that the enhanced peer coordination is more adaptive and effective in dynamic business processes

de-In chapter 10, I-Hsien Ting and Hui-Ju Wu provide a study about the issues of ing web mining techniques for on-line social networks analysis Techniques and con-cepts of web mining and social networks analysis will be introduced and reviewed in this chapter as well as a discussion about how to use web mining techniques for on-line social networks analysis Moreover, in this chapter, a process to use web mining for on-line social networks analysis is proposed, which can be treated as a general process in this research area Discussions of the challenges and future research are also included in this chapter

us-In summary, this book’s content sets out to highlight the trends in theory and tice which are likely to influence e-commerce and e-services practices in the web mining research Through applying Web-mining techniques to e-commerce and e-services, value is enhanced and the research fields of Web mining, e-commerce and e-services can be expanded

prac-I-Hsien Ting Hui-Ju Wu

Trang 7

Semantics-Based Analysis and Navigation of Heterogeneous

Bettina Berendt, Daniel Tr¨ umper 45

Semantic Analysis of Web Site Audience by Integrating Web

Usage Mining and Web Content Mining

Jean-Pierre Norguet, Esteban Zim´ anyi, Ralf Steinberger 65

Towards Web Performance Mining

Leszek Borzemski 81

Anticipate Site Browsing to Anticipate the Need

Ali Mroue, Jean Caussanel 103

User Behaviour Analysis Based on Time Spent on Web Pages

Istvan K Nagy, Csaba Gaspar-Papanek 117

Ranking Companies on the Web Using Social Network Mining

Yingzi Jin, Yutaka Matsuo, Mitsuru Ishizuka 137

Adaptive E-Services Selection in P2P-Based Workﬂow with

Multiple Property Speciﬁcations

Jun Shen, Shuai Yuan 153

Web Mining Techniques for On-Line Social Networks Analysis:

An Overview

I-Hsien Ting, Hui-Ju Wu 169

Author Index 181

Trang 8

Peter I Hofgesang

VU University Amsterdam, Department of Computer Science

De Boelelaan 1081A, 1081 HV Amsterdam, The Netherlands

hpi@few.vu.nl

Abstract In recent years, web usage mining techniques have helped online service

providers to enhance their services, and restructure and redesign their websites in linewith the insights gained The application of these techniques is essential in buildingintelligent, personalised online services More recently, it has been recognised that theshift from traditional to online services – and so the growing numbers of online cus-tomers and the increasing traffic generated by them – brings new challenges to the field.Highly demanding real-world E-commerce and E-services applications, where the rapid,and possibly changing, large volume data streams do not allow offline processing, mo-tivate the development of new, highly efficient real-time web usage mining techniques.This chapter provides an introduction to online web usage mining and presents anoverview of the latest developments In addition, it outlines the major, and yet mostlyunsolved, challenges in the field

Keywords: Online web usage mining, survey, incremental algorithms, data stream

mining

1 Introduction

In the case of traditional, “offline” web usage mining (WUM), usage and otheruser-related data are analysed and modelled offline The mining process is nottime-limited, the entire process typically takes days or weeks, and the entire dataset is available upfront, prior to the analysis Algorithms may perform several iter-ations on the entire data set and thus data instances can be read more than once.However, as the number of online users – and the traffic generated by them –greatly increases, these techniques become inapplicable Services with more than

a critical amount of user access traﬃc need to apply highly eﬃcient, real-timeprocessing techniques that are constrained both computationally and in terms

of memory requirements Real-time, or online, WUM techniques (as we refer tothem throughout this chapter) that provide solutions to these problems havereceived great attention recently, both from academics and the industry.Figure 1 provides a schematic overview of the online WUM process Userinteractions with the web server are presented as a continuous flow of usage data;the data are pre-processed – including being filtered and sessionised – on-the-fly;models are incrementally updated when new data instances arrive and refreshed

I.-H Ting, H.-J Wu (Eds.): Web Mining Appl in E-Commerce & E-Services, SCI 172, pp 1–23.

Trang 9

2 P.I Hofgesang

Fig 1 An overview of online WUM User interactions with a web server are

pre-processed continuously and fed into online WUM systems that process the data andupdate the models in real-time The outputs of these models are used to, e.g mon-itor user behaviour in real-time, to support online decision making, and to updatepersonalised services on-the-ﬂy

models are applied, e.g to update (personalised) websites, to instantly alert ondetected changes in user behaviour, and to report on performance analysis or onresults of monitoring user behaviour to support online decision making

This book chapter is intended to be an introduction to online WUM and itaims to provide an overview of the latest developments in the ﬁeld and so, inthis respect, it is – to the best of our knowledge – the ﬁrst survey on the topic.The remainder of this chapter is organised as follows In the 2 section, weprovide a brief general introduction to WUM, and the new online challenges

We survey the literature related to online WUM divided in three sections(Sections 3, 4, and 5) 3 overviews the eﬃcient and compact structures used in(or even developed for) online WUM 4 overviews online algorithms for WUM,while 5 presents the work related to real-time monitoring systems The mostimportant (open) challenges are described in 6 Finally, the last section provides

a discussion

2 Background

This section provides a background to traditional WUM; describes incrementallearning to efficiently update WUM models in a single pass over the clickstream;and, finally, it motivates the need for highly efficient real-time, change-awarealgorithms for high volume, streaming web usage data through the description

of web dynamics, characterising changing websites and usage data

Web or application servers log all relevant information available on user–serverinteraction These log data, also known as web user access or clickstream data,

Trang 10

can be used to explore, model, and predict user behaviour WUM is the cation of data mining techniques to perform these steps, to discover and analysepatterns automatically in (enriched) clickstream data Its applications includecustomer proﬁling, personalisation of online services, product and content recom-mendations, and various other applications in E-commerce and web marketing.There are three major stages in the WUM process (see Figure 2): (I) data collec-tion and pre-processing, (II) pattern discovery, and (III) pattern analysis (see,for example, [18, 51, 67]).

appli-Web Usage Data Sources The clickstream data contain information on each user

click, such as the date and time of the clicks, the URI of visited web sources, andsome sort of user identiﬁer (IP, browser type and, in the case of authentication-required sites, login names) An example of (artiﬁcially designed) user access logdata can be seen in Table 1

In addition to server-side log data, some applications allow the installation ofspecial software on the client side (see, for example, [3]) to collect various otherinformation (e.g scrolling activity, active window), and, in some cases, morereliable information (e.g actual page view time) Web access information can befurther enriched by, for example, user registration information, search queries,and geographic and demographic information

Pre-processing Raw log data need to be pre-processed; ﬁrst, by ﬁltering all

irrele-vant data and possible noise, then by identifying unique visitors, and by recovering

Fig 2 An overview of the web usage mining process

Trang 11

4 P.I Hofgesang

Table 1 An example of user access log data entries

IP address Time stamp Request (URI) Status Size User agent1.2.3.4 2008-04-28 22:24:14 GET index.html 200 5054 MSIE+6.01.3.4.5 2008-04-28 22:24:51 GET index.html 200 5054 Mozilla/5.01.2.3.4 2008-04-28 22:25:04 GET content1.html 200 880 MSIE+6.01.2.3.4 2008-04-28 22:27:46 GET content2.html 200 23745 MSIE+6.01.3.4.5 2008-04-28 22:28:02 GET content5.html 200 6589 Mozilla/5.01.2.3.4 2008-04-29 08:18:43 GET index.html 200 5054 MSIE+6.01.2.3.4 2008-04-29 08:22:17 GET content2.html 200 23745 MSIE+6.0

user sessions1 Due to browser and proxy server caching some references are ing from the log entries; here we can use information about the site structure alongwith certain heuristics to recover original sessions (e.g [19]) Different resources(typically, distinct web pages) on a website also need to mapped to distinct in-dices Page mapping itself is a challenging task In the case of advanced websites,with dynamically generated pages – as in the case of most E-commerce sites —URIs contain many application-specific parameters, and their mapping requires1) a complete overview of the page generation logic, and 2) application-orienteddecisions for determining page granularities Pages can be mapped to predefinedcategories by content based classification as well (e.g [6])

miss-User identification based on the combination of the IP address and the miss-User agent fields identifies two distinct users (“1.2.3.4, MSIE+6.0” and “1.3.4.5, Mozilla/5.0”) on the sample entries (Table 1) If we take all visited URIs (Re- quest field) for both users, ordered ascendingly by the Time stamp field, and

then form user sessions by the time frame identification method (see [19]), usinge.g a 30 minute timeout, the individual entries would broke into two separatesessions in case of the first user and into a single session for the second Havingthe visited pages mapped to distinct indices – e.g by assigning integer num-bers increasingly, starting from 1, to each unique pages by their appearance, i.e.index.html→1, content1.html→2, content2.html→3, content5.html→4 – we can denote the two sessions of the first user as user1

1: 1,2,3 and user1

2: 1,3, and for

the other user as user2s1: 1,4 Data in this format, i.e ordered sequences of pageIDs, can directly be used in numerous WUM methods and can easily be trans-formed into e.g histogram or binary vector representation for the application ofothers For complete and detailed overviews on pre-processing web usage data,

we refer the reader to [19, 51]

WUM Techniques There is a vast amount of related work on traditional WUM.

The most popular research areas include frequent itemsets and association rulesmining, sequential pattern mining, classiﬁcation, clustering, and personalisa-tion For an overview on these techniques and on related work, see for example,

1A session is a timely ordered sequence of pages visited by a user during one visit

Trang 12

Mobasher et al [54], Eirinaki and Vazirgiannis [22], Anand and Mobasher [2],Pierrakos et al [63], and Liu [51].

Modelling Individual User Behaviour Most related work processes user sessions

without a distinction of individual origin, i.e which session belongs to which user,either due to lack of reliable user identiﬁcation or because the application doesnot require it For some applications, however, it is beneﬁcial to process sessionswith their individual origin preserved Model maintenance for each individualposes necessary constraints on model sizes; real-world applications with (tensof) thousands of individuals require compact models

In the case of “traditional” WUM, the complete training set is available prior

to the analysis In many real-world scenarios, however, it is more appropriate toassume that data instances arrive in a continuous fashion over time, and we need

to process information as it ﬂows in and to update our models accordingly We

can identify a task as an incremental learning task when the application does

not allow for waiting and gathering all data instances or when the data ﬂow ispotentially inﬁnite – and we gain information by processing more data points

We say that a learning algorithm is incremental (see, for example, [30]), if at

each stage our current model is dependent only on the current data instance and

the previous model More formally, given the ﬁrst i training instances (x1, , xi)

in a data ﬂow, the incremental algorithm builds M0, M1, , Mimodels such that

each M j is dependent only on M j −1 and x j , where M0 is an initial model and

1≤ j ≤ i We can generally allow a batch of the last n instances to be processed, where n is relatively small compared with the size of the stream, instead of only

the last instance

Alternatively, even if we have an incremental learning task in hand, we maychoose to execute the entire model building process again on the complete dataset (that is, complete at that given time), but in many cases our algorithm is lim-ited by calculation time and/or computational resources Incremental learningtasks include many real-world problems and they are best solved by incrementalalgorithms Continuous streams of mass or individual usage data from websitesrequire continuous processing, and so incremental learning algorithms are re-quired for online responses

Note, however, that in the description above we assume that the underlyingdata generation process is constant over time and that we gain information

by continuously updating our models with new data instances However, thisassumption is not realistic in most real-world data streams The underlying datadistribution is likely to change over time due to hidden or explicit influentialfactors In the following sections, first we outline the major influential factors inonline WUM and then we describe data stream mining, an emergent new field

of data mining that deals with massive dynamic data ﬂows

Trang 13

6 P.I Hofgesang

The web and, therefore, web content, structure and usage data are dynamic

in nature – as are most real-world data sets Websites are changed every day:pages are removed, new pages are added, new links are created or removed,and the contents of pages are also updated At the same time, the behaviour ofusers may change as well Related work on web dynamics is motivated mostly

by search engine design and maintenance, and investigates the evolution of theweb, tracking structural and content evolution of websites over time [25, 60] Inthe following, we outline the most inﬂuential factors for the dynamic web

Website Structural Changes Most popular websites change rapidly However, the

type of probable changes may diﬀer over domains In case of news portals andonline retail shops, for instance, the general news and product categories changeinfrequently; however, if we would like to capture more detailed information,and we identify page granularity on the very article or product level, we facedaily emergence of new articles and product pages and the likely disappearance

of many others at the same time As a side-eﬀect, new links are created and oldones are removed Links may be maintained independently as well; for example,

in case of a website that uses automated cross-product recommendations, orprovides personalised pages Evolution of web graphs is the subject of the article

by Desikan and Srivastava [21]

Changes in Page Content In addition to structural changes of a website, which

are due to maintenance, content is also likely to change over time For some webpages this means minor updates, but for others it means radical alteration Forinstance, Wikipedia articles evolve over time as more and more content is addedand the information is smoothed; however, their general subjects mostly remain.Other pages may undergo drastic content modiﬁcations or be merged with otherpages, which both lead, most probably, to changes in page categories Evolvingweb content has been the subject of numerous research, see, for example, [33,

42, 52, 71]

The evolution of website structure and content over time raises many practicalproblems How do we maintain the semantic links between old pages and theirsuccessors? We should still be able to identify a new home page with differentcontent, and most likely a different URI, as the same home page with the sameindices mapped to it, perhaps flagged as changed and the change quantified, ifnecessary How do we synchronise an evolving website with user access data?Usage data always refer to the actual website and so an earlier snapshot of thestructure and its page mapping may be obsolete Automatic maintenance ofsemantics and synchronising web structure, content, and access data is an arealargely unexplored

Changing User Behaviour Changes in a dynamic website are, of course, reﬂected

in user access patterns However, the interests and behaviours of individuals arelikely to change independently over time as well An individual planning to buy a

TV may browse through the available selection of an online retail shop for several

Trang 14

days or weeks and abandon the topic completely for years after the purchase Inthe case of an investment bank, we can imagine a client who ﬁlls up his portfolioover several days (“investing behaviour”) and then only seldomly checks their ac-count (“account checking behaviour”), to get an overview, for an extended period

of time before getting involved in transactions again Additionally, behaviouralpatterns of users may re-occur over time (e.g alternating account checking andinvesting behaviour), and seasonal eﬀects (e.g Christmas or birthday shopping)are also likely to inﬂuence user behaviour Detecting behavioural changes is es-sential in triggering model updates; and identifying re-occurrence and seasonalpatterns helps to apply knowledge gained in the past

Incremental mining tasks require single-pass algorithms (Section 2.2) Datastream mining (DSM) [1, 4, 26, 39, 68] tasks induce further harsh constraints

on the methods that solve them The high volume ﬂow of data streams allowsonly a single-time process of data instances, or a maximum of a few times using

a relatively small buﬀer of recent data, and the mining process is limited bothcomputationally and in terms of memory requirements In many real-world ap-plications, the underlying data distribution or the structure of the data streamchanges over time, as described for WUM in the previous section Such appli-cations require DSM algorithms to account for changes and provide solutions

to handle them Temporal aspects of data mining are surveyed in Roddick andSpiliopoulou [64], without the focus on eﬃciency and DSM constraints The workemphasises the necessity for many applications to incorporate temporal knowl-edge into the data mining process, so that temporal correlations in patterns can

be explored Such a temporal pattern may be, for example, that certain productsare more likely to be purchased together during winter

Concept Drift In online supervised learning, concept drift [27, 72] refers to

changes in the context underlying the target, concept variable More generally,

we refer to “concept” drift in unsupervised learning as well A drifting conceptdeteriorates the model and, to recover it, we need to get rid of outdated infor-mation and base our model only on the most recent data instances that belong

to the new concept Applying a fixed-size moving window on the data streamand considering only the latest instances is a simple and widely used solution tothis problem However, in practice, we cannot assume that any fixed value, how-ever carefully selected, of window size is able to capture a sufficient amount of

“good” instances A dynamic window size, adjusted to the changing context, isdesirable but it requires sophisticated algorithms to detect the points of change

An alternative to the sliding window approach is to exponentially discount olddata instances and update models accordingly

Incremental and stream mining algorithms need an online feed of pre-processeddata Although we did not ﬁnd related work on real-time pre-processing of

Trang 15

8 P.I Hofgesang

clickstreams and related data sources, we assume traditional pre-processingmethods to have straightforward extensions to perform all necessary steps, in-cluding ﬁltering, user identiﬁcation, and sessionisation, described in Section 2.1

To support online pre-processing, we further need to maintain look-up tablesincluding a user table with user identiﬁcation and related user data, a pagemapping table, and a table with ﬁltering criteria, which holds, for example, anup-to-date list of robot patterns or pages to remove The automatic mainte-nance of page mapping consistent with both the website and the access data is

a non-trivial task, as mentioned in Section 2.3

3 Compact and Eﬃcient Incremental Structures to

Maintain Usage Data

To support real-time clickstream mining, we need to employ flexible and tive structures to maintain web usage data These structures need to be memoryefficient and compact, and they need to support efficient self-maintenance, i.e.insertion and deletion operators and updates in some applications We stress effi-ciency requirements especially for applications where individual representation isneeded How should web usage data be represented to meet these requirements?Much of the related work applies tree-like structures to maintain sequential

adap-or “market basket”2(MB) type data User sessions tend to have a low branchingproperty on the first few pages, i.e the variation of pages in session prefixes ismuch lower than in the suffixes This property reflects the hierarchy of websitesand that most users in general visit only a small set of popular pages In practice,this property assures compactness in prefix-tree-like representations where thesame prefixes of sessions share the same branches in the tree In addition, treesare inherently easy to maintain incrementally A simplest, generic prefix-treeconsists of a root node, which may contain some general information aboutthe tree (e.g sum frequency) and references to its children nodes Every othernode in the tree contains fields with local information about the node (e.g pageidentification or node label, a frequency counter) and reference to its parent andchildren nodes An insertion operator, designed to assure that the same prefixesshare the same branches in the tree, turns this simple structure into a prefix-tree.This structure can be used to store sequences and, by applying some canonical(e.g lexicographic or frequency-based) order on items prior to insertion, MBtype data as well

Related work mostly extends this structure to suit speciﬁc applications Themajority of the following structures were originally proposed to maintain (fre-quent) itemsets but they can be used to store sessions, preserving the orderinginformation Whenever the application allows (e.g user proﬁling) or requires(e.g frequent itemset mining) the transformation of sessions to MB type, the2

“Market basket” type data is common in E-commerce applications With the ing information disregarded, sessions turn into sets of items or “market basket” typedata sets; cardinality of pages within single sessions or sets is often disregarded aswell

Trang 16

order-loss of information results in highly compact trees, with the size reduced by largemargins This section focuses only on the structures – their original application

is mostly ignored

FP-Tree[34] was designed to facilitate frequent pattern mining The ture includes a header table to easily access similar items in the tree; nodes areordered by their frequency It was designed for oﬄine mining, using two scansover the whole data set, and therefore no operators for online maintenance aredeﬁned in the paper Cheung and Zaiane [16] introduced an FP-Tree variant,CATS Tree, for incremental mining In CATS Tree, sub-trees are optimisedlocally to improve compression, and nodes are sorted in descending order ac-cording to local frequencies AFPIM [43] extends FP-Tree by enabling onlinemining and providing the necessary maintenance operations on the tree How-ever, if we apply a minimum threshold to the frequency, the algorithm wouldstill need a complete scan of the data set in case of the emergence of “prefre-quent” items not yet represented in the tree FP-stream [29], another extension

struc-of FP-Tree, stores frequent patterns over tilted-time windows in an FP-Treestructure with tree nodes extended to embed information about the window.CanTree [47] is a simple tree structure to store MB type data all ordered,

by the same criteria, prior to insertion In this way, the order of insertion ofsequences will not have any eﬀect on the ﬁnal structure It is designed to supportsingle-pass algorithms; however, it does not apply a minimum threshold either,which would require multiple scans The authors extended this work in [46] andproposed DSTree to support frequent itemsets mining in data streams.Xie et al [41] proposed FIET (frequent itemset enumeration tree), a structurefor frequent itemset mining Nodes represent frequent itemsets and have an active

or inactive status to deal with potentially frequent itemsets Rojas and Nasraoui[65] presented a prefix tree with efficient single pass maintenance to summarizeevolving data streams of transactional data Along with the tree structure, analgorithm, to construct and maintain prefix trees with dynamic ranking, i.e withordering criterion that changes with time, was provided

The structures mentioned so far were designed to store MB type data andthus, if applied with the original intention, they spoil the sequential information

of sessions The following structures were inherently designed to store sequences.CST[32] is a simple generic preﬁx-tree for compact session representation Chen

et al [14] used a simple preﬁx-tree for incremental sequential patterns mining.El-Sayed et al [23] proposed FS-Tree, a frequent sequences tree structure,

to store potentially frequent sequences A simple tree structure is extended by

a header table that stores information about frequent and potentially frequentsequences in the data with a chain of pointers to sequences in the tree A non-frequent links table stores information about non-frequent links to support in-cremental mining In Li et al [49], TKP-forest, a top-k path forest, is used tomaintain essential information about the top-k path traversal patterns A TKP-forest consists of a set of traversal pattern trees, where a tree is assigned toeach character in the alphabet and contains sequences with their ﬁrst elementequal to this character All possible suﬃxes of each incoming session are added

Trang 17

10 P.I Hofgesang

Fig 3 Simple preﬁx-tree representation of the original sessions with a list of references

pointing to the last pages of sessions

Fig 4 Simple preﬁx-tree representation of sessions transformed into ascendingly

or-dered MB-type data with a list of references pointing to the last pages of sessions

to the appropriate tree Each tree maintains general statistics over its sequencesand the same items are linked together within trees to support eﬃcient mining.Although this is mostly not covered in the literature we can assume that main-tenance of data over a variable or ﬁxed-size sliding window can be implemented

easily by, for instance, maintaining a list of references for the last n sessions

pointing to the last pages of the sessions in the tree Sessions can easily be inated by following these pointers Figure 3 and 4 present an example of treeevolution, based on the simple generic preﬁx-tree we described above, both forordered session data (Figure 3) and for its MB type data representation (Figure4) using the data in Table 2 The simple tree structure is extended by a list ofreferences pointing to the last pages of sessions in the tree

elim-Table 2 Sample session data both in original and MB-type format

ID Original Session MB-type ID Original Session MB-type

s1 1 1 2 5 5 5 1 2 5 s4 1 2 1 2

s2 1 2 2 9 1 2 9 s5 1 2 3 3 1 2 3

Trang 18

4 Online WUM Algorithms

This section provides an overview of online WUM algorithms grouped into fourcategories: frequent (top-k) items, itemsets and association rules mining; discov-ery of sequential patterns and sequential prediction; clustering; and web userproﬁling and personalisation We have attempted to compile a comprehensivelist of relevant papers; however, the list may not be complete Most of the workrelates to online WUM, but we also included general DSM methods where theapplication allows the processing of web usage data

Extending traditional frequent itemsets and association rules mining methods

to DSM environments has been widely studied recently, and it is one of themost popular fundamental research areas in DSM Just as in traditional item-sets mining, the exponential number of candidate items, and even the result set

is typically huge – and so in DSM we need to apply a minimal support old to rule out infrequent candidates The greatest challenge in finding frequentpatterns, and therefore frequent itemsets, in streaming data in an incrementalfashion is that previously infrequent patterns may become frequent after newinstances flow in and, similarly, previously frequent patterns may become in-frequent There is a vast amount of research proposed to solve this problem.Here we introduce only a few of the pioneer works and several more recent ones.Note, that most of these techniques can be applied directly to any kind of MBtype data sets, so we do not need to differentiate between WUM and generalDSM techniques Most of the following algorithms use some type of compact andefficient structure to maintain frequent and pre-frequent patterns over time

thresh-Two-pass and “semi-” incremental algorithms The candidate

generation-and-test method of Apriori-like algorithms is efficient for searching among theotherwise exponential amount of candidates, but it is not suitable for solvingincremental or stream mining tasks A number of algorithms apply a more ef-ficient, although still not online, approach that scans the database once to findcandidates, and to identify the actual frequent sets, with respect to a specifiedminimum support threshold, in a second scan FP-Growth, proposed by Han et

al ([35]), requires two scans over the entire database to ﬁnd frequent itemsets Itseﬃcient tree structure, FP-Tree, uses header links to connect the same items inthe tree [16, 43] extended this work Cheung and Zaiane [16] introduced CATSTree, an extension of FP-Tree with higher compression, with the FELINEalgorithm FELINE allows adjustment to minimal support, to aid interactivemining (“built once, mine many”) AFPIM, proposed by Koh and Shieh [43],stores both frequent and pre-frequent items in an extended FP-Tree The tree

is adjusted according to the inserted and deleted transactions; however, it needs

a complete rescan over the database in case a newly emerged frequent item isnot yet in the tree

Trang 19

12 P.I Hofgesang

One-pass algorithms There are two methods, in related work, to limit the

fre-quent pattern structure size: some algorithms use double thresholds (e.g [45]),and some apply pruning (e.g [17]) Lee and Lee [45] applied double thresholdsand an additional monitoring preﬁx-tree to maintain candidates They evaluatedtheir method both on real and synthetic web log data Chi et al [17] presentedMoment, an algorithm to maintain all closed frequent itemsets in a sliding win-dow A closed enumeration tree, CET, is used to record all actual and potentialclosed frequent itemsets estWin, by Chang and Lee [11], maintains a slidingwindow over the itemsets and stores all the currently signiﬁcant ones in a moni-toring tree This tree is pruned over time to limit its size Frequent itemsets aremined, upon user request, from the monitoring tree Calders et al [10] pointedout that the goodness of online mining methods for frequent itemsets dependshighly on the correct parameter settings, i.e on the size of the sliding window or

on the decay factor if applied They proposed a max frequency measure of anitemset, that refers to the maximal frequency of an itemset over all possible win-dows on the stream They show that, in practice, it is sufficient to calculate max.frequencies over some specific points, called borders, and to maintain summarystatistics over only these points in order to determine frequent itemsets.The above papers focus on frequent itemsets mining and do not presentmethodology to maintain association rules Although rules can be calculatedbased on the frequent itemsets, it is not straightforward to maintain them overtime given the evolving itemsets and a user-defined confidence threshold.Yet another, slightly similar problem, is to find supported (top-k) items over adata stream (e.g the top 10 most-visited web pages) Cormode and Muthukrish-nan [20] presented methods to maintain top-k items, and their approximate fre-quency, based on statistics over random samples, referred to as “group testing”.Jin et al [40] proposed two hash-based approaches, hCount and hCount*, tofind a list of most frequent items over a data stream Charikar et al [12] pre-sented a one-pass algorithm applied on a novel data structure (count sketch)

to estimate the most frequent items using very limited storage space

In the previous section, on frequent itemsets mining, we ignored the order formation of user sessions This section, however, presents methods to discoverfrequent sequences and sequential relationships Essentially, the main problem

in-in frequent sequential pattern min-inin-ing is the same as described in-in the previoussection: how to deal with patterns that become frequent or infrequent over time.Finding frequent sequences online may help to adapt websites in real-time based

on the analysis of popular page traversals; and sequential page prediction els may form the basis of online page recommendation systems or page cachingmechanisms

mod-Wang [70] used a dynamic suﬃx tree structure for incremental pattern ing Parthasarathy et al [61] presented ISM, an incremental sequence miningthat maintains the frequent and potentially frequent sequences in a sequencelattice Massaeglia et al [53] proposed IseWum to maintain sequential web

Trang 20

updat-usage patterns incrementally However, no guidelines for eﬃcient tion are provided, the algorithm, as described, needs multiple iterative scansover the entire database The necessary number of iterations is a multiple of thelength of the longest sequence.

implementa-Cheng et al [15] proposed IncSpan to maintain sequential patterns in ically changing databases, solving the problem of inserting and appending records

dynam-to a database – deletion of records is not discussed The algorithm maintains abuﬀer of semi-frequent patterns as candidates and stores frequent ones in a se-quential pattern tree The eﬃciency of the algorithm is optimised through reversepattern matching and shared projection Chen et al [14] argued that IncSpanand its improved variant IncSpan+ [59] fail to detect some potentially frequentsequences and thus, eventually, the method is prone to miss a portion of all fre-quent sequences They proposed PBIncSpan to overcome the problem

El-Sayed et al [23] presented a tree structure (FS-tree) for frequent quences The tree is maintained incrementally, sequences are inserted or deletedbased on changes in the database In Li et al [48] StreamPath was presented

se-to mine the set of all frequent traversal patterns over a web-click stream byone scan The authors extended this work in [49] to ﬁnd the top-k traversalsubsequence patterns

Yen et al [75] presented IncWTP to mine web traversal patterns incrementallyusing an extended lattice structure The size of the structure is limited by the web-site link structure: only connected pages are considered to be valid traversals.Gündüz- Ögüdücü and Tamer Özsu [32] presented an incremental web pagerecommendation model based on a compact tree structure (CST) and similar-ity based clustering of user sessions Li et al [50] presented DSM-PLW, aprojection-based, single-pass algorithm for online incremental mining of pathtraversal patterns over a continuous stream of maximal forward references us-ing a Landmark Window Laxman et al [44] presented space- and time-efficientalgorithms for frequency counting under the non-overlapped occurrences-basedfrequency for episodes

Markov models are highly popular in oﬄine sequential prediction tasks though we found no prior work, we can assume it is straightforward to extendtraditional Markov-model-based techniques to online versions The state transi-tion probability matrix can be updated incrementally and, to keep it compact,state transitions can be represented using eﬃcient tree or hash structures

Clustering partitions data instances into similarity groups, called clusters, suchthat members of the same cluster are similar, and members of diﬀerent clustersare dissimilar To determine the degree of similarity, clustering applies a simi-larity or distance measure on pairs of instances Applications of web usage dataclustering in E-commerce environments include market segmentation and webpersonalisation In a stream mining environment, in addition to the constraintsdescribed in Section 2.4, the major challenge in clustering is to handle evolvingclusters New clusters may arise, old ones may disappear or merge, and instances,

Trang 21

14 P.I Hofgesang

for example, in case clustered instances are individual users, may change ter membership over time Barbar [8] presents requirements for clustering datastreams and overviews some of the latest algorithms in the literature

clus-Ester et al [24] present an incremental density-based clustering, tal DBSCAN, one of the earliest incremental clustering methods The relationbetween objects is deﬁned by assumptions about object density in a given neigh-bourhood of the object Eﬀects of incremental updates, insertion and deletion

Incremen-of objects, are considered through their eﬀect in changing these relations uation includes experiments on web access log data of a computer science department site

Eval-Nasraoui et al [58] presented TECNO-STREAMS, an immune system spired single pass method to cluster noisy data streams The system continuouslylearns and adapt to new incoming patterns In [56] the authors extended thiswork to track and validate evolving clusters and present a case study on thetask of mining real evolving web clickstream data and on tracking evolving topictrends in textual stream data

in-In Hofgesang [37] user profiles are maintained for each individual tally by means of a prefix-tree structure Clustering of profiles is offline, the workassumes that clusters need to be updated only periodically on demand Wu et al.[74] propose a clustering model, to generate and maintain clusters mined fromevolving clickstreams, based on dense regions discovery However, the authors

incremen-do not enclose details about cluster maintenance issues and the evaluation, onreal-world web usage data, do not cover the evolving aspects either

In Suryavanshi et al [69] the authors extend their previous work, RelationalFuzzy Subtractive Clustering, and propose its incremental version, Incremen-tal RFSC, for adaptive web usage profiling They define a measure, impactfactor, which quantifies the necessity of reclustering Their method thus up-dates clusters incrementally until the model deteriorates and needs a completere-clustering of the data from scratch

The following works, despite they present offline methods, capture the ing environment via incorporating temporal aspects In Nasraoui et al [57] theauthors present a framework based on a robust evolutionary clustering approach,for mining, tracking, and validating evolving user profiles on dynamic websites.The session similarity measure for clustering is extended to incorporate websiteontology, weighting pages based on their distance in the site hierarchy MONIC,proposed by Spiliopoulou et al [66], is a framework for monitoring cluster tran-sitions In the framework an offline clustering is applied periodically on an accu-mulating data set Cluster transitions, such as the emergence and disappearance

chang-of clusters and migration chang-of members from one cluster to the other, are trackedbetween two consecutive cluster sets

The aim of web personalisation is to help users cope with the informationload and to automatically ﬁlter relevant, new information An adaptive, person-alised website automatically ﬁlters new content according to user preference and

Trang 22

adjusts its structure and presentation to improve usability Personalisation isbased on individual user or aggregate group proﬁles that capture individual orcommon interest and preference For an overview of oﬄine personalisation, see[2, 22].

Most current personalisation systems consist of an offline part, to discover userprofiles, and an online part, to apply the profiles in real-time This approach isnot suitable in real-time dynamic environments with changing user preferences

In this scenario, user proﬁles also need to be updated online User proﬁles can bebased virtually on any of the online techniques presented in the previous sections

to extract user-speciﬁc patterns, e.g to maintain a list of most popular pages

or page sets of an individual over time In the case of group personalisation orcollaborative ﬁltering, we may use online clustering to identify aggregate groupsand to calculate a centroid or base proﬁle for each of these groups

Chen [13] presented a self-organising HCMAC neural network that can mentally update user proﬁles based on explicit feedback on page relevance given

incre-by users browsing a website A network needs an initial training on an initialdata set to build a starting model that is updated incrementally later on.Godoy and Amandi [31] proposed a user profiling technique, based on the webdocument conceptual clustering algorithm, that supports incremental learningand profile adaptation The personal agent, PersonalSearcher, adapts itsbehaviour to interesting changes to assist users on the web Furthermore, profilescan be presented in a readable description so that users can explore their profilesand verify their correctness

Based on the user proﬁles, we can build personalised services to provide tomised pages and adaptive websites The notion of an adaptive website wasproposed by Perkowitz and Etzioni [62] for websites that automatically improvetheir organisation and presentation based on user access patterns Baraglia andSilvestri [7] introduced SUGGEST, which performs online user proﬁling, modelupdating, and recommendation building

cus-In an article by Nasraoui et al [55], the authors presented two strategies,based on K-Nearest-Neighbors and TECNO-STREAMS (see Section 4.3), forcollaborative ﬁltering-based recommendations applied on dynamic, streamingweb usage data They described a methodology to test the adaptability of rec-ommender systems in streaming environments

5 Online Web Usage Mining Systems

While related work in the previous sections focus mostly on single algorithms,here we present works that describe complete frameworks for online change de-tection and monitoring systems

Baron and Spiliopoulou [9] presented PAM, a framework to monitor changes

of a rule base over time Despite the oﬄine methods, i.e pattern sets are ﬁed in batches of the data between two consecutive time slices, tracking changes

identi-of usage patterns makes this work interesting to our survey Patterns – tion rules – are represented by a generic rule model that captures both statistics

Trang 23

associa-16 P.I Hofgesang

and temporal information of rules Thus each rule is stored together with itstimestamp and statistics, such as support, conﬁdence and certainty factor Ateach time slice patterns are compared to the ones discovered in the previousbatch: the same rules in the two sets are checked for signiﬁcant changes in theirstatistics using a two-sided binomial test In case a change is detected based onthe current and the previous batches it is labelled either as short or long-termchange depending on the results of change detection in the following step, i.e.whether the changed value returns to its previous state in the next test or itremains the same for at least one more period Change detection in this form

is local, it checks rules that coexist in consecutive patterns sets To track rulesthroughout the whole period several heuristics were given that analyse changes

in the time series – formed of consecutive measurements for each rule on all dataslices, – e.g to check pattern stability over time and label them as permanent,frequent, or temporary changes The set of rules with changed statistics may belarge and to reduce its size the notion of atomic change were introduced A rulewith an atomic change contains no changed subpart itself At each step only theset of rules with atomic changes is presented to the user Experimental evalu-ation of PAM included analysis of 8 months of server-side access log data of anon-commercial website The total set was sliced into monthly periods, whichseems to be a reasonable setup, although no evaluation was presented how doesthe selection of window size aﬀect the framework Furthermore, the authors gave

no guidelines to ﬁeld experts on which heuristics to apply on a particular dataset and how to interpret the results of the heuristics

In their work Ganti et al [28] assumed the data to be kept in a large datawarehouse and to be maintained by systematic block evolution, i.e addition anddeletion of blocks of data They presented DEMON, a framework for miningand monitoring blocks of data in such dynamic environments A new dimension,

called the data span dimension, was introduced on the database which allows to select a window of the w most recent data blocks for analysis They also speci- ﬁed a selection constraint, the block selection predicate, which allows to limit the

analysis to data blocks that satisfy certain criteria, e.g to select blocks of dataadded on each Monday They described three incremental algorithms, includingtwo variants of frequent itemset mining algorithms and a clustering algorithm,

on varying selections of data blocks In addition, they proposed a generic rithm that can be instantiated by additional incremental algorithms to facilitatetheir framework Furthermore, to capture possible cyclic and seasonal eﬀects, asimilarity measure between blocks of data was deﬁned

algo-The topology of a website represents the view of its designer’s algo-The actual siteusage, that reﬂects how visitors actually use the site, can conﬁrm the correctness

of the site topology or can indicate paths of improvements It is in the best est of the site maintainer to match the topology and usage to facilitate efficientnavigation on the site Wu et al [73] proposed a system to online monitor andimprove website connectivity based on the site topology and usage data Theydefined two measures to quantify access efficiency on a website They assumedthat each user session consists of a set of target pages that the particular user

Trang 24

inter-wants to visit The measures deﬁne eﬃciency based on the extra clicks a user has

to perform to reach his target pages within a given web graph These measuresare monitored constantly over the incoming sessions and in case their valuesdrop below a certain threshold redesign of the website topology is initiated The

redesign phase is facilitated by the access interest measure, which is designed

to indicate whether an access pattern is popular but not efficient Although,the concept of target page sets is the basis of their methods the authors simplyassume that these targets can be identified using page view times and the web-site topology Unfortunately, since it is a non-trivial task – and these pages canonly be approximated to a certain extent (e.g [36]) but can never be completelyidentified, – no guidelines are provided on how to identify these pages withinuser sessions

Hofgesang and Patist [38] provided a framework for online change detection inindividual web user behaviour They defined three goals – detecting changes inindividual browsing behaviour, reporting on user actions that may need specialcare, and detecting changes in visitation frequency – and proposed their spaceand computationally efficient, real-time solutions The first problem deals withdetecting changes in navigational patterns of users, i.e the sets of visited pages

of individuals Solution to the second goal is an integral part of the solution

to the ﬁrst problem It considers outlier patterns of the ﬁrst goal and checkswhether these patterns are “interesting” based on their “uniqueness” compared

to patterns of other individual proﬁles The third goal is to detect increased

or decreased activities in real-time on individual activity data, i.e the seriesformed by the number of sessions for individuals in a given time period (e.g.day) Changes detected in navigation patterns can be used, e.g to update per-sonalised websites, while the solution to the second problem provides hints that

an individual may need online assistance Detecting changes in user activity cilitates customer retention, e.g decreasing user activity may forecast a defectingcustomer If detected in time, a change can be used to take certain marketingactions to possibly retain the customer

fa-6 Challenges in Online WUM

This section summarises major challenges in online web usage mining to vate research in the ﬁeld There is only a handful of works devoted completely

moti-to online web usage mining and thus more research – moti-to adapt and improve ditional web usage mining tools to meet the severe constraints of stream miningapplications and to develop novel online web usage mining algorithms – is muchneeded In particular, the most challenging and largely unexplored aspects are:

tra-• Change detection Characteristics of real-world data, collected over an

ex-tended period of time, are likely to change (see Section 2.3 and 2.4) Toaccount these changes and to trigger proper actions (e.g to update models,

or send alerts) algorithms for change detection need to be developed

• Compact models Many applications require a single model or a single proﬁle

maintained for each individual (e.g individual personalisation and direct

Trang 25

18 P.I Hofgesang

marketing) In case of most E-commerce applications this would lead to themaintenance of (tens or hundreds of) thousands of individual models (seeSection 2.1) and therefore eﬃcient, compact representations are required

• Maintenance of page mapping [5] shows that in commercial websites over

40% of the content changes each day How to maintain consistency betweenpage mapping and usage data (see Section 2.3)? How to interpret previouslydiscovered patterns that refer to outdated web content, in a changed environ-ment? Automated solutions to maintain consistent mappings are required

• New types of websites Numerous practical problems arise with the growing

number of AJAX and ﬂash based applications In case of ﬂash, the content

is downloaded at once and user interaction is limited to the client side andthus not tracked by the server AJAX based web applications refresh onlyparts of the content How to collect the complete usage data in these environ-ments and how to identify web pages? Ad hoc solutions exist to tackle theseproblems but automated solutions, that capture the intentions of websitedesigners, would be highly desirable

• Public data sets The lack of publicly available web usage data sets sets

back research on online web usage mining Data sets collected over an sive amount of time, possibly reﬂecting web dynamics and user behaviouralchanges, carefully processed and well documented, with clear individual iden-tiﬁcation would highly facilitate research

exten-7 Discussion

This work presented an introduction to online web usage mining It describedthe problem and provided background information followed by a comprehen-sive overview of the related work As in traditional web usage mining, the mostpopular research areas in online web usage mining are frequent pattern mining(frequent itemsets and frequent sequential patterns), clustering, and user pro-filing and personalisation We motivated research in online web usage miningthrough identification of major, and yet mostly unsolved, challenges in the field.Applications of online WUM techniques include many real-world E-commercescenarios including real-time user behaviour monitoring, support of on-the-flydecision making, and real-time personalisation that support adaptive websites

Trang 26

3 Atterer, R., Wnuk, M., Schmidt, A.: Knowing the user’s every move: user activitytracking for website usability evaluation and implicit interaction In: WWW 2006:Proceedings of the 15th international conference on World Wide Web, pp 203–212.ACM, New York (2006)

4 Babcock, B., Babu, S., Datar, M., Motwani, R., Widom, J.: Models and issues

in data stream systems In: PODS 2002: Proceedings of the twenty-ﬁrst ACMSIGMOD-SIGACT-SIGART symposium on Principles of database systems, pp.1–16 ACM, New York (2002)

5 Baldi, P., Frasconi, P., Smyth, P.: Modeling the Internet and the Web: ProbabilisticMethods and Algorithms John Wiley & Sons, Chichester (2003)

6 Balog, K., Hofgesang, P.I., Kowalczyk, W.: Modeling navigation terns of visitors of unstructured websites In: AI-2005: Proceedings ofthe 25th SGAI International Conference on Innovative Techniques andApplications of Artiﬁcial Intelligence, pp 116–129 Springer SBM, Heidelberg(2005)

pat-7 Baraglia, R., Silvestri, F.: Dynamic personalization of web sites without user tervention Commun ACM 50(2), 63–67 (2007)

in-8 Barbar´a, D.: Requirements for clustering data streams SIGKDD Explor.Newsl 3(2), 23–27 (2002)

9 Baron, S., Spiliopoulou, M.: Monitoring the evolution of web usage patterns In:Berendt, B., Hotho, A., Mladeniˇc, D., van Someren, M., Spiliopoulou, M., Stumme,

G (eds.) EWMF 2003 LNCS (LNAI), vol 3209, pp 181–200 Springer, Heidelberg(2004)

10 Calders, T., Dexters, N., Goethals, B.: Mining frequent itemsets in a stream In:Perner, P (ed.) ICDM 2007, pp 83–92 IEEE Computer Society, Los Alamitos(2007)

11 Chang, J.H., Lee, W.S.: EstWin: Online data stream mining of recent frequentitemsets by sliding window method J Inf Sci 31(2), 76–90 (2005)

12 Charikar, M., Chen, K., Farach-Colton, M.: Finding frequent items in data streams.In: Widmayer, P., Triguero, F., Morales, R., Hennessy, M., Eidenbenz, S., Conejo,

R (eds.) ICALP 2002 LNCS, vol 2380, pp 693–703 Springer, Heidelberg (2002)

13 Chen, C.-M.: Incremental personalized web page mining utilizing self-organizingHCMAC neural network Web Intelli and Agent Sys 2(1), 21–38 (2004)

14 Chen, Y., Guo, J., Wang, Y., Xiong, Y., Zhu, Y.: Incremental mining of sequentialpatterns using preﬁx tree In: Zhou, Z.-H., Li, H., Yang, Q (eds.) PAKDD 2007.LNCS (LNAI), vol 4426, pp 433–440 Springer, Heidelberg (2007)

15 Cheng, H., Yan, X., Han, J.: IncSpan: incremental mining of sequential patterns

in large database In: KDD 2004: Proceedings of the 2004 ACM SIGKDD national conference on Knowledge discovery and data mining, pp 527–532 ACMPress, New York (2004)

inter-16 Cheung, W., Za¨ıane, O.R.: Incremental mining of frequent patterns without date generation or support constraint In: IDEAS 2003: 7th International DatabaseEngineering and Applications Symposium, pp 111–116 IEEE Computer Society,Los Alamitos (2003)

candi-17 Chi, Y., Wang, H., Yu, P.S., Muntz, R.R.: Moment: Maintaining closed frequentitemsets over a stream sliding window In: ICDM 2004, pp 59–66 IEEE ComputerSociety, Los Alamitos (2004)

Trang 27

20 P.I Hofgesang

18 Cooley, R., Mobasher, B., Srivastava, J.: Web mining: Information and pattern covery on the world wide web In: ICTAI 1997: Proceedings of the 9th InternationalConference on Tools with Artiﬁcial Intelligence, pp 558–567 IEEE Computer So-ciety, Los Alamitos (1997)

dis-19 Cooley, R., Mobasher, B., Srivastava, J.: Data preparation for mining world wideweb browsing patterns Knowledge and Information Systems 1(1), 5–32 (1999)

20 Cormode, G., Muthukrishnan, S.: What’s hot and what’s not: tracking most quent items dynamically ACM Trans Database Syst 30(1), 249–278 (2005)

fre-21 Desikan, P., Srivastava, J.: Mining temporally evolving graphs In: Mobasher, B.,Liu, B., Masand, B., Nasraoui, O (eds.) WebKDD 2004: Webmining and WebUsage Analysis (2004)

22 Eirinaki, M., Vazirgiannis, M.: Web mining for web personalization ACM Trans.Inter Tech 3(1), 1–27 (2003)

23 El-Sayed, M., Ruiz, C., Rundensteiner, E.A.: FS-Miner: eﬃcient and incrementalmining of frequent sequence patterns in web logs In: WIDM 2004: Proceedings

of the 6th annual ACM international workshop on Web information and datamanagement, pp 128–135 ACM Press, New York (2004)

24 Ester, M., Kriegel, H.-P., Sander, J., Wimmer, M., Xu, X.: Incremental clusteringfor mining in a data warehousing environment In: Gupta, A., Shmueli, O., Widom,

J (eds.) VLDB 1998: Proceedings of 24rd International Conference on Very LargeData Bases, pp 323–333 Morgan Kaufmann, San Francisco (1998)

25 Fetterly, D., Manasse, M., Najork, M., Wiener, J.L.: A large-scale study of theevolution of web pages Softw Pract Exper 34(2), 213–237 (2004)

26 Gaber, M.M., Zaslavsky, A., Krishnaswamy, S.: Mining data streams: a review.SIGMOD Rec 34(2), 18–26 (2005)

27 Gama, J., Castillo, G.: Learning with local drift detection In: Li, X., Za¨ıane, O.R.,

Li, Z (eds.) ADMA 2006 LNCS (LNAI), vol 4093, pp 42–55 Springer, Heidelberg(2006)

28 Ganti, V., Gehrke, J., Ramakrishnan, R.: DEMON: Mining and monitoring ing data Knowledge and Data Engineering 13(1), 50–63 (2001)

evolv-29 Giannella, C., Han, J., Pei, J., Yan, X., Yu, P.: Mining Frequent Patterns in DataStreams at Multiple Time Granularities In: Kargupta, H., Joshi, A., Sivakumar,K., Yesha, Y (eds.) Next Generation Data Mining AAAI/MIT (2003)

30 Giraud-Carrier, C.: A note on the utility of incremental learning AI tions 13(4), 215–223 (2000)

Communica-31 Godoy, D., Amandi, A.: User proﬁling for web page ﬁltering IEEE Internet puting 9(04), 56–64 (2005)

Com-32 Gündüz- Ögüdücü, S., Özsu, M.T.: Incremental click-stream tree model: Learningfrom new users for web page prediction Distributed and Parallel Databases 19(1),5–27 (2006)

33 Han, J., Han, D., Lin, C., Zeng, H.-J., Chen, Z., Yu, Y.: Homepage live: automaticblock tracing for web personalization In: WWW 2007: Proceedings of the 16thInternational Conference on World Wide Web, pp 1–10 ACM, New York (2007)

34 Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation.In: Chen, W., Naughton, J.F., Bernstein, P.A (eds.) Proceedings of the 2000 ACMSIGMOD International Conference on Management of Data, Dallas, Texas, USA,May 16-18, pp 1–12 ACM, New York (2000)

35 Han, J., Pei, J., Yin, Y., Mao, R.: Mining frequent patterns without candidategeneration: A frequent-pattern tree approach Data Min Knowl Discov 8(1),53–87 (2004)

Trang 28

36 Hofgesang, P.I.: Methodology for preprocessing and evaluating the time spent onweb pages In: WI 2006: Proceedings of the 2006 IEEE/WIC/ACM InternationalConference on Web Intelligence, pp 218–225 IEEE Computer Society, Los Alami-tos (2006)

37 Hofgesang, P.I.: Web personalisation through incremental individual proﬁlingand support-based user segmentation In: WI 2007: Proceedings of the 2007IEEE/WIC/ACM International Conference on Web Intelligence, pp 213–220.IEEE Computer Society, Washington (2007)

38 Hofgesang, P.I., Patist, J.P.: Online change detection in individual web user haviour In: WWW 2008: Proceedings of the 17th International Conference onWorld Wide Web, pp 1157–1158 ACM, New York (2008)

be-39 Hulten, G., Spencer, L., Domingos, P.: Mining time-changing data streams In:Proceedings of the Seventh ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining, pp 97–106 ACM Press, New York (2001)

40 Jin, C., Qian, W., Sha, C., Yu, J.X., Zhou, A.: Dynamically maintaining frequentitems over a data stream In: CIKM 2003: Proceedings of the twelfth internationalconference on Information and knowledge management, pp 287–294 ACM, NewYork (2003)

41 Xie, Z.-j., Chen, H., Li, C.: MFIS—mining frequent itemsets on data streams In:

Li, X., Za¨ıane, O.R., Li, Z (eds.) ADMA 2006 LNCS, vol 4093, pp 1085–1093.Springer, Heidelberg (2006)

42 Khoury, I., El-Mawas, R.M., El-Rawas, O., Mounayar, E.F., Artail, H.: An eﬃcientweb page change detection system based on an optimized Hungarian algorithm.IEEE Trans Knowl Data Eng 19(5), 599–613 (2007)

43 Koh, J.-L., Shieh, S.-F.: An eﬃcient approach for maintaining association rulesbased on adjusting FP-tree structures1 In: Lee, Y., Li, J., Whang, K.-Y., Lee, D.(eds.) DASFAA 2004 LNCS, vol 2973, pp 417–424 Springer, Heidelberg (2004)

44 Laxman, S., Sastry, P.S., Unnikrishnan, K.P.: A fast algorithm for ﬁnding frequentepisodes in event streams In: KDD 2007: Proceedings of the 13th ACM SIGKDDinternational conference on Knowledge discovery and data mining, pp 410–419.ACM, New York (2007)

45 Lee, D., Lee, W.: Finding maximal frequent itemsets over online data streamsadaptively In: ICDM 2005: Proceedings of the 5th IEEE International Conference

on Data Mining, pp 266–273 IEEE Computer Society, Los Alamitos (2005)

46 Leung, C.K.-S., Khan, Q.I.: DSTree: A tree structure for the mining of frequentsets from data streams In: Perner, P (ed.) ICDM 2006: Proceedings of the SixthInternational Conference on Data Mining, pp 928–932 IEEE Computer Society,Los Alamitos (2006)

47 Leung, C.K.-S., Khan, Q.I., Hoque, T.: CanTree: A tree structure for eﬃcientincremental mining of frequent patterns In: ICDM 2005: Proceedings of the 5thIEEE International Conference on Data Mining, pp 274–281 IEEE ComputerSociety, Los Alamitos (2005)

48 Li, H.-F., Lee, S.-Y., Shan, M.-K.: On mining webclick streams for path traversalpatterns In: WWW Alt 2004: Proceedings of the 13th international World WideWeb conference on Alternate track papers & posters, pp 404–405 ACM, New York(2004)

49 Li, H.-F., Lee, S.-Y., Shan, M.-K.: DSM-TKP: Mining top-k path traversal patternsover web click-streams In: WI 2005: Proceedings of the 2005 IEEE/WIC/ACM In-ternational Conference on Web Intelligence, pp 326–329 IEEE Computer Society,Los Alamitos (2005)

Trang 29

22 P.I Hofgesang

50 Li, H.-F., Lee, S.-Y., Shan, M.-K.: DSM-PLW: single-pass mining of path traversalpatterns over streaming web click-sequences Comput Netw 50(10), 1474–1487(2006)

51 Liu, B.: Web Data Mining Springer, Heidelberg (2007)

52 Liu, L., Pu, C., Tang, W.: WebCQ-detecting and delivering information changes

on the web In: CIKM 2000: Proceedings of the ninth international conference

on Information and knowledge management, pp 512–519 ACM Press, New York(2000)

53 Masseglia, F., Poncelet, P., Teisseire, M.: Web usage mining: How to eﬃcientlymanage new transactions and new clients In: Zighed, D.A., Komorowski, J.,

˙Zytkow, J.M (eds.) PKDD 2000 LNCS, vol 1910, pp 530–535 Springer, delberg (2000)

Hei-54 Mobasher, B., Dai, H., Luo, T., Sun, Y., Zhu, J.: Integrating web usage and contentmining for more eﬀective personalization In: Bauknecht, K., Madria, S.K., Pernul,

G (eds.) EC-Web 2000 LNCS, vol 1875, pp 165–176 Springer, Heidelberg (2000)

55 Nasraoui, O., Cerwinske, J., Rojas, C., Gonz´alez, F.A.: Performance of mendation systems in dynamic streaming environments In: SDM 2007 SIAM,Philadelphia (2007)

recom-56 Nasraoui, O., Rojas, C., Cardona, C.: A framework for mining evolving trends inweb data streams using dynamic learning and retrospective validation ComputerNetworks 50(10), 1488–1512 (2006)

57 Nasraoui, O., Soliman, M., Saka, E., Badia, A., Germain, R.: A web usage miningframework for mining evolving user proﬁles in dynamic web sites IEEE Trans.Knowl Data Eng 20(2), 202–215 (2008)

58 Nasraoui, O., Uribe, C.C., Coronel, C.R., Gonz´alez, F.A.: TECNO-STREAMS:Tracking evolving clusters in noisy data streams with a scalable immune systemlearning model In: ICDM 2003: Proceedings of the 3rd IEEE International Confer-ence on Data Mining, pp 235–242 IEEE Computer Society, Los Alamitos (2003)

59 Nguyen, S.N., Sun, X., Orlowska, M.E.: Improvements of incSpan: Incrementalmining of sequential patterns in large database In: Ho, T.-B., Cheung, D., Liu, H.(eds.) PAKDD 2005 LNCS, vol 3518, pp 442–451 Springer, Heidelberg (2005)

60 Ntoulas, A., Cho, J., Olston, C.: What’s new on the web?: the evolution of theweb from a search engine perspective In: WWW 2004: Proceedings of the 13thinternational conference on World Wide Web, pp 1–12 ACM, New York (2004)

61 Parthasarathy, S., Zaki, M.J., Ogihara, M., Dwarkadas, S.: Incremental and active sequence mining In: CIKM 1999: Proceedings of the eighth internationalconference on Information and knowledge management, pp 251–258 ACM Press,New York (1999)

inter-62 Perkowitz, M., Etzioni, O.: Adaptive web sites: automatically synthesizing webpages In: AAAI 1998/IAAI 1998: Proceedings of the fifteenth national/tenth con-ference on Artificial intelligence/Innovative applications of artificial intelligence,

pp 727–732 American Association for Artiﬁcial Intelligence, Menlo Park (1998)

63 Pierrakos, D., Paliouras, G., Papatheodorou, C., Spyropoulos, C.D.: Web usagemining as a tool for personalization: A survey User Modeling and User-AdaptedInteraction 13(4), 311–372 (2003)

64 Roddick, J.F., Spiliopoulou, M.: A survey of temporal knowledge discoveryparadigms and methods IEEE Transactions on Knowledge and Data Engineer-ing 14(4), 750–767 (2002)

Trang 30

65 Rojas, C., Nasraoui, O.: Summarizing evolving data streams using dynamic preﬁxtrees In: WI 2007: Proceedings of the IEEE/WIC/ACM International Conference

on Web Intelligence, pp 221–227 IEEE Computer Society, Washington (2007)

66 Spiliopoulou, M., Ntoutsi, I., Theodoridis, Y., Schult, R.: MONIC: modeling andmonitoring cluster transitions In: Proceedings of the Twelfth ACM SIGKDD In-ternational Conference on Knowledge Discovery and Data Mining, pp 706–711.ACM, New York (2006)

67 Srivastava, J., Cooley, R., Deshpande, M., Tan, P.-N.: Web usage mining: Discoveryand applications of usage patterns from web data SIGKDD Explorations 1(2), 12–

Nas-70 Wang, K.: Discovering patterns from large and dynamic sequential data J Intell.Inf Syst 9(1), 33–56 (1997)

71 Weinreich, H., Obendorf, H., Herder, E., Mayer, M.: Not quite the average: Anempirical study of web use ACM Trans Web 2(1), 1–31 (2008)

72 Widmer, G., Kubat, M.: Learning in the presence of concept drift and hiddencontexts Machine Learning 23(1), 69–101 (1996)

73 Wu, E.H., Ng, M.K., Huang, J.Z.: On improving website connectivity by usingweb-log data streams In: Lee, Y., Li, J., Whang, K.-Y., Lee, D (eds.) DASFAA

2004 LNCS, vol 2973, pp 352–364 Springer, Heidelberg (2004)

74 Wu, E.H., Ng, M.K., Yip, A.M., Chan, T.F.: A clustering model for mining ing web user patterns in data stream environment In: Yang, Z.R., Yin, H., Ever-son, R.M (eds.) IDEAL 2004 LNCS, vol 3177, pp 565–571 Springer, Heidelberg(2004)

evolv-75 Yen, S.-J., Lee, Y.-S., Hsieh, M.-C.: An eﬃcient incremental algorithm for miningweb traversal patterns In: ICEBE 2005: Proceedings of the IEEE InternationalConference on e-Business Engineering, pp 274–281 IEEE Computer Society, LosAlamitos (2005)

Trang 32

I.-H Ting, H.-J Wu (Eds.): Web Mining Appl in E-Commerce & E-Services, SCI 172, pp 25–43 springerlink.com © Springer-Verlag Berlin Heidelberg 2009

Gulden Uchyigit

Department of Computer Science Mathematics, University of Brighton

unprecedented rate, making it very difficult for users to find interesting information This situation is likely worsen in the future unless the end user has the available tools to assist them Web personalization is a research area which has received great attention in recent years Web personalization aims to assist the users with information overload problem One area of web personalization is the so called recommender systems Recommender systems make recommendations based on the user’s individual profiles Traditionally, the user profiles are keyword-based, they work on the premise that, those items which match certain keywords found in the user’s profile will be of interest and of relevance to the user, so those items are recommended to the user

One of the problems with the keyword-based profile representation methods is that a lot of useful information is lost during the pre-processing phase To overcome this problem eliciting and utilization of semantic-based information from the domain, rather than the individual keywords, within all stages of the personalization process including can enhance the personalization process

This chapter presents a state-of-the-art survey of the techniques which can be used to semantically enhance the data processing, user modeling and the recommendation stages of the personalization process

1 Introduction

Personalization technologies have been a popular tool for assisting users with the information overload problem As the number of services and the volume of content continues to grow personalization technologies are more than ever in demand Over the years they have been deployed in several different domains including the entertainment domain and e-commerce

In recent years developments into extending the Web with semantic knowledge in an attempt to gain a deeper insight into the meaning of the data being created, stored and exchanged has taken the Web to a different level This has lead to developments of semantically rich descriptions to achieve improvements in the area of personalization technologies (Pretschner and Gauch, 2004)

Traditional approaches to personalization include the content-based method

(Armstrong et al., 1995), (Balabanovic and Shoham, 1997), (Liberman, 1995), (Mladenic, 1996), (Pazzani and Billsus, 1997),(Lang, 1995) These systems generally infer a user's profile from the contents of the items the user previously seen and rated Incoming information is then compared with the user's profile and those items which are similar to the user's profile are assumed to be of interest to the user and are recommended

Trang 33

26 G Uchyigit

A traditional method for determining whether information matches a user's

interests is through keyword matching If a user's interests are described by certain

keywords then the assumption is made that information containing those keywords should be of relevant and interest to the user Such methods may match lots of irrelevant information as well as relevant information, mainly because any item which matches the selected keywords will be assumed interesting regardless of its existing

context For instance, if the word learning exists in a paper about student learning (from the educational literature) then a paper on machine learning (from artificial

intelligence literature) will also be recommended In order to overcome such problems, it is important to model the semantic meaning of the data in the domain In recent years ontologies have been very popular in achieving this

Ontologies are formal explicit descriptions of concepts and their relationships within a domain Ontology-based representations are richer, more precise and less ambiguous than ordinary keyword based or item based approaches (Middleton et al.,

2002) For instance they can overcome the problem of similar concepts by helping the

system understand the relationship between the different concepts within the domain

For example to find a job as a doctor an ontology may suggest relevant related terms such as clinician and medicine Utilizing such semantic information provides a more

precise understanding of the application domain, and provides a better means to define the user's needs, preferences and activities with regard to the system, hence improving the personalization process

2 Background

Web personalization is a popular technique for assisting with the complex process of information discovery on the World Wide Web Web personalization is of importance both to the service provider and to the end-user interacting with the web site For the service provider it is used to develop a better understanding of the needs of their customers, so as to improve the design their web sites For the end-users web personalization is important because they are given customized assistance whilst they are interacting with a web site

More recently web usage mining has been used as the underlying approach to web personalization (Mobasher et al., 2004) The goal of web usage mining is to capture and model the user’s behavioral patterns as they are interacting with the web site and use this data during the personalization process Web usage patterns display the frequently accessed web pages by users of the web site who are in search of a particular piece of information Using such information the service providers can better understand which information their users are searching for and how they can assist the user during their search process by improving the organizations and structure the web site

Mobasher (Mobasher et al., 2004) classifies web personalization into 3 groups Manual decision rule systems, content-based recommender systems and collaborative-based recommender systems Manual decision rule systems allow the web site administrator to specify rules based on user demographics or static profiles (collected through a registration process) Content-based recommender systems make use of user profiles and make recommendations based on these profiles Collaborative-based

Trang 34

recommender systems make use of user ratings and give recommendations based on how other users in the group have rated similar items

2.1 Recommender Systems

Over the past decade recommender systems have become very successful in assisting with the information overload problem They have been very popular in applications including e-commerce, entertainment and the news domains Recommender systems fall into three main categories collaborative, content and hybrid Their distinction is reliant on the nature in which the recommendations are made These distinctions are formalized by the methods in which: the items are perceived by a community of users; how the content of each item compares with the user's individual profile; a combination of both methods Collaborative based systems take in user ratings and make recommendations based on how other users in the group have rated similar items, content-based filtering systems make recommendations based on user’s profiles and hybrid systems combine both the content and collaborative based techniques

Content based systems automatically infer the user’s profile from the contents of the items the user has previously seen and rated These profiles are then used as inputs

to a classification algorithm along with the new unseen items from the domain Those items which are similar in content to the user’s profile are assumed to be interesting and are recommended to the user

A popular and extensively used document and profile representation method employed by many information filtering methods, including the content based method

is the so called vector space representation (Chen and Sycara, 1998), (Mladenic,

1996), (Lang, 1995), (Moukas, 1996), (Liberman, 1995), (T Kamba and Koseki, 1997), (Armstrong et al., 1995) Content based systems has their roots in text filtering, many of the techniques The content-based recommendation method was developed based on the text filtering model described by (Oard 1997) In (Oard, 1997), a generic information filtering model is described as having four components:

a method for representing the documents within the domain; a method for representing the user's information need; a method for making the comparison; and a method for utilizing the results of the comparison process The vector space method (Beaza-Yates and Ribeiro-Neto, 1999])consider that each document (profile) is

described as a set of keywords The text document is viewed as a vector in n dimensional space, n being the number of different words in the document set Such a representation is often referred to as bag-of-words, because of the loss of word

ordering and text structure (see Figure 2) The tuple of weights associated with each word, reflecting the significance of that word for a given document, give the document's position in the vector space The weights are related to the number of occurrences of each word within the document The word weights in the vector space

method are ultimately used to compute the degree of similarity between two feature

vectors This method can be used to decide whether a document represented as a weighted feature vector, and a profile are similar If they are similar then an assumption is made that the document is relevant to the user The vector space model

evaluates the similarity of the document d j with regard to a profile p as the correlation between the vectors d j and p This correlation can be quantified by the cosine of the

angle between these two vectors That is,

Trang 35

in the past

Collaborative-based systems (Terveen et al., 1997), (Breese et al., 1998), (Knostan

et al., 1997), (Balabanovic and Shoham, 1997) were proposed as an alternative to the content-based methods The basic idea is to move beyond the experience of an individual user profile and instead draw on the experiences of a population or community of users Collaborative-based systems (Herlocker et al., 1999), (Knostan

et al., 1997), (Terveen et al., 1997), (Kautz et al., 1997), (Resnick and Varian, 1997) are built on the assumption that a good way to find interesting content is to find other people who have similar tastes, and recommend the items that those users like Typically, each target user is associated with a set of nearest neighbor users by comparing the profile information provided by the target user to the profiles of other users These users then act as recommendation partners for the target user, and items that occur in their profiles can be recommended to the target user In this way, items are recommended on the basis of user similarity rather than item similarity Collaborative recommender systems have several shortcomings one of which is that the users will only be recommended new items only if their ratings agree with other people within the community Also, if a new item has not been rated by anyone in the community if will not get recommended

To overcome, the problems posed by pure content and collaborative based recommender systems, hybrid recommender systems have been proposed Hybrid systems combine two or more recommendation techniques to overcome the shortcomings of each individual technique (Balabanovic, 1998), (Balabanovic and Shoham, 1997), (Burke, 2002), (Claypool et al., 1999) These systems generally, use the content-based component to overcome the new item start up problem, if a new item is present then it can still be recommended regardless if it was seen and rated The collaboration component overcomes the problem of over specialization as is the case with pure content based systems

2.2 Content-Based Recommender Systems

Content-based recommender systems have been very popular over the past decade They have mainly been employed in textual domains They have their roots in information retrieval and text mining Oard (Oard, 1997), presents a generic information filtering model that is described as having four components: a method for representing the documents within the domain; a method for representing the user's information need; a method for making the comparison; and a method for utilizing the results of the comparison process Oard's model described the text filtering model as the process of automating the user's judgments of new textual documents, where the same representation methods are used both for the user profile and the documents within the domain The goal of the text filtering model is to

Trang 36

automate the text filtering model, so that the results of the automated comparison process are equal to the user’s judgment of the documents

Content based systems automatically infer the user’s profile from the contents of the document the user has previously seen and rated These profiles are then used as input to a classification algorithm along with the new unseen documents from the domain Those documents which are similar in content to the user’s profile are assumed to be interesting and recommended to the user

A popular and extensively used document and profile representation method employed by many information filtering methods, including the content based method

is the so called vector space representation (Chen and Sycara, 1998), (Mladenic,

1996), (Lang, 1995), (Moukas, 1996), (Liberman, 1995), (Kamba and Koseki, 1997), (Armstrong et al., 1995) The vector space method (Baeza-Yates and Ribeiro-Neto, 1999) consider that each document (profile) is described as a set of keywords The

text document is viewed as a vector in n dimensional space, n being the number of different words in the document set Such a representation is often referred to as bag- of-words, because of the loss of word ordering and text structure The tuple of weights

associated with each word, reflecting the significance of that word for a given document, give the document's position in the vector space The weights are related to the number of occurrences of each word within the document The word weights in

the vector space method are ultimately used to compute the degree of similarity

between two feature vectors This method can be used to decide whether a document represented as a weighted feature vector, and a profile are similar If they are similar then an assumption is made that the document is relevant to the user The vector space

model evaluates the similarity of the document d j with regard to a profile p as the correlation between the vectors d j and p

2.3 Collaborative-Based Recommender Systems

Collaborative-based systems (Terveen et al., 1997), (Breese et al., 1998), (Knostan et al., 1997), (Balabanovic and Shoham, 1997) are an alternative to the content-based methods The basic idea is to move beyond the experience of an individual user profile and instead draw on the experiences of a population or community of users Collaborative-based systems (Herlocker et al., 1999), (Konstan et al., 1997), (Terveen

et al., 1997), (Kautz et al., 1997), (Resnick and Varian, 1997) are built on the assumption that a good way to find interesting content is to find other people who have similar tastes, and recommend the items that those users like Typically, each target user is associated with a set of nearest neighbor users by comparing the profile information provided by the target user to the profiles of other users These users then act as recommendation partners for the target user, and items that occur in their profiles can be recommended to the target user In this way, items are recommended

on the basis of user similarity rather than item similarity Content-based systems suffer from shortcomings in the way they select items for recommendations Items are recommended if the user has seen and liked similar items in the past

A user profile effectively delimits a region of the item space from which future recommendations will be drawn Therefore, future recommendations will display limited diversity This is particularly problematic for new users since their recommendations will be based on a very limited set of items represented in their

Trang 37

30 G Uchyigit

immature profiles Items relevant to a user, but bearing little resemblance to the snapshot of items the user has looked at in the past, will never be recommended in the future Collaborative filtering techniques try to overcome these shortcomings presented by content-based systems However, collaborative filtering alone can prove

ineffective for several reasons (Claypool et al., 1999) For instance, the early rater problem, arises when a prediction can not be provided for a given item because it’s new and therefore it has not been rated and it can not be recommended, the sparsity problem which arises due to sparse nature of the ratings within the information matrices making the recommendations inaccurate, the grey sheep problem which arises when there are individuals who do not benefit from the collaborative

recommendations because their opinions do not consistently agree or disagree with

other people in the community

To overcome, the problems posed by pure content and collaborative based recommender systems, hybrid recommender systems have been proposed Hybrid systems combine two or more recommendation techniques to overcome the shortcomings of each individual technique (Balabanovic, 1998), (Balabanovic and Shoham, 1997), (Burke, 2002), (Claypool et al., 1999) These systems generally, use the content-based component to overcome the new item start up problem, if a new item is present then it can still be recommended regardless if it was seen and rated The collaboration component overcomes the problem of over specialization as is the case with pure content based systems

2.4 The Semantic Web

The semantic web is an extension of the current Web which aims to provide an easier way to find, share, reuse and combine information It extends Web documents by adding new data and metadata to the existing Web documents This extension of Web documents is what enables the them to be processed automatically accessible by

machines To do this RDF (Resource Description Framework) is used to turn basic

Web data into structured data RDF works on Web pages and also inside applications and it's based on machine-readable information which builds on XML technology's capability to define customized tagging schemes and RDF's flexible approach to representing data RDF is a general framework for describing a Web site's metadata,

or the information about the information on the site It provides interoperability between applications that exchange machine-understandable information on the Web

RDF Schema (RDFS)

RDFS is used to create vocabularies that describe groups of related RDF resources and the relationships between those resources An RDFS vocabulary defines the allowable properties that can be assigned to RDF resources within a given domain RDFS also allows for the creation of classes of resources that share common properties In an RDFS vocabulary, resources are defined as instances of classes A class is a resource too, and any class can be a subclass of another This hierarchical semantic information is what allows machines to determine the meanings of resources based on their properties and classes

Trang 38

Web Ontology Language (OWL)

OWL is a W3C specification for creating Semantic Web applications Building upon RDF and RDFS, OWL defines the types of relationships that can be expressed in RDF using an XML vocabulary to indicate the hierarchies and relationships between different resources In fact, this is the very definition of “ontology” in the context of the Semantic Web: a schema that formally defines the hierarchies and relationships between different resources Semantic Web ontologies consist of a taxonomy and a set of inference rules from which machines can make logical conclusions

A taxonomy in this context is system of classification, such as the scientific kingdom/phylum/class/order/etc system for classifying plants and animals that groups resources into classes and sub-classes based on their relationships and shared properties

Since taxonomies (systems of classification) express the hierarchical relationships that exist between resources, we can use OWL to assign properties to classes of resources and allow their subclasses to inherit the same properties OWL also utilizes the XML Schema data types and supports class axioms such as subClassOf, disjointWith, etc., and class descriptions such as unionOf, intersectionOf, etc Many other advanced concepts are included in OWL, making it the richest standard ontology description language available today

3 Data Preperation: Ontology Learning, Extraction and

Pre-processing

As previously described personalization techniques such as the content-based method extensively employ the vector space representation This data representation technique is popular because of it’s simplicity and efficiency However, it has the disadvantage that a lot of useful information is lost during the representation phase since the sentence structure is broken down to the individual words In an attempt to minimize the loss of information during the representation phase it is important to retain the relationships between the words One popular technique in doing this is

to use conceptual hierarchies In this section we present an overview of the existing techniques, algorithms and methodologies which have been employed for ontology learning

The main component of ontology learning is the construction of the concept hierarchy Concept hierarchies are useful because they are an intuitive way to describe information (Lawrie and Croft, 2000) Generally hierarchies are manually created by domain experts This is a very cumbersome process and requires specialized knowledge from domain experts This therefore necessitates tools for their automatic generation Research into automatically constructing a hierarchy of concepts directly from data is extensive and includes work from a number of research groups including, machine learning, natural language processing and statistical analysis One approach

is to attempt to induce word categories directly from a corpus based on statistical occurrence (Evans et al., 1991), (Finch and Chater, 1994), (McMahon and Smith, 1996), (Nanas et al., 2003a) Another approach is to merge existing linguistic resources such as dictionaries and thesauri (Klavans et al., 1992), (Knight and Luk,

Trang 39

co-32 G Uchyigit

1994) or tuning a thesaurus (e.g WordNet) using a corpus (Miller et al., 1990a) Other methods include using natural language processing (NLP) methods to extract phrases and keywords from text (Sanderson and Croft, 1999), or to use an already constructed hierarchy such as yahoo and map the concepts onto this hierarchy

Subsequent parts of this section include machine learning approaches and natural language processing approaches used for ontology learning

3.1 Machine Learning Approaches

Learning ontologies from unstructured text is not an easy task The system needs to automatically extract the concepts within the domain as well as extracting the relationships between the discovered concepts Machine learning approaches in particular clustering techniques, rule based techniques, fuzzy logic and formal concept analysis techniques have been very popular for this purpose This section presents an overview of the machine learning approaches which have been popular in discovering ontologies from unstructured text

3.1.1 Clustering Algorithms

Clustering algorithms are very popular in ontology learning They function by clustering the instances together based on their similarity The clustering algorithms can be divided

into hierarchical and non hierarchical methods Hierarchical methods construct a tree

where each node represents a subset of the input items (documents), where the root of the tree represents all the items in the item set Hierarchical methods can be divided

into the divisive and agglomerative methods Divisive methods begin with the entire set

of items and partition the set until only an individual item remains Agglomerative methods work in the opposite way, beginning with individual items, each item is represented as a cluster and merging these clusters until a single cluster remains At the

first step of hierarchical agglomerative clustering (HAC) algorithm, when each

instance represents its own cluster, the similarities between each cluster are simply defined by the chosen similarity method rule to determine the similarity of these new clusters to each other There are various rules which can be applied depending on the data, some of the measures are described below:

Single-Link: In this method the similarity of two clusters is determined by the

similarity of the two closest (most similar) instances in the different clusters So for

each pair of clusters S i and S j,

sim(Si,Sj) = max{cos(di,dj) di∈ Si,dj ∈ Sj} (2)

Complete-Link: In this method the similarity of two clusters is determined by the

similarity of the two least similar instances of both clusters This approach can be performed well in cases where the data forms the natural distinct categories, since it tends to produce tight (cohesive) spherical clusters This is calculated as:

Average-Link or Group Average: In this method, the similarity between two clusters is

calculated as the average distance between all pairs of objects in both clusters, i.e it's an

Trang 40

intermediate solution between complete link and single-link This is unweighted, or weighted by the size of the clusters The weighted form is calculated as:

sim(Si,Sj) = 1

ninj ∑ cos(di,dj) (4)

where n i and n j refer to the size of S i and S j respectively

Hierarchical clustering methods are popular for ontology learning because they are able to naturally discover the concept hierarchy during the clustering process Scatter/Gather (Lin and Pantel, 2001) is one of the earlier methods in which clustering

is used to create document hierarchies Recently new types of hierarchies have been introduced which rely on the terms used by a set of documents to expose some structure of the document collection One such technique is lexical modification and another is subsumption

3.1.2 Rule Learning Algorithms

These are algorithms that learn association rules or other attribute based rules The algorithms are generally based on a greedy search of the attribute-value tests that can

be added to the rule preserving its consistency with the training instances Apriori algorithm is a simple algorithm which learns association rules between objects Apriori is designed to operate on databases containing transactions (for example, the collections of items bought by customers) As is common in association rule mining, given a set of item sets (for instance, sets of retail transactions each listing individual item’s purchased), the algorithm attempts to find subsets which are common to at

least a minimum number S c (the cutoff, or confidence threshold) of the item sets

Apriori uses a bottom up approach, where frequent subsets are extended one item at a time (a step known as candidate generation, and groups of candidates are tested against the data The algorithm terminates when no further successful extensions are found One example of an ontology learning tool is OntoEdit (Maedche and Staab, 2001), which is used to assist the ontology engineer during the ontology creation process The algorithm semi automatically learns to construct an ontology from

unstructured text The algorithm uses a method for discovering generalized

association rules The input data for the learner is a set of transactions, each of which consists of set of items that appear together in the transaction The algorithm extracts association rules represented by sets of items that occur together sufficiently often and presents the rules to the knowledge engineer For example a shopping transaction may include the items purchased together The generalized association rule may say that snacks are purchased together with drinks rather than crisps are purchased with beer

3.1.3 Fuzzy Logic

Fuzzy logic provide the opportunity to model systems that are inherently imprecisely defined Fuzzy logic is popular in modeling of textual data because of the uncertainty which is present in textual data Fuzzy logic is built on theories of fuzzy sets Fuzzy set theory deals with representation of classes whose boundaries are not well defined The key idea is to associate a membership function with the elements of a class The

Định dạng
Số trang	187
Dung lượng	8,85 MB