Data Analysis Machine Learning and Applications Episode 3 Part 3 pps

594 Patrick Mair and Marcus Hudecof complex web sites which consist of a large number of pages it is often reasonable to reduce the number of different pages by aggregating individual pa

Trang 1

Analysis of Dwell Times in Web Usage Mining

Patrick Mair1and Marcus Hudec2

1 Department of Statistics and Mathematics and ec3

Abstract In this contribution we focus on dwell times a user spends on various areas of a web

site within a session We assume that dwell times may be adequately modeled by a Weibulldistribution which is a flexible and common approach in survival analysis Furthermore weintroduce heterogeneity by various parameterizations of dwell time densities by means ofproportional hazards models According to these assumptions the observed data stem from

a mixture of Weibull densities Estimation is based on EM-algorithm and model selectionmay be guided by BIC Identification of mixture components corresponds to a segmentation

of users/sessions A real life data set stemming from the analysis of a world wide operatingeCommerce application is provided The corresponding computations are performed with themixPHM package inR.

1 Introduction

Web Usage Mining focuses on the analysis of visiting behavior of users on a website Common starting point are the so called click-stream data which are derivedfrom web-server logs and may be viewed as the electronic trace a user leaves on aweb site Adequate modeling of the dynamics of browsing behavior is of particularrelevance for the optimization of eCommerce applications Recently Montgomery

et al (2004) proposed a dynamic multinomial probit model of navigation patternswhich lead to an remarkable increase of conversion rates Park and Fader (2004) de-veloped multivariate exponential-gamma models which enhance cross-site customeracquisition These papers indicate the potential that such approaches offer for web-shop providers

In this paper we will focus on modeling dwell times, i.e., the time a user spendsfor viewing a particular page impression They are defined by the time span betweentwo subsequent page requests and can be calculated by taking the difference betweenthe two logged time points when the page request have been issued For the analysis

Trang 2

594 Patrick Mair and Marcus Hudec

of complex web sites which consist of a large number of pages it is often reasonable

to reduce the number of different pages by aggregating individual page-impressions

to semantically related page categories reflecting meaningful regions of the web site.Analysis of dwell times is an important source of information with regard tothe relevance of the content for different users and the effectiveness of the page inattracting visitors In this paper we are particularly interested in segmentation ofusers into various groups which exhibit a similar behavior with regard to the dwelltimes they spend on various areas of the site Such a segmentation analysis is animportant step towards a better understanding of the way a user interacts on a website It is therefore of relevance with regard to the prediction of user behavior as well

as for a user-specific customization or even personalization of web sites

2 Model specification and estimation

2.1 Weibull mixture model

Since survival analysis focuses on duration times until some event occurs (e.g thedeath of a patient in medical applications) it seems straightforward to apply theseconcepts to the analysis of dwell times in web usage mining applications

With regard to dwell time distributions we assume that they follow a Weibull

distribution with density function f (t) = OJtJ−1exp(−OtJ), where O is a scale rameter and J the shape parameter For modeling the heterogeneity of the observed

pa-population, we assume K latent segments of sessions While the Weibull assumption

holds within all segments, different segments exhibit different parameter values This

leads to the underlying idea of a Weibull mixture model For each page category p (p = 1, ,P) under consideration the resulting mixture has the following form

where t p represents the dwell time on page category p with mixing proportions S k

which correspond to the relative size of each segment

In order to reduce the number of parameters involved we impose restrictions

on the hazard rates of different components of the mixture respectively pages An

elegant way of doing this is offered by the concept of Weibull proportional hazards

models (WPHM) The general formulation of a WPHM (see e.g., Kalbfleisch and

Prentice (1980)) is

where Z is a matrix of covariates, and E are the regression parameters The term OJtJ−1 is the baseline hazard rate h0(t) due to the Weibull assumption and h(t;Z) hazard proportional to h0(t) resulting from the regression part in the model.

Trang 3

Analysis of Dwell Times in Web Usage Mining 595

2.2 Parsimonious modeling strategies

We propose five different models with respect to different proportionality restrictions

in the hazard rates as to reduce the number of parameters In the mixPHM package byMair and Hudec (2007) the most general model is called separate: The WPHM is

computed for each component and page separately Hence, the hazard of session i belonging to component k (k = 1, ,K) on page category p (p = 1, ,P) is

h (t i,p;1) = Ok,pJk,p tJk,p −1

The parameter matrices can be represented jointly as

the parameters (2 × K × P in total) are the same as they were estimated directly by

using a Weibull mixture model

Next, we impose a proportionality assumption across the latent components Inthe classification version of the EM-algorithm (see next section) in each iteration step

we have a “crisp" assignment of each session to a component Thus, if we consider

this component vector g as main effect in the WPHM, i.e., h(t;g), we impose

pro-portional hazards for the components across the pages (main.g in mixPHM) Again,the elements of the matrix / of scale parameters can vary freely, whereas the shapeparameter matrix reduces to the vector * = (J1,1 , , J 1,P) Thus, the shape param-eters are constant over the components and the number of parameters is reduced to

K × P + P.

If we impose page main effects in the WPHM, i.e., h(t; p) or main.p,

respec-tively, as before, the elements of / are not restricted at all but this time the shapeparameters are constant over the pages, i.e., * = (J1,1 , , J 1,K) The total number of

Trang 4

The c- and d-scalars are proportionality constants over the pages and components,

re-spectively The shape parameters are constant over the components and pages Thus,

* reduces to one shape parameter J which implies that the hazard rates are tional over components and pages

propor-To relax the rather restrictive assumption with respect to / we can extend themain effects model by the corresponding component-page interaction term, i.e.,

h (t;g ∗ p) In mixPHM notation this model is called int.gp The elements of / can

vary freely whereas * is again reduced to one parameter only, leaving us with a total

number of parameters of K × P + 1 With respect to the hazard rate this relaxation

implies again proportional hazards over components and pages

relative frequency The elements of the resulting K × P matrix are model parameters

and have to be taken into account when determining the total number of parameters.The resulting likelihood Wk,p (s i ) for session i being in component k for each page p

individually, is

Wk,p (s i) =f (y p; ˆOk,p , ˆJ k,p )Pr k,p (s i ) if p was visited by s i

1 − Pr k,p (s i) if p was not visited by s i (7)

To establish the joint likelihood, a crucial assumption is made: independence

of the dwell times over page-categories To make this assumption feasible, a would be hierarchical, the independence assumption would not hold Without thisindependence assumption, a multivariate mixture Weibull model would have to befitted which takes into account the covariance structure of the observations This

well-would require that each session must have a full observation vector of length p, i.e,

each page category is visited within each session which seems not to be realisticwithin the context of dwell times in web usage mining

However, for a reasonable independence assumption the likelihood over all pages

that session i belongs to component k is given by

At this point, the M-step is carried out The mixPHM package provides three

different methods The classical version of the EM-algorithm (maximization EM;

advised page categorization must be established For instance, if some page-categories

Trang 5

Analysis of Dwell Times in Web Usage Mining 597EMoption = "maximization" in mixPHM) computes the posterior probabilities that

session i belongs to group k and does not make a group assignment within each iteration step but rather updates the matrix of posterior probabilities Q A faster EM- version is proposed by Celeux and Govaert (1992) which they call classification EM

(EMoption = "classification" in mixPHM): Within each iteration step a groupassignment is performed due to supk(<i) Hence, the computation of the posteriormatrix is not needed A randomized version of the M-step considers a combination

of the approaches above: After the computation of the posterior matrix Q, a

ran-domized group assignment is performed due to the corresponding probability values(EMoption = "randomization")

As usual, the joint likelihood L is updated at each EM-iteration l until a certain

convergence criterion H is reached, i.e.,L (l) − L (l−1) < H Theoretical issues about

the EM-convergence in Weibull mixture models can be found in Ishwaran (1996)and Jewell (1982)

3 Real life example

In this section we use a real dataset of a large Austrian company which runs a shop to demonstrate our modeling approach We restrict empirical analysis to a sub-set of 333 buying-sessions and 7 page-categories we perform a dwell time based clus-tering with corresponding proportionality hazard assumptions by using the mixPHMpackage inR(RDevelopment Core Team, 2007)

web-bestview checkout service figurines jewellery landing search

We start with a rather exploratory approach to determine an appropriate

propor-tionality model with an adequate number of clusters K By using the msBIC statement

we can accomplish such a heuristic model search

> res.bic <- msBIC(x,K=2:5,method="all")

> res.bic

Bayes Information Criteria

Survival distribution: Weibull

Trang 6

It is obvious that the main.g model with K = 5 components fits quite well pared to the other models (if we fit models for K > 5 the BIC’s do not decrease

com-perspicuously anymore) For the sake of demonstration of the imposed hazard portionalities, we compare this model to the more flexible separate model First, wefit the two models again by using the phmclust statement which is the core routine

pro-of the mixPHM package The matrices pro-of shape parameters *sepand *g, respectively,for the first 5 pages (due to limited space) are:

From Figure 2 it is obvious that the hazards are proportional across componentsfor each page Note that due to space limitations, in both plots we only used threeselected pages to demonstrate the hazard characteristics The hazard plots allow toasses the relevance of different page categories with respect to cluster formation.Similar plots for dwell time distributions are available

4 Conclusion

In this work we presented a flexible framework to analyze dwell times on web pages

by adopting concepts from survival analysis to probability based clustering served heterogeneity is modeled by mixtures of Weibull distributed dwell times Ap-plication of the EM-algorithm leads to a segmentation of sessions

Unob-Since the Weibull distribution is rather highly parameterized it offers a able amount of flexibility for the hazard rates A more parsimonious modeling mayeither be achieved by posing proportionality restrictions on the hazards or mak-ing use of simpler distributional assumptions (e.g., for constant hazard rates) The

Trang 7

size-Analysis of Dwell Times in Web Usage Mining 599

Fig 1 Hazard Plot for Model separate

mixPHM package covers therefore additional survival distributions such as tial, Rayleigh, Gaussian, and Log-logistic

Exponen-A segmentation of sessions as it is achieved by our method may serve as a startingpoint for optimization of a website Identification of typical user behavior allows anefficient dynamic modification of content as well as an optimization of adverts fordifferent groups of users

Fig 2 Hazard Plot for Model main.g

Trang 8

References

CELEUX, G., and GOVAERT, G (1992) A Classification EM Algorithm for Clustering and

Two Stochastic Versions Computational Statistics & Data Analysis, 14, 315–332.

DEMPSTER, A.P., LAIRD, N.M and RUBIN, D.B (1977) Maximum Likelihood from

In-complete Data via the EM-Algorithm Journal of the Royal Statistical Society, Series B,

39, 1–38.

ISHWARAN, H (1996) Identifiability and Rates of Estimation for Scale Parameters in

Loca-tion Mixture Models The Annals of Statistics, 24, 1560-1571.

JEWELL, N.P (1982) Mixtures of Exponential Distributions The Annals of Statistics, 24,

479–484.

KALBFLEISCH, J.D and PRENTICE, R.L (1980): The Statistical Analysis of Failure Time

Data Wiley, New York.

MAIR, P and HUDEC, M (2007) mixPHM: Mixtures of proportional hazard models.Rpackage version 0.5.0: http://CRAN.R-project.org/

MCLACHLAN, G.J and KRISHNAN, T (1997) The EM Algorithm and Extensions Wiley,

New York

MONTGOMERY, A.L., LI, S., SRINIVASAN, K and LIECHTY, J.C (2004) Modeling

on-line browsing and path analysis using clickstream data Marketing Science, 23, 579–595 PARK, Y and FADER, P.S (2004) Modeling browsing behavior at multiple websites Mar-

keting Science, 23, 280–303

RDevelopment Core Team (2007).R : A Language and Environment for Statistical ing Vienna, Austria (ISBN 3-900051-07-0)

Trang 9

Comput-Classifying Number Expressions in German Corpora

Irene Cramer1, Stefan Schacht2, Andreas Merkel2

1 Dortmund University, Germany

irene.cramer@uni-dortmund.de

2 Saarland University, Germany

{stefan.schacht, andreas.merkel}@lsv.uni-saarland.de

Abstract Number and date expressions are essential information items in corpora and

there-fore play a major role in various text mining applications However, so far number expressionswere investigated in a rather superficial manner In this paper we introduce a comprehensivenumber classification and present promising, initial results of a classification experiment usingvarious Machine Learning algorithms (amongst others AdaBoost and Maximum Entropy) toextract and classify number expressions in a German newspaper corpus

1 Introduction

In many natural language processing (NLP) applications such as Information traction and Question Answering number expressions play a major role, e.g ques-tions about the altitude of a mountain, the final score of a football match, or theopening hours of a museum make up a significant amount of the users’ informa-tion need However, common Named Entity task definitions do not consider num-ber and date/time expressions in detail (or as in the Conference on ComputationalNatural Language Learning (CoNLL) 2003 (Tjong Kim Sang (2003) do not incor-porate them at all) We therefore present a novel, extended classification scheme fornumber expressions, which covers all Message Understanding Conference (MUC)(Chinchor (1998a)) types but additionally includes various structures not considered

Ex-in common Named Entity defEx-initions In our approach, numbers are classified cording to two aspects: their function in the sentence and their internal structure Weargue that our classification covers most of the number expressions occurring in textcorpora Based on this classification scheme we have annotated the German CoNLL

ac-2003 data and trained various machine learning algorithms to automatically extractand classify number expressions We also plan to incorporate the number extractionand classification system described in this paper into an open domain Web-basedQuestion Answering system for German As mentioned above, the recognition ofcertain date, time, and number expressions is especially important in the context ofInformation Extraction and Question Answering E g the MUC Named Entity def-initions (Chinchor (1998b)) include the following basic types: date, time (<TIMEX>)

Trang 10

554 Irene Cramer, Stefan Schacht, Andreas Merkel

as well as monetary amount and percentage (<NUMEX>), and thus fostered the velopment of extraction systems able to handle number and date/time expressions.Famous Information Extraction systems developed in conjunction with MUC aree.g FASTUS (Appelt et al (1993)) or LaSIE (Humphreys et al (1998)) At thattime, many researchers used finite-state approaches to extract Named Entities Morerecent Named Entity definitions, such as CoNLL 2003 (Tjong Kim Sang (2003)),aiming at the development of Machine Learning based systems, however, again ex-cluded number and date expressions Nevertheless, due to the increasing interest inQuestion Answering and the TREC QA tracks (Voorhees et al (2000)), recently, anumber of research groups investigate various techniques to fast and accurately ex-tract information items of different types form text corpora and the Web, respectively.Many answer typologies naturally include number and date expressions, e.g the ISIQuestion Answer Typology (Hovy et al (2002)) Unfortunately, in the correspondingpapers only the whole Question Answering System’s performance is specified, wetherefore could not detect any performance values, which would be directly compa-rable to our results A very interesting and partially comparable (they only consider

de-a smde-all frde-action of our clde-assificde-ation) work (Ahn et de-al (2005)) investigde-ates the traction and interpretation of time expressions Their reported accuracy values rangebetween about 40% and 75%

ex-Paper Plan: This paper is structured as follows Section 2 presents our

classifica-tion scheme and the annotaclassifica-tion Secclassifica-tion 3 deals with the features and the tal setting Section 4 analyzes the results and comments on the future perspectives

experimen-2 Classification of number expressions

Many researchers use regular expressions to find numbers in corpora, however, mostnumbers are part of a larger construct such as ’2,000 miles’ or ’Paragraph 249 Bürg-erliches Gesetzbuch’ Consequently, the number without its context has no meaning

or is highly ambiguous (2,000 miles vs 2,000 cars) In applications such as tion Answering it is therefore necessary to detect this additional information Table 1shows example questions that obviously ask for number expressions as answers Theexamples clearly indicate that we are not looking for mere digits but multi-word units

Ques-or even phrases consisting of a number and its specifying context Thus, a number isnot a stand-alone information and, as the examples show, might not even look like

a number at all This paper therefore proposes a novel, extended classification thathandles number expressions similar to Named Entities and thus provides a flexibleand scalable method to incorporate these various entity types into one generic frame-work We classify numbers according to their internal structure (which corresponds

to their text extension) and their function (which corresponds to their class)

We also included all MUC types to guarantee that our classification conformswith previous work

Trang 11

Classifying Number Expressions in German Corpora 555

Table 1 Example Questions and Corresponding Types

Q: How far is the Earth from Mars? miles? light-years?

Q: What are the opening hours of museum X? daily from 9 am to 5 pm

Q: How did Dortmund score against Cottbus last weekend? 2:3

2.1 Classification scheme

Based on Web data and a small fraction of online available German newspaper pora (Frankfurter Rundschau1and die tageszeitung2) we deduced 5 basic types: date(including date and time expressions), number (covering count and measure expres-sions), itemization (rank and score), formula, and isPartofNE (such as streetnumber or zipcode) As further analyses of the corpora showed most of the basictypes naturally split into sub-types, which also conforms to the requirements imposed

cor-on the classificaticor-on by our applicaticor-ons The final classificaticor-on thus comprises the

30 classes shown in table 2 The table additionally gives various examples and a shortexplanation of the class’ sense and extension

2.2 Corpora and annotation

According to our findings in Web data and newspaper corpora we developed lines which we used to annotate the German CoNLL 2003 data To ensure a con-sistent and accurate annotation of the corpus, we worked every part over in severalpasses and performed a special reviewing process for critical cases Table 3 shows

guide-an exemplary extract of the data It is structured as follows: the first column sents the token, the second column its corresponding lemma and the third column itspart-of-speech, the fourth column specifies the information produced by a chunker

repre-We did not change any of these columns In column five, typically representing theNamed Entity tag, we added our own annotation We replaced the given tag if wefound the tag O (=other) and appended our classification in all other cases.3Whileannotating the corpora we met a number of challenges:

• Preprocessing: The CoNLL 2003 corpus exhibits a couple of erroneous sentence

and token boundaries In fact, this is much more problematic for the extraction ofnumber expressions than for Named Entity Recognition, which is not surprising,since it inherently occurs more frequently in the context of numbers

• Very complex expressions: We found many date.relative and date.regular

expressions, which are extremely complex types in terms of length, internal ture, as well as possible phrasing and therefore difficult to extract and classify Inaddition, we also observed very complex number.amount contexts and a couple

struc-of broken sports score tables, which we found very difficult to annotate

1 http://www.fr-online.de/

2 http://www.taz.de/

3 Our annotation is freely available for download However, we cannot provide the original CoNLL 2003 data, which you need to reconstruct our annotation.

Trang 12

556 Irene Cramer, Stefan Schacht, Andreas Merkel

Table 2 Overview of Number Classes

date.period for 3 hours, two decades time/date period, start

and end point not specified date.regular weekdays 10 am to 6 pm expressions like

opening hours etc.

date.time at around 11 o’clock common time expressions

date.time.period 6-10 am duration, start and end

specified date.time.relative in two hours relative specification

tie: e.g now date.time.complete 17:40:34 time stamp

date.date October 5 common date expressions

date.date.period November 22-29, Wednesday duration,

to Friday, 1998/1990 start and end specified date.date.relative next month, in three days relative specification

tie: e.g today date.date.complete July 21, 1991 complete date

date.date.day on Monday all weekdays

date.date.month last November all months

date.date.year 1993 year specification

number.amount 4 books, several thousand count, number of items

spectators number.amount.age aged twenty, Peter (27) age

number.amount.money 1 Mio Euros, 1,40 monetary amount

number.amount.complex 40 children per year complex counts

number.measure 18 degrees Celsius measurements not

covered otherwise number.measure.area 30.000 acres specification of area

number.measure.speed 30 mph specification of speed

number.measure.length 100 km bee-line, 10 meters specification

of length, altitude,

number.measure.volume 43,7 l of rainfall, 230.000 specification of capacity

cubic meters of water number.measure.weight 52 kg sterling silver, specification of weight

3600 barrel number.measure.complex 67 l per square mile, complex measurement

30x90x45 cm number.percent 32 %, 50 to 60 percent percentage

number.phone 069-848436 phone number

itemization.rank third rank ranking e.g in competition

itemization.score 9 points, 23:26 goals score e.g in tournament

formula.variables cos(x) generic equations

formula.parameters y = 4.132 ∗ x3 specific equations

• Ambiguities: In some cases we needed a very large context window to

disam-biguate the expressions they annotated Additionally, we even found examples

which we could not disambiguate at all E.g über 3 Jahre with the possible lations more than 3 years or for 3 year In German such structures are typically

trans-disambiguated by prosody

• Particular text type: A comparison between CoNLL and the corpora we used to

develop our guidelines showed that there might be a very particular style We alsohad the impression that the CoNLL training and test data differ with respect totype distribution and style We therefore based our experiments on the completedata and performed cross-validation

Định dạng
Số trang	25
Dung lượng	475,07 KB