594 Patrick Mair and Marcus Hudecof complex web sites which consist of a large number of pages it is often reasonable to reduce the number of different pages by aggregating individual pa
Trang 1Analysis of Dwell Times in Web Usage Mining
Patrick Mair1and Marcus Hudec2
1 Department of Statistics and Mathematics and ec3
Abstract In this contribution we focus on dwell times a user spends on various areas of a web
site within a session We assume that dwell times may be adequately modeled by a Weibulldistribution which is a flexible and common approach in survival analysis Furthermore weintroduce heterogeneity by various parameterizations of dwell time densities by means ofproportional hazards models According to these assumptions the observed data stem from
a mixture of Weibull densities Estimation is based on EM-algorithm and model selectionmay be guided by BIC Identification of mixture components corresponds to a segmentation
of users/sessions A real life data set stemming from the analysis of a world wide operatingeCommerce application is provided The corresponding computations are performed with themixPHM package inR.
1 Introduction
Web Usage Mining focuses on the analysis of visiting behavior of users on a website Common starting point are the so called click-stream data which are derivedfrom web-server logs and may be viewed as the electronic trace a user leaves on aweb site Adequate modeling of the dynamics of browsing behavior is of particularrelevance for the optimization of eCommerce applications Recently Montgomery
et al (2004) proposed a dynamic multinomial probit model of navigation patternswhich lead to an remarkable increase of conversion rates Park and Fader (2004) de-veloped multivariate exponential-gamma models which enhance cross-site customeracquisition These papers indicate the potential that such approaches offer for web-shop providers
In this paper we will focus on modeling dwell times, i.e., the time a user spendsfor viewing a particular page impression They are defined by the time span betweentwo subsequent page requests and can be calculated by taking the difference betweenthe two logged time points when the page request have been issued For the analysis
Trang 2594 Patrick Mair and Marcus Hudec
of complex web sites which consist of a large number of pages it is often reasonable
to reduce the number of different pages by aggregating individual page-impressions
to semantically related page categories reflecting meaningful regions of the web site.Analysis of dwell times is an important source of information with regard tothe relevance of the content for different users and the effectiveness of the page inattracting visitors In this paper we are particularly interested in segmentation ofusers into various groups which exhibit a similar behavior with regard to the dwelltimes they spend on various areas of the site Such a segmentation analysis is animportant step towards a better understanding of the way a user interacts on a website It is therefore of relevance with regard to the prediction of user behavior as well
as for a user-specific customization or even personalization of web sites
2 Model specification and estimation
2.1 Weibull mixture model
Since survival analysis focuses on duration times until some event occurs (e.g thedeath of a patient in medical applications) it seems straightforward to apply theseconcepts to the analysis of dwell times in web usage mining applications
With regard to dwell time distributions we assume that they follow a Weibull
distribution with density function f (t) = OJtJ−1exp(−OtJ), where O is a scale rameter and J the shape parameter For modeling the heterogeneity of the observed
pa-population, we assume K latent segments of sessions While the Weibull assumption
holds within all segments, different segments exhibit different parameter values This
leads to the underlying idea of a Weibull mixture model For each page category p (p = 1, ,P) under consideration the resulting mixture has the following form
where t p represents the dwell time on page category p with mixing proportions S k
which correspond to the relative size of each segment
In order to reduce the number of parameters involved we impose restrictions
on the hazard rates of different components of the mixture respectively pages An
elegant way of doing this is offered by the concept of Weibull proportional hazards
models (WPHM) The general formulation of a WPHM (see e.g., Kalbfleisch and
Prentice (1980)) is
where Z is a matrix of covariates, and E are the regression parameters The term OJtJ−1 is the baseline hazard rate h0(t) due to the Weibull assumption and h(t;Z) hazard proportional to h0(t) resulting from the regression part in the model.
Trang 3Analysis of Dwell Times in Web Usage Mining 595
2.2 Parsimonious modeling strategies
We propose five different models with respect to different proportionality restrictions
in the hazard rates as to reduce the number of parameters In the mixPHM package byMair and Hudec (2007) the most general model is called separate: The WPHM is
computed for each component and page separately Hence, the hazard of session i belonging to component k (k = 1, ,K) on page category p (p = 1, ,P) is
h (t i,p;1) = Ok,pJk,p tJk,p −1
The parameter matrices can be represented jointly as
the parameters (2 × K × P in total) are the same as they were estimated directly by
using a Weibull mixture model
Next, we impose a proportionality assumption across the latent components Inthe classification version of the EM-algorithm (see next section) in each iteration step
we have a “crisp" assignment of each session to a component Thus, if we consider
this component vector g as main effect in the WPHM, i.e., h(t;g), we impose
pro-portional hazards for the components across the pages (main.g in mixPHM) Again,the elements of the matrix / of scale parameters can vary freely, whereas the shapeparameter matrix reduces to the vector * = (J1,1 , , J 1,P) Thus, the shape param-eters are constant over the components and the number of parameters is reduced to
K × P + P.
If we impose page main effects in the WPHM, i.e., h(t; p) or main.p,
respec-tively, as before, the elements of / are not restricted at all but this time the shapeparameters are constant over the pages, i.e., * = (J1,1 , , J 1,K) The total number of
Trang 4596 Patrick Mair and Marcus Hudec
The c- and d-scalars are proportionality constants over the pages and components,
re-spectively The shape parameters are constant over the components and pages Thus,
* reduces to one shape parameter J which implies that the hazard rates are tional over components and pages
propor-To relax the rather restrictive assumption with respect to / we can extend themain effects model by the corresponding component-page interaction term, i.e.,
h (t;g ∗ p) In mixPHM notation this model is called int.gp The elements of / can
vary freely whereas * is again reduced to one parameter only, leaving us with a total
number of parameters of K × P + 1 With respect to the hazard rate this relaxation
implies again proportional hazards over components and pages
relative frequency The elements of the resulting K × P matrix are model parameters
and have to be taken into account when determining the total number of parameters.The resulting likelihood Wk,p (s i ) for session i being in component k for each page p
individually, is
Wk,p (s i) =f (y p; ˆOk,p , ˆJ k,p )Pr k,p (s i ) if p was visited by s i
1 − Pr k,p (s i) if p was not visited by s i (7)
To establish the joint likelihood, a crucial assumption is made: independence
of the dwell times over page-categories To make this assumption feasible, a would be hierarchical, the independence assumption would not hold Without thisindependence assumption, a multivariate mixture Weibull model would have to befitted which takes into account the covariance structure of the observations This
well-would require that each session must have a full observation vector of length p, i.e,
each page category is visited within each session which seems not to be realisticwithin the context of dwell times in web usage mining
However, for a reasonable independence assumption the likelihood over all pages
that session i belongs to component k is given by
At this point, the M-step is carried out The mixPHM package provides three
different methods The classical version of the EM-algorithm (maximization EM;
advised page categorization must be established For instance, if some page-categories
Trang 5Analysis of Dwell Times in Web Usage Mining 597EMoption = "maximization" in mixPHM) computes the posterior probabilities that
session i belongs to group k and does not make a group assignment within each iteration step but rather updates the matrix of posterior probabilities Q A faster EM- version is proposed by Celeux and Govaert (1992) which they call classification EM
(EMoption = "classification" in mixPHM): Within each iteration step a groupassignment is performed due to supk(<i) Hence, the computation of the posteriormatrix is not needed A randomized version of the M-step considers a combination
of the approaches above: After the computation of the posterior matrix Q, a
ran-domized group assignment is performed due to the corresponding probability values(EMoption = "randomization")
As usual, the joint likelihood L is updated at each EM-iteration l until a certain
convergence criterion H is reached, i.e.,L (l) − L (l−1) < H Theoretical issues about
the EM-convergence in Weibull mixture models can be found in Ishwaran (1996)and Jewell (1982)
3 Real life example
In this section we use a real dataset of a large Austrian company which runs a shop to demonstrate our modeling approach We restrict empirical analysis to a sub-set of 333 buying-sessions and 7 page-categories we perform a dwell time based clus-tering with corresponding proportionality hazard assumptions by using the mixPHMpackage inR(RDevelopment Core Team, 2007)
web-bestview checkout service figurines jewellery landing search
We start with a rather exploratory approach to determine an appropriate
propor-tionality model with an adequate number of clusters K By using the msBIC statement
we can accomplish such a heuristic model search
> res.bic <- msBIC(x,K=2:5,method="all")
> res.bic
Bayes Information Criteria
Survival distribution: Weibull
Trang 6598 Patrick Mair and Marcus Hudec
It is obvious that the main.g model with K = 5 components fits quite well pared to the other models (if we fit models for K > 5 the BIC’s do not decrease
com-perspicuously anymore) For the sake of demonstration of the imposed hazard portionalities, we compare this model to the more flexible separate model First, wefit the two models again by using the phmclust statement which is the core routine
pro-of the mixPHM package The matrices pro-of shape parameters *sepand *g, respectively,for the first 5 pages (due to limited space) are:
From Figure 2 it is obvious that the hazards are proportional across componentsfor each page Note that due to space limitations, in both plots we only used threeselected pages to demonstrate the hazard characteristics The hazard plots allow toasses the relevance of different page categories with respect to cluster formation.Similar plots for dwell time distributions are available
4 Conclusion
In this work we presented a flexible framework to analyze dwell times on web pages
by adopting concepts from survival analysis to probability based clustering served heterogeneity is modeled by mixtures of Weibull distributed dwell times Ap-plication of the EM-algorithm leads to a segmentation of sessions
Unob-Since the Weibull distribution is rather highly parameterized it offers a able amount of flexibility for the hazard rates A more parsimonious modeling mayeither be achieved by posing proportionality restrictions on the hazards or mak-ing use of simpler distributional assumptions (e.g., for constant hazard rates) The
Trang 7size-Analysis of Dwell Times in Web Usage Mining 599
Fig 1 Hazard Plot for Model separate
mixPHM package covers therefore additional survival distributions such as tial, Rayleigh, Gaussian, and Log-logistic
Exponen-A segmentation of sessions as it is achieved by our method may serve as a startingpoint for optimization of a website Identification of typical user behavior allows anefficient dynamic modification of content as well as an optimization of adverts fordifferent groups of users
Fig 2 Hazard Plot for Model main.g
Trang 8600 Patrick Mair and Marcus Hudec
References
CELEUX, G., and GOVAERT, G (1992) A Classification EM Algorithm for Clustering and
Two Stochastic Versions Computational Statistics & Data Analysis, 14, 315–332.
DEMPSTER, A.P., LAIRD, N.M and RUBIN, D.B (1977) Maximum Likelihood from
In-complete Data via the EM-Algorithm Journal of the Royal Statistical Society, Series B,
39, 1–38.
ISHWARAN, H (1996) Identifiability and Rates of Estimation for Scale Parameters in
Loca-tion Mixture Models The Annals of Statistics, 24, 1560-1571.
JEWELL, N.P (1982) Mixtures of Exponential Distributions The Annals of Statistics, 24,
479–484.
KALBFLEISCH, J.D and PRENTICE, R.L (1980): The Statistical Analysis of Failure Time
Data Wiley, New York.
MAIR, P and HUDEC, M (2007) mixPHM: Mixtures of proportional hazard models.Rpackage version 0.5.0: http://CRAN.R-project.org/
MCLACHLAN, G.J and KRISHNAN, T (1997) The EM Algorithm and Extensions Wiley,
New York
MONTGOMERY, A.L., LI, S., SRINIVASAN, K and LIECHTY, J.C (2004) Modeling
on-line browsing and path analysis using clickstream data Marketing Science, 23, 579–595 PARK, Y and FADER, P.S (2004) Modeling browsing behavior at multiple websites Mar-
keting Science, 23, 280–303
RDevelopment Core Team (2007).R : A Language and Environment for Statistical ing Vienna, Austria (ISBN 3-900051-07-0)
Trang 9Comput-Classifying Number Expressions in German Corpora
Irene Cramer1, Stefan Schacht2, Andreas Merkel2
1 Dortmund University, Germany
irene.cramer@uni-dortmund.de
2 Saarland University, Germany
{stefan.schacht, andreas.merkel}@lsv.uni-saarland.de
Abstract Number and date expressions are essential information items in corpora and
there-fore play a major role in various text mining applications However, so far number expressionswere investigated in a rather superficial manner In this paper we introduce a comprehensivenumber classification and present promising, initial results of a classification experiment usingvarious Machine Learning algorithms (amongst others AdaBoost and Maximum Entropy) toextract and classify number expressions in a German newspaper corpus
1 Introduction
In many natural language processing (NLP) applications such as Information traction and Question Answering number expressions play a major role, e.g ques-tions about the altitude of a mountain, the final score of a football match, or theopening hours of a museum make up a significant amount of the users’ informa-tion need However, common Named Entity task definitions do not consider num-ber and date/time expressions in detail (or as in the Conference on ComputationalNatural Language Learning (CoNLL) 2003 (Tjong Kim Sang (2003) do not incor-porate them at all) We therefore present a novel, extended classification scheme fornumber expressions, which covers all Message Understanding Conference (MUC)(Chinchor (1998a)) types but additionally includes various structures not considered
Ex-in common Named Entity defEx-initions In our approach, numbers are classified cording to two aspects: their function in the sentence and their internal structure Weargue that our classification covers most of the number expressions occurring in textcorpora Based on this classification scheme we have annotated the German CoNLL
ac-2003 data and trained various machine learning algorithms to automatically extractand classify number expressions We also plan to incorporate the number extractionand classification system described in this paper into an open domain Web-basedQuestion Answering system for German As mentioned above, the recognition ofcertain date, time, and number expressions is especially important in the context ofInformation Extraction and Question Answering E g the MUC Named Entity def-initions (Chinchor (1998b)) include the following basic types: date, time (<TIMEX>)
Trang 10554 Irene Cramer, Stefan Schacht, Andreas Merkel
as well as monetary amount and percentage (<NUMEX>), and thus fostered the velopment of extraction systems able to handle number and date/time expressions.Famous Information Extraction systems developed in conjunction with MUC aree.g FASTUS (Appelt et al (1993)) or LaSIE (Humphreys et al (1998)) At thattime, many researchers used finite-state approaches to extract Named Entities Morerecent Named Entity definitions, such as CoNLL 2003 (Tjong Kim Sang (2003)),aiming at the development of Machine Learning based systems, however, again ex-cluded number and date expressions Nevertheless, due to the increasing interest inQuestion Answering and the TREC QA tracks (Voorhees et al (2000)), recently, anumber of research groups investigate various techniques to fast and accurately ex-tract information items of different types form text corpora and the Web, respectively.Many answer typologies naturally include number and date expressions, e.g the ISIQuestion Answer Typology (Hovy et al (2002)) Unfortunately, in the correspondingpapers only the whole Question Answering System’s performance is specified, wetherefore could not detect any performance values, which would be directly compa-rable to our results A very interesting and partially comparable (they only consider
de-a smde-all frde-action of our clde-assificde-ation) work (Ahn et de-al (2005)) investigde-ates the traction and interpretation of time expressions Their reported accuracy values rangebetween about 40% and 75%
ex-Paper Plan: This paper is structured as follows Section 2 presents our
classifica-tion scheme and the annotaclassifica-tion Secclassifica-tion 3 deals with the features and the tal setting Section 4 analyzes the results and comments on the future perspectives
experimen-2 Classification of number expressions
Many researchers use regular expressions to find numbers in corpora, however, mostnumbers are part of a larger construct such as ’2,000 miles’ or ’Paragraph 249 Bürg-erliches Gesetzbuch’ Consequently, the number without its context has no meaning
or is highly ambiguous (2,000 miles vs 2,000 cars) In applications such as tion Answering it is therefore necessary to detect this additional information Table 1shows example questions that obviously ask for number expressions as answers Theexamples clearly indicate that we are not looking for mere digits but multi-word units
Ques-or even phrases consisting of a number and its specifying context Thus, a number isnot a stand-alone information and, as the examples show, might not even look like
a number at all This paper therefore proposes a novel, extended classification thathandles number expressions similar to Named Entities and thus provides a flexibleand scalable method to incorporate these various entity types into one generic frame-work We classify numbers according to their internal structure (which corresponds
to their text extension) and their function (which corresponds to their class)
We also included all MUC types to guarantee that our classification conformswith previous work
Trang 11Classifying Number Expressions in German Corpora 555
Table 1 Example Questions and Corresponding Types
Q: How far is the Earth from Mars? miles? light-years?
Q: What are the opening hours of museum X? daily from 9 am to 5 pm
Q: How did Dortmund score against Cottbus last weekend? 2:3
2.1 Classification scheme
Based on Web data and a small fraction of online available German newspaper pora (Frankfurter Rundschau1and die tageszeitung2) we deduced 5 basic types: date(including date and time expressions), number (covering count and measure expres-sions), itemization (rank and score), formula, and isPartofNE (such as streetnumber or zipcode) As further analyses of the corpora showed most of the basictypes naturally split into sub-types, which also conforms to the requirements imposed
cor-on the classificaticor-on by our applicaticor-ons The final classificaticor-on thus comprises the
30 classes shown in table 2 The table additionally gives various examples and a shortexplanation of the class’ sense and extension
2.2 Corpora and annotation
According to our findings in Web data and newspaper corpora we developed lines which we used to annotate the German CoNLL 2003 data To ensure a con-sistent and accurate annotation of the corpus, we worked every part over in severalpasses and performed a special reviewing process for critical cases Table 3 shows
guide-an exemplary extract of the data It is structured as follows: the first column sents the token, the second column its corresponding lemma and the third column itspart-of-speech, the fourth column specifies the information produced by a chunker
repre-We did not change any of these columns In column five, typically representing theNamed Entity tag, we added our own annotation We replaced the given tag if wefound the tag O (=other) and appended our classification in all other cases.3Whileannotating the corpora we met a number of challenges:
• Preprocessing: The CoNLL 2003 corpus exhibits a couple of erroneous sentence
and token boundaries In fact, this is much more problematic for the extraction ofnumber expressions than for Named Entity Recognition, which is not surprising,since it inherently occurs more frequently in the context of numbers
• Very complex expressions: We found many date.relative and date.regular
expressions, which are extremely complex types in terms of length, internal ture, as well as possible phrasing and therefore difficult to extract and classify Inaddition, we also observed very complex number.amount contexts and a couple
struc-of broken sports score tables, which we found very difficult to annotate
1 http://www.fr-online.de/
2 http://www.taz.de/
3 Our annotation is freely available for download However, we cannot provide the original CoNLL 2003 data, which you need to reconstruct our annotation.
Trang 12556 Irene Cramer, Stefan Schacht, Andreas Merkel
Table 2 Overview of Number Classes
date.period for 3 hours, two decades time/date period, start
and end point not specified date.regular weekdays 10 am to 6 pm expressions like
opening hours etc.
date.time at around 11 o’clock common time expressions
date.time.period 6-10 am duration, start and end
specified date.time.relative in two hours relative specification
tie: e.g now date.time.complete 17:40:34 time stamp
date.date October 5 common date expressions
date.date.period November 22-29, Wednesday duration,
to Friday, 1998/1990 start and end specified date.date.relative next month, in three days relative specification
tie: e.g today date.date.complete July 21, 1991 complete date
date.date.day on Monday all weekdays
date.date.month last November all months
date.date.year 1993 year specification
number.amount 4 books, several thousand count, number of items
spectators number.amount.age aged twenty, Peter (27) age
number.amount.money 1 Mio Euros, 1,40 monetary amount
number.amount.complex 40 children per year complex counts
number.measure 18 degrees Celsius measurements not
covered otherwise number.measure.area 30.000 acres specification of area
number.measure.speed 30 mph specification of speed
number.measure.length 100 km bee-line, 10 meters specification
of length, altitude,
number.measure.volume 43,7 l of rainfall, 230.000 specification of capacity
cubic meters of water number.measure.weight 52 kg sterling silver, specification of weight
3600 barrel number.measure.complex 67 l per square mile, complex measurement
30x90x45 cm number.percent 32 %, 50 to 60 percent percentage
number.phone 069-848436 phone number
itemization.rank third rank ranking e.g in competition
itemization.score 9 points, 23:26 goals score e.g in tournament
formula.variables cos(x) generic equations
formula.parameters y = 4.132 ∗ x3 specific equations
• Ambiguities: In some cases we needed a very large context window to
disam-biguate the expressions they annotated Additionally, we even found examples
which we could not disambiguate at all E.g über 3 Jahre with the possible lations more than 3 years or for 3 year In German such structures are typically
trans-disambiguated by prosody
• Particular text type: A comparison between CoNLL and the corpora we used to
develop our guidelines showed that there might be a very particular style We alsohad the impression that the CoNLL training and test data differ with respect totype distribution and style We therefore based our experiments on the completedata and performed cross-validation