A trace clustering solution based on using the distance graph model

This work provides a trace clustering solution based on the idea of using the distance graph model for trace representation.. Because of the similar between the graph structure of proces

Trang 1

Based on Using the Distance Graph Model

Quang-Thuy Ha1(&), Hong-Nhung Bui1,2, and Tri-Thanh Nguyen1

1

Vietnam National University (VNU), VNU-University of Engineering

and Technology (UET), No 144, Xuan Thuy, Cau Giay, Hanoi, Vietnam

{ntthanh,thuyhq}@vnu.edu.vn, nhungbh79@gmail.com

2

Banking Academy of Vietnam, No.12, Chua Boc, Dong Da, Hanoi, Vietnam

Abstract Process discovery is the most important task in the process mining Because of the complexity of event logs (i.e activities of several different processes are written into the same log), the discovered process models may be diffuse and unintelligible That is why the input event logs should be clustered into simpler event sub-logs This work provides a trace clustering solution based

on the idea of using the distance graph model for trace representation Experi-mental results proved the effect of the proposed solution on two measures of Fitness and Precision, especially the effect on the Precision measure

Keywords: Event logProcess miningFitness measurePrecision measure

Process discoveringTrace clusteringDistance graph model

Process discovery is the most important task in process mining There exists some algorithms for discovering process models form event logs, such asa (Wil M P van der Aalst and Boudewijn F van Dongen [1]),a+ (A.K.A de Medeiros et al [9]),a++ (Lijie Wen et al [17]), and other algorithms [2] Due to the complexity of event logs, the discovered process models may be diffuse and unintelligible That is why the two-phase approach is proposed for process model discovering In thefirst phase, the input event log is refined, in which clustering algorithms are popularly used In the second phase, process discovering algorithms are run on the refined event log to find out the model There exists some works following this approach [4,5,8,10,13,15,16]

The distance graph model for text processing has been proposed by Charu C Aggarwal and Peixiang Zhao in 2013 [3] Distance graphs of order k (k = 0, 1, 2,…) for a document (a string of words) D based on the corpus C is a useful representation of

D for text mining tasks [3,7]

Because of the similar between the graph structure of process model and the Distance graph model, this work focuses on a trace clustering solution based on the idea of using the distance graph model for trace representation This study is oriented to contribute a new solution to trace clustering

The rest of this article is organized as follows: In the next section, a trace clustering solution on using the distance graph model is showed This framework includes three phases: “Trace representation and Clustering”, “Process discovery”, and “Model

N.T Nguyen et al (Eds.): ICCCI 2016, Part I, LNAI 9875, pp 313 –322, 2016.

DOI: 10.1007/978-3-319-45243-2_29

Trang 2

Evaluation” Experiments and remarks are described in the third section In the fourth section, related work is introduced And conclusions are shown in the last section

Model

2.1 The Problem

The paper proposes a solution to trace clustering in event logs based on the distance graph model [3] The problem is described as follows

Let A be the activity-name universe in an organization and A A be the set of all activity-names for a business process in the organization A tracer is a sequence of activities, i.e.,r 2 A+(where A+is a set of non empty sequences of activities in A) Let

L be a simple event log of a business process containing a set of traces constructed from

A Process discovery algorithms transform event logs into process models represented

in a process modeling language, e.g Petri nets (WorkFlow nets: WF-nets), BPMN (Business Process Modeling Notation), or YAWL (Yet Another Workflow Language), etc There exists some algorithms for discovering process models form event logs, such

asa [1],a+ [9],a++ [17], and others [2]

For example, let L = [abdeh, adceg, acdefbdeg, adbeh, acdefdcefcdeh, acdeg] (where a =“register request”, b = “examine thoroughly”, c = “examine casually”,

d =“check ticket”, e = “decide”, f = “reinitiate request”, g = “pay compensation”,

h =“reject request”) be an event log for the requests for compensation business process within an airline Figure1describes the WorkFlow net discovered the event log L by applying thea algorithm [2]

Due to the complexity of event logs, the discovered process models may be diffuse and unintelligible That is why the two-phase approach is proposed for process model discovering In the first phase, the input event log is refined, in which clustering algorithms are popularly used In the second phase, process discovering algorithms are run on the refined event log to find out the process model [6]

Fig 1 WorkFlow net discovered by thea-algorithm based on L [2]

Trang 3

2.2 The Distance Graph Model

As mentioned in the introduction section, the distance graph model (“A distance graph

of order k for a document D drawn from a corpus C”) for text processing was proposed

by Charu C Aggarwal and Peixiang Zhao in 2013 Figure2 illustrates the distance graphs of orders 0, 1, and 2 for the well-known nursery rhyme“Mary had a little lamb” [3] As stated in [3], the most common method of representing a document D is a vector

of distinct terms generated from the corpus C, where each component of the vector is the frequency of a certain term appearing in D Charu C et al proposed to convert a distance graph into a vector-space representation, i.e each directed edge in the dis-tance graph is used to create a new“token” or “pseudo-word” For example, the edge from MARRY to LITTLE (in the distance graph order 2) is used to create a new pseudo-word MARRY-LITTLE; the pseudo-word created from the edge from LAMB

to itself (in the distance graph order 2) is LAMB-LAMB The frequency of the edge is used to denote the frequency of the pseudo-word These new pseudo-words preserve the order of words in the document, thus, when combined with distinct terms in the corpus C, they enhance the semantic of the document representation in the form of a vector

Charu C Aggarwal and Peixiang Zhao showed some interesting features of dis-tance graph model, as well as the effectiveness of the model applied for text classi ﬁ-cation Since the order of activities within a trace plays an important role, one characteristic of distance graph which is considered to be suitable for trace represen-tation is its ability to preserve the order of words in a document in the form of directed edges

Fig 2 Illustration of distance graph representation [3]

Trang 4

2.3 A Three-Phase Process Discovery Framework

Figure3 describes a process discovery using trace clustering solution based on the distance graph model The framework includes“Trace representation and Clustering”,

“Process discovery”, and “Model evaluation” Phases

Trace representation and Clustering Phase includes two steps In the Trace Rep-resentation step, a dataset for clustering is created, in which a data point is a vector of distance graphs (with different orders) of a trace in the event log

The set A of activities in the event log is considered as the set of“distinct words” in the corpus C, and a trace in the event log is considered as a document D, thus distance graphs for a trace can be constructed For the given trace <a c d e f d b e h>,

• Order 0 distance graph is: a(1), c(1), d(2), e(2), f(1), b(1), h(1), where the number denotes the frequency of directed edges from the node to itself This graph contains

7 unconnected components

Fig 3 A three-phase framework of process discovery

Trang 5

• Order 1 distance graph is constructed from order 0 graph a(1), c(1), d(2), e(2), f(1), b(1), h(1) by adding more edges: ac(1), cd(1), de(1), ef(1), fd(1), db(1), be(1), eh(1), where the number denotes the frequency

• Order 2 distance graph is constructed from order 1 graph a(1), c(1), d(2), e(2), f(1), b(1), h(1), ac(1), cd(1), de(1), ef(1), fd(1), db(1), be(1), eh(1) by adding more edges: ad(1), ce(1), df(1), ed(1), fb(1), de(1), bh(1), where the number denotes the frequency

• etc

We followed the method of [3] to decompose a distance graph into a set of features for vector representation with a small modiﬁcation A feature is either the vertex or the directed edge of the graph Our modiﬁcation is to ignore the edge from a vertex v to itself (i.e edge vv) in distance graph order 0, since every vertex in the graph order 0 always has an edge from itself to itself (self-loop) In addition, an edge from vertex to itself, in a trace, should indicate an activity is repeated For the above order 1 distance graph of the trace <a c d e f d b e h>, the set of features is {a, c, d, e, f, b, h, ac, cd, de,

ef, fd, db, be, eh} The frequency of the feature in each trace is preserved in vector representation Since a higher order distance graph of a trace includes all lower distance graphs using this representation, only the highest order distance graph is enough to represent the trace with consideration to distinguish the self-loop of distance graph order 0 with the self-loop of higher order With this representation, if two graphs share common sub-graphs, it will be preserved in the representation Obviously, for another trace <a c d e f b h>, its set of features {a, c, d, e, f, b, h, ac, cd, de, ef, fb, bh} is a subset of the above trace Consequently, the two vectors will be close to each other in the vector space Because event logs reflect the executions of business processes then all distance graphs of traces in an event log include some relation patterns in the discovered process model That is why the number of features generated from all the traces in an event log L is signiﬁcantly less than (|A| + |A|*(|A|-1)/2) where |A| denoted the cardinality of set A of activities

In the Clustering step, one clustering algorithm is applied on the dataset (e.g K-Modes and K-means algorithms) The output of the Trace Representation and Clustering Phase is a set of clusters (sub-logs) of traces (cases) of the event log

In the Process Discovery Phase, a process discovery algorithm (i.e.a-algorithm) is applied on the clusters (event sub-logs) to get process models

The Model Evaluation shows the effect of result process models Though there are four common measures for evaluation, i.e Fitness, Precision, Generalization, and Simplicity [2, 11,12], this work considers two measures: i.e Fitness and Precision, which had been described by A Rozinat and Wil M.P van der Aalst [11] The Fitness measure indicates that the discovered model should accept the behaviors seen in the event log, and the Precision measure means that the discovered model should not accept behaviors completely unrelated to what was seen in the event log Since these measures are calculated on each cluster, an aggregated value for whole event log should be calculated This work selects a weighted average value as follow:

Trang 6

k

1

ni

where wagvis the aggregated value of theﬁtness or precision measure, k is the number

of clusters, n is the number of traces in the event log, niis the number of traces in the ith cluster and wiis the value of the measure of the ithcluster

This work used the prBm6 event log in the“Conformance Checking in the Large”1for experiments The event log includes 1200 cases with 37961 events In the Clustering step, two clustering algorithms: K-Modes and K-means were used In Process dis-covery and Model evaluation phrases, ProM [19] was used From several tests, we selected the maximum distance graph order of 2 for all the experiments

3.1 The Experiment with K-Modes Algorithm

Since a trace is a sequence of activities, from an event log, we have a set of activities, a common trace representation was proposed: binary vector activities, i.e a vector component is 1 if the trace contains a certain activity, otherwise 0 [2,8] To evaluate the model, binary trace vector based on activity representation was implemented as a baseline The experiment results are described in the Table1 We consider the values

of measures of Average-Fitness and Average-Precision (1) in the cases of the vector-based and the Distance graph order 2-based trace representation in columns titled “Avg” in the table After several runs, we found out the suitable number of clusters for the data set is 3

Experiments on the Distance graph order 1-based also are implemented All experimental results on the vector-based, the Distance graph order 1-based, and the Distance graph order 2-based trace representations are also showed in the Fig.4

3.2 The Experiment with K-Means Algorithm

In this experiment, the K-means clustering algorithm was used to run on the vector-based and distance graph-based trace representation The experiment results are described in the Table2 We also calculated the values of measures of Average-Fitness and Average-Precision (1) for activity-based (Vector) and the Distance graph-based (Distance graph) trace representation in columns titled“Avg” in the table

Experiments on the Distance graph order 1-based also are implemented All experimental results on the vector-based, the Distance graph order 1-based, and the Distance graph order 2-based trace representations are also showed in the Fig.5

1 http://data.3tu.nl/repository/uuid:44c32783-15d0-4dbd-af8a-78b97be3de49

Trang 7

3.2.1 Discussions

There are some ﬁndings from the results showed in Tables 1 (Fig.4) and Table2 (Fig.5) as follows:

Table 1 Using the K-modes clustering algorithm: theﬁtness and precision for all event sublogs (clusters) in the activity-based (Vector) and the distance graph order 2-based (Distance Graph) trace representation

Method

Measure

Fig 4 Comparison of the discovered process models on the measures of Fitness and Precision between Activity-based (Vector), Distance graph order 1-based (Distance Graph1), and Distance graph order 2-based (Distance Graph2) Representations with K-Modes clustering algorithm

Table 2 Using the K-means Clustering Algorithm: The Fitness and Precision for all event sublogs (clusters) in the Activity-based (Vector) and the Distance gpaph-based (Distance Gpaph) trace representation

Method

Measure

Trang 8

– In all cases, the performance of the distance graph based trace representation is better than that of the vector based trace representation on ﬁtness and precision measures

– The effect of the distance graph based trace representation on the precision measure

is higher than that on theﬁtness measure

– Distance graph order 2 has a better effect on precision in comparison with distance graph order 1

G Greco et al [8] proposed a clustering solution on traces in event log They used a vector representation for traces and the K-means algorithm This work is theﬁrst study

on trace clustering within the process mining domain

R P Jagadeesh Chandra Bose [6], R P Jagadeesh Chandra Bose et al [4, 5] proposed trace clustering solutions based on using some control-flow context infor-mation i.e.“context-aware” The Levenshtein distance technique was used

De Weerdt et al [15] proposed a two phase solution to combine of trace clustering and text mining for process discovering In theﬁrst phase, a MRA-based semi-supervised clustering technique (the SemSup-MRA algorithm) was applied After that, there are two kinds of clusters, clusters of standard behaviors, and clusters of atypical behaviors In the second phase, process mining and text-data mining techniques were applied After [15],

De Weerdt et al [16] proposed the ActiTraC algorithm, a three-phase algorithm for clustering an event log into a collection of event logs (clusters) The ActiTraC algorithm includes three phases: Selection, Look ahead, and Residual trace resolution They also developed the ActiTraCMRA algorithm, a further version of the ActiTraC algorithm

Fig 5 Comparison of the discovered process models on the measures ofﬁtness and precision among activity-based (Vector), distance graph order 1-based (Distance Graph1), and distance graph order 2-based (Distance Graph2) representations with K-means clustering algorithm

Trang 9

T Thaler et al [14] provided a survey of trace clustering techniques They also analyzed and compared the investigated trace clustering techniques

This work is theﬁrst study on using the distance graph model [3] for trace clustering

This work provided a trace representation solution based on the distance graph model [3] for clustering of traces in the event logs Experiments showed that the distance graph based is more effective than activity based trace representation

In this work, experiments are limited There are several tasks needed to do in the future Firstly, other distance measures between graphs, e.g distance in graph theory [18] should be studied to directly cluster traces in the form of graphs Secondly, more clustering algorithms, especially graph-based clustering algorithms, should be con-sidered Thirdly, more event log datasets should be experimented to conﬁrm the reli-ability of the method

Acknowledgments This work was supported in part by VNU Grant QG-15- 22

References

1 van der Aalst, W.M., van Dongen, B.F.: Discovering workflow performance models from timed logs In: Han, Y., Tai, S., Wikarski, D (eds.) EDCIS 2002 LNCS, vol 2480,

pp 45–63 Springer, Heidelberg (2002)

2 Van der Aalst, W.M.P.: Process Mining: Discovery, Conformance and Enhancement of Business Processes Springer, Heidelberg (2011)

3 Aggarwal, C.C., Zhao, P.: Towards graphical models for text processing Knowl Inf Syst 36(1), 1–21 (2013)

4 Bose, R.C., van der Aalst, W.M.: Trace clustering based on conserved patterns: towards achieving better process models In: Rinderle-Ma, S., Sadiq, S., Leymann, F (eds.) BPM

2009 LNBIP, vol 43, pp 170–181 Springer, Heidelberg (2010)

5 Bose, R.P.J.C., van der Aalst, W.M.P.: Context aware trace clustering: towards improving process mining results In: SDM 2009, pp 401–412 (2009)

6 Bose, R.P.J.C.: Process Mining in the Large: Preprocessing, Discovery, and Diagnostics Ph

D thesis Eindhoven University of Technology (2012)

7 Dai, Xin-Yu., Cheng, C., Huang, S., Chen, J.: Sentiment classiﬁcation with graph sparsity regularization In: Gelbukh, A (ed.) LNCS, vol 9042, pp 140–151 Springer, Heidelberg (2015)

8 Greco, G., Guzzo, A., Pontieri, L., Saccà, D.: Discovering expressive process models by clustering log traces IEEE Trans Knowl Data Eng 18(8), 1010–1027 (2006)

9 de Medeiros, A.K.A., van Dongen, B.F., van der Aalst, W.M.P., Weijters, A.J.M.M.: Process mining: extending the alpha-algorithm to mine short loops BETA Working Paper Series (2004)

10 de Medeiros, A.K.A., Guzzo, A., Greco, G., van der Aalst, W.M., Weijters, A., van Dongen, B.F., Saccà, D.: Process mining based on clustering: a quest for precision In: Hofstede, A H., Benatallah, B., Paik, H.-Y (eds.) BPM Workshops 2007 LNCS, vol 4928, pp 17–29 Springer, Heidelberg (2008)

Trang 10

11 Rozinat, A., van der Wil, M.P.: Aalst Conformance checking of processes based on monitoring real behavior Inf Syst 33(1), 64–95 (2008)

12 Buijs, J.C., van Dongen, B.F., van der Aalst, W.M.: On the role ofﬁtness, precision, generalization and simplicity in process discovery In: Meersman, R., Panetto, H., Dillon, T., Rinderle-Ma, S., Dadam, P., Zhou, X., Pearson, S., Ferscha, A., Bergamaschi, S., Cruz, I.F (eds.) OTM 2012, Part I LNCS, vol 7565, pp 305–322 Springer, Heidelberg (2012)

13 Song, M., Günther, C.W., van der Aalst, W.M.: Trace clustering in process mining In: Ardagna, D., Mecella, M., Yang, J (eds.) Business Process Management Workshops LNBIP, vol 17, pp 109–120 Springer, Heidelberg (2009)

14 Thaler, T., Ternis, S.F., Fettke, P., Loos, P.: A comparative analysis of process instance cluster techniques In: Wirtschaftsinformatik 2015, pp 423–437 (2015)

15 De Weerdt, J., van den Broucke, S.K.L.M., Vanthienen, J., Baesens, B.: Leveraging process discovery with trace clustering and text mining for intelligent analysis of incident management processes In: IEEE Congress on Evolutionary Computation, pp 1–8 (2012)

16 De Weerdt, J., vanden Broucke, S.K.L.M., Vanthienen, J., Baesens, B.: Active trace clustering for improved process discovery IEEE Trans Knowl Data Eng 25(12), 2708–

2720 (2013)

17 Wen, L., van der Aalst, W.M.P., Wang, J., Sun, J.: Mining process models with non-free-choice constructs Data Min Knowl Discov 15(2), 145–180 (2007)

18 Deza, M.M., Deza, E.: Distances in Graph Theory Springer, Heidelberg (2014)

19 http://www.processmining.org/prom/start

Định dạng
Số trang	10
Dung lượng	1,29 MB