This work provides a trace clustering solution based on the idea of using the distance graph model for trace representation.. Because of the similar between the graph structure of proces
Trang 1Based on Using the Distance Graph Model
Quang-Thuy Ha1(&), Hong-Nhung Bui1,2, and Tri-Thanh Nguyen1
1
Vietnam National University (VNU), VNU-University of Engineering
and Technology (UET), No 144, Xuan Thuy, Cau Giay, Hanoi, Vietnam
{ntthanh,thuyhq}@vnu.edu.vn, nhungbh79@gmail.com
2
Banking Academy of Vietnam, No.12, Chua Boc, Dong Da, Hanoi, Vietnam
Abstract Process discovery is the most important task in the process mining Because of the complexity of event logs (i.e activities of several different processes are written into the same log), the discovered process models may be diffuse and unintelligible That is why the input event logs should be clustered into simpler event sub-logs This work provides a trace clustering solution based
on the idea of using the distance graph model for trace representation Experi-mental results proved the effect of the proposed solution on two measures of Fitness and Precision, especially the effect on the Precision measure
Keywords: Event logProcess miningFitness measurePrecision measure
Process discoveringTrace clusteringDistance graph model
Process discovery is the most important task in process mining There exists some algorithms for discovering process models form event logs, such asa (Wil M P van der Aalst and Boudewijn F van Dongen [1]),a+ (A.K.A de Medeiros et al [9]),a++ (Lijie Wen et al [17]), and other algorithms [2] Due to the complexity of event logs, the discovered process models may be diffuse and unintelligible That is why the two-phase approach is proposed for process model discovering In thefirst phase, the input event log is refined, in which clustering algorithms are popularly used In the second phase, process discovering algorithms are run on the refined event log to find out the model There exists some works following this approach [4,5,8,10,13,15,16]
The distance graph model for text processing has been proposed by Charu C Aggarwal and Peixiang Zhao in 2013 [3] Distance graphs of order k (k = 0, 1, 2,…) for a document (a string of words) D based on the corpus C is a useful representation of
D for text mining tasks [3,7]
Because of the similar between the graph structure of process model and the Distance graph model, this work focuses on a trace clustering solution based on the idea of using the distance graph model for trace representation This study is oriented to contribute a new solution to trace clustering
The rest of this article is organized as follows: In the next section, a trace clustering solution on using the distance graph model is showed This framework includes three phases: “Trace representation and Clustering”, “Process discovery”, and “Model
© Springer International Publishing Switzerland 2016
N.T Nguyen et al (Eds.): ICCCI 2016, Part I, LNAI 9875, pp 313 –322, 2016.
DOI: 10.1007/978-3-319-45243-2_29
Trang 2Evaluation” Experiments and remarks are described in the third section In the fourth section, related work is introduced And conclusions are shown in the last section
Model
2.1 The Problem
The paper proposes a solution to trace clustering in event logs based on the distance graph model [3] The problem is described as follows
Let A be the activity-name universe in an organization and A A be the set of all activity-names for a business process in the organization A tracer is a sequence of activities, i.e.,r 2 A+(where A+is a set of non empty sequences of activities in A) Let
L be a simple event log of a business process containing a set of traces constructed from
A Process discovery algorithms transform event logs into process models represented
in a process modeling language, e.g Petri nets (WorkFlow nets: WF-nets), BPMN (Business Process Modeling Notation), or YAWL (Yet Another Workflow Language), etc There exists some algorithms for discovering process models form event logs, such
asa [1],a+ [9],a++ [17], and others [2]
For example, let L = [abdeh, adceg, acdefbdeg, adbeh, acdefdcefcdeh, acdeg] (where a =“register request”, b = “examine thoroughly”, c = “examine casually”,
d =“check ticket”, e = “decide”, f = “reinitiate request”, g = “pay compensation”,
h =“reject request”) be an event log for the requests for compensation business process within an airline Figure1describes the WorkFlow net discovered the event log L by applying thea algorithm [2]
Due to the complexity of event logs, the discovered process models may be diffuse and unintelligible That is why the two-phase approach is proposed for process model discovering In the first phase, the input event log is refined, in which clustering algorithms are popularly used In the second phase, process discovering algorithms are run on the refined event log to find out the process model [6]
Fig 1 WorkFlow net discovered by thea-algorithm based on L [2]
Trang 32.2 The Distance Graph Model
As mentioned in the introduction section, the distance graph model (“A distance graph
of order k for a document D drawn from a corpus C”) for text processing was proposed
by Charu C Aggarwal and Peixiang Zhao in 2013 Figure2 illustrates the distance graphs of orders 0, 1, and 2 for the well-known nursery rhyme“Mary had a little lamb” [3] As stated in [3], the most common method of representing a document D is a vector
of distinct terms generated from the corpus C, where each component of the vector is the frequency of a certain term appearing in D Charu C et al proposed to convert a distance graph into a vector-space representation, i.e each directed edge in the dis-tance graph is used to create a new“token” or “pseudo-word” For example, the edge from MARRY to LITTLE (in the distance graph order 2) is used to create a new pseudo-word MARRY-LITTLE; the pseudo-word created from the edge from LAMB
to itself (in the distance graph order 2) is LAMB-LAMB The frequency of the edge is used to denote the frequency of the pseudo-word These new pseudo-words preserve the order of words in the document, thus, when combined with distinct terms in the corpus C, they enhance the semantic of the document representation in the form of a vector
Charu C Aggarwal and Peixiang Zhao showed some interesting features of dis-tance graph model, as well as the effectiveness of the model applied for text classi fi-cation Since the order of activities within a trace plays an important role, one characteristic of distance graph which is considered to be suitable for trace represen-tation is its ability to preserve the order of words in a document in the form of directed edges
Fig 2 Illustration of distance graph representation [3]
Trang 42.3 A Three-Phase Process Discovery Framework
Figure3 describes a process discovery using trace clustering solution based on the distance graph model The framework includes“Trace representation and Clustering”,
“Process discovery”, and “Model evaluation” Phases
Trace representation and Clustering Phase includes two steps In the Trace Rep-resentation step, a dataset for clustering is created, in which a data point is a vector of distance graphs (with different orders) of a trace in the event log
The set A of activities in the event log is considered as the set of“distinct words” in the corpus C, and a trace in the event log is considered as a document D, thus distance graphs for a trace can be constructed For the given trace <a c d e f d b e h>,
• Order 0 distance graph is: a(1), c(1), d(2), e(2), f(1), b(1), h(1), where the number denotes the frequency of directed edges from the node to itself This graph contains
7 unconnected components
Fig 3 A three-phase framework of process discovery
Trang 5• Order 1 distance graph is constructed from order 0 graph a(1), c(1), d(2), e(2), f(1), b(1), h(1) by adding more edges: ac(1), cd(1), de(1), ef(1), fd(1), db(1), be(1), eh(1), where the number denotes the frequency
• Order 2 distance graph is constructed from order 1 graph a(1), c(1), d(2), e(2), f(1), b(1), h(1), ac(1), cd(1), de(1), ef(1), fd(1), db(1), be(1), eh(1) by adding more edges: ad(1), ce(1), df(1), ed(1), fb(1), de(1), bh(1), where the number denotes the frequency
• etc
We followed the method of [3] to decompose a distance graph into a set of features for vector representation with a small modification A feature is either the vertex or the directed edge of the graph Our modification is to ignore the edge from a vertex v to itself (i.e edge vv) in distance graph order 0, since every vertex in the graph order 0 always has an edge from itself to itself (self-loop) In addition, an edge from vertex to itself, in a trace, should indicate an activity is repeated For the above order 1 distance graph of the trace <a c d e f d b e h>, the set of features is {a, c, d, e, f, b, h, ac, cd, de,
ef, fd, db, be, eh} The frequency of the feature in each trace is preserved in vector representation Since a higher order distance graph of a trace includes all lower distance graphs using this representation, only the highest order distance graph is enough to represent the trace with consideration to distinguish the self-loop of distance graph order 0 with the self-loop of higher order With this representation, if two graphs share common sub-graphs, it will be preserved in the representation Obviously, for another trace <a c d e f b h>, its set of features {a, c, d, e, f, b, h, ac, cd, de, ef, fb, bh} is a subset of the above trace Consequently, the two vectors will be close to each other in the vector space Because event logs reflect the executions of business processes then all distance graphs of traces in an event log include some relation patterns in the discovered process model That is why the number of features generated from all the traces in an event log L is significantly less than (|A| + |A|*(|A|-1)/2) where |A| denoted the cardinality of set A of activities
In the Clustering step, one clustering algorithm is applied on the dataset (e.g K-Modes and K-means algorithms) The output of the Trace Representation and Clustering Phase is a set of clusters (sub-logs) of traces (cases) of the event log
In the Process Discovery Phase, a process discovery algorithm (i.e.a-algorithm) is applied on the clusters (event sub-logs) to get process models
The Model Evaluation shows the effect of result process models Though there are four common measures for evaluation, i.e Fitness, Precision, Generalization, and Simplicity [2, 11,12], this work considers two measures: i.e Fitness and Precision, which had been described by A Rozinat and Wil M.P van der Aalst [11] The Fitness measure indicates that the discovered model should accept the behaviors seen in the event log, and the Precision measure means that the discovered model should not accept behaviors completely unrelated to what was seen in the event log Since these measures are calculated on each cluster, an aggregated value for whole event log should be calculated This work selects a weighted average value as follow:
Trang 6k
1
ni
where wagvis the aggregated value of thefitness or precision measure, k is the number
of clusters, n is the number of traces in the event log, niis the number of traces in the ith cluster and wiis the value of the measure of the ithcluster
This work used the prBm6 event log in the“Conformance Checking in the Large”1for experiments The event log includes 1200 cases with 37961 events In the Clustering step, two clustering algorithms: K-Modes and K-means were used In Process dis-covery and Model evaluation phrases, ProM [19] was used From several tests, we selected the maximum distance graph order of 2 for all the experiments
3.1 The Experiment with K-Modes Algorithm
Since a trace is a sequence of activities, from an event log, we have a set of activities, a common trace representation was proposed: binary vector activities, i.e a vector component is 1 if the trace contains a certain activity, otherwise 0 [2,8] To evaluate the model, binary trace vector based on activity representation was implemented as a baseline The experiment results are described in the Table1 We consider the values
of measures of Average-Fitness and Average-Precision (1) in the cases of the vector-based and the Distance graph order 2-based trace representation in columns titled “Avg” in the table After several runs, we found out the suitable number of clusters for the data set is 3
Experiments on the Distance graph order 1-based also are implemented All experimental results on the vector-based, the Distance graph order 1-based, and the Distance graph order 2-based trace representations are also showed in the Fig.4
3.2 The Experiment with K-Means Algorithm
In this experiment, the K-means clustering algorithm was used to run on the vector-based and distance graph-based trace representation The experiment results are described in the Table2 We also calculated the values of measures of Average-Fitness and Average-Precision (1) for activity-based (Vector) and the Distance graph-based (Distance graph) trace representation in columns titled“Avg” in the table
Experiments on the Distance graph order 1-based also are implemented All experimental results on the vector-based, the Distance graph order 1-based, and the Distance graph order 2-based trace representations are also showed in the Fig.5
1 http://data.3tu.nl/repository/uuid:44c32783-15d0-4dbd-af8a-78b97be3de49
Trang 73.2.1 Discussions
There are some findings from the results showed in Tables 1 (Fig.4) and Table2 (Fig.5) as follows:
Table 1 Using the K-modes clustering algorithm: thefitness and precision for all event sublogs (clusters) in the activity-based (Vector) and the distance graph order 2-based (Distance Graph) trace representation
Method
Measure
Fig 4 Comparison of the discovered process models on the measures of Fitness and Precision between Activity-based (Vector), Distance graph order 1-based (Distance Graph1), and Distance graph order 2-based (Distance Graph2) Representations with K-Modes clustering algorithm
Table 2 Using the K-means Clustering Algorithm: The Fitness and Precision for all event sublogs (clusters) in the Activity-based (Vector) and the Distance gpaph-based (Distance Gpaph) trace representation
Method
Measure
Trang 8– In all cases, the performance of the distance graph based trace representation is better than that of the vector based trace representation on fitness and precision measures
– The effect of the distance graph based trace representation on the precision measure
is higher than that on thefitness measure
– Distance graph order 2 has a better effect on precision in comparison with distance graph order 1
G Greco et al [8] proposed a clustering solution on traces in event log They used a vector representation for traces and the K-means algorithm This work is thefirst study
on trace clustering within the process mining domain
R P Jagadeesh Chandra Bose [6], R P Jagadeesh Chandra Bose et al [4, 5] proposed trace clustering solutions based on using some control-flow context infor-mation i.e.“context-aware” The Levenshtein distance technique was used
De Weerdt et al [15] proposed a two phase solution to combine of trace clustering and text mining for process discovering In thefirst phase, a MRA-based semi-supervised clustering technique (the SemSup-MRA algorithm) was applied After that, there are two kinds of clusters, clusters of standard behaviors, and clusters of atypical behaviors In the second phase, process mining and text-data mining techniques were applied After [15],
De Weerdt et al [16] proposed the ActiTraC algorithm, a three-phase algorithm for clustering an event log into a collection of event logs (clusters) The ActiTraC algorithm includes three phases: Selection, Look ahead, and Residual trace resolution They also developed the ActiTraCMRA algorithm, a further version of the ActiTraC algorithm
Fig 5 Comparison of the discovered process models on the measures offitness and precision among activity-based (Vector), distance graph order 1-based (Distance Graph1), and distance graph order 2-based (Distance Graph2) representations with K-means clustering algorithm
Trang 9T Thaler et al [14] provided a survey of trace clustering techniques They also analyzed and compared the investigated trace clustering techniques
This work is thefirst study on using the distance graph model [3] for trace clustering
This work provided a trace representation solution based on the distance graph model [3] for clustering of traces in the event logs Experiments showed that the distance graph based is more effective than activity based trace representation
In this work, experiments are limited There are several tasks needed to do in the future Firstly, other distance measures between graphs, e.g distance in graph theory [18] should be studied to directly cluster traces in the form of graphs Secondly, more clustering algorithms, especially graph-based clustering algorithms, should be con-sidered Thirdly, more event log datasets should be experimented to confirm the reli-ability of the method
Acknowledgments This work was supported in part by VNU Grant QG-15- 22
References
1 van der Aalst, W.M., van Dongen, B.F.: Discovering workflow performance models from timed logs In: Han, Y., Tai, S., Wikarski, D (eds.) EDCIS 2002 LNCS, vol 2480,
pp 45–63 Springer, Heidelberg (2002)
2 Van der Aalst, W.M.P.: Process Mining: Discovery, Conformance and Enhancement of Business Processes Springer, Heidelberg (2011)
3 Aggarwal, C.C., Zhao, P.: Towards graphical models for text processing Knowl Inf Syst 36(1), 1–21 (2013)
4 Bose, R.C., van der Aalst, W.M.: Trace clustering based on conserved patterns: towards achieving better process models In: Rinderle-Ma, S., Sadiq, S., Leymann, F (eds.) BPM
2009 LNBIP, vol 43, pp 170–181 Springer, Heidelberg (2010)
5 Bose, R.P.J.C., van der Aalst, W.M.P.: Context aware trace clustering: towards improving process mining results In: SDM 2009, pp 401–412 (2009)
6 Bose, R.P.J.C.: Process Mining in the Large: Preprocessing, Discovery, and Diagnostics Ph
D thesis Eindhoven University of Technology (2012)
7 Dai, Xin-Yu., Cheng, C., Huang, S., Chen, J.: Sentiment classification with graph sparsity regularization In: Gelbukh, A (ed.) LNCS, vol 9042, pp 140–151 Springer, Heidelberg (2015)
8 Greco, G., Guzzo, A., Pontieri, L., Saccà, D.: Discovering expressive process models by clustering log traces IEEE Trans Knowl Data Eng 18(8), 1010–1027 (2006)
9 de Medeiros, A.K.A., van Dongen, B.F., van der Aalst, W.M.P., Weijters, A.J.M.M.: Process mining: extending the alpha-algorithm to mine short loops BETA Working Paper Series (2004)
10 de Medeiros, A.K.A., Guzzo, A., Greco, G., van der Aalst, W.M., Weijters, A., van Dongen, B.F., Saccà, D.: Process mining based on clustering: a quest for precision In: Hofstede, A H., Benatallah, B., Paik, H.-Y (eds.) BPM Workshops 2007 LNCS, vol 4928, pp 17–29 Springer, Heidelberg (2008)
Trang 1011 Rozinat, A., van der Wil, M.P.: Aalst Conformance checking of processes based on monitoring real behavior Inf Syst 33(1), 64–95 (2008)
12 Buijs, J.C., van Dongen, B.F., van der Aalst, W.M.: On the role offitness, precision, generalization and simplicity in process discovery In: Meersman, R., Panetto, H., Dillon, T., Rinderle-Ma, S., Dadam, P., Zhou, X., Pearson, S., Ferscha, A., Bergamaschi, S., Cruz, I.F (eds.) OTM 2012, Part I LNCS, vol 7565, pp 305–322 Springer, Heidelberg (2012)
13 Song, M., Günther, C.W., van der Aalst, W.M.: Trace clustering in process mining In: Ardagna, D., Mecella, M., Yang, J (eds.) Business Process Management Workshops LNBIP, vol 17, pp 109–120 Springer, Heidelberg (2009)
14 Thaler, T., Ternis, S.F., Fettke, P., Loos, P.: A comparative analysis of process instance cluster techniques In: Wirtschaftsinformatik 2015, pp 423–437 (2015)
15 De Weerdt, J., van den Broucke, S.K.L.M., Vanthienen, J., Baesens, B.: Leveraging process discovery with trace clustering and text mining for intelligent analysis of incident management processes In: IEEE Congress on Evolutionary Computation, pp 1–8 (2012)
16 De Weerdt, J., vanden Broucke, S.K.L.M., Vanthienen, J., Baesens, B.: Active trace clustering for improved process discovery IEEE Trans Knowl Data Eng 25(12), 2708–
2720 (2013)
17 Wen, L., van der Aalst, W.M.P., Wang, J., Sun, J.: Mining process models with non-free-choice constructs Data Min Knowl Discov 15(2), 145–180 (2007)
18 Deza, M.M., Deza, E.: Distances in Graph Theory Springer, Heidelberg (2014)
19 http://www.processmining.org/prom/start