Linguistic Vector Similarity Measures and Applications to Linguistic Information Classification Phong Pham Hong,1,∗ Son Le Hoang2 1 Faculty of Information Technology, National University
Trang 1Linguistic Vector Similarity Measures and Applications to Linguistic Information
Classification
Phong Pham Hong,1,∗ Son Le Hoang2
1 Faculty of Information Technology, National University of Civil Engineering, Hanoi, Vietnam
2 VNU University of Science, Vietnam National University, Hanoi, Vietnam
In this paper, we generalize the similarity degree for linguistic labels to the so-called the linguistic similarity measure Linguistic vector, whichcan be used to represent objects whose attributes are given in terms of linguistic labels, is defined Some mathematical properties are stated and proved The linguistic vector similarity measure is developed and applied to linguistic information classification Experimental results on real data confirm the effectiveness of the proposed method C
2016 Wiley Periodicals, Inc.
In many situations, information is at best represented by linguistic labels.1The transformation of these labels to numbers, in many cases, is costly and impossible.2
Example 1 We consider the Car Evaluation Database,3where each item has seven
attributes, each of them takes linguistic values ranged in a corresponding label set
(Table I).
The label set, S, is mathematically described in many ways as follows (g is an
even positive integer):
rFinite and totally ordered discrete set: 4
S = {s0, s1, , s g}; (1)
rSubscript-symmetric linguistic evaluation scale 5 :
S=s αα = − g
2, , −1, 0, 1, , g
2
∗Author to whom all correspondence should be addressed; e-mail: phphong84@yahoo.com.
INTERNATIONAL JOURNAL OF INTELLIGENT SYSTEMS, VOL 00, 1–15 (2016) C
2016 Wiley Periodicals, Inc.
View this article online at wileyonlinelibrary.com • DOI 10.1002/int.21830
Trang 22 PHONG AND SON
Table I Car evaluation database
Acceptable (22.222%), Good (3.993%), Very good (3.762%)
rMultiplicative linguistic evaluation scale 6 :
S=
s α
α = g 1
2 + 1, ,
1
2, 1, 2, ,
g
2 + 1
Linguistic labels can be handled using linguistic aggregation operators, which are classified into four groups:7,8the linguistic computational model based on mem-bership functions, the linguistic computational model based on type-2 fuzzy sets, the linguistic symbolic computational models based on ordinal scales, and the linguistic symbolic computational models based on 2-tuple representation.
Xu9introduced the computational model to improve the accuracy in processes
of linguistic aggregation by extending the subscript-symmetric linguistic evaluation scale (Equation 2) to the continuous linguistic one, ¯S = {s α |α ∈ [−t, t]}, where t (t > g) is a sufficiently large positive integer If s α ∈ S, then s α is called an original linguistic label; otherwise, an extended (or virtual) linguistic label
Example 2.10 The subscript-symmetric linguistic evaluation scale set (original
lin-guistic terms) S = {s−3 , s−2, s−1, s0, s1, s2, s3} is extended to a continuous one
¯
S = {s α |α ∈ [−3, 3]} The label s−0.3 ∈ ¯S, for example, is a virtual linguistic
label
The linguistic weighted averaging (LWA) operator, which aggregates weighted continuous labels, was developed:11
LWA :
¯
Sn
−→ S,s α1, , s α1
where ¯α= n
j=1w j α j and w = (w1 , , w n ) is the weight vector, w j ≥ 0 for all
j = 1 , n, n
j=1w j = 1
Xu12developed the deviation measures and the similarity degrees between two
linguistic labels, s α and s β , in the subscript-symmetric linguistic evaluation scale S
given in Equation 2:
International Journal of Intelligent Systems DOI 10.1002/int
Trang 3LINGUISTIC VECTOR SIMILARITY MEASURES 3
rthe deviation degree: d(s α , s β) = |α−β|
rthe similarity degree: ρ(s α , s β)= 1 − d(s α , s β).
This paper contains two main contributions:
(1) Some new concepts are defined: linguistic similarity measures, linguistic vectors, and linguistic vector similarity measures;
(2) The aforementioned concepts are used to construct an algorithm to classify linguistic information (linguistic classification algorithm, LCA).
The rest of the paper is organized as follows:
rIn Section 2, we define linguistic similarity measure and introduced some instances of this concept;
rIn the next section, linguistic vector is defined Then, linguistic vector similarity measures are developed as combinations of the linguistic similarity measures;
rSection 4 is devoted to propose linguistic classification and modified linguistic classifi-cation algorithms (LCA and MLCA) The algorithms are applied on a real data set, and some statistical criteria are also assessed;
rThe last section is conclusion.
Consider a linguistic scale S = s0, s1, , s g
, such that:4 s i ≥ s j iff i ≥ j, for all s i , s j ∈ S.
set of all real numbers sim is called a linguistic similarity measure if it satisfies the following conditions:
(A1) sim (s i , s j)= sim (s j , s i ), for all s i , s j ∈ S;
(A2) 0 ≤ sim (s i , s j)≤ 1, for all s i , s j ∈ S;
(A3) sim (s i , s j)= 1 ⇔ s i = s j , for all s i , s j ∈ S;
(A4) If s i ≤ s j ≤ s k , sim (s i , s j)≥ sim (s i , s k ) and sim (s j , s k)≥ sim (s i , s j ), for all
s i , s j , s k ∈ S.
(1) sim1(s i , s j) = g −|i−j|
(2) sim2(s i , s j) = min{i,j}
max{i,j} ,
(3) sim3(s i , s j) = 1 −1−exp(−|i−j| g )
1 −exp(−1) , and
(4) sim4(s i , s j) = 1 −1−exp(−|
√
i√− j|
g ) 1−exp(−1) .
Notice that to avoid the denominator being zero, set 00 = 1 in the definition of sim2
International Journal of Intelligent Systems DOI 10.1002/int
Trang 44 PHONG AND SON
Proof It is easily seen that sim1, sim2, sim3, sim4satisfy the condition (A1)
(1) Consider sim1.
(A2) Since 0≤ |i − j| ≤ g,
0≤ g − |i − j| ≤ g
=⇒ 0 ≤g −|i−j|
=⇒ 0 ≤ sim1(s i , s j)≤ 1.
(A3)
sim1
s i , s j
= 1
⇐⇒ |i − j| = 0
⇐⇒ i = j
⇐⇒ s i = s j (A4) If s i ≤ s j ≤ s k , i.e i ≤ j ≤ k,
|i − j| = j − i ≤ k − i = |i − k|
=⇒ g − |i − j| ≥ g − |i − k|
=⇒ g −|i−j|
g ≥ g −|i−k|
g
We thus obtain sim1(si , s j)≥ sim1(s i , s k) and similarly for sim1(s j , s k)≥ sim1(s i , s k)
(2) Consider sim2.
(A2)
0≤ min{s i , s j } ≤ max{s i , s j}
=⇒ 0 ≤max{i,j}min{i,j} ≤ 1
=⇒ 0 ≤ sim2(s i , s j)≤ 1.
(A3)
sim2(s i , s j)= 1
⇐⇒ min {i, j} = max {i, j}
⇐⇒ i = j
⇐⇒ s i = s j (A4) If s i ≤ s j ≤ s k , i.e i ≤ j ≤ k,
min{s i , s j} max{s i , s j} =
i
j ≥ i
k = min{s i , s k} max{s i , s k}. This shows that sim2(s i , s j)≥ sim2(s i , s k) Similarly, sim2(s j , s k)≥ sim2(s i , s k)
International Journal of Intelligent Systems DOI 10.1002/int
Trang 5LINGUISTIC VECTOR SIMILARITY MEASURES 5
(3) Consider sim3
(A2)
0≤ |i − j| ≤ g
=⇒ 0 ≤|i−j|
g ≤ 1
=⇒ −1 ≤ −|i−j| g ≤ 0
=⇒ exp (−1) ≤ exp−|i−j| g ≤ 1
=⇒ 0 ≤ 1 − exp−|i−j| g ≤ 1 − exp (−1)
=⇒ 0 ≤1−exp
−|i−j|
g
1−exp(−1) ≤ 1
=⇒ 0 ≤ 1 −1−exp
−|i−j|
g
1−exp(−1) ≤ 1
=⇒ 0 ≤ sim3(s i , s j)≤ 1.
(A3)
sim3
s i , s j
= 1
⇐⇒ 1 −1−exp
−|i−j| g 1−exp(−1) = 1
⇐⇒ 1−exp
−|i−j|
g
1−exp(−1) = 0
⇐⇒ 1 − exp−|i−j| g = 0
⇐⇒ exp−|i−j| g = 1
⇐⇒ −|i−j| g = 0
⇐⇒ i = j
⇐⇒ s i = s j (A4) If s i ≤ s j ≤ s k , i.e., i ≤ j ≤ k,
|i−j|
g = j −i
g ≤ k −i
g = |i−k|
g
=⇒ −|i−j| g ≥ −|i−k| g
=⇒ exp−|i−j| g ≥ exp−|i−k| g
=⇒ 1 − exp−|i−j| g ≤ 1 − exp−|i−k|
g
=⇒ 1−exp
−|i−j|
g
1−exp(−1) ≤ 1−exp
−|i−k|
g
1−exp(−1)
=⇒ 1 − 1−exp
−|i−j|
g
1−exp(−1) ≥ 1 −1−exp
−|i−k|
g
1−exp(−1) ).
This proves sim3(s i , s j)≥ sim3(s i , s k) In exactly the same way, sim3(s j , s k)≥ sim3(s i , s k)
International Journal of Intelligent Systems DOI 10.1002/int
Trang 66 PHONG AND SON
(4) We finally consider sim4
(A2)
0≤√
i−√j ≤√
g
=⇒ 0 ≤ |√i√− j|
=⇒ −1 ≤ −|√i√− j|
=⇒ exp (−1) ≤ exp−|√i√− j|
=⇒ 0 ≤ 1 − exp−|√i√− j|
g ≤ 1 − exp (−1)
=⇒ 0 ≤1−exp
−|√
i√−√j|
g
1−exp(−1) ≤ 1
=⇒ 0 ≤ 1 −1−exp
−|√
i√−√j|
g
1−exp(−1) ≤ 1
=⇒ 0 ≤ sim4(s i , s j)≤ 1.
(A3)
sim4
s i , s j
= 1
⇐⇒ 1 −1−exp
−|√
i√−√j|
g
1−exp(−1) = 1
⇐⇒ 1−exp
−|√
i−√j|
g
1 −exp(−1) = 0
⇐⇒ 1 − exp−|√i− j|
⇐⇒ exp−|√i− j|
⇐⇒ −|√i− j|
⇐⇒ i = j ⇐⇒ s i = s j (A4) If s i ≤ s j ≤ s k , i.e i ≤ j ≤ k,
|√i√− j|
g = √j√ − i
g ≤ √k√ − i
g = |√i√− k|
g
=⇒ −|√i√− j|
g ≥ −|√i√− k|
g
=⇒ exp−|√i√− j|
g ≥ exp−|√i√− k|
g
=⇒ 1 − exp−|√i− j|
g ≤ 1 − exp−|√i√− k|
g
=⇒ 1−exp
−|√
i√−√j|
g
1−exp(−1) ≤ 1−exp
−|√
i√−√k|
g
1−exp(−1)
=⇒ 1 −1−exp
−|√
i√−√j|
g
1 −exp(−1) ≥ 1 −1−exp
−|√
i√−√k|
g
1 −exp(−1) ).
International Journal of Intelligent Systems DOI 10.1002/int
Trang 7LINGUISTIC VECTOR SIMILARITY MEASURES 7 This proves sim4(s i , s j)≥ sim4(s i , s k) In exactly the same way, sim4(s j , s k)≥
n as
V = (v1 , , v n ) ,
where v t , which represents the linguistic value of the t-th attribute, is a linguistic label in the tth linguistic scale S t = {s t
1, , s t
g t } (t = 1, , n).
From now on, S denotes the set of all n-length linguistic vector.
called a linguistic vector similarity measure if it satisfies the follows.
(B1) SIM(U, V ) = SIM(V, U), for all U, V ∈ S;
(B2) 0 ≤ SIM(U, V ) ≤ 1, for all U, V ∈ S;
(B3) SIM(U, V ) = 1 ⇔ U = V , for all U, V ∈ S;
(B4) If U ≤ V ≤ T , then SIM(U, V ) ≥ SIM(U, T ) and SIM(V, T ) ≥ SIM(U, T ), for all
U , V , T ∈ S (for U = (u1, , u n ) and V = (v1, , v n ), U ≤ V means u t ≤ v t for all t = 1, , n).
simi-larity measure, and w = (w1 , , w n ) is a weighting vector satisfying w t ≥ 0, for all t = 1, , n, and n
t=1w t = 1 We define:
(1) The quadric linguistic similarity measure between U and V :
SIMQ (U, V )=
n
t=1
w t (sim (u t , v t)) 2
1
.
(2) The arithmetic linguistic similarity measure between U and V :
SIMA (U, V )=
n
t=1
w t sim (u t , v t ).
(3) The geometric linguistic similarity measure between U and V :
SIMG (U, V )=
n
t=1
(sim (u t , v t))w t
International Journal of Intelligent Systems DOI 10.1002/int
Trang 88 PHONG AND SON
(4) The harmonic linguistic similarity measure between U and V :
SIMH (U, V )=
n
t=1
w t
sim (u t , v t)
−1
.
To avoid the violation mathematical rules, set 00= 1 in the definition of SIMG
and set w t
0 = 0 in definition of SIMH
SIMQ (U, V )≥ SIMA (U, V )≥ SIMG (U, V )≥ SIMH (U, V )
Proof.
rUsing the Cauchy–Schwarz inequality, ( n
t=1x t2)( n
t=1y t2) ≥ ( n
t=1x t y t)2 for all
(x1, , x n ), (y1, , y n)∈ Rn, note that n
t=1(w 1/2 t )2= n
t=1w t= 1,
n
t=1
w t (sim (u t , v t)) 2
=
n
t=1
w 1/2 t
2 n
t=1
w t 1/2 sim (u t , v t)
2
≥
n
t=1
w t 1/2 w 1/2 t sim (u t , v t)
2
=
n
t=1
w t sim (u t , v t)
2
.
That means (SIMQ (U, V ))2≥ (SIMA (U, V ))2, or SIMQ (U, V )≥ SIMA (U, V ).
rUsing the inequality of weighted arithmetic and geometric means (weighted AM-GM inequality), ( n
t=1w t x t )/w ≥ (n
t=1x t w t)w1 for all x t ≥ 0, w t ≥ 0 (t = 1, , n), w =
n
t=1w t >0, we have
n
t=1
w t sim (u t , v t)
=
n
t=1
w t sim (u t , v t)
/
n
t=1
w t
≥
n
t=1
(sim (u t , v t))w t
1/ n
t=1w t
= n
t=1
(sim (u t , v t))w t
So, SIMA (U, V )≥ SIMG (U, V ).
International Journal of Intelligent Systems DOI 10.1002/int
Trang 9LINGUISTIC VECTOR SIMILARITY MEASURES 9
rUsing the weighted AM-GM inequality,
n
t=1
w t
sim(u t ,v t)
=
n
t=1w t
(sim (u t , v t))−1
/
n
t=1w t
≥ n
t=1
(sim (u t , v t))−1w t
= n
t=1
1
(sim(u t ,v t))
w t
=
n
t=1
(sim (u t , v t))w t
−1
.
This proves (SIMH (U, V ))−1≥ (SIMG (U, V ))−1, or SIMG (U, V )≥
vector similarity measures.
Proof Obviously, SIM Q, SIMA, SIMG, SIMH satisfy (B1)
SIMQ (U, V ) ≤ 1 for all U, V ∈ S.
By the fact that
sim (u t , v t)≤ 1, ∀t = 1, , n,
we obtain
SIMQ(U, V)=
n
t=1
w t (sim (u t , v t))2
1
≤
n
t=1w t
1
= 1.
SIMQ (U, V ), SIM A (U, V ), SIM G (U, V ) equals to 1 So, it is sufficient to prove
(B3) for SIMH
We have
SIMH (U, V )= 1
sim(u t ,v t) = w t , ∀t = 1, , n
⇐⇒ sim (u t , v t)= 1, ∀t = 1, , n
⇐⇒ simu t = v t , ∀t = t = 1, , n
⇐⇒ U = V.
(v1, , v n)≤ T = (τ1 , , τ n ), we have u t ≤ v t ≤ τ t , for all t = 1, , n This
International Journal of Intelligent Systems DOI 10.1002/int
Trang 1010 PHONG AND SON
implies
sim (u t , v t)≥ sim (u t , τ t ) , ∀t = 1, , n. (5) Using (5), we will show that SIMQ (U, V )≥ SIMQ (U, T ), SIM A (U, V )≥ SIMA (U, T ), SIM G (U, V )≥ SIMG (U, T ), and SIM H (U, V )≥ SIMH (U, T ) (the
remainders are runs as before) We have
(sim (u t , v t))w t ≥ (sim (u t , τ t))w t , ∀t
= 1, , n
=⇒
n
t=1w t (sim (u t , v t))
2
1
≥
n
t=1
w t (sim (u t , τ t))2
1
=⇒ SIMQ (U, V )≥ SIMQ (U, T ) ;
w t sim (u t , v t)≥ w t sim (u t , τ t ) , ∀t = 1, , n
=⇒ n
t=1
w t sim (u t , v t)≥ n
t=1
w t sim (u t , τ t)
=⇒ SIMA (U, V ) ≥ SIM A (U, T ) ; (sim (u t , v t))w t ≥ (sim (u t , τ t))w t ∀t = 1, , n
=⇒ n
t=1(sim (u t , v t))
w t ≥ n
t=1(sim (u t , τ t))
w t
=⇒ SIMG (U, V )≥ SIMG (U, T ) ;
1
sim(u t ,v t) ≤ 1
sim(u t ,τ t), ∀t = 1, , n
sim(u t ,v t) ≤ w t
sim(u t ,τ t)∀t = 1, , n
=⇒
n
t=1
w t
sim(u t ,v t)
−1
≥
n
t=1
w t
sim(u t ,τ t)
−1
CLASSIFICATION
In this section, we use the linguistic similarity measure and the linguistic vector similarity measure to classify items with attributes given as linguistic labels
Let D be a data set whose items are described by (n+ 1) attributes Each
attribute A t takes values in corresponding linguistic scale S t = {s (t)
0 , , s g (t) t }, t =
1, , (n + 1).D is classified on the (n + 1)-th attribute, A n+1.
International Journal of Intelligent Systems DOI 10.1002/int
Trang 11LINGUISTIC VECTOR SIMILARITY MEASURES 11
Linguistic Classification Algorithm (LCA) Let D1 ⊂ D be the training set having the cardinality of n1 Consider an item K ∈ D\D1 , with V = (v1 , , v n)
is the associated attribute vector containing values of n first attributes, where v t is
the linguistic value of the tth attribute, for all t = 1, , n.
(1) For each I j ∈ D1, V j denotes its attribute vector containing values of n first attributes
of I j (j = 1, , n1) The similarity between the items K and I jis determined as the
similarity between two linguistic vector V and V j.
SIM
K, I j
= SIMV , V j
.
(2) Aggregate the values of the (n + 1)-th attribute of all items I j (j = 1, , n1 ) using the LWA operator (Equation 4).
u = s α¯ = LWAu1, u n1
,
where u j is the value of (n + 1)-th attribute of the item I j (j = 1, , n1 ), the weighting
vector is w = (w1, , w n1) with w j = SIM(K,I j)
n1
j=1SIM(K,I j)
(j = 1, , n1 ).
(3) Evaluate value u∗ of the (n + 1)-th attribute of the item K: predicted value of the (n + 1)-th attribute of K is u∗= s l (n+1) , where l = round( ¯α).
Remark 1 Step 2 of algorithm LCA can be refined using two substeps as follows.
The modification is termed as modified linguistic classification algorithm (MLCA)
(2a) Specify N nearest neighbors of K (1 ≤ N ≤ n1) That is to choose I j∗∈ D1 (j=
1, , N ) being N most similar to K according to the similarity in the previous step.
N is an adjustable integer determined by experiments.
(2b) Aggregate the values of the (n + 1)-th attribute of the items I∗
j (j = 1, , N) using
the LWA operator (eq (4)).
u = s α¯ = LWAu∗1, u∗N
, where u∗j is the value of (n + 1)-th attribute of the item I∗
j (j = 1, , N), the weighting
vector is w = (w1, , w N ) with w j= SIM (K,I j∗)
N
j=1SIM (K,I
∗
j)
(j = 1, , N).
In this example, Mushroom Database13is used Each item has seven attributes with the corresponding linguistic scales being listed as in Table II
In Table III, u∗1 and u∗2being the values to be determined
Consider sim= sim3and SIM= SIMA, we have
SIM (K1 , I1)= 0.5471505;
SIM (K1 , I2)= 0.6109411;
SIM (K1 , I3)= 0.5832141;
SIM (K1 , I4)= 0.5549266;
SIM (K1, I5)= 0.8150827;
SIM (K1, I6)= 0.5558733;
International Journal of Intelligent Systems DOI 10.1002/int
...CLASSIFICATION< /b>
In this section, we use the linguistic similarity measure and the linguistic vector similarity measure to classify items with attributes given as linguistic. .. n).
From now on, S denotes the set of all n-length linguistic vector.
called a linguistic vector similarity measure if it satisfies the follows.
(B1)... 11
LINGUISTIC VECTOR SIMILARITY MEASURES< /small> 11
Linguistic Classification Algorithm (LCA) Let D1 ⊂ D be