Linguistic vector similarity measures and applications to linguistic information classification

Linguistic Vector Similarity Measures and Applications to Linguistic Information Classification Phong Pham Hong,1,∗ Son Le Hoang2 1 Faculty of Information Technology, National University

Trang 1

Linguistic Vector Similarity Measures and Applications to Linguistic Information

Classification

Phong Pham Hong,1,∗ Son Le Hoang2

1 Faculty of Information Technology, National University of Civil Engineering, Hanoi, Vietnam

2 VNU University of Science, Vietnam National University, Hanoi, Vietnam

In this paper, we generalize the similarity degree for linguistic labels to the so-called the linguistic similarity measure Linguistic vector, whichcan be used to represent objects whose attributes are given in terms of linguistic labels, is defined Some mathematical properties are stated and proved The linguistic vector similarity measure is developed and applied to linguistic information classification Experimental results on real data confirm the effectiveness of the proposed method C

2016 Wiley Periodicals, Inc.

In many situations, information is at best represented by linguistic labels.1The transformation of these labels to numbers, in many cases, is costly and impossible.2

Example 1 We consider the Car Evaluation Database,3where each item has seven

attributes, each of them takes linguistic values ranged in a corresponding label set

(Table I).

The label set, S, is mathematically described in many ways as follows (g is an

even positive integer):

rFinite and totally ordered discrete set: 4

S = {s0, s1, , s g}; (1)

rSubscript-symmetric linguistic evaluation scale 5 :

S=s αα = − g

2, , −1, 0, 1, , g

2

∗Author to whom all correspondence should be addressed; e-mail: phphong84@yahoo.com.

INTERNATIONAL JOURNAL OF INTELLIGENT SYSTEMS, VOL 00, 1–15 (2016) C

2016 Wiley Periodicals, Inc.

View this article online at wileyonlinelibrary.com • DOI 10.1002/int.21830

Trang 2

2 PHONG AND SON

Table I Car evaluation database

Acceptable (22.222%), Good (3.993%), Very good (3.762%)

rMultiplicative linguistic evaluation scale 6 :

S=

s α

α = g 1

2 + 1, ,

1

2, 1, 2, ,

g

2 + 1

Linguistic labels can be handled using linguistic aggregation operators, which are classified into four groups:7,8the linguistic computational model based on mem-bership functions, the linguistic computational model based on type-2 fuzzy sets, the linguistic symbolic computational models based on ordinal scales, and the linguistic symbolic computational models based on 2-tuple representation.

Xu9introduced the computational model to improve the accuracy in processes

of linguistic aggregation by extending the subscript-symmetric linguistic evaluation scale (Equation 2) to the continuous linguistic one, ¯S = {s α |α ∈ [−t, t]}, where t (t > g) is a sufficiently large positive integer If s α ∈ S, then s α is called an original linguistic label; otherwise, an extended (or virtual) linguistic label

Example 2.10 The subscript-symmetric linguistic evaluation scale set (original

lin-guistic terms) S = {s−3 , s−2, s−1, s0, s1, s2, s3} is extended to a continuous one

¯

S = {s α |α ∈ [−3, 3]} The label s−0.3 ∈ ¯S, for example, is a virtual linguistic

label

The linguistic weighted averaging (LWA) operator, which aggregates weighted continuous labels, was developed:11

LWA :

¯

Sn

−→ S,s α1, , s α1

where ¯α= n

j=1w j α j and w = (w1 , , w n ) is the weight vector, w j ≥ 0 for all

j = 1 , n, n

j=1w j = 1

Xu12developed the deviation measures and the similarity degrees between two

linguistic labels, s α and s β , in the subscript-symmetric linguistic evaluation scale S

given in Equation 2:

International Journal of Intelligent Systems DOI 10.1002/int

Trang 3

LINGUISTIC VECTOR SIMILARITY MEASURES 3

rthe deviation degree: d(s α , s β) = |α−β|

rthe similarity degree: ρ(s α , s β)= 1 − d(s α , s β).

This paper contains two main contributions:

(1) Some new concepts are defined: linguistic similarity measures, linguistic vectors, and linguistic vector similarity measures;

(2) The aforementioned concepts are used to construct an algorithm to classify linguistic information (linguistic classification algorithm, LCA).

The rest of the paper is organized as follows:

rIn Section 2, we define linguistic similarity measure and introduced some instances of this concept;

rIn the next section, linguistic vector is defined Then, linguistic vector similarity measures are developed as combinations of the linguistic similarity measures;

rSection 4 is devoted to propose linguistic classification and modified linguistic classifi-cation algorithms (LCA and MLCA) The algorithms are applied on a real data set, and some statistical criteria are also assessed;

rThe last section is conclusion.

Consider a linguistic scale S = s0, s1, , s g

, such that:4 s i ≥ s j iff i ≥ j, for all s i , s j ∈ S.

set of all real numbers sim is called a linguistic similarity measure if it satisfies the following conditions:

(A1) sim (s i , s j)= sim (s j , s i ), for all s i , s j ∈ S;

(A2) 0 ≤ sim (s i , s j)≤ 1, for all s i , s j ∈ S;

(A3) sim (s i , s j)= 1 ⇔ s i = s j , for all s i , s j ∈ S;

(A4) If s i ≤ s j ≤ s k , sim (s i , s j)≥ sim (s i , s k ) and sim (s j , s k)≥ sim (s i , s j ), for all

s i , s j , s k ∈ S.

(1) sim1(s i , s j) = g −|i−j|

(2) sim2(s i , s j) = min{i,j}

max{i,j} ,

(3) sim3(s i , s j) = 1 −1−exp(−|i−j| g )

1 −exp(−1) , and

(4) sim4(s i , s j) = 1 −1−exp(−|

√

i√− j|

g ) 1−exp(−1) .

Notice that to avoid the denominator being zero, set 00 = 1 in the definition of sim2

Trang 4

4 PHONG AND SON

Proof It is easily seen that sim1, sim2, sim3, sim4satisfy the condition (A1)

(1) Consider sim1.

(A2) Since 0≤ |i − j| ≤ g,

0≤ g − |i − j| ≤ g

=⇒ 0 ≤g −|i−j|

=⇒ 0 ≤ sim1(s i , s j)≤ 1.

(A3)

sim1

s i , s j

= 1

⇐⇒ |i − j| = 0

⇐⇒ i = j

⇐⇒ s i = s j (A4) If s i ≤ s j ≤ s k , i.e i ≤ j ≤ k,

|i − j| = j − i ≤ k − i = |i − k|

=⇒ g − |i − j| ≥ g − |i − k|

=⇒ g −|i−j|

g ≥ g −|i−k|

g

We thus obtain sim1(si , s j)≥ sim1(s i , s k) and similarly for sim1(s j , s k)≥ sim1(s i , s k)

(2) Consider sim2.

(A2)

0≤ min{s i , s j } ≤ max{s i , s j}

=⇒ 0 ≤max{i,j}min{i,j} ≤ 1

=⇒ 0 ≤ sim2(s i , s j)≤ 1.

(A3)

sim2(s i , s j)= 1

⇐⇒ min {i, j} = max {i, j}

⇐⇒ i = j

⇐⇒ s i = s j (A4) If s i ≤ s j ≤ s k , i.e i ≤ j ≤ k,

min{s i , s j} max{s i , s j} =

i

j ≥ i

k = min{s i , s k} max{s i , s k}. This shows that sim2(s i , s j)≥ sim2(s i , s k) Similarly, sim2(s j , s k)≥ sim2(s i , s k)

Trang 5

(3) Consider sim3

(A2)

0≤ |i − j| ≤ g

=⇒ 0 ≤|i−j|

g ≤ 1

=⇒ −1 ≤ −|i−j| g ≤ 0

=⇒ exp (−1) ≤ exp−|i−j| g ≤ 1

=⇒ 0 ≤ 1 − exp−|i−j| g ≤ 1 − exp (−1)

=⇒ 0 ≤1−exp

−|i−j|

g

1−exp(−1) ≤ 1

=⇒ 0 ≤ 1 −1−exp

−|i−j|

g

1−exp(−1) ≤ 1

=⇒ 0 ≤ sim3(s i , s j)≤ 1.

(A3)

sim3

s i , s j

= 1

⇐⇒ 1 −1−exp

−|i−j| g 1−exp(−1) = 1

⇐⇒ 1−exp

−|i−j|

g

1−exp(−1) = 0

⇐⇒ 1 − exp−|i−j| g = 0

⇐⇒ exp−|i−j| g = 1

⇐⇒ −|i−j| g = 0

⇐⇒ i = j

⇐⇒ s i = s j (A4) If s i ≤ s j ≤ s k , i.e., i ≤ j ≤ k,

|i−j|

g = j −i

g ≤ k −i

g = |i−k|

g

=⇒ −|i−j| g ≥ −|i−k| g

=⇒ exp−|i−j| g ≥ exp−|i−k| g

=⇒ 1 − exp−|i−j| g ≤ 1 − exp−|i−k|

g

=⇒ 1−exp

−|i−j|

g

1−exp(−1) ≤ 1−exp

−|i−k|

g

1−exp(−1)

=⇒ 1 − 1−exp

−|i−j|

g

1−exp(−1) ≥ 1 −1−exp

−|i−k|

g

1−exp(−1) ).

This proves sim3(s i , s j)≥ sim3(s i , s k) In exactly the same way, sim3(s j , s k)≥ sim3(s i , s k)

Trang 6

6 PHONG AND SON

(4) We finally consider sim4

(A2)

0≤√

i−√j ≤√

g

=⇒ 0 ≤ |√i√− j|

=⇒ −1 ≤ −|√i√− j|

=⇒ exp (−1) ≤ exp−|√i√− j|

=⇒ 0 ≤ 1 − exp−|√i√− j|

g ≤ 1 − exp (−1)

=⇒ 0 ≤1−exp

−|√

i√−√j|

g

1−exp(−1) ≤ 1

=⇒ 0 ≤ 1 −1−exp

−|√

i√−√j|

g

1−exp(−1) ≤ 1

=⇒ 0 ≤ sim4(s i , s j)≤ 1.

(A3)

sim4

s i , s j

= 1

⇐⇒ 1 −1−exp

−|√

i√−√j|

g

1−exp(−1) = 1

⇐⇒ 1−exp

−|√

i−√j|

g

1 −exp(−1) = 0

⇐⇒ 1 − exp−|√i− j|

⇐⇒ exp−|√i− j|

⇐⇒ −|√i− j|

⇐⇒ i = j ⇐⇒ s i = s j (A4) If s i ≤ s j ≤ s k , i.e i ≤ j ≤ k,

|√i√− j|

g = √j√ − i

g ≤ √k√ − i

g = |√i√− k|

g

=⇒ −|√i√− j|

g ≥ −|√i√− k|

g

=⇒ exp−|√i√− j|

g ≥ exp−|√i√− k|

g

=⇒ 1 − exp−|√i− j|

g ≤ 1 − exp−|√i√− k|

g

=⇒ 1−exp

−|√

i√−√j|

g

1−exp(−1) ≤ 1−exp

−|√

i√−√k|

g

1−exp(−1)

=⇒ 1 −1−exp

−|√

i√−√j|

g

1 −exp(−1) ≥ 1 −1−exp

−|√

i√−√k|

g

1 −exp(−1) ).

Trang 7

LINGUISTIC VECTOR SIMILARITY MEASURES 7 This proves sim4(s i , s j)≥ sim4(s i , s k) In exactly the same way, sim4(s j , s k)≥

n as

V = (v1 , , v n ) ,

where v t , which represents the linguistic value of the t-th attribute, is a linguistic label in the tth linguistic scale S t = {s t

1, , s t

g t } (t = 1, , n).

From now on, S denotes the set of all n-length linguistic vector.

called a linguistic vector similarity measure if it satisfies the follows.

(B1) SIM(U, V ) = SIM(V, U), for all U, V ∈ S;

(B2) 0 ≤ SIM(U, V ) ≤ 1, for all U, V ∈ S;

(B3) SIM(U, V ) = 1 ⇔ U = V , for all U, V ∈ S;

(B4) If U ≤ V ≤ T , then SIM(U, V ) ≥ SIM(U, T ) and SIM(V, T ) ≥ SIM(U, T ), for all

U , V , T ∈ S (for U = (u1, , u n ) and V = (v1, , v n ), U ≤ V means u t ≤ v t for all t = 1, , n).

simi-larity measure, and w = (w1 , , w n ) is a weighting vector satisfying w t ≥ 0, for all t = 1, , n, and n

t=1w t = 1 We define:

(1) The quadric linguistic similarity measure between U and V :

SIMQ (U, V )=

n

t=1

w t (sim (u t , v t)) 2

1

.

(2) The arithmetic linguistic similarity measure between U and V :

SIMA (U, V )=

n

t=1

w t sim (u t , v t ).

(3) The geometric linguistic similarity measure between U and V :

SIMG (U, V )=

n

t=1

(sim (u t , v t))w t

Trang 8

8 PHONG AND SON

(4) The harmonic linguistic similarity measure between U and V :

SIMH (U, V )=

n

t=1

w t

sim (u t , v t)

−1

.

To avoid the violation mathematical rules, set 00= 1 in the definition of SIMG

and set w t

0 = 0 in definition of SIMH

SIMQ (U, V )≥ SIMA (U, V )≥ SIMG (U, V )≥ SIMH (U, V )

Proof.

rUsing the Cauchy–Schwarz inequality, ( n

t=1x t2)( n

t=1y t2) ≥ ( n

t=1x t y t)2 for all

(x1, , x n ), (y1, , y n)∈ Rn, note that n

t=1(w 1/2 t )2= n

t=1w t= 1,

n

t=1

w t (sim (u t , v t)) 2

=

n

t=1

w 1/2 t

2 n

t=1

w t 1/2 sim (u t , v t)

2

≥

n

t=1

w t 1/2 w 1/2 t sim (u t , v t)

2

=

n

t=1

w t sim (u t , v t)

2

.

That means (SIMQ (U, V ))2≥ (SIMA (U, V ))2, or SIMQ (U, V )≥ SIMA (U, V ).

rUsing the inequality of weighted arithmetic and geometric means (weighted AM-GM inequality), ( n

t=1w t x t )/w ≥ (n

t=1x t w t)w1 for all x t ≥ 0, w t ≥ 0 (t = 1, , n), w =

n

t=1w t >0, we have

n

t=1

=

n

t=1

/

n

t=1

w t

≥

n

t=1

(sim (u t , v t))w t

1/ n

t=1w t

= n

t=1

(sim (u t , v t))w t

So, SIMA (U, V )≥ SIMG (U, V ).

Trang 9

rUsing the weighted AM-GM inequality,

n

t=1

w t

sim(u t ,v t)

=

n

t=1w t

(sim (u t , v t))−1

/

n

t=1w t

≥ n

t=1

(sim (u t , v t))−1w t

= n

t=1

1

(sim(u t ,v t))

w t

=

n

t=1

(sim (u t , v t))w t

−1

.

This proves (SIMH (U, V ))−1≥ (SIMG (U, V ))−1, or SIMG (U, V )≥

vector similarity measures.

Proof Obviously, SIM Q, SIMA, SIMG, SIMH satisfy (B1)

SIMQ (U, V ) ≤ 1 for all U, V ∈ S.

By the fact that

sim (u t , v t)≤ 1, ∀t = 1, , n,

we obtain

SIMQ(U, V)=

n

t=1

w t (sim (u t , v t))2

1

≤

n

t=1w t

1

= 1.

SIMQ (U, V ), SIM A (U, V ), SIM G (U, V ) equals to 1 So, it is sufficient to prove

(B3) for SIMH

We have

SIMH (U, V )= 1

sim(u t ,v t) = w t , ∀t = 1, , n

⇐⇒ sim (u t , v t)= 1, ∀t = 1, , n

⇐⇒ simu t = v t , ∀t = t = 1, , n

⇐⇒ U = V.

(v1, , v n)≤ T = (τ1 , , τ n ), we have u t ≤ v t ≤ τ t , for all t = 1, , n This

Trang 10

10 PHONG AND SON

implies

sim (u t , v t)≥ sim (u t , τ t ) , ∀t = 1, , n. (5) Using (5), we will show that SIMQ (U, V )≥ SIMQ (U, T ), SIM A (U, V )≥ SIMA (U, T ), SIM G (U, V )≥ SIMG (U, T ), and SIM H (U, V )≥ SIMH (U, T ) (the

remainders are runs as before) We have

(sim (u t , v t))w t ≥ (sim (u t , τ t))w t , ∀t

= 1, , n

=⇒

n

t=1w t (sim (u t , v t))

2

1

≥

n

t=1

w t (sim (u t , τ t))2

1

=⇒ SIMQ (U, V )≥ SIMQ (U, T ) ;

w t sim (u t , v t)≥ w t sim (u t , τ t ) , ∀t = 1, , n

=⇒ n

t=1

w t sim (u t , v t)≥ n

t=1

w t sim (u t , τ t)

=⇒ SIMA (U, V ) ≥ SIM A (U, T ) ; (sim (u t , v t))w t ≥ (sim (u t , τ t))w t ∀t = 1, , n

=⇒ n

t=1(sim (u t , v t))

w t ≥ n

t=1(sim (u t , τ t))

w t

=⇒ SIMG (U, V )≥ SIMG (U, T ) ;

1

sim(u t ,v t) ≤ 1

sim(u t ,τ t), ∀t = 1, , n

sim(u t ,v t) ≤ w t

sim(u t ,τ t)∀t = 1, , n

=⇒

n

t=1

w t

sim(u t ,v t)

−1

≥

n

t=1

w t

sim(u t ,τ t)

−1

CLASSIFICATION

In this section, we use the linguistic similarity measure and the linguistic vector similarity measure to classify items with attributes given as linguistic labels

Let D be a data set whose items are described by (n+ 1) attributes Each

attribute A t takes values in corresponding linguistic scale S t = {s (t)

0 , , s g (t) t }, t =

1, , (n + 1).D is classified on the (n + 1)-th attribute, A n+1.

Trang 11

Linguistic Classification Algorithm (LCA) Let D1 ⊂ D be the training set having the cardinality of n1 Consider an item K ∈ D\D1 , with V = (v1 , , v n)

is the associated attribute vector containing values of n first attributes, where v t is

the linguistic value of the tth attribute, for all t = 1, , n.

(1) For each I j ∈ D1, V j denotes its attribute vector containing values of n first attributes

of I j (j = 1, , n1) The similarity between the items K and I jis determined as the

similarity between two linguistic vector V and V j.

SIM

K, I j

= SIMV , V j

.

(2) Aggregate the values of the (n + 1)-th attribute of all items I j (j = 1, , n1 ) using the LWA operator (Equation 4).

u = s α¯ = LWAu1, u n1

,

where u j is the value of (n + 1)-th attribute of the item I j (j = 1, , n1 ), the weighting

vector is w = (w1, , w n1) with w j = SIM(K,I j)

n1

j=1SIM(K,I j)

(j = 1, , n1 ).

(3) Evaluate value u∗ of the (n + 1)-th attribute of the item K: predicted value of the (n + 1)-th attribute of K is u∗= s l (n+1) , where l = round( ¯α).

Remark 1 Step 2 of algorithm LCA can be refined using two substeps as follows.

The modification is termed as modified linguistic classification algorithm (MLCA)

(2a) Specify N nearest neighbors of K (1 ≤ N ≤ n1) That is to choose I j∗∈ D1 (j=

1, , N ) being N most similar to K according to the similarity in the previous step.

N is an adjustable integer determined by experiments.

(2b) Aggregate the values of the (n + 1)-th attribute of the items I∗

j (j = 1, , N) using

the LWA operator (eq (4)).

u = s α¯ = LWAu∗1, u∗N

, where u∗j is the value of (n + 1)-th attribute of the item I∗

j (j = 1, , N), the weighting

vector is w = (w1, , w N ) with w j= SIM (K,I j∗)

N

j=1SIM (K,I

∗

j)

(j = 1, , N).

In this example, Mushroom Database13is used Each item has seven attributes with the corresponding linguistic scales being listed as in Table II

In Table III, u∗1 and u∗2being the values to be determined

Consider sim= sim3and SIM= SIMA, we have

SIM (K1 , I1)= 0.5471505;

SIM (K1 , I2)= 0.6109411;

SIM (K1 , I3)= 0.5832141;

SIM (K1 , I4)= 0.5549266;

SIM (K1, I5)= 0.8150827;

SIM (K1, I6)= 0.5558733;

CLASSIFICATION< /b>

In this section, we use the linguistic similarity measure and the linguistic vector similarity measure to classify items with attributes given as linguistic. .. n).

From now on, S denotes the set of all n-length linguistic vector.

called a linguistic vector similarity measure if it satisfies the follows.

(B1)... 11

LINGUISTIC VECTOR SIMILARITY MEASURES< /small> 11

Linguistic Classification Algorithm (LCA) Let D1 ⊂ D be

Định dạng
Số trang	15
Dung lượng	134 KB