1. Trang chủ
  2. » Luận Văn - Báo Cáo

Tài liệu Báo cáo khoa học: "Discovering Global Patterns in Linguistic Networks through Spectral Analysis: A Case Study of the Consonant Inventories" pdf

9 705 1
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Discovering global patterns in linguistic networks through spectral analysis: A case study of the consonant inventories
Tác giả Animesh Mukherjee, Monojit Choudhury, Ravi Kannan
Trường học Indian Institute of Technology, Kharagpur
Thể loại báo cáo khoa học
Thành phố Kharagpur
Định dạng
Số trang 9
Dung lượng 664,15 KB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

Discovering Global Patterns in Linguistic Networks throughSpectral Analysis: A Case Study of the Consonant Inventories Indian Institute of Technology, Kharagpur animeshm@cse.iitkgp.ernet

Trang 1

Discovering Global Patterns in Linguistic Networks through

Spectral Analysis: A Case Study of the Consonant Inventories

Indian Institute of Technology, Kharagpur animeshm@cse.iitkgp.ernet.in Monojit Choudhury and Ravi Kannan

Microsoft Research India

{monojitc,kannan}@microsoft.com

Abstract

Recent research has shown that language

and the socio-cognitive phenomena

asso-ciated with it can be aptly modeled and

visualized through networks of linguistic

entities However, most of the existing

works on linguistic networks focus only

on the local properties of the networks

This study is an attempt to analyze the

structure of languages via a purely

struc-tural technique, namely spectral analysis,

which is ideally suited for discovering the

global correlations in a network

Appli-cation of this technique to PhoNet, the

co-occurrence network of consonants, not

only reveals several natural linguistic

prin-ciples governing the structure of the

con-sonant inventories, but is also able to

quan-tify their relative importance We believe

that this powerful technique can be

suc-cessfully applied, in general, to study the

structure of natural languages

1 Introduction

Language and the associated socio-cognitive

phe-nomena can be modeled as networks, where the

nodes correspond to linguistic entities and the

edges denote the pairwise interaction or

relation-ship between these entities The study of

lin-guistic networks has been quite popular in the

re-cent times and has provided us with several

in-teresting insights into the nature of language (see

Choudhury and Mukherjee (to appear) for an

ex-tensive survey) Examples include study of the

WordNet (Sigman and Cecchi, 2002), syntactic

dependency network of words (Ferrer-i-Cancho,

2005) and network of co-occurrence of

conso-nants in sound inventories (Mukherjee et al., 2008;

Mukherjee et al., 2007)

This research has been conducted during the author’s

in-ternship at Microsoft Research India.

Most of the existing studies on linguistic net-works, however, focus only on the local structural properties such as the degree and clustering coef-ficient of the nodes, and shortest paths between pairs of nodes On the other hand, although it is

a well known fact that the spectrum of a network

can provide important information about its global structure, the use of this powerful mathematical machinery to infer global patterns in linguistic net-works is rarely found in the literature Note that spectral analysis, however, has been successfully employed in the domains of biological and social networks (Farkas et al., 2001; Gkantsidis et al., 2003; Banerjee and Jost, 2007) In the context of linguistic networks, (Belkin and Goldsmith, 2002)

is the only work we are aware of that analyzes the eigenvectors to obtain a two dimensional visualize

of the network Nevertheless, the work does not study the spectrum of the graph

The aim of the present work is to demonstrate the use of spectral analysis for discovering the global patterns in linguistic networks These pat-terns, in turn, are then interpreted in the light of ex-isting linguistic theories to gather deeper insights into the nature of the underlying linguistic phe-nomena We apply this rather generic technique

to find the principles that are responsible for shap-ing the consonant inventories, which is a well re-searched problem in phonology since 1931 (Tru-betzkoy, 1931; Lindblom and Maddieson, 1988; Boersma, 1998; Clements, 2008) The analysis

is carried out on a network defined in (Mukherjee

et al., 2007), where the consonants are the nodes

and there is an edge between two nodes u and v

if the consonants corresponding to them co-occur

in a language The number of times they co-occur across languages define the weight of the edge We explain the results obtained from the spectral anal-ysis of the network post-facto using three linguis-tic principles The method also automalinguis-tically re-veals the quantitative importance of each of these

Trang 2

It is worth mentioning here that earlier

re-searchers have also noted the importance of the

aforementioned principles However, what was

not known was how much importance one should

associate with each of these principles We also

note that the technique of spectral analysis neither

explicitly nor implicitly assumes that these

princi-ples exist or are important, but deduces them

auto-matically Thus, we believe that spectral analysis

is a promising approach that is well suited to the

discovery of linguistic principles underlying a set

of observations represented as a network of

enti-ties The fact that the principles “discovered” in

this study are already well established results adds

to the credibility of the method Spectral analysis

of large linguistic networks in the future can

possi-bly reveal hitherto unknown universal principles

The rest of the paper is organized as follows

Sec 2 introduces the technique of spectral

anal-ysis of networks and illustrates some of its

ap-plications The problem of consonant inventories

and how it can be modeled and studied within the

framework of linguistic networks are described in

Sec 3 Sec 4 presents the spectral analysis of

the consonant co-occurrence network, the

obser-vations and interpretations Sec 5 concludes by

summarizing the work and the contributions and

listing out future research directions

2 A Primer to Spectral Analysis

Spectral analysis1 is a powerful tool capable of

revealing the global structural patterns

underly-ing an enormous and complicated environment

of interacting entities Essentially, it refers to

the systematic study of the eigenvalues and the

eigenvectors of the adjacency matrix of the

net-work of these interacting entities Here we shall

briefly review the basic concepts involved in

spec-tral analysis and describe some of its applications

(see (Chung, 1994; Kannan and Vempala, 2008)

for details)

A network or a graph consisting of n nodes

(la-beled as 1 through n) can be represented by a n×n

square matrix A, where the entry a ijrepresents the

weight of the edge from node i to node j A, which

is known as the adjacency matrix, is symmetric for

an undirected graph and have binary entries for an

1 The term spectral analysis is also used in the context of

signal processing, where it refers to the study of the frequency

spectrum of a signal.

unweighted graph λ is an eigenvalue of A if there

is an n-dimensional vector x such that

Ax = λx

Any real symmetric matrix A has n (possibly non-distinct) eigenvalues λ0 ≤ λ1 ≤ ≤ λ n−1, and

corresponding n eigenvectors that are mutually or-thogonal The spectrum of a graph is the set of the

distinct eigenvalues of the graph and their corre-sponding multiplicities It is usually represented

as a plot with the eigenvalues in x-axis and their multiplicities plotted in the y-axis

The spectrum of real and random graphs dis-play several interesting properties Banerjee and Jost (2007) report the spectrum of several biologi-cal networks that are significantly different from the spectrum of artificially generated graphs2

Spectral analysis is also closely related to

Prin-cipal Component Analysis and Multidimensional

Scaling If the first few (say d) eigenvalues of a

matrix are much higher than the rest of the eigen-values, then it can be concluded that the rows of the matrix can be approximately represented as

linear combinations of d orthogonal vectors This

further implies that the corresponding graph has

a few motifs (subgraphs) that are repeated a large number of time to obtain the global structure of the graph (Banerjee and Jost, to appear)

Spectral properties are representative of an n-dimensional average behavior of the underlying system, thereby providing considerable insight into its global organization For example, the prin-cipal eigenvector (i.e., the eigenvector correspond-ing to the largest eigenvalue) is the direction in which the sum of the square of the projections

of the row vectors of the matrix is maximum In fact, the principal eigenvector of a graph is used to compute the centrality of the nodes, which is also

known as PageRank in the context of WWW

Sim-ilarly, the second eigen vector component is used for graph clustering

In the next two sections we describe how spec-tral analysis can be applied to discover the orga-nizing principles underneath the structure of con-sonant inventories

2 Banerjee and Jost (2007) report the spectrum of the graph’s Laplacian matrix rather than the adjacency matrix.

It is increasingly popular these days to analyze the spectral properties of the graph’s Laplacian matrix However, for rea-sons explained later, here we will be conduct spectral analysis

of the adjacency matrix rather than its Laplacian.

Trang 3

Figure 1: Illustration of the nodes and edges of PlaNet and PhoNet along with their respective adjacency matrix representations

3 Consonant Co-occurrence Network

The most basic unit of human languages are the

speech sounds The repertoire of sounds that make

up the sound inventory of a language are not

cho-sen arbitrarily even though the speakers are

ca-pable of producing and perceiving a plethora of

them In contrast, these inventories show

excep-tionally regular patterns across the languages of

the world, which is in fact, a common point of

consensus in phonology Right from the

begin-ning of the 20th century, there have been a large

number of linguistically motivated attempts

(Tru-betzkoy, 1969; Lindblom and Maddieson, 1988;

Boersma, 1998; Clements, 2008) to explain the

formation of these patterns across the consonant

inventories More recently, Mukherjee and his

col-leagues (Choudhury et al., 2006; Mukherjee et al.,

2007; Mukherjee et al., 2008) studied this problem

in the framework of complex networks Since here

we shall conduct a spectral analysis of the network

defined in Mukherjee et al (2007), we briefly

sur-vey the models and the important results of their

work

Choudhury et al (2006) introduced a bipartite

network model for the consonant inventories

For-mally, a set of consonant inventories is represented

as a graph G = hV L , V C , E lc i, where the nodes in

one partition correspond to the languages (V L) and

that in the other partition correspond to the

conso-nants (V C ) There is an edge (v l , v c) between a

language node v l ∈ V L(representing the language

l) and a consonant node v c ∈ V C(representing the

consonant c) iff the consonant c is present in the inventory of the language l This network is called

the Phoneme-Language Network or PlaNet and represent the connections between the language and the consonant nodes through a 0-1 matrix A

as shown by a hypothetical example in Fig 1 Fur-ther, in (Mukherjee et al., 2007), the authors define the Phoneme-Phoneme Network or PhoNet as the one-mode projection of PlaNet onto the consonant

nodes, i.e., a network G = hV C , E cc 0 i, where the

nodes are the consonants and two nodes v c and

v c 0 are linked by an edge with weight equal to the

number of languages in which both c and c 0occur together In other words, PhoNet can be expressed

as a matrix B (see Fig 1) such that B = AAT−D

where D is a diagonal matrix with its entries cor-responding to the frequency of occurrence of the consonants Similarly, we can also construct the one-mode projection of PlaNet onto the language nodes (which we shall refer to as the Language-Language Graph or LangGraph) can be expressed

as B0 = ATA − D 0, where D0 is a diagonal ma-trix with its entries corresponding to the size of the consonant inventories for each language

The matrix A and hence, B and B0 have been constructed from the UCLA Phonological Seg-ment Inventory Database (UPSID) (Maddieson, 1984) that hosts the consonant inventories of 317 languages with a total of 541 consonants found across them Note that, UPSID uses articulatory

Trang 4

features to describe the consonants and assumes

these features to be binary-valued, which in turn

implies that every consonant can be represented

by a binary vector Later on, we shall use this

rep-resentation for our experiments

By construction, we have |V L | = 317, |V C | =

541, |E lc | = 7022, and |E cc 0 | = 30412

Conse-quently, the order of the matrix A is 541 × 317

and that of the matrix B0 is 541 × 541 It has been

found that the degree distribution of both PlaNet

and PhoNet roughly indicate a power-law behavior

with exponential cut-offs towards the tail

(Choud-hury et al., 2006; Mukherjee et al., 2007)

Further-more, PhoNet is also characterized by a very high

clustering coefficient The topological properties

of the two networks and the generative model

explaining the emergence of these properties are

summarized in (Mukherjee et al., 2008) However,

all the above properties are useful in

characteriz-ing the local patterns of the network and provide

very little insight about its global structure

4 Spectral Analysis of PhoNet

In this section we describe the procedure and

re-sults of the spectral analysis of PhoNet We begin

with computation of the spectrum of PhoNet

Af-ter the analysis of the spectrum, we systematically

investigate the top few eigenvectors of PhoNet

and attempt to characterize their linguistic

signif-icance In the process, we also analyze the

corre-sponding eigenvectors of LanGraph that helps us

in characterizing the properties of languages

4.1 Spectrum of PhoNet

Using a simple Matlab script we compute the

spectrum (i.e., the list of eignevalues along with

their multiplicities) of the matrix B

correspond-ing to PhoNet Fig 2(a) shows the spectral plot,

which has been obtained through binning3with a

fixed bin size of 20 In order to have a better

visu-alization of the spectrum, in Figs 2(b) and (c) we

further plot the top 50 (absolute) eigenvalues from

the two ends of the spectrum versus the index

rep-resenting their sorted order in doubly-logarithmic

scale Some of the important observations that one

can make from these results are as follows

First, the major bulk of the eigenvalues are

con-centrated at around 0 This indicates that though

3 Binning is the process of dividing the entire range of a

variable into smaller intervals and counting the number of

observations within each bin or interval In fixed binning, all

the intervals are of the same size.

the order of B is 541 × 541, its numerical rank is

quite low Second, there are at least a few very large eigenvalues that dominate the entire spec-trum In fact, 89% of the spectrum, or the square

of the Frobenius norm, is occupied by the princi-pal (i.e., the topmost) eigenvalue, 92% is occupied

by the first and the second eigenvalues taken to-gether, while 93% is occupied by the first three taken together The individual contribution of the other eigenvalues to the spectrum is significantly lower than that of the top three Third, the eigen-values on either ends of the spectrum tend to decay gradually, mostly indicating a power-law behavior The power-law exponents at the positive and the

negative ends are -1.33 (the R2 value of the fit is

0.98) and -0.88 (R2 ∼ 0.92) respectively.

The numerically low rank of PhoNet suggests that there are certain prototypical structures that frequently repeat themselves across the consonant inventories, thereby, increasing the number of 0 eigenvalues to a large extent In other words, all the rows of the matrix B (i.e., the inventories) can

be expressed as the linear combination of a few

independent row vectors, also known as factors.

Furthermore, the fact that the principal eigen-value constitutes 89% of the Frobenius norm of the spectrum implies that there exist one very strong organizing principle which should be able to ex-plain the basic structure of the inventories to a very good extent Since the second and third eigen-values are also significantly larger than the rest

of the eigenvalues, one should expect two other organizing principles, which along with the basic principle, should be able to explain, (almost) com-pletely, the structure of the inventories In order

to “discover” these principles, we now focus our attention to the first three eigenvectors of PhoNet 4.2 The First Eigenvector of PhoNet

Fig 2(d) shows the first eigenvector component for each consonant node versus its frequency of occurrence across the language inventories (i.e., its degree in PlaNet) The figure clearly indicates that

the two are highly correlated (r = 0.99), which in

turn means that 89% of the spectrum and hence, the organization of the consonant inventories, can

be explained to a large extent by the occurrence frequency of the consonants The question arises: Does this tell us something special about the struc-ture of PhoNet or is it always the case for any sym-metric matrix that the principal eigenvector will

Trang 5

Figure 2: Eigenvalues and eigenvectors of B (a) Binned distribution of the eigenvalues (bin size = 20) versus their multiplicities (b) the top 50 (absolute) eigenvalues from the positive end of the spectrum and their ranks (c) Same as (b) for the negative end of the spectrum (d), (e) and (f) respectively represents the first, second and the third eigenvector components versus the occurrence frequency of the consonants

be highly correlated with the frequency? We

as-sert that the former is true, and indeed, the high

correlation between the principal eigenvector and

the frequency indicates high “proportionate

co-occurrence” - a term which we will explain

To see this, consider the following 2n × 2n

ma-trix X

X =

0 M1 0 0 0

M1 0 0 0 0

0 0 0 M2 0

0 0 M2 0 0

. . . .

where X i,i+1 = X i+1,i = M (i+1)/2 for all odd

i and 0 elsewhere Also, M1 > M2 > >

M n ≥ 1 Essentially, this matrix represents a

graph which is a collection of n disconnected

edges, each having weights M1, M2, and so on

It is easy to see that the principal eigenvector of

this matrix is (1/ √ 2, 1/ √ 2, 0, 0, , 0) >, which

of course is very different from the frequency

vec-tor: (M1, M1, M2, M2, , M n , M n)>

At the other extreme, consider an n × n

ma-trix X with X i,j = Cf i f j for some vector f =

(f1, f2, f n)> that represents the frequency of

the nodes and a normalization constant C This is

what we refer to as ”proportionate co-occurrence”

because the extent of co-occurrence between the

nodes i and j (which is X i,j or the weight of the

edge between i and j) is exactly proportionate to

the frequencies of the two nodes The principal eigenvector in this case is f itself, and thus, corre-lates perfectly with the frequencies Unlike this hypothetical matrix X, PhoNet has all 0 entries

in the diagonal Nevertheless, this perturbation,

which is equivalent to subtracting f2

i from the i th

diagonal, seems to be sufficiently small to preserve the “proportionate co-occurrence” behavior of the adjacency matrix thereby resulting into a high cor-relation between the principal eigenvector compo-nent and the frequencies

On the other hand, to construct the

Lapla-cian matrix, we would have subtracted f iPn j=1 f j

from the i th diagonal entry, which is a much

larger quantity than f2

i In fact, this operation would have completely destroyed the correlation between the frequency and the principal eigen-vector component because the eigeneigen-vector corre-sponding to the smallest4eigenvalue of the

Lapla-cian matrix is [1, 1, , 1] > Since the first eigenvector of B is perfectly

cor-4 The role played by the top eigenvalues and eigenvectors

in the spectral analysis of the adjacency matrix is compara-ble to that of the smallest eigenvalues and the corresponding eigenvectors of the Laplacian matrix (Chung, 1994)

Trang 6

related with the frequency of occurrence of the

consonants across languages it is reasonable to

argue that there is a universally observed innate

preference towards certain consonants This

pref-erence is often described through the linguistic

concept of markedness, which in the context of

phonology tells us that the substantive conditions

that underlie the human capacity of speech

pro-duction and perception renders certain consonants

more favorable to be included in the inventory than

some other consonants (Clements, 2008) We

ob-serve that markedness plays a very important role

in shaping the global structure of the consonant

in-ventories In fact, if we arrange the consonants in a

non-increasing order of the first eigenvector

com-ponents (which is equivalent to increasing order

of statistical markedness), and compare the set of

consonants present in an inventory of size s with

that of the first s entries from this hierarchy, we

find that the two are, on an average, more than

50% similar This figure is surprisingly high

be-cause, in spite of the fact that ∀ s s ¿ 5412 , on an

average s2 consonants in an inventory are drawn

from the first s entries of the markedness hierarchy

(a small set), whereas the rest2sare drawn from the

remaining (541 − s) entries (a much larger set).

The high degree of proportionate co-occurrence

in PhoNet implied by this high correlation

be-tween the principal eigenvector and frequency

fur-ther indicates that the innate preference towards

certain phonemes is independent of the presence

of other phonemes in the inventory of a language

4.3 The Second Eigenvector of PhoNet

Fig 2(e) shows the second eigenvector component

for each node versus their occurrence frequency It

is evident from the figure that the consonants have

been clustered into three groups Those that have

a very low or a very high frequency club around 0

whereas, the medium frequency zone has clearly

split into two parts In order to investigate the

ba-sis for this split we carry out the following

experi-ment

Experiment I

(i) Remove all consonants whose frequency of

oc-currence across the inventories is very low (< 5).

(ii) Denote the absolute maximum value of the

positive component of the second eigenvector as

M AX+ and the absolute maximum value of the

negative component as M AX − If the absolute

value of a positive component is less than 15% of

M AX+ then assign a neutral class to the corre-sponding consonant; else assign it a positive class Denote the set of consonants in the positive class

by C+ Similarly, if the absolute value of a

nega-tive component is less than 15% of M AX −then

assign a neutral class to the corresponding conso-nant; else assign it a negative class Denote the set

of consonants in the negative class by C − (iii) Using the above training set of the classified consonants (represented as boolean feature vec-tors) learn a decision tree (C4.5 algorithm (Quin-lan, 1993)) to determine the features that are re-sponsible for the split of the medium frequency

zone into the negative and the positive classes.

Fig 3(a) shows the decision rules learnt from the above training set It is clear from these rules

that the split into C − and C+ has taken place mainly based on whether the consonants have

the combined “dental alveolar” feature (negative

class) or the “dental” and the “alveolar” features

separately (positive class) Such a combined fea-ture is often termed ambiguous and its presence in

a particular consonant c of a language l indicates that the speakers of l are unable to make a distinc-tion as to whether c is articulated with the tongue

against the upper teeth or the alveolar ridge In contrast, if the features are present separately then the speakers are capable of making this distinc-tion In fact, through the following experiment,

we find that the consonant inventories of almost all the languages in UPSID get classified based on whether they preserve this distinction or not

Experiment II

(i) Construct B0 = ATA – D0 (i.e., the adjacency matrix of LangGraph)

(ii) Compute the second eigenvector of B0 Once again, the positive and the negative components

split the languages into two distinct groups L+and

L −respectively

(iii) For each language l ∈ L+ count the

num-ber of consonants in C+ that occur in l Sum up the counts for all the languages in L+ and

nor-malize this sum by |L+||C+| Similarly, perform

the same step for the pairs (L+,C − ), (L − ,C+) and

(L − ,C −)

From the above experiment, the values obtained

for the pairs (i) (L+,C+), (L+,C −) are 0.35, 0.08

respectively, and (ii) (L − ,C+), (L − ,C −) are 0.07, 0.32 respectively This immediately implies that

almost all the languages in L+ preserve the

den-tal/alveolar distinction while those in L −do not

Trang 7

Figure 3: Decision rules obtained from the study of (a) the second, and (b) the third eigenvectors The classification errors for both (a) and (b) are less than 15%

4.4 The Third Eigenvector of PhoNet

We next investigate the relationship between the

third eigenvector components of B and the

occur-rence frequency of the consonants (Fig 2(f)) The

consonants are once again found to get clustered

into three groups, though not as clearly as in the

previous case Therefore, in order to determine the

basis of the split, we repeat experiments I and II

Fig 3(b) clearly indicates that in this case the

con-sonants in C+ lack the complex features that are

considered difficult for articulation On the other

hand, the consonants in C −are mostly composed

of such complex features The values obtained for

the pairs (i) (L+,C+), (L+,C −) are 0.34, 0.06

re-spectively, and (ii) (L − ,C+), (L − ,C −) are 0.19,

0.18 respectively This implies that while there is

a prevalence of the consonants from C+in the

lan-guages of L+, the consonants from C −are almost

absent However, there is an equal prevalence of

the consonants from C+and C −in the languages

of L − Therefore, it can be argued that the

pres-ence of the consonants from C −in a language can

(phonologically) imply the presence of the

conso-nants from C+, but not vice versa We do not find

any such aforementioned pattern for the fourth and

the higher eigenvector components

4.5 Control Experiment

As a control experiment we generated a set of ran-dom inventories and carried out the experiments

I and II on the adjacency matrix, BR, of the ran-dom version of PhoNet We construct these in-ventories as follows Let the frequency of

occur-rence for each consonant c in UPSID be denoted

by f c Let there be 317 bins each corresponding to

a language in UPSID f cbins are then chosen

uni-formly at random and the consonant c is packed

into these bins Thus the consonant inventories

of the 317 languages corresponding to the bins are generated Note that this method of inventory construction leads to proportionate co-occurrence Consequently, the first eigenvector components of

BR are highly correlated to the occurrence fre-quency of the consonants However, the plots of the second and the third eigenvector components versus the occurrence frequency of the consonants indicate absolutely no pattern thereby, resulting in

a large number of decision rules and very high classification errors (upto 50%)

Trang 8

5 Discussion and Conclusion

Are there any linguistic inferences that can be

drawn from the results obtained through the

study of the spectral plot and the eigenvectors of

PhoNet? In fact, one can correlate several

phono-logical theories to the aforementioned

observa-tions, which have been construed by the past

re-searchers through very specific studies

One of the most important problems in

defin-ing a feature-based classificatory system is to

de-cide when a sound in one language is different

from a similar sound in another language

Ac-cording to Ladefoged (2005) “two sounds in

dif-ferent languages should be considered as distinct

if we can point to a third language in which the

same two sounds distinguish words” The

den-tal versus alveolar distinction that we find to be

highly instrumental in splitting the world’s

lan-guages into two different groups (i.e., L+and L −

obtained from the analysis of the second

eigen-vectors of B and B0) also has a strong

classifi-catory basis It may well be the case that

cer-tain categories of sounds like the dental and the

alveolar sibilants are not sufficiently distinct to

constitute a reliable linguistic contrast (see

(Lade-foged, 2005) for reference) Nevertheless, by

al-lowing the possibility for the dental versus

alveo-lar distinction, one does not increase the

complex-ity or introduce any redundancy in the

classifica-tory system This is because, such a distinction

is prevalent in many other sounds, some of which

are (a) nasals in Tamil (Shanmugam, 1972) and

Malayalam (Shanmugam, 1972; Ladefoged and

Maddieson, 1996), (b) laterals in Albanian

(Lade-foged and Maddieson, 1996), and (c) stops in

cer-tain dialectal variations of Swahili (Hayward et al.,

1989) Therefore, it is sensible to conclude that the

two distinct groups L+and L −induced by our

al-gorithm are true representatives of two important

linguistic typologies

The results obtained from the analysis of the

third eigenvectors of B and B0 indicate that

im-plicational universals also play a crucial role in

determining linguistic typologies The two

ty-pologies that are predominant in this case

con-sist of (a) languages using only those sounds that

have simple features (e.g., plosives), and (b)

lan-guages using sounds with complex features (e.g.,

lateral, ejectives, and fricatives) that automatically

imply the presence of the sounds having

sim-ple features The distinction between the simsim-ple

and complex phonological features is a very com-mon hypothesis underlying the implicational hier-archy and the corresponding typological classifi-cation (Clements, 2008) In this context, Locke and Pearson (1992) remark that “Infants heavily favor stop consonants over fricatives, and there are languages that have stops and no fricatives but

no languages that exemplify the reverse pattern [Such] ‘phonologically universal’ patterns, which cut across languages and speakers are, in fact, the phonetic properties of Homo sapiens.” (as quoted

in (Vallee et al., 2002))

Therefore, it turns out that the methodology pre-sented here essentially facilitates the induction of linguistic typologies Indeed, spectral analysis de-rives, in a unified way, the importance of these principles and at the same time quantifies their ap-plicability in explaining the structural patterns ob-served across the inventories In this context, there are at least two other novelties of this work The first novelty is in the systematic study of the spec-tral plots (i.e., the distribution of the eigenvalues), which is in general rare for linguistic networks, although there have been quite a number of such studies in the domain of biological and social net-works (Farkas et al., 2001; Gkantsidis et al., 2003; Banerjee and Jost, 2007) The second novelty is

in the fact that there is not much work in the com-plex network literature that investigates the nature

of the eigenvectors and their interactions to infer the organizing principles of the system represented through the network

To summarize, spectral analysis of the com-plex network of speech sounds is able to provide

a holistic as well as quantitative explanation of the organizing principles of the sound inventories This scheme for typology induction is not depen-dent on the specific data set used as long as it is representative of the real world Thus, we believe that the scheme introduced here can be applied as

a generic technique for typological classifications

of phonological, syntactic and semantic networks; each of these are equally interesting from the per-spective of understanding the structure and evolu-tion of human language, and are topics of future research

Acknowledgement

We would like to thank Kalika Bali for her valu-able inputs towards the linguistic analysis

Trang 9

A Banerjee and J Jost 2007 Spectral plots and the

representation and interpretation of biological data.

Theory in Biosciences, 126(1):15–21.

A Banerjee and J Jost to appear Graph spectra as a

systematic tool in computational biology Discrete

Applied Mathematics.

M Belkin and J Goldsmith 2002 Using eigenvectors

of the bigram graph to infer morpheme identity In

Proceedings of the ACL-02 Workshop on

Morpho-logical and PhonoMorpho-logical Learning, pages 41–47.

Association for Computational Linguistics.

P Boersma 1998 Functional Phonology The Hague:

Holland Academic Graphics.

M Choudhury and A Mukherjee to appear The

structure and dynamics of linguistic networks In

N Ganguly, A Deutsch, and A Mukherjee, editors,

Dynamics on and of Complex Networks:

Applica-tions to Biology, Computer Science, Economics, and

the Social Sciences Birkhauser.

M Choudhury, A Mukherjee, A Basu, and N

Gan-guly 2006 Analysis and synthesis of the

distribu-tion of consonants over languages: A complex

net-work approach In COLING-ACL’06, pages 128–

135.

F R K Chung 1994 Spectral Graph Theory

Num-ber 2 in CBMS Regional Conference Series in

Math-ematics American Mathematical Society.

G N Clements 2008 The role of features in speech

sound inventories In E Raimy and C Cairns,

edi-tors, Contemporary Views on Architecture and

Rep-resentations in Phonological Theory Cambridge,

MA: MIT Press.

E J Farkas, I Derenyi, A -L Barab´asi, and T

Vic-seck 2001 Real-world graphs: Beyond the

semi-circle law Phy Rev E, 64:026704.

R Ferrer-i-Cancho 2005 The structure of

syntac-tic dependency networks: Insights from recent

ad-vances in network theory In Levickij V and

Altm-man G., editors, Problems of quantitative linguistics,

pages 60–75.

Spectral analysis of internet topologies In

INFO-COM’03, pages 364–374.

K M Hayward, Y A Omar, and M Goesche 1989.

Dental and alveolar stops in Kimvita Swahili: An

electropalatographic study African Languages and

Cultures, 2(1):51–72.

http://www.cc.gatech.edu/˜vempala/spectral/spectral.pdf.

P Ladefoged and I Maddieson 1996 Sounds of the

Worlds Languages Oxford: Blackwell.

different purposes In Working Papers in Phonet-ics, volume 104, pages 1–13 Dept of LinguistPhonet-ics,

UCLA.

B Lindblom and I Maddieson 1988 Phonetic univer-sals in consonant systems In M Hyman and C N.

Li, editors, Language, Speech, and Mind, pages 62–

78.

J L Locke and D M Pearson 1992 Vocal learn-ing and the emergence of phonological capacity A

neurobiological approach In Phonological devel-opment Models, Research, Implications, pages 91–

129 York Press.

I Maddieson 1984 Patterns of Sounds Cambridge

University Press.

A Mukherjee, M Choudhury, A Basu, and N Gan-guly 2007 Modeling the co-occurrence principles

of the consonant inventories: A complex network

approach Int Jour of Mod Phys C, 18(2):281–

295.

A Mukherjee, M Choudhury, A Basu, and N Gan-guly 2008 Modeling the structure and dynamics of the consonant inventories: A complex network

ap-proach In COLING-08, pages 601–608.

J R Quinlan 1993 C4.5: Programs for Machine Learning Morgan Kaufmann.

S V Shanmugam 1972 Dental and alveolar nasals in

Dravidian In Bulletin of the School of Oriental and African Studies, volume 35, pages 74–84 University

of London.

M Sigman and G A Cecchi 2002 Global

organi-zation of the wordnet lexicon Proceedings of the National Academy of Science, 99(3):1742–1747.

N Trubetzkoy 1931 Die phonologischen systeme.

TCLP, 4:96–116.

N Trubetzkoy 1969 Principles of Phonology

Uni-versity of California Press, Berkeley.

N Vallee, L J Boe, J L Schwartz, P Badin, and

C Abry 2002 The weight of phonetic substance in

the structure of sound inventories ZASPiL, 28:145–

168.

Ngày đăng: 22/02/2014, 02:20

TỪ KHÓA LIÊN QUAN

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN

🧩 Sản phẩm bạn có thể quan tâm