A frequency dictionary of mandarin chinese

A Frequency Dictionary of Mandarin Chinese Mandarin Chinese, providing a list of the 5,000 words and the 2,000 Chinese characters most commonly used in the language.. Like other volumes

Trang 2

A Frequency Dictionary

of Mandarin Chinese

Mandarin Chinese, providing a list of the 5,000 words and the 2,000 Chinese characters most commonly used in the language Based on a 50-million-wo「d corpus composed of spoken, fiction, non-fiction and news texts in current use, the dictionary provides the user with a detailed frequency-based list, as well as alphabetical and part-of-speech indexes All entries in the frequency list feature the English equivalent and a sample sentence with English translation The dictionary also contains 30 thematically organised lists of frequently used words on a variety of topics such as food, weather, travel and time expressions

their study of Mandarin vocabulary in an efficient and engaging way It also represents an excellent resource for teachers of the language

Richard Xiao is Senior Lecturer and Programme Leader in Chinese Studies at Edge Hill University Paul Rayson is Director of the University Centre for Computer Corpus Research on Language and a teaching fellow at Lancaster University Tony McEnery is Professor of English Language and Linguistics at Lancaster University

A Frequency Dictionary of Mandarin Chinese is an invaluable tool for all learners of

A Frequency Dictionary of Mandarin Chinese enables students of all levels to maximise

Trang 3

Routledge Frequency Dictionaries

General Editors:

Paul Rayson, Lancaster University, UK

Mark Davies, Brigham Young University, USA

Editorial Board:

Michael Barlow, University of Auckland, New Zealand

Geoffrey Leech, Lancaster University, UK

Barbara Lewandowska-Tomaszczyk, University of Lodz, Poland

Josef Schmied, Chemnitz University of Technology, Germany

Andrew Wilson, Lancaster University, UK

Adam Kilgarriff, Lexicography MasterClass Ltd and University of Sussex, UK Hongying Tao, University of California at Los Angeles

Chris Tribble, King's College London, UK

Other books in the series:

A Frequency Dictionary of German

A Frequency Dictionary of Portuguese

A Frequency Dictionary of Spanish

A Frequency Dictionary of French (forthcoming)

A Frequency Dictionary of Arabic (forthcoming)

Trang 4

A Frequency Dictionary

of Mandarin Chinese

Core vocabulary for learners

Richard Xiao, Paul Rayson and Tony McEnery

Routledge

Taylor & Francis Group

LONDON AND NEW YORK

Trang 5

First published 2009

by Routledge

2 Park Square, Milton Park, Abingdon, OX14 4RN

Simultaneously published in the USA and Canada

by Routledge

711 Third Ave, New York, NY 10017

Routledge is an imprint of the Taylor & Francis Group, an informa business

Typeset in Parisine by Graphicraft Limited, Hong Kong

or utilised in any form or by any electronic, mechanical, or other means,

now known or hereafter invented, including photocopying and recording,

or in any information storage or retrieval system, without permission in

writing from the publishers

British Library Cataloguing in Publication Data

A catalogue record for this book is available from the British Library

Library of Congress Cataloging-in-Publication Data

ISBN13: 978-0-203-88307-5 (ebk)

Trang 6

Contents

Thematic vocabulary list | vi

Series preface | vii

Part of speech index | 300

Character frequency index | 356

Trang 7

Thematic vocabulary lists

6 Weather and equipment | 54

7 City facilities and stores | 61

23 Kinship and family relations | 172

24 Moods and emotions | 177

Trang 8

Series preface

Frequency information has a central role to play in learning a language Nation (1990) showed that the 4,000-5,000 most frequent words account for up to 95 per cent of a written text and the 1,000 most frequent words account for 85 per cent of speech

Although Nation's results were only for English, they do provide clear evidence that, when employing frequency as a general guide for vocabulary learning, it is possible to acquire a lexicon which will serve a learner well most of the time There are two caveats to bear in mind here First, counting words is not as straightforward as it might seem Gardner (2007) highlights the problems that multiple word meanings, the presence of multiword items, and grouping words into families or lemmas, have on counting and analysing words Second, frequency data contained in frequency dictionaries should never act as the only information source to guide a learner Frequency information is nonetheless a very good starting point, and one which may produce rapid benefits It therefore seems rational

to prioritise learning the words that you are likely to hear and read most often That is the philosophy behind this series of dictionaries

Lists of words and their frequencies have long been available for teachers and learners

of language For example, Thorndike (1921, 1932) and Thorndike and Lorge (1944)

produced word frequency books with counts of word occurrences in texts used in the

education of American children Michael West's General Service List of English Words (1953)

was primarily aimed at foreign learners of English More recently, with the aid of efficient computer software and very large bodies of language data (called corpora), researchers have been able to provide more sophisticated frequency counts from both written text and transcribed speech One important feature of the resulting frequencies presented in this series is that they are derived from recently collected language data The earlier lists for

English included samples from, for example, Austen's Pride and Prejudice and Defoe's

Robinson Crusoe, thus they could no longer represent p「esent-day language in any sense Frequency data derived from a large representative corpus of a language brings

students closer to language as it is used in real life as opposed to textbook language (which often distorts the frequencies of features in a language, see Ljung, 1990) The information in these dictionaries is presented in a number of formats to allow users to access the data in different ways So, for example, if you would prefer not to simply drill down through the word frequency list, but would rather focus on verbs for example, the part of speech index will allow you to focus on just the most frequent verbs Given that verbs typically account for 20 per cent of all words in a language, this may be a good strategy Also, a focus on function words may be equally rewarding - 60 per cent of speech in English is composed of a mere 50 function words The series also provides information of use to the language teacher The idea that frequency information may have

a role to play in syllabus design is not new (see, for example, Sinclair and Renouf, 1988) However, to date it has been difficult for those teaching languages other than English to use frequency information in syllabus design because of a lack of data

Trang 9

viii Series preface

Frequency information should not be studied to the exclusion of othe「contextual and situational knowledge about language use and we may even doubt the validity of frequency information derived from large corpora It is interesting to note that Alderson (2007) found that corpus frequencies may not match a native speaker's intuition about estimates of word frequency and that a set of estimates of word frequencies collected from language experts varied widely Thus corpus-derived frequencies are still the best current estimate of a word's importance that a learner will come across Around the time

of the construction of the first machine-readable corpora, Halliday (1971: 344) stated that

"a rough indication of frequencies is often just what is needed" Our aim in this series is

to provide as accurate as possible estimates of word frequencies

Paul Rayson and Mark Davies Lancaster and Provo, 2008

References

Alderson, J.C (2008) Judging the frequency of English words Applied Linguistics, 28(3): 383–409.

Gardner, D (2007) Validating the construct of Word in applied corpus-based vocabulary research: a critical

survey Applied Linguistics, 28, pp 241–265.

Halliday, M.A.K (1971) Linguistic functions and literary style In S Chatman (ed.) Style: A Symposium Oxford University Press, pp 330–365.

Ljung, M (1990) A Study of TEFL Vocabulary Almqvist & Wiksell International, Stokholm.

Nation, I.S.P (1990) Teaching and learning vocabulary Heinle & Heinle, Boston.

Sinclair, J.M., and Renouf, A (1988) “A lexical syllabus for language learning”, in R Carter & M McCarthy

(eds) Vocabulary and Language Teaching Longman, London, pp 140–158.

Thorndike, E (1921) Teacher’s Word Book Columbia Teachers College, New York.

Thorndike, E (1932) A Teacher’s Word Book of 20,000 Words Columbia University Press, New York.

Thorndike, E and Lorge, I (1944) The Teacher’s Word Book of 30,000 Words Columbia University Press,

New York

West, M (1953) A General Service List of English Words Longman, London.

Trang 10

0005 了 [了] //e/ (1) aux [aspect marker indicating

realisation of a situation] ^！也从马上摔了下来。She fell

吗？ She's a regular beauty, isn't she? 181 | 0.62 | 112

0 8 2 0 嗎[n恩]/en, hg, ng, hg/ (1) interj [interjection used for questioning, surprise, disapproval, or agreement] well, eh, hey, m-hm, uh-huh 卩恩，他们是

个村J土

天生一对嘛。Well, they seem to make a good couple

2525 I 0.18 I 449 I S

0 0 1 5 上 [ 上 ] I s h a n g ! (1) loc up, on, i n 孩子的新毛衣上

勾了一个洞 “ T h e child has picked a hole in his new jumper 22041 丨 o.92 丨 20201

0 0 1 4 人 [ 人 ] I r e n l (1) n person, human, man 我尊重他这

位作家，也尊重他这个人 ° I respect him as a writer and

as a man 247241 0.86121225

0 0 0 3 — [ - ] lyTI (1) num one, a, an 市政大厅前有一群

人 ° There is a crowd of people in front of the town

h a l l 69925 I 0.89 | 62263

3 1 3 5 呀 [ 呀 ] l y a l ono the sound of a creak 门呀的一声

开了。T h e door opened with a creak 2231 0.41 1901 s

Trang 11

This is an excerpt from a talk

by the former President Nixon 434 | 0.75 | 326

0 0 0 4 在 [ 在 ] i z a l i (1) prep [indicating location or time,

etc.] at, in 她坐在窗旁。 S h e sat at the window

52774 | 0.94 | 49460

0 0 1 0 他 [ 他 ] _ 0) pron he, h i m 目前，他正在度假。

At present, he is on holiday. 36234 | 0.76 | 27619

0111 们 [ 們 ] / m e n / (1) suf [plural marker for pronouns

and some animate n o u n s ] 朋友们舞会后就分开了。 The friends separated after the party. 3199 | 0.92 | 2930

Trang 12

Introduction

The need for a frequency dictionary

A good dictionary is indispensable for language

learning This is particularly true when learning a

second or foreign language There are many kinds

of dictionaries, designed for different purposes

and having different values for different readers

For instance, a dictionary in multiple volumes

that incorporates encyclopaedic or etymological

knowledge may not be terribly relevant for language

learners unless they are at very advanced levels.

Different types of dictionaries can benefit different

types of learners Of those dictionaries that are

specifically created for language learners, the focus

is usually on providing basic information such as

definitions, glosses and word classes illustrated

by suitable examples Such dictionaries typically

follow the lexicographic convention of arranging

words in alphabetic order so that, while providing

an effective and convenient way of looking up a

specific word, they do not tell learners which

words are more commonly used – that is, which

words learners are more likely to encounter at

different stages of learning Neither can a

conventional dictionary tell learners which words

they are more likely to encounter in different

registers such as speech, news, imaginative or

informative writing.

It is in these regards that a frequency dictionary

as an innovative type of dictionary proves valuable.

While it is clearly nạve to state simply and boldly

that the most frequent words are the most

important to learn, frequency ranking is nonetheless

“a parameter for sequencing and grading learning

materials” because frequency is “a measure of

probability of usefulness” and “high-frequency words

constitute a core vocabulary that is useful above the

incidental choice of text of one teacher or textbook

author” (Goethals 2003: 424) As Leech (1997: 16)

argues: “Whatever the imperfections of the simple

equation ‘most frequent’ = ‘most important to

learn’, it is difficult to deny that frequency

information derived from large collections of

text (corpora) has an important empirical input to language learning materials.”

With that said, one should not assume that there is an irreconcilable conflict between the frequency and more conventional dictionary Rather they have different focuses and are complementary to each other In this dictionary, for example, we have provided just one illustrative example for each word included, even though the word can have different meanings Readers seeking a broader range of illustrative examples may refer to a different type of dictionary In addition, for a small number of words included

in this dictionary where the word can have different pronunciations for its different senses, we have not indicated which pronunciation corresponds

to which word sense Again, a different dictionary can help with that issue These decisions were motivated largely by considerations such as the levels of intended readership, the size of the dictionary (see section 5 for further discussion) and, crucially, what was readily available elsewhere We did not want to unhelpfully replicate what other dictionaries could provide This dictionary should be used with a conventional learner dictionary – not instead of one.

Like other volumes in the Routledge Frequency

Dictionary series, A Frequency Dictionary of

Mandarin Chinese furnishes a list of core vocabulary

for learners of Mandarin Chinese as a second or foreign language, especially learners whose first or second language is English This dictionary will prove useful for such learners, whether they are instructed in the classroom or are independent learners It should also prove of benefit to teachers.

How might classroom learners use this dictionary? Classroom learners normally rely on a selected textbook, which is typically organised around a variety of themes (e.g shopping, eating out) Thematically related words are surely of benefit

in vocabulary acquisition, but a textbook rarely if

Trang 13

2 Introduction

ever tells learners which of these words are more

likely to appear in their actual reading or

conversation In fact, it is very likely that some of

those words, which they have learnt with painstaking

effort, never occur beyond the classroom situations

invented specifically for the sake of language

learning In other words, learners will not find a

chance to employ those words in real

communication contexts, unless they are talking

about a specialised topic As you will see in the

callout boxes for themed vocabulary embedded in

the main frequency index of common words, some

words are infrequent in the language, but can be

potentially useful when discussing specific topics

That's why we have decided to include 30

thematically related lists (see section 5) These

thematically related lists are one resource that

classroom learners should want to draw upon

What of the independent learner? They tend

to have needs somewhat distinct from those of

the classroom learner They may pick up a piece

of text, e.g a work of fiction or a newspaper, and

step through it word by word, consulting a

dictionary to check on words which are new to

them While such independent learners work on

authentic texts, they may often suspect that their

learning could be more effective if they knew at the

outset that they were learning the most common

words in general Mandarin, as these would be the

words which they are most likely to encounter in a

wide variety of contexts They would then work

towards specialised and infrequent vocabulary

A frequency dictionary should make an ideal

companion for such learners

Finally, how might teachers use a dictionary

such as this? Language teachers should find a

frequency dictionary a very helpful source of

information On the one hand, the frequency

dictionary provides a graded list of vocabulary with

authentic examples, which is valuable supplementary

material to complement a textbook On the other

hand, the teacher may find it frustrating that some

students entering intermediate level are deficient in

vocabulary In such cases, a frequency dictionary will

be an advantage as it affords a structured remedy in

this regard Last but not least, the frequency

information contained in this dictionary, which is

based on a large balanced collection of data (see

section 2), will prove a valuable resource in guiding and informing the development of a language teaching curriculum

The corpus

Given that the frequencies presented in this dictionary are derived from a large collection of

Chinese language, a so-called corpus, it is clearly

important to present the corpus data used in this dictionary For a dictionary that aims to provide a frequency-based core vocabulary for learners, a well- composed corpus is essential We think that such a corpus must satisfy four requirements for the intended purpose First of all, it must be large enough to yield a basis for reliable quantification; second, it must achieve a reasonably wide coverage

of registers so that learners are exposed to commonly used words in different communication contexts Third, the language contained in the corpus must be current Finally, in addition to

the quality of data per se, corpus processing must be

sufficiently reliable, and this is particularly important for a Chinese frequency dictionary because running texts in Chinese must first of all

be segmented into legitimate tokens (a computational process known as segmentation or tokenisation, see below) before they can be annotated with word class information

The corpus in this dictionary is composed of written and spoken texts from four broad categories

as shown in Table 1, totalling roughly 50 million word tokens (or 73 million Chinese characters) The spoken component contains 3.4 million words, covering face-to-face conversations, telephone calls, cross-talks, movie and play scripts, interviews, storytelling, public lectures, radio broadcasts, and public debates, which were mostly produced in the 1990s and 2000-2006 1 The news component comprises 16 million words of newswire texts released in 1995 by the Xinhua News Agency and

newspaper texts published by the People's Daily in

1998 and 2000, in addition to the news categories in the Lancaster Corpus of Mandarin Chinese (LCMC) 2

and the UCLA Written Chinese Corpus 3 The fiction component amounts to 15 million words, including all fiction categories in LCMC and UCLA Chinese corpora in addition to novels and short stories sampled from various periods in the twentieth

Trang 14

3 Introductio n

Table 1 Structure of the corpus

4,679,991 26,277,906 19,962,277 22,158,904 73,079,078

century, with the majority published in the

1980s-1990s The non-fiction component is

composed of all informative categories in LCMC

and UCLA corpora, together with various non-literary

texts of different genres such as official documents,

academic prose, applied writing and popular lore,

which were sampled from different periods in the

second half of the twentieth century, totalling

15 million words

While the majority of our corpus data

introduced above are monolingual Chinese texts,

we have also used a parallel corpus composed of

Chinese fictional and non-fictional texts with their

English versions, which has allowed us to extract,

for each of the selected words included in this

dictionary, an illustrative example with English

translation 4

Once the texts (including transcripts of spoken

data) were collected, the next step was to segment

the running strings of characters in these texts into

word tokens For alphabetical languages like English,

word tokens in a written text are normally separated

by white spaces so that the one-to-one

correspondence between orthographic and

morpho-syntactic word tokens can be considered as a default

with a few exceptions: multiwords (e.g so that and

in spite of ), mergers (e.g can't and gonna) and

variably spelt compounds (e.g noticeboard,

notice-board, notice board). In Chinese, however, since a

written text contains running strings of characters

with no delimiting spaces, one has to determine

where the words are in the data More specifically,

as it is the computer that is analysing the data, a

process must be run which allows the computer to

determine where the words are This process is

called word segmentation Word segmentation

requires complex computer processing, which generally involves lexicon matching and the use

of a statistical model (cf McEnery, Xiao and Tono 2006: 35)

The segmentation tool we engaged to process our Chinese corpus is ICTCLAS, an acronym for the Chinese Lexical Analysis System developed by the institute of Computing Technology, Chinese Academy of Sciences The core of the system lexicon incorporates a lexicon of 80,000 words with part of speech information The system is based on a multi- layer hidden Markov model and integrates modules for word segmentation, part of speech analysis, so called part of speech tagging, and unknown word recognition (cf Zhang, Liu, Zhang and Cheng 2002) The rough segmentation module of the system is based on the n-shortest paths method (Zhang and Liu 2002) The model, based on 2-shortest-paths, achieves a precision rate of 97.58 per cent, with

a recall rate as high as 99.94 per cent (ibid.)

In addition the average number of segmentation candidates is reduced by 64 times compared to the full segmentation method The unknown word recognition module of the system is based

on role tagging The module applies the Viterbi algorithm to determine the sequence of roles (e.g internal constituents and context) with the greatest probability in a sentence, on the basis

of which template matching is carried out

The integrated ICTCLAS system is reported to achieve a precision rate of 97.16 per cent for tagging, with a recall rate of over 90 per cent for unknown words and 98 per cent for Chinese person names (ibid.)

ICTCLAS applies a very fine-grained part of speech annotation scheme, or tagset, (see Appendix) however

Trang 15

4 Introduction

to corpus data A tagset is the list of part of speech

distinctions made by the programme, while tagging

is the process whereby the machine decides which

part of speech applies to each word and a tag is the

individual mnemonic assigned to each word in the

corpus by the computer when the tagging has

determined which tag each word should have

For the purpose of a frequency dictionary, we

think that a less fine-grained tagset than that

provided by ICTCLAS is more helpful as this can

have a boosting effect on words of the same form

but with minor difference in usage Hence we

decided to merge subcategories and similar part of

speech categories Our decision to combine those

subcategories was also motivated by the fact that

Chinese does not have a very strong link between

word classes and grammatical functions For

(tagged as ad by ICTCLAS) while a verb can behave

syntactically like a noun (tagged as vn) or adverb

(tagged as vd) In addition, non-predicate adjectives

(tagged as b) and descriptive adjectives (tagged as z)

are also merged into the broad category of adjectives As it is not always possible to

differentiate between idioms (tagged as i) and fixed expressions (tagged as l), the two categories are

combined into the category for idiomatic and formulaic expressions These manipulations resulted

in a tagset for use in this dictionary, which consists

of 20 part of speech tags as shown in Table 2 When the texts were tokenised and annotated with part of speech information, the corpus was converted from the local character encoding GB2312 into Unicode (UTF-8), with register information and linguistic annotation marked up in the extensible mark-up language (XML) so that the corpus could

be used with our PERL (Practical Extraction and Retrieval Language) scripts to build frequency

be discussed in section 4 But before that, let us have a look at the previous frequency dictionaries and lists

of words and characters in Chinese

example, adjectives can be used directly as adverbial indexes for use in this dictionary This will

Table 2 Part of speech tags annotated in our corpus

b, z

Trang 16

5 Introduction Previous frequency dictionaries of

Chinese

Ours is not the first frequency dictionary of Chinese

It is, however, quite distinctive, as can be

demonstrated by considering the other frequency

dictionaries of Chinese that have been created

Because of the large inventory of characters in

Chinese, there has been a long tradition of teaching

Chinese characters on basis of frequency, though

the research of word frequency on the basis of large

collections of text only became possible in the

1990s, with the advent of more powerful computers

and specialised computer software for word

segmentation There are at least a dozen frequency

lists or dictionaries of Chinese characters and words

including, for example:

• Chen's (1928) Yutiwen Yingyong Zi Hui (The

Applied Glossary of Modern Chinese): listing

4,261 distinct Chinese characters on the basis

of six corpora (children's books, newspapers,

women's magazines, after-class work of

schoolchildren, classic and modern fiction,

and miscellaneous) totalling 554,478 Chinese

character tokens;

• Liu's (1973) Frequency Dictionary of Chinese

Words: giving statistics such as frequency,

dispersion index and usage rate for 3,059 most

frequently used words in Chinese on the basis of

a 0.25-million-word corpus covering five registers

(fiction, drama, essays, newspapers and

periodicals, technical writing);

• Xiandai Hanzi Zonghe Shiyong Pindu Biao (A

Comprehensive Frequency Table of Character

Usage in Modern Chinese), established on

Project Code 748 (1976): listing 4,152 frequently

used characters on the basis of

a corpus of 21 million characters;

• Beijing Aeronautical University (1985) Xiandai

Hanyu Yong Zi Pindu Biao (A Frequency Table of

Character Usage in Modern Chinese): listing

frequently used characters for ten genres and

technical domains on the basis of samples

totalling 11.08 million characters;

• Beijing Language and Culture University (1986)

Xiandai Hanyu Pinlu Cidian (A Frequency

Dictionary of Modern Chinese): listing 16,593

commonly used words extracted from 1,315,752

word tokens (or 1.82 million characters);

• National Language Committee (1988) Xiandai

Hanyu Changyong Zi Biao (Commonly Used Characters in Modern Chinese): listing the most commonly used 2,500 characters and 1,000 commonly used characters on the basis of data collected by Beijing Aeronautic University covering the period 1928-1986;

• Hong Kong Polytechnic University (1991-1997)

Zhongguo Dalu, Taiwan, Xianggang Hanyu Ciku

(A Chinese Word Bank from Mainland China, Taiwan, and Hong Kong): listing 68,011 entries based on a 6-million-character corpus of news texts published during 1990-1992 in the three Chinese speech communities

With the exception of Liu (1973), all other character and word frequency lists and dictionaries are published in Chinese All of them are targeted either at native speakers of Mandarin learning their mother tongue (e.g Chen 1928; National Language Committee 1988), or at language engineers (e.g the frequency list by Project Code 738) and expert Chinese linguists (e.g the word bank by Hong Kong Polytechnic University for studying language variation) 5 And with the exception of Liu (1973), all of the existing frequency dictionaries of Chinese characters and words were published and distributed in China, which makes it difficult for learners of Chinese

as a second or foreign language outside China

to get access to them

Liu (1973) was published by Mouton and released worldwide, but it also suffers from a number of drawbacks Like nearly all existing Chinese frequency dictionaries, it is based exclusively on written Mandarin; the data on which the dictionary is based are quite outdated, with texts published during the period 1910-1960; with a total of 0.25 million word tokens, the corpus is also rather small by today's standards; the word class categories featuring in the book are quite obsolete nowadays; no actual Chinese characters are used in the dictionary, these being replaced by a kind of Romanisation system which is no longer widely used; and most importantly, no English gloss or translation, no illustrative example, and no information related

to usage are given, making the dictionary almost useless for today's learners of Chinese as a second or foreign language

Biao

5

Trang 17

6 Introduction

Table 3 HSK graded lists and words and characters in Chinese

HSK level Words Characters

Level 1 1033 800 Level 2 2019 803 Level 3 2205 591 Level 4 3583 671 Levels 1 - 3 5257 2194 Levels 1 - 4 8840 2865

Last but not least, we should not fail to mention

the Syllabus of Graded Words and Characters for

Chinese Proficiency compiled by the Chinese

government's Hanyu Shuiping Kaoshi (the Chinese

Proficiency Test, HSK) Committee, which was

published in 1992 and revised in 2001 The HSK

lexical syllabus lists the words and characters

required of learners of Mandarin Chinese as a

second or foreign language to pass the Chinese

proficiency test HSK, as indicated in Table 3 While

the lexical syllabus is undoubtedly instructive for

learners of Mandarin - we have made special effort

to include as many words as possible from the

syllabus, especially Level 1 and 2 items (see section

4 for further discussion) - it serves a different

purpose from a frequency list The words in the

syllabus are arranged conventionally in alphabetical

order for each level rather than in the order of

frequency, and no actual frequency or frequency

ranking is give门

The compilation of the lexical syllabus, which

was corpus-based, started in 1988 and the latest

texts covered were produced in 1991 Unsurprisingly,

most "new words" included in the syllabus are from

the early 1980s while some words that were

common in the 1970s-1980s, e g 少先队 " y o u n g

pioneer" will not be common enough to merit a

place on the list nowadays On the other hand,

many well-established vocabulary items which are

commonly used today as a result of technological

and social development are not covered in the

A comparison of the HSK graded vocabulary and the frequency index in our dictionary appears to suggest that the corpus on which the HSK vocabulary is based relies too heavily on the Beijing dialect, as evidenced by dialectal usage like

半拉" h a l f " (Level 2) and words ending with the retroflective suffix 儿，including Level 1 words such

as 小孩儿 " c h i l d : 面条儿 " n o o d l e , pasta", Level 2 words such as 聊天儿 " c h a t " and 墨水儿 " i n k : and Level 3 words such as 拐弯儿 " t u r n a corner, make a turn" and 药水儿 " l i q u i d medicine" Words like these are normally listed in a dictionary without the retroflective 儿 which is tagged in our corpus as a suffix

Selection of words and characters

According to the HSK lexical syllabus, learners of Chinese as a foreign language who have learnt about 5,000 words will be able to express their ideas on general issues in Chinese As can be seen in Table 3, this vocabulary is approximate to the total of HSK Levels 1 - 3 words The number of words we have decided to include in this dictionary is roughly comparable While it is certainly true that the larger your vocabulary the better it is, it is nonetheless increasingly more difficult to learn new words as your vocabulary grows This is because, according to Zipf's law, the frequency of a word is reversely proportional to its rank in the frequency table As such, there is a 9.27 per cent increase in coverage from top 1,000 to top 2,000 words, whereas the

Trang 18

Top N characters

I门crease Drop

7 Introduction

Table 4 Coverage of top N words

Figure 1 Coverage of top N characters

increase in coverage drops to 0.94 per cent from top

8,000 to top 9,000 words (see Table 4) In addition

to the reference to the HSK syllabus, our decision to

include 5,000 words was also empirically based by

the sharp drop in coverage (from 3.07 per cent to

1.77 per cent) from top 5,000 to top 6,000 words

As can be seen in Figure 1, which shows the

increase and drop in coverage resulting from each

additional block of 200 characters, Zipf's law also

applies to characters Coverage grows very slowly

after the top 1,200 characters The top 2,000

characters cover nearly 98 per cent of our whole

corpus, with 4,839 characters accounting for the

remaining 2 per cent of coverage

We would like to point out that the above

distribution statistics are based on a valid lexicon

that we created from the corpus By "valid lexicon",

we mean the frequency lists of words and characters that exclude items that are "uninteresting" from the perspective of vocabulary acquisition, e.g symbols and punctuations, Arabic numerals (written in either full- or half-length), and non- Chinese character strings We have also excluded abbreviations, numeral characters indicating years, person names, place names, organisation names,

as well as other proper nouns such as names of countries, nationalities and languages, as well as brand names Table 5 indicates the size of our valid lexicon

While a frequency dictionary could be arranged simply in the order of raw frequencies, i.e actual occurrences of words and characters, we have enlisted a more scientific way to decide which words and characters to include, which takes account of

Trang 19

8 Introduction

Table 5 The valid lexicon

Register Word tokens Chinese characters

Spoken 2,692,315 3,824,579 News 12,147,572 20,185,322 Fiction 11,973,365 16,424,649 Non-fiction 11,900,160 17,954,729 Total 38,713,412 58,389,279

their frequencies as well as their distribution in

different registers Words and characters which are

frequently used in more registers are clearly more

useful than those that are frequent in fewer

registers In this dictionary, we have adopted the

same hierarchy composed of three coefficients as

established in Juilland and Chang-Rodriguez (1964),

namely frequency, dispersion index and usage rate,

which are explained as follows

There are two types of frequency data Raw

frequency refers to the actual occurrence of a word

or character in a corpus while normalised frequency

means the frequency that has been adjusted to a

common base, for example, in this case, the

occurrences per million tokens so that the four

registers covered in our corpus can be compared

even if they are of different sizes We have used

normalised frequencies in different registers to

compute dispersion index and usage rate, while the

overall normalised frequency is also given in the

entry for a headword so that the reader can easily

compare the frequencies of different words on a

common basis of per million words Dispersion

coefficient (D) is computed according to Juilland and

Chang-Rodriguez's formula:

D = 1 - (nE x2 - T 2 ) 1/2 /2T

In this formula, n stands for the number of word

types and T for the number of word tokens This

formula reduces dispersion to a coefficient ranging

from 0 - 1 , regardless of frequency Words with a

higher dispersion coefficient are more evenly

distributed in different registers The usage rate (U)

takes account of both frequency and dispersion,

which can be taken as a dispersion (D) percentage

of frequency (F) or vice versa according to the

following formula:

U = F x D /100 This means that when D = 1 the usage rate equals frequency, and when D = 0.5 the usage rate is half

of frequency Hence, a more frequent word with a lower dispersion index can have a lower usage rate

For example, as the word 说"say" is distributed fairly evenly in the four registers (9,383 instances per million words in spoken, 9,998 in fiction, 4,658

in non-fiction, 3,753 in news), it has a large dispersion index (0.80) If the word has an overall frequency of 27,792, then its usage rate will be 22,252 In contrast, the interjection 哎 has an overall frequency of 1,821 instances in our corpus, but it is distributed unevenly in the four registers (1,697 in spoken, 119 in fiction, 5 in non-fiction and 0 in news) Its dispersion index is much smaller (0.21), and its usage rate is 383 in spite of its high overall frequency

We wrote PERL scripts that automatically computed the overall normalised frequency, normalised frequency in each of the four registers (i.e spoken, news, fiction and non-fiction), the dispersion index, and usage rate for each word and character in our valid lexicon We have used a combination of these statistics, while also taking account of basic vocabulary in the HSK syllabus as well as our intuitive knowledge of the Chinese language, to decide which words and characters to include in this dictionary

• All words with a dispersion index below 0.25 are excluded unless they have a usage rate above

100 or a normalised frequency above 1,000 in any of the four registers

• All items with a usage rate below 45 are excluded unless they are on the Levels 1 and 2 lists in the HSK syllabus.

Trang 20

9 Introduction

• All words with an overall normalised frequency

below 55 are excluded unless they are on the

Levels 1 and 2 lists in the HSK syllabus

• All words with a normalised frequency below

three per million words in the register of fiction

are excluded unless they have a usage rate above

100 or are on the Levels 1 and 2 lists in the HSK

syllabus

These operations helped to establish a core list of

top 5,004 words from a total of 30,922 words from

our valid lexicon Out list covers 95.61 per cent of

Level 1 words, 80.43 per cent of Level 2 words,

44.22 per cent of Level 3 words, and 18.12 per cent

of Level 4 words in the HSK syllabus In addition to

many advancement-related new words like those

mentioned earlier, our word list includes many

commonly used compound words which are missing

in the HSK syllabus, for e x a m p l e : 看到 " s e e : 很多

"eat a meal, e a t : 见到 " s e e : 不再 " n o longer",

很快"fast; soon", and 听至U "hear" among many

others Such new additions are obviously more

helpful to learners than some dialectal or

outdated items in the HSK syllabus such as

半拉" h a l f and 反动"reactionary" and 少先队

"young pioneer"

We have followed a similar procedure to

establish a core list of commonly used Chinese

characters The following cutoff points are used:

• an overall normalised frequency of 70 instances

per million tokens;

• a usage of 50 instances per million tokens;

• a dispersion index of 0.35;

• a minimal frequency of ten in each of the four

registers

These operations produced a list of 2,015 most

commonly used characters In order to include as

many basic characters required in the HSK syllabus,

the final criterion above was not strictly applied to

Level 1 and 2 characters in the syllabus As a result,

14 additional Level 1 characters and 83 additional Level 2 characters are included in our character list, thus pushing the total number of characters included in this dictionary to 2,112, which covers 99.3 per cent of Level 1 characters, and 96.64 per cent of Level 2 character in the HSK syllabus Our character index also covers 81.56 per cent of the 2,500 common characters published in China

by the Ministry of Education for native speakers

of Mandarin

Organisation of the dictionary

Following this introductory chapter are a number

of indexes of the 5,004 most commonly used words, which are arranged in frequency rank order, in alphabetical order, in frequency rank order as per word classes, as well as a list of the 2,112 most commonly used Chinese characters mapping each character onto the top 5,004 words The frequency index of the book also features a series of embedded callout boxes that show thematically related vocabulary The remainder

of this chapter will give more details of each of these indexes

Frequency index

This section lists the 5,004 most commonly used words in the descending order of frequency rank The following information is given for each of the listed headwords:

• frequency rank (in descending order of usage rate);

• headword in Simplified Chinese;

• headword in Traditional Chinese;

• Pinyin gloss of the headword;

• HSK Level (if the word is listed in the HSK syllabus);

• an illustrative example in Simplified Chinese (authentic example cited from our

Chinese-English parallel corpus);

• English translation of the example (from our Chinese-English parallel corpus);

• normalised frequency per million words;

• dispersion index;

• usage rate;

• register code (i.e S or W indicating whether the word is exceptionally common in speech or writing)

9

Trang 21

10 Introduction

A typical entry looks like the following:

Here we show a concrete example of an entry:

In this example, the headword with the frequency

rank of 0263 (i.e the 263rd most commonly used

word in our corpus) is 然后，which is written as

然後 in Traditional Chinese The Pinyin gloss of

the word is rdnhdu It is listed as a Level 1 word

in the HSK graded vocabulary This is an adverb,

meaning "afterwards, then", as exemplified in

你走到第二个十字路口，然后向左拐。 " Y o u go

ahead to the second crossing and then turn left."

This headword has an overall normalised frequency

of 1,887 instances per million tokens and a

dispersion index of 0.66 in our corpus, and thus

a usage rate of 1,241 instances per million

tokens It is exceptionally common in spoken

Mandarin

At this point, we would like to remind the reader

that when a headword has more than one sense,

these senses are separated by a comma or a

semi-colon, with the former for similar senses and the

latter for different sense groups However, when an

orthographic word has different senses for different

word classes, they are listed separately For example,

the orthographic word 会 can function as a verb

(0035) meaning "can, know how to do; meet; be

likely to, be sure to", or as a noun (0864) meaning

"meeting, conference; moment" The two

homonymous words are kept separate Nevertheless,

as our corpus is only tagged with part of speech

information but not annotated with word senses, it

is impossible to find out the frequencies of different

word senses of a homonymous or polysemous word

if these senses belong to the same part of speech

Consequently similar and different word senses of the same word class are simply grouped together with the appropriate punctuation mark indicated above It is also important to note that different senses of a word may have different Pinyin glosses and thus be pronounced differently In such cases

we simply insert commas to separate different Pinyin glosses without mapping Pinyin glosses to word senses, 7 and would like to advise the reader

to consult a traditional dictionary to ascertain how a word is pronounced for a particular word sense

Of the information given for each headword, HSK Level and register code are optional A label for the HSK Level (i.e 1 - 3 ) is only shown if the word is listed in the HSK graded vocabulary, while

a register code (S or W) is available only if the word is exceptionally common in speech or writing Please note that some words may have more than one meaning that belong to different HSK Levels, which are separated by a comma For register code, two criteria were employed to determine whether a word is exceptionally common in speech

or writing If a chi-square test indicates that the difference in frequencies of the word in speech (i.e the spoken register) and writing (e.g fiction, non-fiction and news) is significant at the

probability level p < 0.0000001 while at the same

time the S/W ratio (or the W/S ratio) is greater than

3, the word carries the register code S (or W) Of the 5,004 most common words covered in this dictionary, 103 words are exceptionally common

in speech while 203 are exceptionally common in writing

This frequency index constitutes the meaty part

of the book, which gives you all the essential information about each of the listed words In addition, 30 callout boxes are embedded in this main frequency index, which feature thematically related vocabulary They are organised along themes closely related to people's life, e.g fruits, drinks and beverages, food (flavours, main food, meat, vegetables, food preparation, seasoning), clothing, weather and equipment, city facilities and stores, travel, directions and locations, cities, house and room, home electronics, computers and the Internet, school life and subjects, professions, sports, animals, and human body (physical appearance, body parts, parts on the head, senses) In addition,

Frequency rank Headword in Simplified

Chinese [Headword in Traditional Chinese]

/Pinyin/ (Optional HSK Level) Part o f speech

English gloss

Illustrative example in Simplified Chinese and

English translation of the example

Normalised frequency | Dispersion index | Usage rate |

Optional register code

Trang 22

11 Introduction

we have included a number of lists that help readers

understand the Chinese language and culture,

including the number system, time expressions,

colours, Chinese festivals, Chinese zodiac signs,

kinship and family relations, English loanwords in

Mandarin, special vocabulary in language learning

(terms for sentence analysis and punctuation marks),

and commonly used words in various registers

covered in our corpus (spoken, fiction, non-fiction

and news) For most of the lists, frequency ranks are

included, with frequent items on top of the lists

unless stated otherwise in our comments For

example, the four lists of c o m m o n words across

registers are arranged by statistical salience There

are also a number of lists where no frequency

ranks are given, which happens when a list shows

an almost closed set of vocabulary items, or when

few of the items on a list are covered in our valid

lexicon (see section 4) When a themed list includes

both items with a frequency rank and those without

one, the list is arranged with most frequent words

on the top, and then in the alphabetical order of

words without a frequency rank These thematic

vocabulary lists are an important complement to

frequency indexes in this dictionary because,

as you will see, some words are important when

you talk about a particular topic, yet they would

only be included in a frequency list that covers

top 20,000 c o m m o n words In other words,

those words are infrequent in the language as

a whole

Alphabetical index

This section lists the 5,004 most c o m m o n words

in alphabetical order of the Pinyin glosses of

headwords A typical entry for the alphabetical index

looks like:

H e a d w o r d in Simplified Chinese /Pinyin/

Part of speech code English gloss Frequency rank

Here is a concrete example:

已经 /yfffng/ adv already 0101

In addition to providing the reader with a quick view

of an entry, this alphabetical index also helps the

reader to locate the entry in the main frequency

index quickly and easily

Part of speech index

This section gives a frequency index which shows, for each part of speech category, commonly used members in that group, covering all of the top 5,004 words in the main frequency and alphabetical indexes For each entry in this part of speech index, the following information is included: frequency rank, headword in Simplified Chinese, Pinyin gloss, part of speech code, and English gloss

This index helps readers to build up vocabulary while studying Chinese grammar It also allows t h e m

to refer back quickly to the related items in the main frequency index

Character frequency index

This frequency index lists the 2,112 most commonly used Chinese characters We decided not to include part of speech information or English gloss, nor to give illustrative examples in this chapter There are a number of reasons for this decision First,

as many characters in Chinese are meaningless unless they combine to form a word, it is not always possible to give an English gloss; second,

as many of the c o m m o n monosyllabic words have already been included in the three earlier indexes,

it would be redundant to repeat their details here; third, as a focus on characters is different from the notion of words and parts of speech, English glosses and examples for headwords would be less instructive; finally, since the meaning of a Chinese word is not necessarily the aggregation of its constituent characters, learners are encouraged

to build vocabulary "words" instead of characters For each of the 2,112 commonly used characters included in this index, the following information

is given:

Frequency rank Simplified Chinese

[Traditional Chinese] /Pinyin/ (Optional HSK

Level) List o f headwords in word frequency index containing the character and word frequency ranks

An example entry is as follows:

0017 们 [ 們 ]/ m e / i / ( 1 ) 我们 0026 他们 oo4i

们 0 1 1 1 你们 0 1 7 9 人们 0 2 3 7 它们 0 5 3 5 咱们 0 5 4 0 她们0567

11

Trang 23

12 Introduction

Each Chinese character in this index is linked to the

words in the main frequency index The frequency

ranks of the linked words are given to enable

readers to make cross-references easily If no

headword is included in the word frequency index

(i.e the headword is not in the top 5,004 list) for a

certain character included in this index, only the

character in Simplified Chinese, its Traditional

Chinese version, and its Pinyin gloss (and the HSK

Level if available) are given The index of the top

2,112 characters is of help to readers when they

decide which characters to learn first; it can also

enable learners to switch smoothly between Pinyin

and characters, and between Simplified Chinese and

Traditional Chinese In addition, the commonly used

words containing the same characters which are

mapped from the main frequency index are

particularly useful in vocabulary building

One must remember, however, that a dictionary

is not simply a repository of information of use to

learners of a language - it is also an embodiment

of the language to some degree As the language

encodes in part a worldview and history, the

dictionary also stands as a cultural artefact The

background of the language represented here thus

deserves some consideration

A brief introduction to Mandarin

Chinese

Chinese belongs to the Sino-Tibetan languages It is

spoken by a total of 1.3 billion speakers Of these,

the majority are native speakers of Mandarin (i.e

Standard Chinese based on the Beijing dialect) as

opposed to another variety of Chinese such as

Cantonese Mandarin Chinese has a total of 1,052

million speakers, more than twice as many as speak

English, the language with the second highest total

of speakers, at 508 million (see Ostler 2005)

Mandarin Chinese is the official language of

Mainland China and Taiwan; it is also one of the

official languages of Singapore and the United

Nations

There are currently more than 30 million people

in the world who are learning Mandarin Chinese as

a foreign language 8 The popularity of Chinese as a

second language is growing For example in the

United States, the number of Chinese learners is

growing fastest in comparison to learners of other

foreign languages In Britain, Mandarin is studied by more children than German and Russian (only French and Spanish are presently more popular); and Mandarin is expected to overtake Spanish in three years if the rate of growth continues 9

Probably the most striking difference between Chinese and most other languages is purely visual - its written form English, and many other languages, employ an alphabetical system Chinese uses a logographic system, i.e roughly speaking, the symbols of English encode sounds, whereas those

in Chinese either singly or in combination encode words, with each character being a syllable As a result the Chinese writing system is relatively complex - English has only 26 alphabetical characters (i.e letters) that can be arranged in different combinations to form tens of thousands of different words, Chinese has tens of thousands of individual characters that represent words What makes it even more difficult for learners to build up their Chinese vocabulary is that, while some Chinese characters represent single words, it is more common for characters to function in combination

to form words, many of which have a different meaning than the simple aggregation of the meanings of constituent characters (see below for further discussion) To make things yet more complex, since Chinese does not use white spaces to delimit words in writing, learners have to decide for themselves which characters in the running text form a word when they read, though word boundaries can be inferred in spoken Chinese on the basis of spoken features such as pauses and repetitions Given the huge number of Chinese characters, and of words, it is not only quite impossible, but totally unnecessary as well, for learners (or average native Chinese speakers) to know tens of thousands of Chinese characters and words (see section 4) This explains why a frequency dictionary of Chinese, which provides core lists of characters and words in this language, can be of particular advantage as a guide to vocabulary teaching and learning

As noted earlier, some Chinese characters can serve as words; they can also combine with other characters to form new words In terms of word types, the overwhelming majority of Chinese words are disyllabic, as illustrated in Figure 2, which shows

Trang 24

Figure 2 Words of varying lengths in the HSK syllabus

the proportions of words at different levels as

required in the HSK syllabus As can be seen,

disyllabic words account for 72 per cent on average

while the proportion of monosyllabic words is

roughly 22 per cent Words composed of three

characters or more are relatively infrequent It is also

interesting to note that, as the HSK Level increases,

disyllabic words increase in number while the

proportion of monosyllabic words drops This is

probably because many high-frequency function

words, which are more likely to be monosyllabic

words, are typically required at HSK Level 1 For the

same reason, monosyllabic words are expected to

make up a large proportion of Chinese texts in

terms of word tokens

This expectation is in fact borne out in our

corpus data The "valid lexicon" (see section 4)

which furnishes a quantitative basis for this

dictionary comprises 38,713,412 word tokens

(running words in the text) in 84,883 word types

(different words) Of these, there are 6,413

monosyllabic words (7.56 per cent of the total), yet

they account for 54.08 per cent of total tokens In

contrast, while 46,670 disyllabic words take up the

largest proportion in terms of word types (54.98 per

cent), they account for 42.33 per cent of total word

tokens Although three-character (22.35 per cent)

and four-character words (13.24 per cent) are also very frequent, they do not contribute much in terms

of word tokens, as shown in Figure 3

A character corresponds to a syllable in spoken Chinese (cf Li and Thompson 1980: 13) While classical Chinese can be classified as a monosyllabic language, this is no longer true of modern Chinese As we have noted, monosyllabic words only account for 22 per cent of the graded vocabulary in the HSK syllabus, while the proportion

of monosyllabic words in Mandarin as a whole is much lower (see Figure 3) Some characters can be used directly as words, e g 有 " h a v e " , 来 " c o m e " , and 新 " n e w " ; some characters can serve directly as words or as parts of a word, the meaning of which may or may not be related to the meanings of individual constituent characters ( e g 我 " I , me",

们" p l u r a l suffix" and 我们 " w e , us" versus 东 " e a s t " ,

西" w e s t " and 东西 " s t u f f " ) ; some characters cannot stand alone as words ( e g 蝴蝶 " b u t t e r f l y " and 葡萄

"grape")

Words in Chinese can be simplex or compound

A simplex word has one morpheme, which can be monosyllabic ( e g 天 " s k y " , 去 " g o : 他 " h e , him")

Trang 25

Figure 3 Words of varying lengths in our corpus

There are different types of compound words in

Chinese according to their internal structures,

namely, coordinate ( e g 寒冷 " c o l d " ) , endocentric

(e.g.冰箱"ice-box, refrigerator"), verb-complement

( e g 提高 " r a i s e , improve"), verb-object ( 洗澡

"take a bath or shower"), subject-predicate (e.g

地震"earthquake"), affixed ( e g 刀子 " k n i f e " ) ,

and reduplication ( 姐姐 " e l d e r sister")

While the discussion above appears to suggest

that "wordhood" is easy to define in Chinese, it is

nevertheless not always easy or even possible to

make a distinction between morphemes and words

on the one hand and between words and phrases on

the other hand (cf Wu 2003: 3) In fact, a whole

range of criteria have been proposed to define

wordhood of various types, e.g orthographical word,

morphological word, lexical word, syntactic word,

grammatical word, semantic word, sociological word,

psychological word, phonological word, prosodic

word (see Di Sciullo and Williams 1987; Dai 1998;

Duanmu 1998; Packard 2000; Feng 2001a) On the

basis of a review of such criteria, Dixon and

Aikhenvald (2002) propose to maintain a distinction

between phonological and grammatical words, which

may or may not coincide While the authors

concede that the criteria that they engage to define

a phonological word do not apply in every language,

they offer three "universal criteria" that define a

grammatical word: "A grammatical word consists of

a number of grammatical elements which (a) always

occur together, rather than scattered through the

clause (the criterion of cohesiveness); (b) occur in a

fixed order; (c) have a conventionalised coherence and mea门ing〃 (ibid.: 19)

Unfortunately, the so-called "split words" in Chinese satisfy none of these criteria The morphemes that make up a split word can not only scatter through the clause instead of occurring together ( e g 睡觉 " s l e e p " : 睡了一天觉 " s l e p t for a whole day"), they can even occur in a reversed rather than fixed order ( e g 上学 " g o to school":

学，我爱上就上，不爱上就不上，谁管得着。

"School, I'll go if I like and won't if I don't Who has the right to interfere?") Dixon and Aikhenvald (ibid.: 20) c o m m e n t that criterion (c) means that as a word has its own coherence and meaning, speakers of the language "may talk about a word (but are unlikely to talk about a morpheme)." This criterion does not apply in Chinese either, not only because morphemes in Chinese can be coherent and meaningful (a monosyllabic word consists of one morpheme) but also because even bound morphemes in split words such as 请客 " i n v i t e / treat guest(s)" can be "talked about", e g 请了

"invited/treated three tablefuls of guests" According to Dixon and Aikhenvald (ibid.: 6),

"the (grammatical) word forms the interface between morphology and syntax Morphology deals with the composition of words while syntax deals with the combination of words" In Chinese, however, word-internal structures are similar to syntactic structures (Dai 1998; Wu 2003: 3), as can

be seen in the types of Chinese compound words discussed above

Trang 26

15 Introduction

As far as split words are concerned the difficulty

lies in the fuzzy distinction between words and

phrases Many competing criteria have been

proposed to differentiate (compound) words from

phrases, e.g conjunction reduction, freedom of

parts, semantic composition, syllable count,

insertion, exocentric structure, adverbial

modification, XP substitution, productivity, and

intuition (see Duanmu 1998 for a review) However,

while each of these tests "may work in certain areas

for certain cases, there is no overall generalisation

and constraint on what is a compound and what

must be a phrase" (Feng 2001b) By grammatical

criteria such as splitability and insertion, disyllabic

split words such as 吃饭 " e a t : 睡觉 " s l e e p : 走路

" w a l k : 跑步 " j o g : 关心 " c a r e for", and 担心 " w o r r y

about" are phrases, but they are generally accepted

as words by native speakers whereas those that

can be judged as words by grammatical criteria

( e g 多弹头分寻重返大气层运载工具 " m u l t i p l e

independently targeted reentry vehicle, MIRV") may

not be accepted as words This is not only because

"the morphological system of Chinese is strongly

sensitive to prosodic foot" (Feng 1997: 135), but also

because the "word sense" of a native speaker of

Chinese is based on prosody (cf Feng 2001a)

Duanmu (1998) also observes that "there is a rich

body of phonological evidence, especially metrical

and tonal evidence, for the distinction between

words and phrases in both Mandarin and other

Chinese dialects" This view of wordhood is line with

Matthews's (1991: 209) statement that "the word

tends to be a unit of phonology as well as

grammar"

Because of the multidimensional properties of

wordhood in Chinese (cf Feng 2001a), we decided

to take a less rigid approach to what counts as a

word As noted in section 2, some of the "words"

included in this dictionary can be more

appropriately called formulaic expressions, e.g

不得不"have t o : 也就是说 " t h a t is to s a y " , 与此

同时" m e a n w h i l e : 忍不住 " c a n n o t help but", and

有思思 "interesting" We think that such commonly

occurring formulaic expressions acting as larger

"building blocks" of language are equally useful for

learners, if not more so

There are currently two sets of characters for the

Chinese writing system: Simplified Chinese and

Traditional Chinese The former is officially used in

Mainland China, Singapore and Malaysia while the latter is officially used in Taiwan, Hong Kong and Macau Overseas Chinese-speaking communities generally opt for Traditional Chinese characters, but Simplified Chinese characters are gradually becoming popular In this dictionary, we have given both Simplified and Traditional Chinese versions for each entry of word and character, but illustrative examples are shown only in Simplified Chinese as they appear in our corpus

In addition to Chinese characters for the writing system, a commonly used alphabetical system

known as Pinyin has been employed to Romanise

the Chinese script Pinyin uses the Latin alphabet to represent sounds in Mandarin It is not only beneficial in helping native children and foreign learners to learn spoken Chinese before they start to learn Chinese characters, but it is also a popular method of inputting Chinese characters into a computer There are four tones in the Pinyin system, with each syllable of every word characterised by one of them, except for a few syllables which are considered toneless The tones, which are marked on one of the vowels (a, e, i, o, u)

in a syllable, are first or "high" (a, e, C, o, u), second

or "rising" (a, e, i, o, u), third or "falling-rising" (K, e,

M, o, O), and fourth or "falling" (a, e, i, o, u) In this dictionary, toned Pinyin glosses are given for all entries of words and characters It is important to note, however, that the same words can have different Pinyin glosses for different word senses (see section 5) Also the tones for the same characters in the word index and the character index may be different because of "tone sandhi", i.e the change of tone that occurs when different tones come together in a word

15

Notes

1 We thank Professor Hongyin Tao of the University of California, Los Angeles for permitting us to use part of the data he collected in the Lancaster Los Angeles Spoken Chinese Corpus (LLSCC,

http://www.ling.lancs.ac.uk/corplang/llscc/) We are also grateful to Dr Jiajin Xu of Beijing Foreign Studies University for allowing us to use his corpus of Spoken Chinese of Urban Teenagers (SCOUT).

2 See http://www.elda.org/catalogue/en/text/W0039.html for more information about the LCMC corpus.

3 See http://www.ling.lancs.ac.uk/corplang/ucla/ for more

Trang 27

6 The HSK Level 2 words 反动"reactionary" and 红旗"「ed

f l a g " are e x c l u d e d f r o m o u r w o r d list as t h e y are n o t sc

5 The current national standard in China for Chinese

Information Processing GB2312-80 (Character Set for

Chinese Character Encoding in Information Exchange –

Basic Set) is based on the frequency list established on

Project Code 748.

common nowadays (ranking 5781 and 6004 respectively in our corpus) as they were during the so-called “Cultural Revolution”.

7 This decision was motivated by the fact that in computer programming we had to keep the same kind of information (e.g Pinyin gloss, or word sense) together in one field instead of mixing information of different kinds (e.g Pinyin glosses and word senses).

8 See BBC News on 9 January 2007 (http://news.bbc.co.uk/2/hi/asia-pacific/6244763.stm).

9 See mandarin.htm.

Trang 28

Beijing Aeronautical University.

Beijing Language and Culture University (1986)

(A Frequency Dictionary of Modern Chinese) Beijing: Beijing Language and

Culture University Press.

“Syntactic, morphological and phonological

words in Chinese” In J Packard (ed.) New

Approaches to Chinese Word Formation,

pp 103–134 Berlin: Mouton de Gruyter.

Di Sciullo, A.M and Williams, E (1987)

On the Definition of Word Cambridge, MA:

MIT Press.

Dixon, R.M.W and Aikhenvald, A.Y (2002)

Word: A Cross-Linguistic Typology Cambridge:

CUP.

Duanmu, S (1998)

“Wordhood in Chinese” In J Packard (ed.)

New Approaches to Chinese Word Formation,

pp 135–196 Berlin: Mouton de Gruyter.

Feng, S (1997)

“Prosodically determined word-formation in

Mandarin Chinese” Social Sciences in China,

(ed.) Practical Applications in Language and

Computers, pp 417–427 Frankfurt: Peter

Lang.

Hong Kong Polytechnic University (1991–1997)

(A Chinese Word Bank from Mainland China, Taiwan, and Hong Kong) Hong Kong: Hong Kong Polytechnic University.

Juilland, A and Chang-Rodríguez, E (1964)

Frequency Dictionary of Spanish Words The

Hague: Mouton.

Leech, G (1997)

“Teaching and language corpora: A convergence” In A Wichmann, S Fligelstone,

A McEnery and G Knowles (eds) Teaching and

Language Corpora, pp 1–23 London: Longman.

Li, N and Thompson, S (1980)

Mandarin Chinese Berkeley: University of

California Press.

Liu, E (1973)

Frequency Dictionary of Chinese Words

The Hague: Mouton.

Matthews, P.H (1991)

Morphology (2nd edn) Cambridge: Cambridge

University Press.

McEnery, T., Xiao, R and Tono, Y (2006)

Corpus-Based Language Studies: An Advanced Resource Book London and New York:

Routledge.

Trang 29

“Customisable segmentation of morphologically

derived words in Chinese” Computational

Linguistics and Chinese Language Processing, 8(1):

1–28.

Zhang, H and Liu, Q (2002)

“Model of Chinese words rough segmentation

based on N-shortest-paths method” Journal of

Chinese Information Processing, 16(5): 1–7.

Zhang, H., Liu, Q., Zhang, H and Cheng, X (2002)

“Automatic recognition of Chinese unknown

words based on role tagging” In Proceedings

of the 1st SIGHAN Workshop, COLING 2002,

pp 71–7 Taipei.

Trang 30

The ICTCLAS part of speech annotation scheme

Trang 31

Frequency index

Frequency rank H e a d w o r d in Simplified Chinese [Headword in Traditional

Chinese] /Pinyin/ (Optional HSK Level) Part o f speech English gloss

Illustrative example in Simplified Chinese and English translation of the example

Normalised frequency | Dispersion index | Usage rate | Optional register code

0001 的 [ 的 ]I d e l (1) aux [structural particle used

0006 不 [ 不 ]I b u l (1) adv no, not

这条往f 不准停 t 。 Y o u can't park in this street

5 0 5 8 9 | 0 8 | 4 0 2 4 5

0007 我 [ 我 ]I w d l (1) pron I, me

咖啡和茶使我感到兴奋。 C o f f e e and tea stimulate me

5 1 3 6 5 | 0 7 1 | 3 6 6 5 3

0008 个 [ 個 ]I g e l (1) clas [generalised measure word

used for nouns without a specific measure term]

山那边有一个村庄。 T h e r e is a village beyond the hill

li

xiangjiao

yTngt6o fizhT shzi shfliu mangguo lfzi juzi caomei juzi yezi longyan

boluo shanzha

mfhoutao nfngmeng

Gloss

apple watermelon grape peach pear banana cherry litchi, lychee persimmon pomegranate mango plum tangerine strawberry orange coconut longan fruit pineapple hawthorn Chinese gooseberry lemon

Frequency rank

Trang 32

0019 了 [ 了] llel (1) part [sentence final particle

indicating change of state or current

0022 对 [ 對 ]I d u l l (1) prep for, to, with regard to

她必须对她的行为负责。 S h e must answer for

her actions

0023 还 [ I ] Ihail (1) adv still, yet

后i 还有座位 0 There are still vacant seats at the back

0025 大 [ 大 ]/ d a / (1) adj big, large

— 个夫浪把小命卷走了。 A huge wave swept the boat away

1 1 7 9 5 | 0 9 8 | 1 1 5 2 5

0026 我们 [ 我們 ]/ w o m e n / pron we, us

我们早上很早就出发了。 W e made an early start in the morning

1 5 0 6 5 | 0 7 4 | 1 1 1 1 5

0027 着 [ 著 ]I z h e l (1) aux [aspect marker indicating

a durative or ongoing situation]

花园里的花正开着 ° The flower in the garden

0030 中 [ 中 ] I z h O n g l (1) toe in, within

数字写在背后的表中0 The figures are set out

in the table at the back of the book

9 9 0 6 | 0 8 6 | 8 4 8 5

0035 会 [ 會 ]I hull (1) v can, know how to do

谁说我不会做饭？ Who said that I cannot cook?

1 0 1 7 8 | 0 8 3 | 8 4 7 6

0036 地 [ 地 ]I d e l (1) aux [structural particle

introducing an adverbial modifier]

你得认真地打每一个球。Y o u should be serious about each stroke

O)pron

pron

Trang 33

22 A Frequency Dictionary of Mandarin Chinese

0037 那 [ 那 ]I n a l (1) pron that

那是一派胡言！ That's total nonsense!

1 3 1 2 6 | 0 6 2 | 8 1 5 2 | S

0 0 3 8 很 [ 很 ]

火车很可能要晚点。 I t ' s very likely that the

train will be delayed

0041 他们 [ 他們 ]I t a m e n l (1) pron they, them

i也 i n W\ 冈！J至i]达�They've j ust arrived

0044 得[得]/cfe/ (1) aux [structural particle used

after a verb that introduces a complement

0046 又 [ 又 ] l y d u l (1) adv again, once again

电為殳诛了。 T h e elevator's out again

0050 出 [ 出 ]I c h u l (1) v out; produce; happen

这儿出了什幺事？ What's happened here?

0053 来 [ 來 ]H a i l (2) aux [preceding a verb to

indicate the intended or suggested action]

我找到这个小玩意儿来做开瓶器。丨f o u n d this gadget that will serve as a bottle opener

6 5 1 6 | 0 8 6 | 5 6 0 4

0054 次[次]/c/V (1) clas [measure word indicating

number of repetitions or count of actions or events] times

这种药每天喝三次° Take this medicine three times a day

6 3 9 8 | 0 8 6 | 5 5 1 6

0055 多 [ 多 ]/ d u o / (1) adj many, much, plentiful

他的钱多得足够买下那个岛。 H e had enough money to buy out the island

5 8 2 3 | 0 9 4 | 5 4 8 5

0056 想 [ 想 ] M d n g f (1) v think

对不起，我想她出去购物了。 S o r r y , I think she's gone shopping

0062 几 [ 幾 ] / / / (1) num several; how much

我i 离弁几天。 I am going away for a few

0064 下 [ 下 ] I x i a l (1) loc under, below

它们在我床卡 ° They're under my bed

5 5 0 5 | 0 8 9 | 4 9 1 6

0065 为 [ 為 ]I w e i l (1) prep [introducing purpose,

reason, beneficiary, e t c ] for, because of

你能为我弄到两张音乐会的好票吗？ Can you secure me two good tickets for the concert?

/ d u o / (1) adj many

Trang 34

0067 后 [ 後 ] I h b u l (1) toe behind; after, later

三关后您可以来试衣。 Y o u can come for a

fitting three days later

0069 多 [ 多 ]/ d u o / (1) num many, much, numerous

我们已试过多次了。 W e have tried many

0077 以 [ 以 ]l y V (2) prep by means of, with, in

(some way); according to; because o f

0080 二 [ 二] I色rl (1) num two

双无踢成二¥ 。 T h e two teams tied

4 6 6 3 | 0 8 6 | 4 0 0 3

0083 更 [ 更 ]I g e n g l (1) adv [comparative degree]

more

我无法跑得更快了。丨 c a n ' t run any faster

4 5 0 9 | 0 8 8 | 3 9 8 8

0084 之 [ 之 ]I z h T I (3) aux [archaic equivalent o f

structural particle 的]

他们做得对还是错，这是争议之处 • It is a matter of dispute whether they did the right thing

5 1 2 5 | 0 7 7 | 3 9 5 1

0085 走 [ 走 ]I z o u l (1) y walk; leave

他一句话都没说就走了。H e left without a word

0089 p尼[呢]Inel (1) part [particle used at the end

of a question or declarative sentence to indicate mood]

你为什么不回家去呢？ Why didn't you go home?

6 5 2 2 | 0 5 6 | 3 6 7 7 | S

0090 知道 [ 知道 ]I z h J d a o l (1) m know

照理他应该知道她的地址• He ought t o know her address

Trang 35

0095 于 [ 於 ]l y u l (2) prep [indicating time, location,

direction, etc.] in, at

他们将于下周出发到香港去。 T h e y will set out

for Hong Kong next week

Please put your articles in this envelope

and seal it

0106 高 [ 高 ] 賊 o l (1) adj tall, high

多高的大楼啊！ What a tall building it is!

0111 们 [ 們 ]/ m e n / (1) swf [plural marker for

pro门ou门s and some 3门im3tc 门ouns]

朋友们舞会后就分开了。T h e friends separated after the party

3 1 9 9 | 0 9 2 | 2 9 3 0

0 1 1 2 新 [ 新 ] /x/"n/(1)ac(/new Iff货t艮t夬京龙至J。The new order is coming soon

4 7 0 2 | 0 6 2 | 2 9 2 5

0113 所 [ 所 ]I s u o l (2) aux [particle preceding a verb

to form a nominal structure]

这本词典正是我所需要的。 T h i s dictionary is exactly what I need

3 6 4 6 | 0 8 | 2 9 0 6

0114 社会 [ 社會 ]I s h e h u l l (1) n society

很多艺术家都感到与社会脱节。M a n y artists feel alienated from society

0119 吧 [ 吧 ]I b a l (1) part [modal particle

indicating a suggestion or request;

marking a question requesting confirmation, or a pause after alternatives]

这次我们各付各的吧。 L e t ' s go Dutch this time

Trang 36

0124 使 [ 使 ]/ s h i / (2) m [often used in serial

verb constructions] make, cause,

房间里有四张单人床。 T h e r e are four single

beds in the room

0131 将 [ 將 ]/ j i a n g / (2) adv [indicating a future

happening] will, be going to

She dressed up like a princess for the party

0134 [ 口…Ijiaol (1, 1) y name, call; shout; order

(somebody to do something); order (meal, taxi, etc.)

你的父母叫你什么？ What do your parents call you?

3 9 4 4 | 0 6 3 | 2 5 0 1

0135 国家 [ 國家 ]/ g u o j i a / (1) n country

儿童是国家和社会的未来。 C h i l d r e n are the future of the country and society

4 5 3 3 | 0 5 5 | 2 4 9 7

0136 起 [ 起 ]_ (1) v get up, rise; up

我通常起得很早。 I usually get up early

2 6 1 4 | 0 9 4 | 2 4 6 5

0140 全 [ 全 ]I q u a n l (1) adj whole, entire, complete

最为重要的一点是，全家人团聚在一起。 M o s t important, the entire family was together

4 5 6 9 | 0 5 3 | 2 4 1 5

0141 完 [ 完 ]/ w a n / (1) v finish, be over; (use) up

我刚看完第 3 章。 I ' v e just finished reading Chapter 3

3 4 3 4 | 0 7 | 2 3 8 8

0142 吋间[日寺間]/shfjian/ (1) n time

剩下时间本多了。 T h e r e is little t i m e left

2 6 3 4 | 0 9 | 2 3 7 5

0143 起来 [ 起卒 ]/ q i l a i / (1) v get up, rise; [following

a verb to indicate the beginning of a situation] start to; [following a verb to indicate completed en ess or effectiveness]

她哭起来了 ° She started crying

0147 老 [ 老 ]/ _ (1) adj old, veteran

房子前面的那棵树很老了 o The tree in front

of the house is very old

2 8 1 7 | 0 8 1 | 2 2 7 4

0148 可能 [ 可能 ]/ k e n e n g / (1) m might (happen)

{ i 可能明天来。 H e may come tomorrow

口 …

Trang 37

I'm afraid the colour is a bit too

bright for me

0157 条 n 条 ]I t i a o l (1) clas [measure word for

things of a long and thin shape (e.g string

and river), pieces of writing (e.g news,

suggestions and regulations), or human

2 9 2 4 | 0 6 9 | 2 0 1 0

0163 文化 [ 文化 ]I w e n h u a l (1) n culture

美国是个多文化的国家。T h e US is a country with many different cultures

2 7 2 6 | 0 7 4 | 2 0 0 7

0164 问 [ 問 ]/ w e n / (1) v ask

如果他来这儿，我将问他几个问题。I f he comes here, I shall ask him some questions

0167 点 [ 點 ] ( 1 ) clas [measure word for

point, item, etc.]; [measure word for small quantities]

我只想简单讲两点。丨s h a l l only mention two things in brief

2 6 5 1 | 0 7 3 | 1 9 4 4

0170 进行 [ 進行 ]/ j m x f n g / (1) l / g o on, last; be under

way; carry on, carry out, perform

比赛进行了病个小时。T h e game lasted two hours

2 2 3 9 | 0 8 6 | 1 9 2 3

0173 长 [ 長 ]/ c h d n g / (1) adj long

到城里要走很长的路程 ° It is a long walk t o the town

Trang 38

kuangquan shu 丫 mineral water 8642

baijiu white spirit

baiiand) brandy

bai putaojiu white wine

bai sh] kele Pepsi, Pepsi-Cola

bTng kafGi ice coffee

bTng shu ice water

chengzhT orange juice

chunfing shu 丫 purified water

dusongzfjiu gin

hongjiu red wine

hong putaojiu red wine

huacha scented tea

jTweijiu cocktail

juzi zhh orange juice

kekou kele Coca-Cola, Coke

maotai jiu maotai

nfngmeng shu 丫 lemonade

pingguo zhT apple juice

re qiaokeli hot chocolate

1 8 8 8 | 0 9 7 | 1 8 3 4

0179 你们 [ 你們 ]/ n i m e n / 0) pron [plural]

you

对不起，给你们找麻烦了。 I ' m sorry t o have caused you so much trouble

2 Drinks and beverages

Trang 39

0 1 8 0 找 [ 找 ]/zhdo/(1)1/look for

如釭械什么？ What are you looking for?

2 5 6 0 | 0 7 | 1 7 9 5

0181 跟 [ 跟 ]/ g e n / (1) prep [indicating relationship,

involvement, or comparison] with

我有点要紧的事跟他商量 • 丨h a v e something

urgent to discuss with him

3 1 4 2 | 0 5 7 | 1 7 8 6 | S

0182 儿 [ 兒 ]l e r , r / s w f [nonsyllabic suffix for

retroflection, especially in the Beijing dialect]

这两张画J L 不 i 一样。 T h e two pictures are

not quite the same

0184 女[女]/ni}/(1)aG(/ female, woman

门女卜是两个 i i l ！ Outside were two female

ghosts

2 0 0 7 | 0 8 8 | 1 7 6 1

0185 而且 [ 而且 ] f e r q i e f (1) conj (not o n l y )

but also

这不仅省事而且省钱。 T h i s will save not only

labour but also money

2 0 4 3 | 0 8 6 | 1 7 6 1

0186 幵 [ 開 ]I k a i l (1) v open; operate, drive (car),

turn on (light); start (business); hold (meeting,

party, etc.); (water) boil; write (cheque); (flower)

0188 其 [ 其 ]I q i l (3) pron [third person singular or

plural] his, her, its, their; that, such

0 1 9 1 沒[沒]/Ar?e7/ (1) v have not, there be not

我根本没时间给你写信。 | have no time at all

2 3 4 1 | 0 7 | 1 6 3 0

0200 住 [ 住 ]I z h u l (1) y live, stay

你住在几号房间呢？先生。W h a t room are you staying in, sir?

1 7 9 5 | 0 9 | 1 6 0 7

0204 带 [ 帶 ]I d a i l (1) v carry, bring, take

请务必随时随身带着钥匙。P l e a s e make sure that you take the key with you at all times

Trang 40

我们可以请您一起进餐吗？ May we have the

pleasure of your company at dinner?

2 7 6 2 | 0 5 7 | 1 5 7 6 | S

0210 地 [ 地 ]_ (1) n earth, ground, field

地上有很多 ^ 对叶。 There are a lot of leaves on

0 2 1 2 同 [ 同 ]I t o n g l (2) prep [indicating relationship,

involvement, comparison] with

0216 张 [ 張 ]/ z h a n g f (1) clas [measure word for flat

objects and things with a flat surface, and also

for bows and mouths]

0219 件 [ 件 ]I j i a n l (1) clas [measure word for

clothes, furniture, affairs, etc.] item,

2 6 8 2 | 0 5 7 | 1 5 2 3 | S

0221 真 [ 蔓 ]I z h e n l (1) adv really

真漂亮• It's really beautiful

2 2 5 6 | 0 6 7 | 1 5 2 1

0 2 2 2 啊 [ 啊 ]l a l (1) part [modal particle

showing affirmation, approval,

or co门sc门t]

多么美丽的地方啊。 W h a t a beautiful place it is!

2 0 3 4 | 0 7 2 | 1 4 5 6

0227 最后丨最後 ]I z u W o u l (1) adj final, last

谁笑到最后谁笑得最好。 W h o laughs last laughs longest

1 6 0 9 | 0 9 | 1 4 5 2

0228 人民 [ 人民 ]/ r e n m f n / (1) n (the) people

人民是文艺工作者的母亲。 I t is the people who nurture our writers and artists

2 7 5 4 | 0 5 3 | 1 4 4 7

0229 手 [ 手 ]I s h o u l (1) n hand

握住瓶子，用另一只手拔瓶塞。 H o l d the bottle and pull the cork out with the other hand

1 9 5 8 | 0 7 2 | 1 4 1 3

0233 政府 [ 政府 ]I z h e n g f u l (1) n government

这篇文章展示了政府的政策。T h e article revealed the policies of the government

Định dạng
Số trang	401
Dung lượng	48,1 MB