A Frequency Dictionary of Mandarin Chinese Mandarin Chinese, providing a list of the 5,000 words and the 2,000 Chinese characters most commonly used in the language.. Like other volumes
Trang 2A Frequency Dictionary
of Mandarin Chinese
Mandarin Chinese, providing a list of the 5,000 words and the 2,000 Chinese characters most commonly used in the language Based on a 50-million-wo「d corpus composed of spoken, fiction, non-fiction and news texts in current use, the dictionary provides the user with a detailed frequency-based list, as well as alphabetical and part-of-speech indexes All entries in the frequency list feature the English equivalent and a sample sentence with English translation The dictionary also contains 30 thematically organised lists of frequently used words on a variety of topics such as food, weather, travel and time expressions
their study of Mandarin vocabulary in an efficient and engaging way It also represents an excellent resource for teachers of the language
Richard Xiao is Senior Lecturer and Programme Leader in Chinese Studies at Edge Hill University Paul Rayson is Director of the University Centre for Computer Corpus Research on Language and a teaching fellow at Lancaster University Tony McEnery is Professor of English Language and Linguistics at Lancaster University
A Frequency Dictionary of Mandarin Chinese is an invaluable tool for all learners of
A Frequency Dictionary of Mandarin Chinese enables students of all levels to maximise
Trang 3Routledge Frequency Dictionaries
General Editors:
Paul Rayson, Lancaster University, UK
Mark Davies, Brigham Young University, USA
Editorial Board:
Michael Barlow, University of Auckland, New Zealand
Geoffrey Leech, Lancaster University, UK
Barbara Lewandowska-Tomaszczyk, University of Lodz, Poland
Josef Schmied, Chemnitz University of Technology, Germany
Andrew Wilson, Lancaster University, UK
Adam Kilgarriff, Lexicography MasterClass Ltd and University of Sussex, UK Hongying Tao, University of California at Los Angeles
Chris Tribble, King's College London, UK
Other books in the series:
A Frequency Dictionary of German
A Frequency Dictionary of Portuguese
A Frequency Dictionary of Spanish
A Frequency Dictionary of French (forthcoming)
A Frequency Dictionary of Arabic (forthcoming)
Trang 4A Frequency Dictionary
of Mandarin Chinese
Core vocabulary for learners
Richard Xiao, Paul Rayson and Tony McEnery
Routledge
Taylor & Francis Group
LONDON AND NEW YORK
Trang 5First published 2009
by Routledge
2 Park Square, Milton Park, Abingdon, OX14 4RN
Simultaneously published in the USA and Canada
by Routledge
711 Third Ave, New York, NY 10017
Routledge is an imprint of the Taylor & Francis Group, an informa business
© 2009 Richard Xiao, Paul Rayson and Tony McEnery
Typeset in Parisine by Graphicraft Limited, Hong Kong
All rights reserved No part of this book may be reprinted or reproduced
or utilised in any form or by any electronic, mechanical, or other means,
now known or hereafter invented, including photocopying and recording,
or in any information storage or retrieval system, without permission in
writing from the publishers
British Library Cataloguing in Publication Data
A catalogue record for this book is available from the British Library
Library of Congress Cataloging-in-Publication Data
ISBN13: 978-0-203-88307-5 (ebk)
Trang 6Contents
Thematic vocabulary list | vi
Series preface | vii
Part of speech index | 300
Character frequency index | 356
Trang 7Thematic vocabulary lists
6 Weather and equipment | 54
7 City facilities and stores | 61
23 Kinship and family relations | 172
24 Moods and emotions | 177
Trang 8Series preface
Frequency information has a central role to play in learning a language Nation (1990) showed that the 4,000-5,000 most frequent words account for up to 95 per cent of a written text and the 1,000 most frequent words account for 85 per cent of speech
Although Nation's results were only for English, they do provide clear evidence that, when employing frequency as a general guide for vocabulary learning, it is possible to acquire a lexicon which will serve a learner well most of the time There are two caveats to bear in mind here First, counting words is not as straightforward as it might seem Gardner (2007) highlights the problems that multiple word meanings, the presence of multiword items, and grouping words into families or lemmas, have on counting and analysing words Second, frequency data contained in frequency dictionaries should never act as the only information source to guide a learner Frequency information is nonetheless a very good starting point, and one which may produce rapid benefits It therefore seems rational
to prioritise learning the words that you are likely to hear and read most often That is the philosophy behind this series of dictionaries
Lists of words and their frequencies have long been available for teachers and learners
of language For example, Thorndike (1921, 1932) and Thorndike and Lorge (1944)
produced word frequency books with counts of word occurrences in texts used in the
education of American children Michael West's General Service List of English Words (1953)
was primarily aimed at foreign learners of English More recently, with the aid of efficient computer software and very large bodies of language data (called corpora), researchers have been able to provide more sophisticated frequency counts from both written text and transcribed speech One important feature of the resulting frequencies presented in this series is that they are derived from recently collected language data The earlier lists for
English included samples from, for example, Austen's Pride and Prejudice and Defoe's
Robinson Crusoe, thus they could no longer represent p「esent-day language in any sense Frequency data derived from a large representative corpus of a language brings
students closer to language as it is used in real life as opposed to textbook language (which often distorts the frequencies of features in a language, see Ljung, 1990) The information in these dictionaries is presented in a number of formats to allow users to access the data in different ways So, for example, if you would prefer not to simply drill down through the word frequency list, but would rather focus on verbs for example, the part of speech index will allow you to focus on just the most frequent verbs Given that verbs typically account for 20 per cent of all words in a language, this may be a good strategy Also, a focus on function words may be equally rewarding - 60 per cent of speech in English is composed of a mere 50 function words The series also provides information of use to the language teacher The idea that frequency information may have
a role to play in syllabus design is not new (see, for example, Sinclair and Renouf, 1988) However, to date it has been difficult for those teaching languages other than English to use frequency information in syllabus design because of a lack of data
Trang 9viii Series preface
Frequency information should not be studied to the exclusion of othe「contextual and situational knowledge about language use and we may even doubt the validity of frequency information derived from large corpora It is interesting to note that Alderson (2007) found that corpus frequencies may not match a native speaker's intuition about estimates of word frequency and that a set of estimates of word frequencies collected from language experts varied widely Thus corpus-derived frequencies are still the best current estimate of a word's importance that a learner will come across Around the time
of the construction of the first machine-readable corpora, Halliday (1971: 344) stated that
"a rough indication of frequencies is often just what is needed" Our aim in this series is
to provide as accurate as possible estimates of word frequencies
Paul Rayson and Mark Davies Lancaster and Provo, 2008
References
Alderson, J.C (2008) Judging the frequency of English words Applied Linguistics, 28(3): 383–409.
Gardner, D (2007) Validating the construct of Word in applied corpus-based vocabulary research: a critical
survey Applied Linguistics, 28, pp 241–265.
Halliday, M.A.K (1971) Linguistic functions and literary style In S Chatman (ed.) Style: A Symposium Oxford University Press, pp 330–365.
Ljung, M (1990) A Study of TEFL Vocabulary Almqvist & Wiksell International, Stokholm.
Nation, I.S.P (1990) Teaching and learning vocabulary Heinle & Heinle, Boston.
Sinclair, J.M., and Renouf, A (1988) “A lexical syllabus for language learning”, in R Carter & M McCarthy
(eds) Vocabulary and Language Teaching Longman, London, pp 140–158.
Thorndike, E (1921) Teacher’s Word Book Columbia Teachers College, New York.
Thorndike, E (1932) A Teacher’s Word Book of 20,000 Words Columbia University Press, New York.
Thorndike, E and Lorge, I (1944) The Teacher’s Word Book of 30,000 Words Columbia University Press,
New York
West, M (1953) A General Service List of English Words Longman, London.
Trang 100005 了 [了] //e/ (1) aux [aspect marker indicating
realisation of a situation] ^!也从马上摔了下来。She fell
吗? She's a regular beauty, isn't she? 181 | 0.62 | 112
0 8 2 0 嗎[n恩]/en, hg, ng, hg/ (1) interj [interjection used for questioning, surprise, disapproval, or agreement] well, eh, hey, m-hm, uh-huh 卩恩,他们是
个村J土
天 生 一 对 嘛 。Well, they seem to make a good couple
2525 I 0.18 I 449 I S
0 0 1 5 上 [ 上 ] I s h a n g ! (1) loc up, on, i n 孩 子 的 新 毛 衣 上
勾了 一 个 洞 “ T h e child has picked a hole in his new jumper 22041 丨 o.92 丨 20201
0 0 1 4 人 [ 人 ] I r e n l (1) n person, human, man 我 尊 重 他 这
位 作 家 , 也 尊 重 他 这 个 人 ° I respect him as a writer and
as a man 247241 0.86121225
0 0 0 3 — [ - ] lyTI (1) num one, a, an 市 政 大 厅 前 有 一 群
人 ° There is a crowd of people in front of the town
h a l l 69925 I 0.89 | 62263
3 1 3 5 呀 [ 呀 ] l y a l ono the sound of a creak 门呀的一声
开 了 。T h e door opened with a creak 2231 0.41 1901 s
Trang 11This is an excerpt from a talk
by the former President Nixon 434 | 0.75 | 326
0 0 0 4 在 [ 在 ] i z a l i (1) prep [indicating location or time,
etc.] at, in 她 坐 在 窗 旁 。 S h e sat at the window
52774 | 0.94 | 49460
0 0 1 0 他 [ 他 ] _ 0) pron he, h i m 目 前 , 他 正 在 度 假 。
At present, he is on holiday. 36234 | 0.76 | 27619
0111 们 [ 們 ] / m e n / (1) suf [plural marker for pronouns
and some animate n o u n s ] 朋 友 们 舞 会 后 就 分 开 了 。 The friends separated after the party. 3199 | 0.92 | 2930
Trang 12Introduction
The need for a frequency dictionary
A good dictionary is indispensable for language
learning This is particularly true when learning a
second or foreign language There are many kinds
of dictionaries, designed for different purposes
and having different values for different readers
For instance, a dictionary in multiple volumes
that incorporates encyclopaedic or etymological
knowledge may not be terribly relevant for language
learners unless they are at very advanced levels.
Different types of dictionaries can benefit different
types of learners Of those dictionaries that are
specifically created for language learners, the focus
is usually on providing basic information such as
definitions, glosses and word classes illustrated
by suitable examples Such dictionaries typically
follow the lexicographic convention of arranging
words in alphabetic order so that, while providing
an effective and convenient way of looking up a
specific word, they do not tell learners which
words are more commonly used – that is, which
words learners are more likely to encounter at
different stages of learning Neither can a
conventional dictionary tell learners which words
they are more likely to encounter in different
registers such as speech, news, imaginative or
informative writing.
It is in these regards that a frequency dictionary
as an innovative type of dictionary proves valuable.
While it is clearly nạve to state simply and boldly
that the most frequent words are the most
important to learn, frequency ranking is nonetheless
“a parameter for sequencing and grading learning
materials” because frequency is “a measure of
probability of usefulness” and “high-frequency words
constitute a core vocabulary that is useful above the
incidental choice of text of one teacher or textbook
author” (Goethals 2003: 424) As Leech (1997: 16)
argues: “Whatever the imperfections of the simple
equation ‘most frequent’ = ‘most important to
learn’, it is difficult to deny that frequency
information derived from large collections of
text (corpora) has an important empirical input to language learning materials.”
With that said, one should not assume that there is an irreconcilable conflict between the frequency and more conventional dictionary Rather they have different focuses and are complementary to each other In this dictionary, for example, we have provided just one illustrative example for each word included, even though the word can have different meanings Readers seeking a broader range of illustrative examples may refer to a different type of dictionary In addition, for a small number of words included
in this dictionary where the word can have different pronunciations for its different senses, we have not indicated which pronunciation corresponds
to which word sense Again, a different dictionary can help with that issue These decisions were motivated largely by considerations such as the levels of intended readership, the size of the dictionary (see section 5 for further discussion) and, crucially, what was readily available elsewhere We did not want to unhelpfully replicate what other dictionaries could provide This dictionary should be used with a conventional learner dictionary – not instead of one.
Like other volumes in the Routledge Frequency
Dictionary series, A Frequency Dictionary of
Mandarin Chinese furnishes a list of core vocabulary
for learners of Mandarin Chinese as a second or foreign language, especially learners whose first or second language is English This dictionary will prove useful for such learners, whether they are instructed in the classroom or are independent learners It should also prove of benefit to teachers.
How might classroom learners use this dictionary? Classroom learners normally rely on a selected textbook, which is typically organised around a variety of themes (e.g shopping, eating out) Thematically related words are surely of benefit
in vocabulary acquisition, but a textbook rarely if
Trang 132 Introduction
ever tells learners which of these words are more
likely to appear in their actual reading or
conversation In fact, it is very likely that some of
those words, which they have learnt with painstaking
effort, never occur beyond the classroom situations
invented specifically for the sake of language
learning In other words, learners will not find a
chance to employ those words in real
communication contexts, unless they are talking
about a specialised topic As you will see in the
callout boxes for themed vocabulary embedded in
the main frequency index of common words, some
words are infrequent in the language, but can be
potentially useful when discussing specific topics
That's why we have decided to include 30
thematically related lists (see section 5) These
thematically related lists are one resource that
classroom learners should want to draw upon
What of the independent learner? They tend
to have needs somewhat distinct from those of
the classroom learner They may pick up a piece
of text, e.g a work of fiction or a newspaper, and
step through it word by word, consulting a
dictionary to check on words which are new to
them While such independent learners work on
authentic texts, they may often suspect that their
learning could be more effective if they knew at the
outset that they were learning the most common
words in general Mandarin, as these would be the
words which they are most likely to encounter in a
wide variety of contexts They would then work
towards specialised and infrequent vocabulary
A frequency dictionary should make an ideal
companion for such learners
Finally, how might teachers use a dictionary
such as this? Language teachers should find a
frequency dictionary a very helpful source of
information On the one hand, the frequency
dictionary provides a graded list of vocabulary with
authentic examples, which is valuable supplementary
material to complement a textbook On the other
hand, the teacher may find it frustrating that some
students entering intermediate level are deficient in
vocabulary In such cases, a frequency dictionary will
be an advantage as it affords a structured remedy in
this regard Last but not least, the frequency
information contained in this dictionary, which is
based on a large balanced collection of data (see
section 2), will prove a valuable resource in guiding and informing the development of a language teaching curriculum
The corpus
Given that the frequencies presented in this dictionary are derived from a large collection of
Chinese language, a so-called corpus, it is clearly
important to present the corpus data used in this dictionary For a dictionary that aims to provide a frequency-based core vocabulary for learners, a well- composed corpus is essential We think that such a corpus must satisfy four requirements for the intended purpose First of all, it must be large enough to yield a basis for reliable quantification; second, it must achieve a reasonably wide coverage
of registers so that learners are exposed to commonly used words in different communication contexts Third, the language contained in the corpus must be current Finally, in addition to
the quality of data per se, corpus processing must be
sufficiently reliable, and this is particularly important for a Chinese frequency dictionary because running texts in Chinese must first of all
be segmented into legitimate tokens (a computational process known as segmentation or tokenisation, see below) before they can be annotated with word class information
The corpus in this dictionary is composed of written and spoken texts from four broad categories
as shown in Table 1, totalling roughly 50 million word tokens (or 73 million Chinese characters) The spoken component contains 3.4 million words, covering face-to-face conversations, telephone calls, cross-talks, movie and play scripts, interviews, storytelling, public lectures, radio broadcasts, and public debates, which were mostly produced in the 1990s and 2000-2006 1 The news component comprises 16 million words of newswire texts released in 1995 by the Xinhua News Agency and
newspaper texts published by the People's Daily in
1998 and 2000, in addition to the news categories in the Lancaster Corpus of Mandarin Chinese (LCMC) 2
and the UCLA Written Chinese Corpus 3 The fiction component amounts to 15 million words, including all fiction categories in LCMC and UCLA Chinese corpora in addition to novels and short stories sampled from various periods in the twentieth
Trang 143 Introductio n
Table 1 Structure of the corpus
4,679,991 26,277,906 19,962,277 22,158,904 73,079,078
century, with the majority published in the
1980s-1990s The non-fiction component is
composed of all informative categories in LCMC
and UCLA corpora, together with various non-literary
texts of different genres such as official documents,
academic prose, applied writing and popular lore,
which were sampled from different periods in the
second half of the twentieth century, totalling
15 million words
While the majority of our corpus data
introduced above are monolingual Chinese texts,
we have also used a parallel corpus composed of
Chinese fictional and non-fictional texts with their
English versions, which has allowed us to extract,
for each of the selected words included in this
dictionary, an illustrative example with English
translation 4
Once the texts (including transcripts of spoken
data) were collected, the next step was to segment
the running strings of characters in these texts into
word tokens For alphabetical languages like English,
word tokens in a written text are normally separated
by white spaces so that the one-to-one
correspondence between orthographic and
morpho-syntactic word tokens can be considered as a default
with a few exceptions: multiwords (e.g so that and
in spite of ), mergers (e.g can't and gonna) and
variably spelt compounds (e.g noticeboard,
notice-board, notice board). In Chinese, however, since a
written text contains running strings of characters
with no delimiting spaces, one has to determine
where the words are in the data More specifically,
as it is the computer that is analysing the data, a
process must be run which allows the computer to
determine where the words are This process is
called word segmentation Word segmentation
requires complex computer processing, which generally involves lexicon matching and the use
of a statistical model (cf McEnery, Xiao and Tono 2006: 35)
The segmentation tool we engaged to process our Chinese corpus is ICTCLAS, an acronym for the Chinese Lexical Analysis System developed by the institute of Computing Technology, Chinese Academy of Sciences The core of the system lexicon incorporates a lexicon of 80,000 words with part of speech information The system is based on a multi- layer hidden Markov model and integrates modules for word segmentation, part of speech analysis, so called part of speech tagging, and unknown word recognition (cf Zhang, Liu, Zhang and Cheng 2002) The rough segmentation module of the system is based on the n-shortest paths method (Zhang and Liu 2002) The model, based on 2-shortest-paths, achieves a precision rate of 97.58 per cent, with
a recall rate as high as 99.94 per cent (ibid.)
In addition the average number of segmentation candidates is reduced by 64 times compared to the full segmentation method The unknown word recognition module of the system is based
on role tagging The module applies the Viterbi algorithm to determine the sequence of roles (e.g internal constituents and context) with the greatest probability in a sentence, on the basis
of which template matching is carried out
The integrated ICTCLAS system is reported to achieve a precision rate of 97.16 per cent for tagging, with a recall rate of over 90 per cent for unknown words and 98 per cent for Chinese person names (ibid.)
ICTCLAS applies a very fine-grained part of speech annotation scheme, or tagset, (see Appendix) however
Trang 154 Introduction
to corpus data A tagset is the list of part of speech
distinctions made by the programme, while tagging
is the process whereby the machine decides which
part of speech applies to each word and a tag is the
individual mnemonic assigned to each word in the
corpus by the computer when the tagging has
determined which tag each word should have
For the purpose of a frequency dictionary, we
think that a less fine-grained tagset than that
provided by ICTCLAS is more helpful as this can
have a boosting effect on words of the same form
but with minor difference in usage Hence we
decided to merge subcategories and similar part of
speech categories Our decision to combine those
subcategories was also motivated by the fact that
Chinese does not have a very strong link between
word classes and grammatical functions For
(tagged as ad by ICTCLAS) while a verb can behave
syntactically like a noun (tagged as vn) or adverb
(tagged as vd) In addition, non-predicate adjectives
(tagged as b) and descriptive adjectives (tagged as z)
are also merged into the broad category of adjectives As it is not always possible to
differentiate between idioms (tagged as i) and fixed expressions (tagged as l), the two categories are
combined into the category for idiomatic and formulaic expressions These manipulations resulted
in a tagset for use in this dictionary, which consists
of 20 part of speech tags as shown in Table 2 When the texts were tokenised and annotated with part of speech information, the corpus was converted from the local character encoding GB2312 into Unicode (UTF-8), with register information and linguistic annotation marked up in the extensible mark-up language (XML) so that the corpus could
be used with our PERL (Practical Extraction and Retrieval Language) scripts to build frequency
be discussed in section 4 But before that, let us have a look at the previous frequency dictionaries and lists
of words and characters in Chinese
example, adjectives can be used directly as adverbial indexes for use in this dictionary This will
Table 2 Part of speech tags annotated in our corpus
b, z
Trang 165 Introduction Previous frequency dictionaries of
Chinese
Ours is not the first frequency dictionary of Chinese
It is, however, quite distinctive, as can be
demonstrated by considering the other frequency
dictionaries of Chinese that have been created
Because of the large inventory of characters in
Chinese, there has been a long tradition of teaching
Chinese characters on basis of frequency, though
the research of word frequency on the basis of large
collections of text only became possible in the
1990s, with the advent of more powerful computers
and specialised computer software for word
segmentation There are at least a dozen frequency
lists or dictionaries of Chinese characters and words
including, for example:
• Chen's (1928) Yutiwen Yingyong Zi Hui (The
Applied Glossary of Modern Chinese): listing
4,261 distinct Chinese characters on the basis
of six corpora (children's books, newspapers,
women's magazines, after-class work of
schoolchildren, classic and modern fiction,
and miscellaneous) totalling 554,478 Chinese
character tokens;
• Liu's (1973) Frequency Dictionary of Chinese
Words: giving statistics such as frequency,
dispersion index and usage rate for 3,059 most
frequently used words in Chinese on the basis of
a 0.25-million-word corpus covering five registers
(fiction, drama, essays, newspapers and
periodicals, technical writing);
• Xiandai Hanzi Zonghe Shiyong Pindu Biao (A
Comprehensive Frequency Table of Character
Usage in Modern Chinese), established on
Project Code 748 (1976): listing 4,152 frequently
used characters on the basis of
a corpus of 21 million characters;
• Beijing Aeronautical University (1985) Xiandai
Hanyu Yong Zi Pindu Biao (A Frequency Table of
Character Usage in Modern Chinese): listing
frequently used characters for ten genres and
technical domains on the basis of samples
totalling 11.08 million characters;
• Beijing Language and Culture University (1986)
Xiandai Hanyu Pinlu Cidian (A Frequency
Dictionary of Modern Chinese): listing 16,593
commonly used words extracted from 1,315,752
word tokens (or 1.82 million characters);
• National Language Committee (1988) Xiandai
Hanyu Changyong Zi Biao (Commonly Used Characters in Modern Chinese): listing the most commonly used 2,500 characters and 1,000 commonly used characters on the basis of data collected by Beijing Aeronautic University covering the period 1928-1986;
• Hong Kong Polytechnic University (1991-1997)
Zhongguo Dalu, Taiwan, Xianggang Hanyu Ciku
(A Chinese Word Bank from Mainland China, Taiwan, and Hong Kong): listing 68,011 entries based on a 6-million-character corpus of news texts published during 1990-1992 in the three Chinese speech communities
With the exception of Liu (1973), all other character and word frequency lists and dictionaries are published in Chinese All of them are targeted either at native speakers of Mandarin learning their mother tongue (e.g Chen 1928; National Language Committee 1988), or at language engineers (e.g the frequency list by Project Code 738) and expert Chinese linguists (e.g the word bank by Hong Kong Polytechnic University for studying language variation) 5 And with the exception of Liu (1973), all of the existing frequency dictionaries of Chinese characters and words were published and distributed in China, which makes it difficult for learners of Chinese
as a second or foreign language outside China
to get access to them
Liu (1973) was published by Mouton and released worldwide, but it also suffers from a number of drawbacks Like nearly all existing Chinese frequency dictionaries, it is based exclusively on written Mandarin; the data on which the dictionary is based are quite outdated, with texts published during the period 1910-1960; with a total of 0.25 million word tokens, the corpus is also rather small by today's standards; the word class categories featuring in the book are quite obsolete nowadays; no actual Chinese characters are used in the dictionary, these being replaced by a kind of Romanisation system which is no longer widely used; and most importantly, no English gloss or translation, no illustrative example, and no information related
to usage are given, making the dictionary almost useless for today's learners of Chinese as a second or foreign language
Biao
5
Trang 176 Introduction
Table 3 HSK graded lists and words and characters in Chinese
HSK level Words Characters
Level 1 1033 800 Level 2 2019 803 Level 3 2205 591 Level 4 3583 671 Levels 1 - 3 5257 2194 Levels 1 - 4 8840 2865
Last but not least, we should not fail to mention
the Syllabus of Graded Words and Characters for
Chinese Proficiency compiled by the Chinese
government's Hanyu Shuiping Kaoshi (the Chinese
Proficiency Test, HSK) Committee, which was
published in 1992 and revised in 2001 The HSK
lexical syllabus lists the words and characters
required of learners of Mandarin Chinese as a
second or foreign language to pass the Chinese
proficiency test HSK, as indicated in Table 3 While
the lexical syllabus is undoubtedly instructive for
learners of Mandarin - we have made special effort
to include as many words as possible from the
syllabus, especially Level 1 and 2 items (see section
4 for further discussion) - it serves a different
purpose from a frequency list The words in the
syllabus are arranged conventionally in alphabetical
order for each level rather than in the order of
frequency, and no actual frequency or frequency
ranking is give门
The compilation of the lexical syllabus, which
was corpus-based, started in 1988 and the latest
texts covered were produced in 1991 Unsurprisingly,
most "new words" included in the syllabus are from
the early 1980s while some words that were
common in the 1970s-1980s, e g 少 先 队 " y o u n g
pioneer" will not be common enough to merit a
place on the list nowadays On the other hand,
many well-established vocabulary items which are
commonly used today as a result of technological
and social development are not covered in the
A comparison of the HSK graded vocabulary and the frequency index in our dictionary appears to suggest that the corpus on which the HSK vocabulary is based relies too heavily on the Beijing dialect, as evidenced by dialectal usage like
半 拉" h a l f " (Level 2) and words ending with the retroflective suffix 儿,including Level 1 words such
as 小 孩 儿 " c h i l d : 面 条 儿 " n o o d l e , pasta", Level 2 words such as 聊 天 儿 " c h a t " and 墨 水 儿 " i n k : and Level 3 words such as 拐 弯 儿 " t u r n a corner, make a turn" and 药 水 儿 " l i q u i d medicine" Words like these are normally listed in a dictionary without the retroflective 儿 which is tagged in our corpus as a suffix
Selection of words and characters
According to the HSK lexical syllabus, learners of Chinese as a foreign language who have learnt about 5,000 words will be able to express their ideas on general issues in Chinese As can be seen in Table 3, this vocabulary is approximate to the total of HSK Levels 1 - 3 words The number of words we have decided to include in this dictionary is roughly comparable While it is certainly true that the larger your vocabulary the better it is, it is nonetheless increasingly more difficult to learn new words as your vocabulary grows This is because, according to Zipf's law, the frequency of a word is reversely proportional to its rank in the frequency table As such, there is a 9.27 per cent increase in coverage from top 1,000 to top 2,000 words, whereas the
Trang 18Top N characters
I门crease Drop
7 Introduction
Table 4 Coverage of top N words
Figure 1 Coverage of top N characters
increase in coverage drops to 0.94 per cent from top
8,000 to top 9,000 words (see Table 4) In addition
to the reference to the HSK syllabus, our decision to
include 5,000 words was also empirically based by
the sharp drop in coverage (from 3.07 per cent to
1.77 per cent) from top 5,000 to top 6,000 words
As can be seen in Figure 1, which shows the
increase and drop in coverage resulting from each
additional block of 200 characters, Zipf's law also
applies to characters Coverage grows very slowly
after the top 1,200 characters The top 2,000
characters cover nearly 98 per cent of our whole
corpus, with 4,839 characters accounting for the
remaining 2 per cent of coverage
We would like to point out that the above
distribution statistics are based on a valid lexicon
that we created from the corpus By "valid lexicon",
we mean the frequency lists of words and characters that exclude items that are "uninteresting" from the perspective of vocabulary acquisition, e.g symbols and punctuations, Arabic numerals (written in either full- or half-length), and non- Chinese character strings We have also excluded abbreviations, numeral characters indicating years, person names, place names, organisation names,
as well as other proper nouns such as names of countries, nationalities and languages, as well as brand names Table 5 indicates the size of our valid lexicon
While a frequency dictionary could be arranged simply in the order of raw frequencies, i.e actual occurrences of words and characters, we have enlisted a more scientific way to decide which words and characters to include, which takes account of
Trang 198 Introduction
Table 5 The valid lexicon
Register Word tokens Chinese characters
Spoken 2,692,315 3,824,579 News 12,147,572 20,185,322 Fiction 11,973,365 16,424,649 Non-fiction 11,900,160 17,954,729 Total 38,713,412 58,389,279
their frequencies as well as their distribution in
different registers Words and characters which are
frequently used in more registers are clearly more
useful than those that are frequent in fewer
registers In this dictionary, we have adopted the
same hierarchy composed of three coefficients as
established in Juilland and Chang-Rodriguez (1964),
namely frequency, dispersion index and usage rate,
which are explained as follows
There are two types of frequency data Raw
frequency refers to the actual occurrence of a word
or character in a corpus while normalised frequency
means the frequency that has been adjusted to a
common base, for example, in this case, the
occurrences per million tokens so that the four
registers covered in our corpus can be compared
even if they are of different sizes We have used
normalised frequencies in different registers to
compute dispersion index and usage rate, while the
overall normalised frequency is also given in the
entry for a headword so that the reader can easily
compare the frequencies of different words on a
common basis of per million words Dispersion
coefficient (D) is computed according to Juilland and
Chang-Rodriguez's formula:
D = 1 - (nE x2 - T 2 ) 1/2 /2T
In this formula, n stands for the number of word
types and T for the number of word tokens This
formula reduces dispersion to a coefficient ranging
from 0 - 1 , regardless of frequency Words with a
higher dispersion coefficient are more evenly
distributed in different registers The usage rate (U)
takes account of both frequency and dispersion,
which can be taken as a dispersion (D) percentage
of frequency (F) or vice versa according to the
following formula:
U = F x D /100 This means that when D = 1 the usage rate equals frequency, and when D = 0.5 the usage rate is half
of frequency Hence, a more frequent word with a lower dispersion index can have a lower usage rate
For example, as the word 说"say" is distributed fairly evenly in the four registers (9,383 instances per million words in spoken, 9,998 in fiction, 4,658
in non-fiction, 3,753 in news), it has a large dispersion index (0.80) If the word has an overall frequency of 27,792, then its usage rate will be 22,252 In contrast, the interjection 哎 has an overall frequency of 1,821 instances in our corpus, but it is distributed unevenly in the four registers (1,697 in spoken, 119 in fiction, 5 in non-fiction and 0 in news) Its dispersion index is much smaller (0.21), and its usage rate is 383 in spite of its high overall frequency
We wrote PERL scripts that automatically computed the overall normalised frequency, normalised frequency in each of the four registers (i.e spoken, news, fiction and non-fiction), the dispersion index, and usage rate for each word and character in our valid lexicon We have used a combination of these statistics, while also taking account of basic vocabulary in the HSK syllabus as well as our intuitive knowledge of the Chinese language, to decide which words and characters to include in this dictionary
• All words with a dispersion index below 0.25 are excluded unless they have a usage rate above
100 or a normalised frequency above 1,000 in any of the four registers
• All items with a usage rate below 45 are excluded unless they are on the Levels 1 and 2 lists in the HSK syllabus.
Trang 209 Introduction
• All words with an overall normalised frequency
below 55 are excluded unless they are on the
Levels 1 and 2 lists in the HSK syllabus
• All words with a normalised frequency below
three per million words in the register of fiction
are excluded unless they have a usage rate above
100 or are on the Levels 1 and 2 lists in the HSK
syllabus
These operations helped to establish a core list of
top 5,004 words from a total of 30,922 words from
our valid lexicon Out list covers 95.61 per cent of
Level 1 words, 80.43 per cent of Level 2 words,
44.22 per cent of Level 3 words, and 18.12 per cent
of Level 4 words in the HSK syllabus In addition to
many advancement-related new words like those
mentioned earlier, our word list includes many
commonly used compound words which are missing
in the HSK syllabus, for e x a m p l e : 看 到 " s e e : 很 多
"eat a meal, e a t : 见 到 " s e e : 不 再 " n o longer",
很 快"fast; soon", and 听至U "hear" among many
others Such new additions are obviously more
helpful to learners than some dialectal or
outdated items in the HSK syllabus such as
半 拉" h a l f and 反动"reactionary" and 少先队
"young pioneer"
We have followed a similar procedure to
establish a core list of commonly used Chinese
characters The following cutoff points are used:
• an overall normalised frequency of 70 instances
per million tokens;
• a usage of 50 instances per million tokens;
• a dispersion index of 0.35;
• a minimal frequency of ten in each of the four
registers
These operations produced a list of 2,015 most
commonly used characters In order to include as
many basic characters required in the HSK syllabus,
the final criterion above was not strictly applied to
Level 1 and 2 characters in the syllabus As a result,
14 additional Level 1 characters and 83 additional Level 2 characters are included in our character list, thus pushing the total number of characters included in this dictionary to 2,112, which covers 99.3 per cent of Level 1 characters, and 96.64 per cent of Level 2 character in the HSK syllabus Our character index also covers 81.56 per cent of the 2,500 common characters published in China
by the Ministry of Education for native speakers
of Mandarin
Organisation of the dictionary
Following this introductory chapter are a number
of indexes of the 5,004 most commonly used words, which are arranged in frequency rank order, in alphabetical order, in frequency rank order as per word classes, as well as a list of the 2,112 most commonly used Chinese characters mapping each character onto the top 5,004 words The frequency index of the book also features a series of embedded callout boxes that show thematically related vocabulary The remainder
of this chapter will give more details of each of these indexes
Frequency index
This section lists the 5,004 most commonly used words in the descending order of frequency rank The following information is given for each of the listed headwords:
• frequency rank (in descending order of usage rate);
• headword in Simplified Chinese;
• headword in Traditional Chinese;
• Pinyin gloss of the headword;
• HSK Level (if the word is listed in the HSK syllabus);
• an illustrative example in Simplified Chinese (authentic example cited from our
Chinese-English parallel corpus);
• English translation of the example (from our Chinese-English parallel corpus);
• normalised frequency per million words;
• dispersion index;
• usage rate;
• register code (i.e S or W indicating whether the word is exceptionally common in speech or writing)
9
Trang 2110 Introduction
A typical entry looks like the following:
Here we show a concrete example of an entry:
In this example, the headword with the frequency
rank of 0263 (i.e the 263rd most commonly used
word in our corpus) is 然后,which is written as
然 後 in Traditional Chinese The Pinyin gloss of
the word is rdnhdu It is listed as a Level 1 word
in the HSK graded vocabulary This is an adverb,
meaning "afterwards, then", as exemplified in
你 走 到 第 二 个 十 字 路 口 , 然 后 向 左 拐 。 " Y o u go
ahead to the second crossing and then turn left."
This headword has an overall normalised frequency
of 1,887 instances per million tokens and a
dispersion index of 0.66 in our corpus, and thus
a usage rate of 1,241 instances per million
tokens It is exceptionally common in spoken
Mandarin
At this point, we would like to remind the reader
that when a headword has more than one sense,
these senses are separated by a comma or a
semi-colon, with the former for similar senses and the
latter for different sense groups However, when an
orthographic word has different senses for different
word classes, they are listed separately For example,
the orthographic word 会 can function as a verb
(0035) meaning "can, know how to do; meet; be
likely to, be sure to", or as a noun (0864) meaning
"meeting, conference; moment" The two
homonymous words are kept separate Nevertheless,
as our corpus is only tagged with part of speech
information but not annotated with word senses, it
is impossible to find out the frequencies of different
word senses of a homonymous or polysemous word
if these senses belong to the same part of speech
Consequently similar and different word senses of the same word class are simply grouped together with the appropriate punctuation mark indicated above It is also important to note that different senses of a word may have different Pinyin glosses and thus be pronounced differently In such cases
we simply insert commas to separate different Pinyin glosses without mapping Pinyin glosses to word senses, 7 and would like to advise the reader
to consult a traditional dictionary to ascertain how a word is pronounced for a particular word sense
Of the information given for each headword, HSK Level and register code are optional A label for the HSK Level (i.e 1 - 3 ) is only shown if the word is listed in the HSK graded vocabulary, while
a register code (S or W) is available only if the word is exceptionally common in speech or writing Please note that some words may have more than one meaning that belong to different HSK Levels, which are separated by a comma For register code, two criteria were employed to determine whether a word is exceptionally common in speech
or writing If a chi-square test indicates that the difference in frequencies of the word in speech (i.e the spoken register) and writing (e.g fiction, non-fiction and news) is significant at the
probability level p < 0.0000001 while at the same
time the S/W ratio (or the W/S ratio) is greater than
3, the word carries the register code S (or W) Of the 5,004 most common words covered in this dictionary, 103 words are exceptionally common
in speech while 203 are exceptionally common in writing
This frequency index constitutes the meaty part
of the book, which gives you all the essential information about each of the listed words In addition, 30 callout boxes are embedded in this main frequency index, which feature thematically related vocabulary They are organised along themes closely related to people's life, e.g fruits, drinks and beverages, food (flavours, main food, meat, vegetables, food preparation, seasoning), clothing, weather and equipment, city facilities and stores, travel, directions and locations, cities, house and room, home electronics, computers and the Internet, school life and subjects, professions, sports, animals, and human body (physical appearance, body parts, parts on the head, senses) In addition,
Frequency rank Headword in Simplified
Chinese [Headword in Traditional Chinese]
/Pinyin/ (Optional HSK Level) Part o f speech
English gloss
Illustrative example in Simplified Chinese and
English translation of the example
Normalised frequency | Dispersion index | Usage rate |
Optional register code
Trang 2211 Introduction
we have included a number of lists that help readers
understand the Chinese language and culture,
including the number system, time expressions,
colours, Chinese festivals, Chinese zodiac signs,
kinship and family relations, English loanwords in
Mandarin, special vocabulary in language learning
(terms for sentence analysis and punctuation marks),
and commonly used words in various registers
covered in our corpus (spoken, fiction, non-fiction
and news) For most of the lists, frequency ranks are
included, with frequent items on top of the lists
unless stated otherwise in our comments For
example, the four lists of c o m m o n words across
registers are arranged by statistical salience There
are also a number of lists where no frequency
ranks are given, which happens when a list shows
an almost closed set of vocabulary items, or when
few of the items on a list are covered in our valid
lexicon (see section 4) When a themed list includes
both items with a frequency rank and those without
one, the list is arranged with most frequent words
on the top, and then in the alphabetical order of
words without a frequency rank These thematic
vocabulary lists are an important complement to
frequency indexes in this dictionary because,
as you will see, some words are important when
you talk about a particular topic, yet they would
only be included in a frequency list that covers
top 20,000 c o m m o n words In other words,
those words are infrequent in the language as
a whole
Alphabetical index
This section lists the 5,004 most c o m m o n words
in alphabetical order of the Pinyin glosses of
headwords A typical entry for the alphabetical index
looks like:
H e a d w o r d in Simplified Chinese /Pinyin/
Part of speech code English gloss Frequency rank
Here is a concrete example:
已经 /yfffng/ adv already 0101
In addition to providing the reader with a quick view
of an entry, this alphabetical index also helps the
reader to locate the entry in the main frequency
index quickly and easily
Part of speech index
This section gives a frequency index which shows, for each part of speech category, commonly used members in that group, covering all of the top 5,004 words in the main frequency and alphabetical indexes For each entry in this part of speech index, the following information is included: frequency rank, headword in Simplified Chinese, Pinyin gloss, part of speech code, and English gloss
This index helps readers to build up vocabulary while studying Chinese grammar It also allows t h e m
to refer back quickly to the related items in the main frequency index
Character frequency index
This frequency index lists the 2,112 most commonly used Chinese characters We decided not to include part of speech information or English gloss, nor to give illustrative examples in this chapter There are a number of reasons for this decision First,
as many characters in Chinese are meaningless unless they combine to form a word, it is not always possible to give an English gloss; second,
as many of the c o m m o n monosyllabic words have already been included in the three earlier indexes,
it would be redundant to repeat their details here; third, as a focus on characters is different from the notion of words and parts of speech, English glosses and examples for headwords would be less instructive; finally, since the meaning of a Chinese word is not necessarily the aggregation of its constituent characters, learners are encouraged
to build vocabulary "words" instead of characters For each of the 2,112 commonly used characters included in this index, the following information
is given:
Frequency rank Simplified Chinese
[Traditional Chinese] /Pinyin/ (Optional HSK
Level) List o f headwords in word frequency index containing the character and word frequency ranks
An example entry is as follows:
0017 们 [ 們 ]/ m e / i / ( 1 ) 我 们 0026 他 们 oo4i
们 0 1 1 1 你 们 0 1 7 9 人 们 0 2 3 7 它 们 0 5 3 5 咱 们 0 5 4 0 她们0567
11
Trang 2312 Introduction
Each Chinese character in this index is linked to the
words in the main frequency index The frequency
ranks of the linked words are given to enable
readers to make cross-references easily If no
headword is included in the word frequency index
(i.e the headword is not in the top 5,004 list) for a
certain character included in this index, only the
character in Simplified Chinese, its Traditional
Chinese version, and its Pinyin gloss (and the HSK
Level if available) are given The index of the top
2,112 characters is of help to readers when they
decide which characters to learn first; it can also
enable learners to switch smoothly between Pinyin
and characters, and between Simplified Chinese and
Traditional Chinese In addition, the commonly used
words containing the same characters which are
mapped from the main frequency index are
particularly useful in vocabulary building
One must remember, however, that a dictionary
is not simply a repository of information of use to
learners of a language - it is also an embodiment
of the language to some degree As the language
encodes in part a worldview and history, the
dictionary also stands as a cultural artefact The
background of the language represented here thus
deserves some consideration
A brief introduction to Mandarin
Chinese
Chinese belongs to the Sino-Tibetan languages It is
spoken by a total of 1.3 billion speakers Of these,
the majority are native speakers of Mandarin (i.e
Standard Chinese based on the Beijing dialect) as
opposed to another variety of Chinese such as
Cantonese Mandarin Chinese has a total of 1,052
million speakers, more than twice as many as speak
English, the language with the second highest total
of speakers, at 508 million (see Ostler 2005)
Mandarin Chinese is the official language of
Mainland China and Taiwan; it is also one of the
official languages of Singapore and the United
Nations
There are currently more than 30 million people
in the world who are learning Mandarin Chinese as
a foreign language 8 The popularity of Chinese as a
second language is growing For example in the
United States, the number of Chinese learners is
growing fastest in comparison to learners of other
foreign languages In Britain, Mandarin is studied by more children than German and Russian (only French and Spanish are presently more popular); and Mandarin is expected to overtake Spanish in three years if the rate of growth continues 9
Probably the most striking difference between Chinese and most other languages is purely visual - its written form English, and many other languages, employ an alphabetical system Chinese uses a logographic system, i.e roughly speaking, the symbols of English encode sounds, whereas those
in Chinese either singly or in combination encode words, with each character being a syllable As a result the Chinese writing system is relatively complex - English has only 26 alphabetical characters (i.e letters) that can be arranged in different combinations to form tens of thousands of different words, Chinese has tens of thousands of individual characters that represent words What makes it even more difficult for learners to build up their Chinese vocabulary is that, while some Chinese characters represent single words, it is more common for characters to function in combination
to form words, many of which have a different meaning than the simple aggregation of the meanings of constituent characters (see below for further discussion) To make things yet more complex, since Chinese does not use white spaces to delimit words in writing, learners have to decide for themselves which characters in the running text form a word when they read, though word boundaries can be inferred in spoken Chinese on the basis of spoken features such as pauses and repetitions Given the huge number of Chinese characters, and of words, it is not only quite impossible, but totally unnecessary as well, for learners (or average native Chinese speakers) to know tens of thousands of Chinese characters and words (see section 4) This explains why a frequency dictionary of Chinese, which provides core lists of characters and words in this language, can be of particular advantage as a guide to vocabulary teaching and learning
As noted earlier, some Chinese characters can serve as words; they can also combine with other characters to form new words In terms of word types, the overwhelming majority of Chinese words are disyllabic, as illustrated in Figure 2, which shows
Trang 24Figure 2 Words of varying lengths in the HSK syllabus
the proportions of words at different levels as
required in the HSK syllabus As can be seen,
disyllabic words account for 72 per cent on average
while the proportion of monosyllabic words is
roughly 22 per cent Words composed of three
characters or more are relatively infrequent It is also
interesting to note that, as the HSK Level increases,
disyllabic words increase in number while the
proportion of monosyllabic words drops This is
probably because many high-frequency function
words, which are more likely to be monosyllabic
words, are typically required at HSK Level 1 For the
same reason, monosyllabic words are expected to
make up a large proportion of Chinese texts in
terms of word tokens
This expectation is in fact borne out in our
corpus data The "valid lexicon" (see section 4)
which furnishes a quantitative basis for this
dictionary comprises 38,713,412 word tokens
(running words in the text) in 84,883 word types
(different words) Of these, there are 6,413
monosyllabic words (7.56 per cent of the total), yet
they account for 54.08 per cent of total tokens In
contrast, while 46,670 disyllabic words take up the
largest proportion in terms of word types (54.98 per
cent), they account for 42.33 per cent of total word
tokens Although three-character (22.35 per cent)
and four-character words (13.24 per cent) are also very frequent, they do not contribute much in terms
of word tokens, as shown in Figure 3
A character corresponds to a syllable in spoken Chinese (cf Li and Thompson 1980: 13) While classical Chinese can be classified as a monosyllabic language, this is no longer true of modern Chinese As we have noted, monosyllabic words only account for 22 per cent of the graded vocabulary in the HSK syllabus, while the proportion
of monosyllabic words in Mandarin as a whole is much lower (see Figure 3) Some characters can be used directly as words, e g 有 " h a v e " , 来 " c o m e " , and 新 " n e w " ; some characters can serve directly as words or as parts of a word, the meaning of which may or may not be related to the meanings of individual constituent characters ( e g 我 " I , me",
们" p l u r a l suffix" and 我 们 " w e , us" versus 东 " e a s t " ,
西" w e s t " and 东 西 " s t u f f " ) ; some characters cannot stand alone as words ( e g 蝴 蝶 " b u t t e r f l y " and 葡 萄
"grape")
Words in Chinese can be simplex or compound
A simplex word has one morpheme, which can be monosyllabic ( e g 天 " s k y " , 去 " g o : 他 " h e , him")
Trang 25Figure 3 Words of varying lengths in our corpus
There are different types of compound words in
Chinese according to their internal structures,
namely, coordinate ( e g 寒 冷 " c o l d " ) , endocentric
(e.g.冰箱"ice-box, refrigerator"), verb-complement
( e g 提 高 " r a i s e , improve"), verb-object ( 洗 澡
"take a bath or shower"), subject-predicate (e.g
地 震"earthquake"), affixed ( e g 刀 子 " k n i f e " ) ,
and reduplication ( 姐 姐 " e l d e r sister")
While the discussion above appears to suggest
that "wordhood" is easy to define in Chinese, it is
nevertheless not always easy or even possible to
make a distinction between morphemes and words
on the one hand and between words and phrases on
the other hand (cf Wu 2003: 3) In fact, a whole
range of criteria have been proposed to define
wordhood of various types, e.g orthographical word,
morphological word, lexical word, syntactic word,
grammatical word, semantic word, sociological word,
psychological word, phonological word, prosodic
word (see Di Sciullo and Williams 1987; Dai 1998;
Duanmu 1998; Packard 2000; Feng 2001a) On the
basis of a review of such criteria, Dixon and
Aikhenvald (2002) propose to maintain a distinction
between phonological and grammatical words, which
may or may not coincide While the authors
concede that the criteria that they engage to define
a phonological word do not apply in every language,
they offer three "universal criteria" that define a
grammatical word: "A grammatical word consists of
a number of grammatical elements which (a) always
occur together, rather than scattered through the
clause (the criterion of cohesiveness); (b) occur in a
fixed order; (c) have a conventionalised coherence and mea门ing〃 (ibid.: 19)
Unfortunately, the so-called "split words" in Chinese satisfy none of these criteria The morphemes that make up a split word can not only scatter through the clause instead of occurring together ( e g 睡 觉 " s l e e p " : 睡 了 一 天 觉 " s l e p t for a whole day"), they can even occur in a reversed rather than fixed order ( e g 上 学 " g o to school":
学 , 我 爱 上 就 上 , 不 爱 上 就 不 上 , 谁 管 得 着 。
"School, I'll go if I like and won't if I don't Who has the right to interfere?") Dixon and Aikhenvald (ibid.: 20) c o m m e n t that criterion (c) means that as a word has its own coherence and meaning, speakers of the language "may talk about a word (but are unlikely to talk about a morpheme)." This criterion does not apply in Chinese either, not only because morphemes in Chinese can be coherent and meaningful (a monosyllabic word consists of one morpheme) but also because even bound morphemes in split words such as 请 客 " i n v i t e / treat guest(s)" can be "talked about", e g 请 了
"invited/treated three tablefuls of guests" According to Dixon and Aikhenvald (ibid.: 6),
"the (grammatical) word forms the interface between morphology and syntax Morphology deals with the composition of words while syntax deals with the combination of words" In Chinese, however, word-internal structures are similar to syntactic structures (Dai 1998; Wu 2003: 3), as can
be seen in the types of Chinese compound words discussed above
Trang 2615 Introduction
As far as split words are concerned the difficulty
lies in the fuzzy distinction between words and
phrases Many competing criteria have been
proposed to differentiate (compound) words from
phrases, e.g conjunction reduction, freedom of
parts, semantic composition, syllable count,
insertion, exocentric structure, adverbial
modification, XP substitution, productivity, and
intuition (see Duanmu 1998 for a review) However,
while each of these tests "may work in certain areas
for certain cases, there is no overall generalisation
and constraint on what is a compound and what
must be a phrase" (Feng 2001b) By grammatical
criteria such as splitability and insertion, disyllabic
split words such as 吃 饭 " e a t : 睡 觉 " s l e e p : 走 路
" w a l k : 跑 步 " j o g : 关 心 " c a r e for", and 担 心 " w o r r y
about" are phrases, but they are generally accepted
as words by native speakers whereas those that
can be judged as words by grammatical criteria
( e g 多 弹 头 分 寻 重 返 大 气 层 运 载 工 具 " m u l t i p l e
independently targeted reentry vehicle, MIRV") may
not be accepted as words This is not only because
"the morphological system of Chinese is strongly
sensitive to prosodic foot" (Feng 1997: 135), but also
because the "word sense" of a native speaker of
Chinese is based on prosody (cf Feng 2001a)
Duanmu (1998) also observes that "there is a rich
body of phonological evidence, especially metrical
and tonal evidence, for the distinction between
words and phrases in both Mandarin and other
Chinese dialects" This view of wordhood is line with
Matthews's (1991: 209) statement that "the word
tends to be a unit of phonology as well as
grammar"
Because of the multidimensional properties of
wordhood in Chinese (cf Feng 2001a), we decided
to take a less rigid approach to what counts as a
word As noted in section 2, some of the "words"
included in this dictionary can be more
appropriately called formulaic expressions, e.g
不 得 不"have t o : 也 就 是 说 " t h a t is to s a y " , 与 此
同 时" m e a n w h i l e : 忍 不 住 " c a n n o t help but", and
有 思 思 "interesting" We think that such commonly
occurring formulaic expressions acting as larger
"building blocks" of language are equally useful for
learners, if not more so
There are currently two sets of characters for the
Chinese writing system: Simplified Chinese and
Traditional Chinese The former is officially used in
Mainland China, Singapore and Malaysia while the latter is officially used in Taiwan, Hong Kong and Macau Overseas Chinese-speaking communities generally opt for Traditional Chinese characters, but Simplified Chinese characters are gradually becoming popular In this dictionary, we have given both Simplified and Traditional Chinese versions for each entry of word and character, but illustrative examples are shown only in Simplified Chinese as they appear in our corpus
In addition to Chinese characters for the writing system, a commonly used alphabetical system
known as Pinyin has been employed to Romanise
the Chinese script Pinyin uses the Latin alphabet to represent sounds in Mandarin It is not only beneficial in helping native children and foreign learners to learn spoken Chinese before they start to learn Chinese characters, but it is also a popular method of inputting Chinese characters into a computer There are four tones in the Pinyin system, with each syllable of every word characterised by one of them, except for a few syllables which are considered toneless The tones, which are marked on one of the vowels (a, e, i, o, u)
in a syllable, are first or "high" (a, e, C, o, u), second
or "rising" (a, e, i, o, u), third or "falling-rising" (K, e,
M, o, O), and fourth or "falling" (a, e, i, o, u) In this dictionary, toned Pinyin glosses are given for all entries of words and characters It is important to note, however, that the same words can have different Pinyin glosses for different word senses (see section 5) Also the tones for the same characters in the word index and the character index may be different because of "tone sandhi", i.e the change of tone that occurs when different tones come together in a word
15
Notes
1 We thank Professor Hongyin Tao of the University of California, Los Angeles for permitting us to use part of the data he collected in the Lancaster Los Angeles Spoken Chinese Corpus (LLSCC,
http://www.ling.lancs.ac.uk/corplang/llscc/) We are also grateful to Dr Jiajin Xu of Beijing Foreign Studies University for allowing us to use his corpus of Spoken Chinese of Urban Teenagers (SCOUT).
2 See http://www.elda.org/catalogue/en/text/W0039.html for more information about the LCMC corpus.
3 See http://www.ling.lancs.ac.uk/corplang/ucla/ for more
Trang 276 The HSK Level 2 words 反动"reactionary" and 红旗"「ed
f l a g " are e x c l u d e d f r o m o u r w o r d list as t h e y are n o t sc
5 The current national standard in China for Chinese
Information Processing GB2312-80 (Character Set for
Chinese Character Encoding in Information Exchange –
Basic Set) is based on the frequency list established on
Project Code 748.
common nowadays (ranking 5781 and 6004 respectively in our corpus) as they were during the so-called “Cultural Revolution”.
7 This decision was motivated by the fact that in computer programming we had to keep the same kind of information (e.g Pinyin gloss, or word sense) together in one field instead of mixing information of different kinds (e.g Pinyin glosses and word senses).
8 See BBC News on 9 January 2007 (http://news.bbc.co.uk/2/hi/asia-pacific/6244763.stm).
9 See mandarin.htm.
Trang 28Beijing Aeronautical University.
Beijing Language and Culture University (1986)
(A Frequency Dictionary of Modern Chinese) Beijing: Beijing Language and
Culture University Press.
“Syntactic, morphological and phonological
words in Chinese” In J Packard (ed.) New
Approaches to Chinese Word Formation,
pp 103–134 Berlin: Mouton de Gruyter.
Di Sciullo, A.M and Williams, E (1987)
On the Definition of Word Cambridge, MA:
MIT Press.
Dixon, R.M.W and Aikhenvald, A.Y (2002)
Word: A Cross-Linguistic Typology Cambridge:
CUP.
Duanmu, S (1998)
“Wordhood in Chinese” In J Packard (ed.)
New Approaches to Chinese Word Formation,
pp 135–196 Berlin: Mouton de Gruyter.
Feng, S (1997)
“Prosodically determined word-formation in
Mandarin Chinese” Social Sciences in China,
(ed.) Practical Applications in Language and
Computers, pp 417–427 Frankfurt: Peter
Lang.
Hong Kong Polytechnic University (1991–1997)
(A Chinese Word Bank from Mainland China, Taiwan, and Hong Kong) Hong Kong: Hong Kong Polytechnic University.
Juilland, A and Chang-Rodríguez, E (1964)
Frequency Dictionary of Spanish Words The
Hague: Mouton.
Leech, G (1997)
“Teaching and language corpora: A convergence” In A Wichmann, S Fligelstone,
A McEnery and G Knowles (eds) Teaching and
Language Corpora, pp 1–23 London: Longman.
Li, N and Thompson, S (1980)
Mandarin Chinese Berkeley: University of
California Press.
Liu, E (1973)
Frequency Dictionary of Chinese Words
The Hague: Mouton.
Matthews, P.H (1991)
Morphology (2nd edn) Cambridge: Cambridge
University Press.
McEnery, T., Xiao, R and Tono, Y (2006)
Corpus-Based Language Studies: An Advanced Resource Book London and New York:
Routledge.
Trang 29“Customisable segmentation of morphologically
derived words in Chinese” Computational
Linguistics and Chinese Language Processing, 8(1):
1–28.
Zhang, H and Liu, Q (2002)
“Model of Chinese words rough segmentation
based on N-shortest-paths method” Journal of
Chinese Information Processing, 16(5): 1–7.
Zhang, H., Liu, Q., Zhang, H and Cheng, X (2002)
“Automatic recognition of Chinese unknown
words based on role tagging” In Proceedings
of the 1st SIGHAN Workshop, COLING 2002,
pp 71–7 Taipei.
Trang 30The ICTCLAS part of speech annotation scheme
Trang 31Frequency index
Frequency rank H e a d w o r d in Simplified Chinese [Headword in Traditional
Chinese] /Pinyin/ (Optional HSK Level) Part o f speech English gloss
Illustrative example in Simplified Chinese and English translation of the example
Normalised frequency | Dispersion index | Usage rate | Optional register code
0001 的 [ 的 ]I d e l (1) aux [structural particle used
0006 不 [ 不 ]I b u l (1) adv no, not
这条往f 不 准 停 t 。 Y o u can't park in this street
5 0 5 8 9 | 0 8 | 4 0 2 4 5
0007 我 [ 我 ]I w d l (1) pron I, me
咖 啡 和 茶 使 我 感 到 兴 奋 。 C o f f e e and tea stimulate me
5 1 3 6 5 | 0 7 1 | 3 6 6 5 3
0008 个 [ 個 ]I g e l (1) clas [generalised measure word
used for nouns without a specific measure term]
山 那 边 有 一 个 村 庄 。 T h e r e is a village beyond the hill
li
xiangjiao
yTngt6o fizhT shzi shfliu mangguo lfzi juzi caomei juzi yezi longyan
boluo shanzha
mfhoutao nfngmeng
Gloss
apple watermelon grape peach pear banana cherry litchi, lychee persimmon pomegranate mango plum tangerine strawberry orange coconut longan fruit pineapple hawthorn Chinese gooseberry lemon
Frequency rank
Trang 320019 了 [ 了] llel (1) part [sentence final particle
indicating change of state or current
0022 对 [ 對 ]I d u l l (1) prep for, to, with regard to
她 必 须 对 她 的 行 为 负 责 。 S h e must answer for
her actions
0023 还 [ I ] Ihail (1) adv still, yet
后i 还 有 座 位 0 There are still vacant seats at the back
0025 大 [ 大 ]/ d a / (1) adj big, large
— 个 夫 浪 把 小 命 卷 走 了 。 A huge wave swept the boat away
1 1 7 9 5 | 0 9 8 | 1 1 5 2 5
0026 我 们 [ 我 們 ]/ w o m e n / pron we, us
我 们 早 上 很 早 就 出 发 了 。 W e made an early start in the morning
1 5 0 6 5 | 0 7 4 | 1 1 1 1 5
0027 着 [ 著 ]I z h e l (1) aux [aspect marker indicating
a durative or ongoing situation]
花 园 里 的 花 正 开 着 ° The flower in the garden
0030 中 [ 中 ] I z h O n g l (1) toe in, within
数 字 写 在 背 后 的 表 中0 The figures are set out
in the table at the back of the book
9 9 0 6 | 0 8 6 | 8 4 8 5
0035 会 [ 會 ]I hull (1) v can, know how to do
谁 说 我 不 会 做 饭 ? Who said that I cannot cook?
1 0 1 7 8 | 0 8 3 | 8 4 7 6
0036 地 [ 地 ]I d e l (1) aux [structural particle
introducing an adverbial modifier]
你 得 认 真 地 打 每 一 个 球 。Y o u should be serious about each stroke
O)pron
pron
Trang 3322 A Frequency Dictionary of Mandarin Chinese
0037 那 [ 那 ]I n a l (1) pron that
那 是 一 派 胡 言 ! That's total nonsense!
1 3 1 2 6 | 0 6 2 | 8 1 5 2 | S
0 0 3 8 很 [ 很 ]
火 车 很 可 能 要 晚 点 。 I t ' s very likely that the
train will be delayed
0041 他 们 [ 他 們 ]I t a m e n l (1) pron they, them
i也 i n W\ 冈!J至i]达�They've j ust arrived
0044 得[得]/cfe/ (1) aux [structural particle used
after a verb that introduces a complement
0046 又 [ 又 ] l y d u l (1) adv again, once again
电 為 殳 诛 了 。 T h e elevator's out again
0050 出 [ 出 ]I c h u l (1) v out; produce; happen
这 儿 出 了 什 幺 事 ? What's happened here?
0053 来 [ 來 ]H a i l (2) aux [preceding a verb to
indicate the intended or suggested action]
我 找 到 这 个 小 玩 意 儿 来 做 开 瓶 器 。 丨f o u n d this gadget that will serve as a bottle opener
6 5 1 6 | 0 8 6 | 5 6 0 4
0054 次[次]/c/V (1) clas [measure word indicating
number of repetitions or count of actions or events] times
这 种 药 每 天 喝 三 次° Take this medicine three times a day
6 3 9 8 | 0 8 6 | 5 5 1 6
0055 多 [ 多 ]/ d u o / (1) adj many, much, plentiful
他 的 钱 多 得 足 够 买 下 那 个 岛 。 H e had enough money to buy out the island
5 8 2 3 | 0 9 4 | 5 4 8 5
0056 想 [ 想 ] M d n g f (1) v think
对 不 起 , 我 想 她 出 去 购 物 了 。 S o r r y , I think she's gone shopping
0062 几 [ 幾 ] / / / (1) num several; how much
我i 离 弁 几 天 。 I am going away for a few
0064 下 [ 下 ] I x i a l (1) loc under, below
它 们 在 我 床 卡 ° They're under my bed
5 5 0 5 | 0 8 9 | 4 9 1 6
0065 为 [ 為 ]I w e i l (1) prep [introducing purpose,
reason, beneficiary, e t c ] for, because of
你 能 为 我 弄 到 两 张 音 乐 会 的 好 票 吗 ? Can you secure me two good tickets for the concert?
/ d u o / (1) adj many
Trang 340067 后 [ 後 ] I h b u l (1) toe behind; after, later
三 关 后 您 可 以 来 试 衣 。 Y o u can come for a
fitting three days later
0069 多 [ 多 ]/ d u o / (1) num many, much, numerous
我 们 已 试 过 多 次 了 。 W e have tried many
0077 以 [ 以 ]l y V (2) prep by means of, with, in
(some way); according to; because o f
0080 二 [ 二] I色rl (1) num two
双 无 踢 成 二¥ 。 T h e two teams tied
4 6 6 3 | 0 8 6 | 4 0 0 3
0083 更 [ 更 ]I g e n g l (1) adv [comparative degree]
more
我 无 法 跑 得 更 快 了 。 丨 c a n ' t run any faster
4 5 0 9 | 0 8 8 | 3 9 8 8
0084 之 [ 之 ]I z h T I (3) aux [archaic equivalent o f
structural particle 的]
他 们 做 得 对 还 是 错 , 这 是 争 议 之 处 • It is a matter of dispute whether they did the right thing
5 1 2 5 | 0 7 7 | 3 9 5 1
0085 走 [ 走 ]I z o u l (1) y walk; leave
他 一 句 话 都 没 说 就 走 了 。H e left without a word
0089 p尼[呢]Inel (1) part [particle used at the end
of a question or declarative sentence to indicate mood]
你 为 什 么 不 回 家 去 呢 ? Why didn't you go home?
6 5 2 2 | 0 5 6 | 3 6 7 7 | S
0090 知 道 [ 知 道 ]I z h J d a o l (1) m know
照 理 他 应 该 知 道 她 的 地 址• He ought t o know her address
Trang 3524 A Frequency Dictionary of Mandarin Chinese
0095 于 [ 於 ]l y u l (2) prep [indicating time, location,
direction, etc.] in, at
他 们 将 于 下 周 出 发 到 香 港 去 。 T h e y will set out
for Hong Kong next week
Please put your articles in this envelope
and seal it
0106 高 [ 高 ] 賊 o l (1) adj tall, high
多 高 的 大 楼 啊 ! What a tall building it is!
0111 们 [ 們 ]/ m e n / (1) swf [plural marker for
pro门ou门s and some 3门im3tc 门ouns]
朋 友 们 舞 会 后 就 分 开 了 。T h e friends separated after the party
3 1 9 9 | 0 9 2 | 2 9 3 0
0 1 1 2 新 [ 新 ] /x/"n/(1)ac(/new Iff货t艮t夬京龙至J。The new order is coming soon
4 7 0 2 | 0 6 2 | 2 9 2 5
0113 所 [ 所 ]I s u o l (2) aux [particle preceding a verb
to form a nominal structure]
这 本 词 典 正 是 我 所 需 要 的 。 T h i s dictionary is exactly what I need
3 6 4 6 | 0 8 | 2 9 0 6
0114 社 会 [ 社 會 ]I s h e h u l l (1) n society
很 多 艺 术 家 都 感 到 与 社 会 脱 节 。M a n y artists feel alienated from society
0119 吧 [ 吧 ]I b a l (1) part [modal particle
indicating a suggestion or request;
marking a question requesting confirmation, or a pause after alternatives]
这 次 我 们 各 付 各 的 吧 。 L e t ' s go Dutch this time
Trang 360124 使 [ 使 ]/ s h i / (2) m [often used in serial
verb constructions] make, cause,
房 间 里 有 四 张 单 人 床 。 T h e r e are four single
beds in the room
0131 将 [ 將 ]/ j i a n g / (2) adv [indicating a future
happening] will, be going to
She dressed up like a princess for the party
0134 [ 口…Ijiaol (1, 1) y name, call; shout; order
(somebody to do something); order (meal, taxi, etc.)
你 的 父 母 叫 你 什 么 ? What do your parents call you?
3 9 4 4 | 0 6 3 | 2 5 0 1
0135 国 家 [ 國 家 ]/ g u o j i a / (1) n country
儿 童 是 国 家 和 社 会 的 未 来 。 C h i l d r e n are the future of the country and society
4 5 3 3 | 0 5 5 | 2 4 9 7
0136 起 [ 起 ]_ (1) v get up, rise; up
我 通 常 起 得 很 早 。 I usually get up early
2 6 1 4 | 0 9 4 | 2 4 6 5
0140 全 [ 全 ]I q u a n l (1) adj whole, entire, complete
最 为 重 要 的 一 点 是 , 全 家 人 团 聚 在 一 起 。 M o s t important, the entire family was together
4 5 6 9 | 0 5 3 | 2 4 1 5
0141 完 [ 完 ]/ w a n / (1) v finish, be over; (use) up
我 刚 看 完 第 3 章 。 I ' v e just finished reading Chapter 3
3 4 3 4 | 0 7 | 2 3 8 8
0142 吋间[日寺間]/shfjian/ (1) n time
剩 下 时 间 本 多 了 。 T h e r e is little t i m e left
2 6 3 4 | 0 9 | 2 3 7 5
0143 起 来 [ 起 卒 ]/ q i l a i / (1) v get up, rise; [following
a verb to indicate the beginning of a situation] start to; [following a verb to indicate completed en ess or effectiveness]
她 哭 起 来 了 ° She started crying
0147 老 [ 老 ]/ _ (1) adj old, veteran
房 子 前 面 的 那 棵 树 很 老 了 o The tree in front
of the house is very old
2 8 1 7 | 0 8 1 | 2 2 7 4
0148 可 能 [ 可 能 ]/ k e n e n g / (1) m might (happen)
{ i 可 能 明 天 来 。 H e may come tomorrow
口 …
Trang 3726 A Frequency Dictionary of Mandarin Chinese
I'm afraid the colour is a bit too
bright for me
0157 条 n 条 ]I t i a o l (1) clas [measure word for
things of a long and thin shape (e.g string
and river), pieces of writing (e.g news,
suggestions and regulations), or human
2 9 2 4 | 0 6 9 | 2 0 1 0
0163 文 化 [ 文 化 ]I w e n h u a l (1) n culture
美 国 是 个 多 文 化 的 国 家 。T h e US is a country with many different cultures
2 7 2 6 | 0 7 4 | 2 0 0 7
0164 问 [ 問 ]/ w e n / (1) v ask
如 果 他 来 这 儿 , 我 将 问 他 几 个 问 题 。I f he comes here, I shall ask him some questions
0167 点 [ 點 ] ( 1 ) clas [measure word for
point, item, etc.]; [measure word for small quantities]
我 只 想 简 单 讲 两 点 。 丨s h a l l only mention two things in brief
2 6 5 1 | 0 7 3 | 1 9 4 4
0170 进 行 [ 進 行 ]/ j m x f n g / (1) l / g o on, last; be under
way; carry on, carry out, perform
比 赛 进 行 了 病 个 小 时 。T h e game lasted two hours
2 2 3 9 | 0 8 6 | 1 9 2 3
0173 长 [ 長 ]/ c h d n g / (1) adj long
到 城 里 要 走 很 长 的 路 程 ° It is a long walk t o the town
Trang 38kuangquan shu 丫 mineral water 8642
baijiu white spirit
baiiand) brandy
bai putaojiu white wine
bai sh] kele Pepsi, Pepsi-Cola
bTng kafGi ice coffee
bTng shu ice water
chengzhT orange juice
chunfing shu 丫 purified water
dusongzfjiu gin
hongjiu red wine
hong putaojiu red wine
huacha scented tea
jTweijiu cocktail
juzi zhh orange juice
kekou kele Coca-Cola, Coke
maotai jiu maotai
nfngmeng shu 丫 lemonade
pingguo zhT apple juice
re qiaokeli hot chocolate
1 8 8 8 | 0 9 7 | 1 8 3 4
0179 你 们 [ 你 們 ]/ n i m e n / 0) pron [plural]
you
对 不 起 , 给 你 们 找 麻 烦 了 。 I ' m sorry t o have caused you so much trouble
2 Drinks and beverages
Trang 3928 A Frequency Dictionary of Mandarin Chinese
0 1 8 0 找 [ 找 ]/zhdo/(1)1/look for
如 釭 械 什 么 ? What are you looking for?
2 5 6 0 | 0 7 | 1 7 9 5
0181 跟 [ 跟 ]/ g e n / (1) prep [indicating relationship,
involvement, or comparison] with
我 有 点 要 紧 的 事 跟 他 商 量 • 丨h a v e something
urgent to discuss with him
3 1 4 2 | 0 5 7 | 1 7 8 6 | S
0182 儿 [ 兒 ]l e r , r / s w f [nonsyllabic suffix for
retroflection, especially in the Beijing dialect]
这 两 张 画J L 不 i 一 样 。 T h e two pictures are
not quite the same
0184 女[女]/ni}/(1)aG(/ female, woman
门女卜是两个 i i l ! Outside were two female
ghosts
2 0 0 7 | 0 8 8 | 1 7 6 1
0185 而 且 [ 而 且 ] f e r q i e f (1) conj (not o n l y )
but also
这 不 仅 省 事 而 且 省 钱 。 T h i s will save not only
labour but also money
2 0 4 3 | 0 8 6 | 1 7 6 1
0186 幵 [ 開 ]I k a i l (1) v open; operate, drive (car),
turn on (light); start (business); hold (meeting,
party, etc.); (water) boil; write (cheque); (flower)
0188 其 [ 其 ]I q i l (3) pron [third person singular or
plural] his, her, its, their; that, such
0 1 9 1 沒[沒]/Ar?e7/ (1) v have not, there be not
我 根 本 没 时 间 给 你 写 信 。 | have no time at all
2 3 4 1 | 0 7 | 1 6 3 0
0200 住 [ 住 ]I z h u l (1) y live, stay
你 住 在 几 号 房 间 呢 ? 先 生 。W h a t room are you staying in, sir?
1 7 9 5 | 0 9 | 1 6 0 7
0204 带 [ 帶 ]I d a i l (1) v carry, bring, take
请 务 必 随 时 随 身 带 着 钥 匙 。P l e a s e make sure that you take the key with you at all times
Trang 40我 们 可 以 请 您 一 起 进 餐 吗 ? May we have the
pleasure of your company at dinner?
2 7 6 2 | 0 5 7 | 1 5 7 6 | S
0210 地 [ 地 ]_ (1) n earth, ground, field
地 上 有 很 多 ^ 对 叶 。 There are a lot of leaves on
0 2 1 2 同 [ 同 ]I t o n g l (2) prep [indicating relationship,
involvement, comparison] with
0216 张 [ 張 ]/ z h a n g f (1) clas [measure word for flat
objects and things with a flat surface, and also
for bows and mouths]
0219 件 [ 件 ]I j i a n l (1) clas [measure word for
clothes, furniture, affairs, etc.] item,
2 6 8 2 | 0 5 7 | 1 5 2 3 | S
0221 真 [ 蔓 ]I z h e n l (1) adv really
真 漂 亮• It's really beautiful
2 2 5 6 | 0 6 7 | 1 5 2 1
0 2 2 2 啊 [ 啊 ]l a l (1) part [modal particle
showing affirmation, approval,
or co门sc门t]
多 么 美 丽 的 地 方 啊 。 W h a t a beautiful place it is!
2 0 3 4 | 0 7 2 | 1 4 5 6
0227 最 后 丨 最 後 ]I z u W o u l (1) adj final, last
谁 笑 到 最 后 谁 笑 得 最 好 。 W h o laughs last laughs longest
1 6 0 9 | 0 9 | 1 4 5 2
0228 人 民 [ 人 民 ]/ r e n m f n / (1) n (the) people
人 民 是 文 艺 工 作 者 的 母 亲 。 I t is the people who nurture our writers and artists
2 7 5 4 | 0 5 3 | 1 4 4 7
0229 手 [ 手 ]I s h o u l (1) n hand
握 住 瓶 子 , 用 另 一 只 手 拔 瓶 塞 。 H o l d the bottle and pull the cork out with the other hand
1 9 5 8 | 0 7 2 | 1 4 1 3
0233 政 府 [ 政 府 ]I z h e n g f u l (1) n government
这 篇 文 章 展 示 了 政 府 的 政 策 。T h e article revealed the policies of the government