The major exception to this rule comprises marks that combine typographically with other characters, which are categorized as "marks" instead of "letters." They include not only diacriti
Trang 1After the code point value and the name, the next most
important property that a Unicode character has is its general category Seven primary categories exist: letter, number,
punctuation, symbol, mark, separator, and miscellaneous Each
is subdivided into additional categories
Letters
The Unicode standard uses the term "letter" rather loosely in assigning things to this general category Whatever counts as the basic unit of meaning in a particular writing system,
whether it represents a phoneme, a syllable, or a whole word or idea, is assigned to the "letter" category The major exception
to this rule comprises marks that combine typographically with other characters, which are categorized as "marks" instead of
"letters." They include not only diacritical marks and tone
marks, but also vowel signs in those consonantal writing
systems where the vowels are written as marks applied to the consonants
Some writing systems, such as the Latin, Greek, and Cyrillic alphabets, also have the concept of "case." That is, two series
of letterforms are used together, with one series, the
"uppercase," used for the first letter of a sentence or a proper name, or for emphasis, and the other series, the "lowercase," used for most other letters
Uppercase Letter (Lu)
In cased writing systems, the uppercase letters are placed in this category
Trang 2In cased writing systems, the lowercase letters are placed in this category
Titlecase Letter (Lt)
Titlecase is reserved for a few special characters in Unicode These characters are basically examples of compatibility
characters characters that were included for round-trip
compatibility with some other standard Every titlecase letter is actually a glyph representing two letters, the first of which is uppercase and the second of which is lowercase For example, the Serbian letter nje ( ) can be thought of as a ligature of the Cyrillic letter n ( ) and the Cyrillic soft sign ( ) When Serbian
is written using the Latin alphabet (as is done in Croatian,
which is almost the same language), this letter is written using the letters nj Existing Serbian and Croatian standards were designed to provide a one-to-one mapping between every
Cyrillic character used in Serbian and the corresponding Latin character used in Croatian This approach required using a
single character code to represent the nj digraph in Croatian, and Unicode carries that character forward Capital Nje in
Cyrillic ( ) thus can convert to either NJ or Nj in Latin
depending on the context The fully uppercase form, NJ, is
U+01CA LATIN CAPITAL LETTER NJ, and the combined upper-lower form, U+01CB LATIN CAPITAL LETTER N WITH SMALL LETTER J, is considered a "titlecase" letter Three Serbian
characters have a titlecase Latin form: (lje, which converts to lj), (nje, which converts to nj), and (dzhe, which converts
to d ) These characters were the only three titlecase letters in Unicode 2.x
Unicode 3.0 added several Greek letters to this category Some early Greek texts represented certain diphthongs by writing a
Trang 3small letter iota underneath the other vowel rather than after it For example, you'd see "ai" written as If you just capitalized the alpha ("Ai"), you'd get the titlecase version: In the fully uppercase version ("AI"), the small iota becomes a regular iota again: AI These characters are all in the Extended Greek
section of the standard and are used only in writing ancient
Greek texts In modern Greek, these diphthongs are written using a regular iota; for example, "ai" is written as
Modifier Letter (Lm)
Just as some things you might conceptually think of as "letters" (vowel signs in various languages) are classified as "marks" in Unicode, the opposite also occurs The modifier letters are
independent forms that don't combine typographically with the characters around them, which is why Unicode doesn't classify them as "marks" (Unicode marks, by definition, combine
typographically with their neighbors) Instead of carrying their own sounds, the modifier letters generally modify the sounds of their neighbors In other words, conceptually they're diacritical marks Because they occur in the middle of words, most text-analysis processes treat them as letters, so they're classified as letters
The Unicode modifier letters are generally either International Phonetic Alphabet characters or characters that are used to
transliterate certain "real" letters in non-Latin writing systems that don't seem to correspond to a regular Latin letter For
example, U+02BC MODIFIER LETTER APOSTROPHE is typically used to represent the glottal stop, the sound made by (or, more
accurately, the absence of sound represented by) the Arabic
letter alef, so the Arabic letter is often transliterated as this
character Likewise, U+02B2 MODIFIER LETTER SMALL J is used
to represent palatalization, and thus is sometimes used in
transliteration as the counterpart of the Cyrillic soft sign
Trang 4This catch-all category includes everything that's conceptually a
"letter," but that doesn't fit into one of the other "letter"
categories Letters from uncased alphabets such as Arabic and Hebrew fall into this category, as do syllables from syllabic
writing systems like Kana and Hangul and the Han ideographs
Marks
Like letters, marks are part of words and carry linguistic
information Unlike letters, marks combine typographically with other characters For example, U+0308 COMBINING DIAERESIS may look like ¨ when shown alone, but is usually drawn on top
of the letter that precedes it That is, U+0061 LATIN SMALL LETTER A followed by U+0308 COMBINING DIAERESIS isn't drawn as "a¨", but rather as "ä" All of the Unicode combining marks do this kind of thing
Non-spacing Mark (Mn)
Most of the Unicode combining marks fall into this category Non-spacing marks don't take up any horizontal space along a line of textthey combine completely with the character that
precedes them and fit entirely into that character's space The various diacritical marks used in European languages, such as the acute and grave accents, the circumflex, the diaeresis, and the cedilla, fall into this category
Combining Spacing Mark (Mc)
Spacing combining marks interact typographically with their neighbors, but still take up horizontal space along a line of text
Trang 5in the various Indian and Southeast Asian writing systems For example, U+093F DEVANAGARI VOWEL SIGN I ( ) is a spacing combining mark Thus U+0915 DEVANAGARI LETTER KA
followed by U+093F DEVANAGARI VOWEL SIGN I is drawn as the vowel sign attaches to the left-hand side of the
consonant
Not all spacing combining marks reorder, however: U+0940 DEVANAGARI VOWEL SIGN II ( ) is also a combining spacing mark When it follows U+0915 DEVANAGARI LETTER KA, you get the vowel attaches to the right-hand side of the
consonant, but the two combine typographically
Enclosing Mark (Me)
Enclosing marks completely surround the characters they
modify For example, U+20DD COMBINING ENCLOSING CIRCLE
is drawn as a ring around the character that precedes it These ten characters are generally used to create symbols
Numbers
The Unicode characters that represent numeric quantities are given the "number" property (technically, it should be called the
"numeral" property, but that's life) The characters in these
categories have additional properties that govern their
interpretation as numerals This category is subdivided as
follows
Decimal-Digit Number (Nd)
The characters in this category can be used as decimal digits
Trang 6This category includes not only the digits with which we're all familiar ("0123456789"), but similar sets of digits used with other writing systems, such as the Thai digits ("
")
Letter Number (Nl)
The characters in this category can be either letters or
numerals Many are compatibility composites whose
decompositions consist of letters The Roman numerals and the Hangzhou numerals are the only characters in this category
Other Number (No)
All of the characters that belong in the "number" category, but not in one of the other subcategories, fall into this one This category includes various numeric presentation forms, such as superscripts, subscripts, and circled numbers; various fractions; and numerals used in various numeration systems other than the Arabic positional notation used in the West
Punctuation
This category attempts to make sense of the various
punctuation characters in Unicode It breaks down as follows
Opening Punctuation (Ps)
For punctuation marks, such as parentheses and brackets, that occur in opening-closing pairs, the "opening" characters in
these pairs are assigned to this category
Trang 7For punctuation marks, such as parentheses and brackets, that occur in opening-closing pairs, the "closing" characters in these pairs are assigned to this category
Initial-Quote Punctuation (Pi)
Quotation marks occur in opening-closing pairs, just like
parentheses do The problem is that which is which depends on the language For example, both French and Russian use
quotation marks that look like this: «», but they use them
differently
«In French, a quotation is set off like this.»
»But in Russian, a quotation is set off like this.«
This category is equivalent to either Ps or Pe, depending on the language
Final-Quote Punctuation (Pf)
The counterpart to the Pi category, Pf is also used with
quotation marks whose usage varies depending on language It's equivalent to either Ps or Pe depending on language It's always the opposite of Pi
Dash Punctuation (Pd)
This category is self-explanatory It includes all hyphens and dashes
Trang 8Characters in this category, such as the middle dot and the
underscore, get treated as part of the word in which they
appear That is, they "connect" series of letters together into single words: This_is_all_one_word An important example is U+30FB KATAKANA MIDDLE DOT, which is used like a hyphen in Japanese
Other Punctuation (Po)
Punctuation marks that don't fit into any of the other
subcategories, including obvious things like the period, comma, and question mark, fall into this category
Symbols
This group of categories contains various symbols
Currency Symbol (Sc)
Self-explanatory
Mathematical Symbol (Sm)
Mathematical operators
Modifier Symbol (Sk)
This category contains two main types of characters: the
"spacing" versions of the combining marks and a few other
Trang 9preceding character in some way Unlike modifier letters,
modifier symbols don't necessarily modify the meanings of letters, and they don't necessarily get counted as parts of
words
Other Symbol (So)
This category contains all symbols that didn't fit into one of the other categories
Separators
These characters mark the boundaries between units of text
Space Separator (Zs)
This category includes all of the space characters (yes, there's more than one space character)
Paragraph Separator (Zp)
There is exactly one character in this category: the Unicode paragraph separator (U+2029) As its name suggests, it marks the boundary between paragraphs
Line Separator (Zl)
There's also only one character in this category: the Unicode line separator (U+2028) As its name suggests, it forces a line break without ending a paragraph
Trang 10Even though the ASCII carriage-return and line-feed characters are often used as line and paragraph separators, they're not placed in either of these categories Likewise, the ASCII tab character isn't considered a Unicode space character, even
though it probably should be They're all put in the "Cc"
category
Miscellaneous
A number of special character categories don't really fit in with the others
Control Characters (Cc)
The codes corresponding to the C0 and C1 control characters from the ISO 2022 standard appear in this category The
Unicode standard doesn't officially assign any semantics to
these characters (which include the ASCII control characters), but most systems that use Unicode text treat these characters the same way as they treat their counterparts in the source standards For example, most processes treat the ASCII line-feed character as a line or paragraph separator
The original idea was to leave the definitions of these code
points open, as ISO 2022 does Over time, however, various Unicode processes and algorithms have attached semantics to these code points, effectively nailing the ISO 6429 semantics to many of them
Formatting Characters (Cf)
Unicode includes some "control" characters of its own:
characters with no visual representation of their own that are used to control how the characters around them are drawn or
Trang 11handled by various processes These characters are assigned to this category
Surrogates (Cs)
The code points in the UTF-16 surrogate range belong to this category Technically, the code points in the surrogate range are treated as unassigned and reserved, but Unicode
implementations based on UTF-16 often treat them as
characters, handling surrogate pairs the same way as
combining character sequences are handled
Private-Use Characters (Co)
The code points in the private-use ranges are assigned to this category
Unassigned Code Points (Cn)
All unassigned and noncharacter code points, other than those
in the surrogate range, are given this category These code
points aren't listed in the Unicode Character Databasetheir
omission gives them this categorybut are listed explicitly in
DerivedGeneralCategory.txt