1. Trang chủ
  2. » Công Nghệ Thông Tin

Addison Wesley Unicode Demystified A Practical Programmers Guide To The Encoding Standard Sep 2002 ISBN 0201700522

11 49 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 11
Dung lượng 292,34 KB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

The major exception to this rule comprises marks that combine typographically with other characters, which are categorized as "marks" instead of "letters." They include not only diacriti

Trang 1

After the code point value and the name, the next most

important property that a Unicode character has is its general category Seven primary categories exist: letter, number,

punctuation, symbol, mark, separator, and miscellaneous Each

is subdivided into additional categories

Letters

The Unicode standard uses the term "letter" rather loosely in assigning things to this general category Whatever counts as the basic unit of meaning in a particular writing system,

whether it represents a phoneme, a syllable, or a whole word or idea, is assigned to the "letter" category The major exception

to this rule comprises marks that combine typographically with other characters, which are categorized as "marks" instead of

"letters." They include not only diacritical marks and tone

marks, but also vowel signs in those consonantal writing

systems where the vowels are written as marks applied to the consonants

Some writing systems, such as the Latin, Greek, and Cyrillic alphabets, also have the concept of "case." That is, two series

of letterforms are used together, with one series, the

"uppercase," used for the first letter of a sentence or a proper name, or for emphasis, and the other series, the "lowercase," used for most other letters

Uppercase Letter (Lu)

In cased writing systems, the uppercase letters are placed in this category

Trang 2

In cased writing systems, the lowercase letters are placed in this category

Titlecase Letter (Lt)

Titlecase is reserved for a few special characters in Unicode These characters are basically examples of compatibility

characters characters that were included for round-trip

compatibility with some other standard Every titlecase letter is actually a glyph representing two letters, the first of which is uppercase and the second of which is lowercase For example, the Serbian letter nje ( ) can be thought of as a ligature of the Cyrillic letter n ( ) and the Cyrillic soft sign ( ) When Serbian

is written using the Latin alphabet (as is done in Croatian,

which is almost the same language), this letter is written using the letters nj Existing Serbian and Croatian standards were designed to provide a one-to-one mapping between every

Cyrillic character used in Serbian and the corresponding Latin character used in Croatian This approach required using a

single character code to represent the nj digraph in Croatian, and Unicode carries that character forward Capital Nje in

Cyrillic ( ) thus can convert to either NJ or Nj in Latin

depending on the context The fully uppercase form, NJ, is

U+01CA LATIN CAPITAL LETTER NJ, and the combined upper-lower form, U+01CB LATIN CAPITAL LETTER N WITH SMALL LETTER J, is considered a "titlecase" letter Three Serbian

characters have a titlecase Latin form: (lje, which converts to lj), (nje, which converts to nj), and (dzhe, which converts

to d ) These characters were the only three titlecase letters in Unicode 2.x

Unicode 3.0 added several Greek letters to this category Some early Greek texts represented certain diphthongs by writing a

Trang 3

small letter iota underneath the other vowel rather than after it For example, you'd see "ai" written as If you just capitalized the alpha ("Ai"), you'd get the titlecase version: In the fully uppercase version ("AI"), the small iota becomes a regular iota again: AI These characters are all in the Extended Greek

section of the standard and are used only in writing ancient

Greek texts In modern Greek, these diphthongs are written using a regular iota; for example, "ai" is written as

Modifier Letter (Lm)

Just as some things you might conceptually think of as "letters" (vowel signs in various languages) are classified as "marks" in Unicode, the opposite also occurs The modifier letters are

independent forms that don't combine typographically with the characters around them, which is why Unicode doesn't classify them as "marks" (Unicode marks, by definition, combine

typographically with their neighbors) Instead of carrying their own sounds, the modifier letters generally modify the sounds of their neighbors In other words, conceptually they're diacritical marks Because they occur in the middle of words, most text-analysis processes treat them as letters, so they're classified as letters

The Unicode modifier letters are generally either International Phonetic Alphabet characters or characters that are used to

transliterate certain "real" letters in non-Latin writing systems that don't seem to correspond to a regular Latin letter For

example, U+02BC MODIFIER LETTER APOSTROPHE is typically used to represent the glottal stop, the sound made by (or, more

accurately, the absence of sound represented by) the Arabic

letter alef, so the Arabic letter is often transliterated as this

character Likewise, U+02B2 MODIFIER LETTER SMALL J is used

to represent palatalization, and thus is sometimes used in

transliteration as the counterpart of the Cyrillic soft sign

Trang 4

This catch-all category includes everything that's conceptually a

"letter," but that doesn't fit into one of the other "letter"

categories Letters from uncased alphabets such as Arabic and Hebrew fall into this category, as do syllables from syllabic

writing systems like Kana and Hangul and the Han ideographs

Marks

Like letters, marks are part of words and carry linguistic

information Unlike letters, marks combine typographically with other characters For example, U+0308 COMBINING DIAERESIS may look like ¨ when shown alone, but is usually drawn on top

of the letter that precedes it That is, U+0061 LATIN SMALL LETTER A followed by U+0308 COMBINING DIAERESIS isn't drawn as "a¨", but rather as "ä" All of the Unicode combining marks do this kind of thing

Non-spacing Mark (Mn)

Most of the Unicode combining marks fall into this category Non-spacing marks don't take up any horizontal space along a line of textthey combine completely with the character that

precedes them and fit entirely into that character's space The various diacritical marks used in European languages, such as the acute and grave accents, the circumflex, the diaeresis, and the cedilla, fall into this category

Combining Spacing Mark (Mc)

Spacing combining marks interact typographically with their neighbors, but still take up horizontal space along a line of text

Trang 5

in the various Indian and Southeast Asian writing systems For example, U+093F DEVANAGARI VOWEL SIGN I ( ) is a spacing combining mark Thus U+0915 DEVANAGARI LETTER KA

followed by U+093F DEVANAGARI VOWEL SIGN I is drawn as the vowel sign attaches to the left-hand side of the

consonant

Not all spacing combining marks reorder, however: U+0940 DEVANAGARI VOWEL SIGN II ( ) is also a combining spacing mark When it follows U+0915 DEVANAGARI LETTER KA, you get the vowel attaches to the right-hand side of the

consonant, but the two combine typographically

Enclosing Mark (Me)

Enclosing marks completely surround the characters they

modify For example, U+20DD COMBINING ENCLOSING CIRCLE

is drawn as a ring around the character that precedes it These ten characters are generally used to create symbols

Numbers

The Unicode characters that represent numeric quantities are given the "number" property (technically, it should be called the

"numeral" property, but that's life) The characters in these

categories have additional properties that govern their

interpretation as numerals This category is subdivided as

follows

Decimal-Digit Number (Nd)

The characters in this category can be used as decimal digits

Trang 6

This category includes not only the digits with which we're all familiar ("0123456789"), but similar sets of digits used with other writing systems, such as the Thai digits ("

")

Letter Number (Nl)

The characters in this category can be either letters or

numerals Many are compatibility composites whose

decompositions consist of letters The Roman numerals and the Hangzhou numerals are the only characters in this category

Other Number (No)

All of the characters that belong in the "number" category, but not in one of the other subcategories, fall into this one This category includes various numeric presentation forms, such as superscripts, subscripts, and circled numbers; various fractions; and numerals used in various numeration systems other than the Arabic positional notation used in the West

Punctuation

This category attempts to make sense of the various

punctuation characters in Unicode It breaks down as follows

Opening Punctuation (Ps)

For punctuation marks, such as parentheses and brackets, that occur in opening-closing pairs, the "opening" characters in

these pairs are assigned to this category

Trang 7

For punctuation marks, such as parentheses and brackets, that occur in opening-closing pairs, the "closing" characters in these pairs are assigned to this category

Initial-Quote Punctuation (Pi)

Quotation marks occur in opening-closing pairs, just like

parentheses do The problem is that which is which depends on the language For example, both French and Russian use

quotation marks that look like this: «», but they use them

differently

«In French, a quotation is set off like this.»

»But in Russian, a quotation is set off like this.«

This category is equivalent to either Ps or Pe, depending on the language

Final-Quote Punctuation (Pf)

The counterpart to the Pi category, Pf is also used with

quotation marks whose usage varies depending on language It's equivalent to either Ps or Pe depending on language It's always the opposite of Pi

Dash Punctuation (Pd)

This category is self-explanatory It includes all hyphens and dashes

Trang 8

Characters in this category, such as the middle dot and the

underscore, get treated as part of the word in which they

appear That is, they "connect" series of letters together into single words: This_is_all_one_word An important example is U+30FB KATAKANA MIDDLE DOT, which is used like a hyphen in Japanese

Other Punctuation (Po)

Punctuation marks that don't fit into any of the other

subcategories, including obvious things like the period, comma, and question mark, fall into this category

Symbols

This group of categories contains various symbols

Currency Symbol (Sc)

Self-explanatory

Mathematical Symbol (Sm)

Mathematical operators

Modifier Symbol (Sk)

This category contains two main types of characters: the

"spacing" versions of the combining marks and a few other

Trang 9

preceding character in some way Unlike modifier letters,

modifier symbols don't necessarily modify the meanings of letters, and they don't necessarily get counted as parts of

words

Other Symbol (So)

This category contains all symbols that didn't fit into one of the other categories

Separators

These characters mark the boundaries between units of text

Space Separator (Zs)

This category includes all of the space characters (yes, there's more than one space character)

Paragraph Separator (Zp)

There is exactly one character in this category: the Unicode paragraph separator (U+2029) As its name suggests, it marks the boundary between paragraphs

Line Separator (Zl)

There's also only one character in this category: the Unicode line separator (U+2028) As its name suggests, it forces a line break without ending a paragraph

Trang 10

Even though the ASCII carriage-return and line-feed characters are often used as line and paragraph separators, they're not placed in either of these categories Likewise, the ASCII tab character isn't considered a Unicode space character, even

though it probably should be They're all put in the "Cc"

category

Miscellaneous

A number of special character categories don't really fit in with the others

Control Characters (Cc)

The codes corresponding to the C0 and C1 control characters from the ISO 2022 standard appear in this category The

Unicode standard doesn't officially assign any semantics to

these characters (which include the ASCII control characters), but most systems that use Unicode text treat these characters the same way as they treat their counterparts in the source standards For example, most processes treat the ASCII line-feed character as a line or paragraph separator

The original idea was to leave the definitions of these code

points open, as ISO 2022 does Over time, however, various Unicode processes and algorithms have attached semantics to these code points, effectively nailing the ISO 6429 semantics to many of them

Formatting Characters (Cf)

Unicode includes some "control" characters of its own:

characters with no visual representation of their own that are used to control how the characters around them are drawn or

Trang 11

handled by various processes These characters are assigned to this category

Surrogates (Cs)

The code points in the UTF-16 surrogate range belong to this category Technically, the code points in the surrogate range are treated as unassigned and reserved, but Unicode

implementations based on UTF-16 often treat them as

characters, handling surrogate pairs the same way as

combining character sequences are handled

Private-Use Characters (Co)

The code points in the private-use ranges are assigned to this category

Unassigned Code Points (Cn)

All unassigned and noncharacter code points, other than those

in the surrogate range, are given this category These code

points aren't listed in the Unicode Character Databasetheir

omission gives them this categorybut are listed explicitly in

DerivedGeneralCategory.txt

Ngày đăng: 26/03/2019, 17:13

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN