Photo:1 Photo:2 Photo:3 Photo:4 |
| Origin and development | |
| 2>
Unicode has the explicit aim of transcending the limitations of traditional character encodings, such as those defined by the ISO 8859 standard, which find wide usage in various countries of the world, but remain largely incompatible with each other. Many traditional character encodings share a common problem in that they allow bilingual computer processing (usually using Latin characters and the local script), but not multilingual computer processing (computer processing of arbitrary scripts mixed with each other).
Unicode, in intent, encodes the underlying characters—graphemes and grapheme-like units—rather than the variant glyphs (renderings) for such characters. In the case of Chinese characters, this sometimes leads to controversies over distinguishing the underlying character from its variant glyphs (see Han unification).
In text processing, Unicode takes the role of providing a unique code point—a number, not a glyph—for each character. In other words, Unicode represents a character in an abstract way and leaves the visual rendering (size, shape, font, or style) to other software, such as a web browser or word processor. This simple aim becomes complicated, however, because of concessions made by Unicode's designers in the hope of encouraging a more rapid adoption of Unicode.
The first 256 code points were made identical to the content of ISO 8859-1 so as to make it trivial to convert existing western text. Many essentially-identical characters were encoded multiple times at different code points to preserve distinctions used by legacy encodings and therefore, allow conversion from those encodings to Unicode (and back) without losing any information. For example, the "fullwidth forms" section of code points encompasses a full Latin alphabet that is separate from the main Latin alphabet section. In Chinese, Japanese, and Korean (CJK) fonts, these characters are rendered at the same width as CJK ideographs, rather than at half the width. For other examples, see Duplicate characters in Unicode.
[edit] Tags:Encoding,Text,Characters,Character Encodings,Case,Iso 8859,Graphemes,Glyphs,Chinese Characters,Font,Web Browser,Fullwidth Forms,Cjk,Ideographs,Code Points,Multilingual,Latin, | |
| History | |
| 3>
The origins of Unicode date back to 1987, when Joe Becker from Xerox and Lee Collins and Mark Davis from Apple started investigating the practicalities of creating a universal character set.[2] In August 1988, Joe Becker published a draft proposal for an "international/multilingual text character encoding system, tentatively called Unicode". Although the term "Unicode" had previously been used for other purposes, such as the name of a programming language developed for the UNIVAC in the late 1950s,[3] and most notably a universal telegraphic phrase-book that was first published in 1889,[4] Becker may not have been aware of these earlier usages, and he explained that "[t]he name 'Unicode' is intended to suggest a unique, unified, universal encoding".[5]
In this document, entitled Unicode 88, Becker outlined a 16-bit character model:[5]
Unicode is intended to address the need for a workable, reliable world text encoding. Unicode could be roughly described as "wide-body ASCII" that has been stretched to 16 bits to encompass the characters of all the world's living languages. In a properly engineered design, 16 bits per character are more than sufficient for this purpose.
His original 16-bit design was based on the assumption that only those scripts and characters in modern use would need to be encoded:[5]
Unicode gives higher priority to ensuring utility for the future than to preserving past antiquities. Unicode aims in the first instance at the characters published in modern text (e.g. in the union of all newspapers and magazines printed in the world in 1988), whose number is undoubtedly far below 214 = 16,384. Beyond those modern-use characters, all others may be defined to be obsolete or rare; these are better candidates for private-use registration than for congesting the public list of generally-useful Unicodes.
In early 1989, the Unicode working group expanded to include Ken Whistler and Mike Kernaghan of Metaphor, Karen Smith-Yoshimura and Joan Aliprand of RLG, and Glenn Wright of Sun Microsystems, and in 1990 Michel Suignard and Asmus Freytag from Microsoft and Rick McGowan of NeXT joined the group. By the end of 1990, most of the work on mapping existing character encoding standards had been completed, and a final review draft of Unicode was ready. The Unicode consortium was incorporated on January 3, 1991, in the state of California, and in October 1991, the first volume of the Unicode standard was published. The second volume, covering Han ideographs, was published in June 1992.
In 1996, a surrogate character mechanism was implemented in Unicode 2.0, so that Unicode was no longer restricted to 16 bits. This increased the Unicode codespace to over a million code points, which allowed for the encoding of many historic scripts (e.g. Egyptian Hieroglyphs) and thousands of rarely-used or obsolete characters that had not been anticipated as needing encoding.
[edit] Tags:Xerox,Apple,Univac,Telegraphic Phrase-book,Rlg,Sun Microsystems,Microsoft,Egyptian Hieroglyphs, | |
| Architecture and terminology | |
| 3>
Unicode defines a codespace of 1,114,112 code points in the range 0hex to 10FFFFhex.[6] Normally a Unicode code point is referred to by writing "U+" followed by its hexadecimal number. For code points in the Basic Multilingual Plane (BMP), four digits are used (e.g. U+0058 for the character LATIN CAPITAL LETTER X); for code points outside the BMP, five or six digits are used, as required (e.g. U+E0001 for the character LANGUAGE TAG and U+10FFFD for the character PRIVATE USE CHARACTER-10FFFD). Older versions of the standard used similar notations, but with slightly different rules. For example, Unicode 3.0 used "U-" followed by eight digits to indicate a code point, and allowed "U+" to be used only with exactly four digits to indicate a code unit, such as a single byte of a multibyte UTF-8 encoding of a code point.[7]
[edit] Tags:Utf-8,Byte,Hexadecimal,Bmp, | |
| Code point planes and blocks | |
| 4>
Main article: Unicode plane
The Unicode codespace is divided into seventeen planes, numbered 0 to 16:
v
d
e
Unicode planes and code point (character) ranges
Basic
Supplementary
Plane 0:
Basic Multilingual Plane
Plane 1:
Supplementary Multilingual Plane
Plane 2:
Supplementary Ideographic Plane
Planes 3–13:
Unassigned
Plane 14:
SupplementÂary Special-purpose Plane
Planes 15–16:
SupplementÂary Private Use Area
0000–​FFFF
10000–​1FFFF
20000–​2FFFF
30000–​DFFFF
E0000–​EFFFF
F0000–​10FFFF
BMP
SMP
SIP
—
SSP
S PUA A/B
0000–0FFF
1000–1FFF
2000–2FFF
3000–3FFF
4000–4FFF
5000–5FFF
6000–6FFF
7000–7FFF
8000–8FFF
9000–9FFF
A000–AFFF
B000–BFFF
C000–CFFF
D000–DFFF
E000–EFFF
F000–FFFF
10000–10FFF
11000–11FFF
12000–12FFF
13000–13FFF
16000–16FFF
1B000–1BFFF
1D000–1DFFF
1F000–1FFFF
20000–20FFF
21000–21FFF
22000–22FFF
23000–23FFF
24000–24FFF
25000–25FFF
26000–26FFF
27000–27FFF
28000–28FFF
29000–29FFF
2A000–2AFFF
2B000–2BFFF
2F000–2FFFF
E0000–E0FFF
15: PUA-A
F0000–​FFFFF
16: PUA-B
100000–​10FFFF
All code points in the BMP are accessed as a single code unit in UTF-16 encoding and can be encoded in one, two or three bytes in UTF-8. Code points in Planes 1 through 16 (supplementary planes, or, informally, astral planes) are accessed as surrogate pairs in UTF-16 and encoded in four bytes in UTF-8.
Within each plane, characters are allocated within named blocks of related characters. Although blocks are an arbitrary size, they are always a multiple of 16 code points, and often a multiple of 128 code points. Characters required for a given script may be spread out over several different blocks.
[edit] Tags:Utf-16,Planes 15–16, | |
| Character General Category | |
| 4>
Each code point has a single General Category property. The major categories are: Letter, Mark, Number, Punctuation, Symbol, Separator and Other. Within these categories, there are subdivisions. The General Category is not useful for every use, since legacy encodings have used multiple characteristics per single code point. E.g. U+000A <control-000A> Line feed (LF) in ASCII is both a control and a formatting separator; in Unicode the General Category is "Other, Control". Often, other properties must be used to specify the characteristics and behaviour of a code point. The possible General Categories are:
General Category (Unicode Character Property)[a]
v
d
e
Value
Category Major, minor
Basic type[b]
Character assigned[b]
Fixed[c]
Remarks
&000Letter
&001Lu
Letter, uppercase
Graphic
Character
&002Ll
Letter, lowercase
Graphic
Character
&003Lt
Letter, titlecase
Graphic
Character
&004Lm
Letter, modifier
Graphic
Character
&005Lo
Letter, other
Graphic
Character
&010Mark
&011Mn
Mark, nonspacing
Graphic
Character
&012Mc
Mark, spacing combining
Graphic
Character
&013Me
Mark, enclosing
Graphic
Character
&020Number
&021Nd
Number, decimal digit
Graphic
Character
All these, and only these, have Numeric Type = De[c]
&022Nl
Number, letter
Graphic
Character
&023No
Number, other
Graphic
Character
&030Punctuation
&031Pc
Punctuation, connector
Graphic
Character
&032Pd
Punctuation, dash
Graphic
Character
&033Ps
Punctuation, open
Graphic
Character
&034Pe
Punctuation, close
Graphic
Character
&035Pi
Punctuation, initial quote
Graphic
Character
May behave like Ps or Pe depending on usage
&036Pf
Punctuation, final quote
Graphic
Character
May behave like Ps or Pe depending on usage
&037Po
Punctuation, other
Graphic
Character
&040Symbol
&041Sm
Symbol, math
Graphic
Character
&042Sc
Symbol, currency
Graphic
Character
&043Sk
Symbol, modifier
Graphic
Character
&044So
Symbol, other
Graphic
Character
&050Separator
&051Zs
Separator, space
Graphic
Character
&052Zl
Separator, line
Format
Character
Only U+2028 line separator (L​SEP)
&053Zp
Separator, paragraph
Format
Character
Only U+2029 paragraph separator (P​SEP)
&060Other
&061Cc
Other, control
Control
Character
Fixed 65
No name[d], <control>
&062Cf
Other, format
Format
Character
&063Cs
Other, surrogate
Surrogate
Not
Fixed 2048
No name[d], <surrogate>
&064Co
Other, private use
Private-use
Not
Fixed 6400 in BMP, 131,068 in Planes 15–16
No name[d], <private-use>
&065Cn
Other, not assigned
Noncharacter
Not
Fixed 66
No name[d], <noncharacter>
Reserved
Not
Not fixed
No name[d], <reserved>
^ Unicode 6.0, Chapter 4, table 4-9
^ a b Unicode 6.0, Chapter 2, table 2-3: Types of code points
^ a b Stability policy: Property Value Stability and table. Stability policy: Some gc groups will never change. gc=Nd corresponds with Numeric Type=De (decimal).
^ a b c d e Unicode 6.0, Chapter 4, table 4-12 Name=""; a Code Point Label may be used to identify a nameless code point. E.g. <control-hhhh>, <control-0088>. The Name remains blank, which can prevent inadvertently replacing, in documentation, a Control Name with a true Control code. Unicode also uses <not a character> for <noncharacter>.
Code points in the range U+D800..U+DBFF (1,024 code points) are known as high-surrogate code points, and code points in the range U+DC00..U+DFFF (1,024 code points) are known as low-surrogate code points. A high-surrogate code point (also known as a leading surrogate) followed by a low-surrogate code point (also known as a trailing surrogate) together form a surrogate pair used in UTF-16 to represent 1,048,576 code points outside BMP. High and low surrogate code points are not valid by themselves. Thus the range of code points that are available for use as characters is U+0000..U+D7FF and U+E000..U+10FFFF (1,112,064 code points). The value of these code points (i.e. excluding surrogates) is sometimes referred to as the character's scalar value.
Certain noncharacter code points are guaranteed never to be used for encoding characters, although applications may make use of these code points internally if they wish. There are sixty-six noncharacters: U+FDD0..U+FDEF and any code point ending in the value FFFE or FFFF (i.e. U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, ... U+10FFFE, U+10FFFF). The set of noncharacters is stable, and no new noncharacters will ever be defined.[12]
Reserved code points are those code points which are available for use as encoded characters, but are not yet defined as characters by Unicode.
Private-use code points are considered to be assigned characters, but they have no interpretation specified by the Unicode standard[13] so any interchange of such characters requires an agreement between sender and receiver on their interpretation. There are three private-use areas in the Unicode codespace:
Private Use Area: U+E000..U+F8FF (6,400 characters)
Supplementary Private Use Area-A: U+F0000..U+FFFFD (65,534 characters)
Supplementary Private Use Area-B: U+100000..U+10FFFD (65,534 characters).
Graphic characters are characters defined by Unicode to have a particular semantic, and either have a visible glyph shape or represent a visible space. As of Unicode 6.1 there are 109,975 graphic characters.
Format characters are characters that do not have a visible appearance, but may have an effect on the appearance or behavior of neighboring characters. For example, U+200C ZERO WIDTH NON-JOINER and U+200D ZERO WIDTH JOINER may be used to change the default shaping behavior of adjacent characters (e.g. to inhibit ligatures or request ligature formation). There are 141 format characters in Unicode 6.1.
Sixty-five code points (U+0000..U+001F and U+007F.. U+009F) are reserved as control codes, and correspond to the C0 and C1 control codes defined in ISO/IEC 6429. Of these U+0009 (Tab), U+000A (Line Feed), and U+000D (Carriage Return) are widely used in Unicode-encoded texts.
Graphic characters, format characters, control code characters, and private use characters are known collectively as assigned characters.
[edit] Tags:Zero Width Non-joiner,Vai, | |
| Abstract characters | |
| 4>
The set of graphic and format characters defined by Unicode does not correspond directly to the repertoire of abstract characters that is representable under Unicode. Unicode encodes characters by associating an abstract character with a particular code point.[14] However, not all abstract characters are encoded as a single Unicode character, and some abstract characters may be represented in Unicode by a sequence of two or more characters. For example, a Latin small letter "i" with an ogonek, a dot above, and an acute accent, which is required in Lithuanian, is represented by the character sequence U+012F, U+0307, U+0301. Unicode maintains a list of uniquely named character sequences for abstract characters that are not directly encoded in Unicode.[15]
All graphic, format, and private use characters have a unique and immutable name by which they may be identified. This immutability has been guaranteed since Unicode version 2.0 by the Name Stability policy.[12] In cases where the name is seriously defective and misleading, or has a serious typographical error, a formal alias may be defined, and applications are encouraged to use the formal alias in place of the official character name. For example, U+A015 ꀕ yi syllable wu has the formal alias yi syllable iteration mark, and U+FE18 ︘ presentation form for vertical right white lenticular brakcet (sic) has the formal alias presentation form for vertical right white lenticular bracket.[16]
[edit] Tags:Ogonek,Dot Above,Acute Accent, | |
| Standard | |
| 3>
The Unicode Consortium, based in California, is a nonprofit organization that coordinates Unicode's development. There are various levels of membership, and any company or individual willing to pay the membership dues may join this organization. Full members include most of the main computer software and hardware companies with any interest in text-processing standards, including Adobe Systems, Apple, Google, IBM, Microsoft, Oracle Corporation, Sun Microsystems, and Yahoo.[17]
The Consortium has the ambitious goal of eventually replacing existing character encoding schemes with Unicode and its standard Unicode Transformation Format (UTF) schemes, as many of the existing schemes are limited in size and scope and are incompatible with multilingual environments.
[edit] Tags:Computer Software,Adobe Systems,Google,Ibm,Oracle Corporation,Yahoo, | |
| Versions | |
| 3>
Unicode is developed in conjunction with the International Organization for Standardization and shares the character repertoire with ISO/IEC 10646: the Universal Character Set. Unicode and ISO/IEC 10646 function equivalently as character encodings, but The Unicode Standard contains much more information for implementers, covering—in depth—topics such as bitwise encoding, collation and rendering. The Unicode Standard enumerates a multitude of character properties, including those needed for supporting bidirectional text. The two standards do use slightly different terminology.
The Consortium first published The Unicode Standard (ISBN 0-321-18578-1) in 1991, and continues to develop standards based on that original work. The latest major version of the standard, Unicode 6.1 was published in 2012, and is available from the consortium's web site. The last version to be published in book form was Unicode 5.0 (ISBN 0-321-48091-0), but since Unicode 6.0 the standard has no longer been published in book form.
Thus far the following major and minor versions of the Unicode standard have been published. Update versions, which do not include any changes to character repertoire, are signified by the third number (e.g. "version 4.0.1"), and are omitted in the table below.[18]
Unicode versions
Version
Date
Book
Corresponding ISO/IEC 10646 Edition
Scripts
Characters
#
Notable additions
1.0.0
October 1991
ISBN 0-201-56788-1 (Vol.1)
24
7,161
Initial repertoire covers these scripts: Arabic, Armenian, Bengali, Bopomofo, Cyrillic, Devanagari, Georgian, Greek and Coptic, Gujarati, Gurmukhi, Hangul, Hebrew, Hiragana, Kannada, Katakana, Lao, Latin, Malayalam, Oriya, Tamil, Telugu, Thai, and Tibetan.[19]
1.0.1
June 1992
ISBN 0-201-60845-6 (Vol.2)
25
28,359
The initial set of 20,902 CJK Unified Ideographs is defined.[20]
1.1
June 1993
ISO/IEC 10646-1:1993
24
34,233
4,306 more Hangul syllables added to original set of 2,350 characters. Tibetan removed.[21]
2.0
July 1996
ISBN 0-201-48345-9
ISO/IEC 10646-1:1993 plus Amendments 5, 6 and 7
25
38,950
Original set of Hangul syllables removed, and a new set of 11,172 Hangul syllables added at a new location. Tibetan added back in a new location and with a different character repertoire. Surrogate character mechanism defined, and Plane 15 and Plane 16 Private Use Areas allocated.[22]
2.1
May 1998
ISO/IEC 10646-1:1993 plus Amendments 5, 6 and 7, and two characters from Amendment 18
25
38,952
Euro sign added.[23]
3.0
September 1999
ISBN 0-201-61633-5
ISO/IEC 10646-1:2000
38
49,259
Cherokee, Ethiopic, Khmer, Mongolian, Burmese, Ogham, Runic, Sinhala, Syriac, Thaana, Unified Canadian Aboriginal Syllabics, and Yi Syllables added, as well as a set of Braille patterns.[24]
3.1
March 2001
ISO/IEC 10646-1:2000
ISO/IEC 10646-2:2001
41
94,205
Deseret, Gothic and Old Italic added, as well as sets of symbols for Western music and Byzantine music, and 42,711 additional CJK Unified Ideographs.[25]
3.2
March 2002
ISO/IEC 10646-1:2000 plus Amendment 1
ISO/IEC 10646-2:2001
45
95,221
Philippine scripts Buhid, Hanunó'o, Tagalog, and Tagbanwa added.[26]
4.0
April 2003
ISBN 0-321-18578-1
ISO/IEC 10646:2003
52
96,447
Cypriot syllabary, Limbu, Linear B, Osmanya, Shavian, Tai Le, and Ugaritic added, as well as Hexagram symbols.[27]
4.1
March 2005
ISO/IEC 10646:2003 plus Amendment 1
59
97,720
Buginese, Glagolitic, Kharoshthi, New Tai Lue, Old Persian, Syloti Nagri, and Tifinagh added, and Coptic was disunified from Greek. Ancient Greek numbers and musical symbols were also added.[28]
5.0
July 2006
ISBN 0-321-48091-0
ISO/IEC 10646:2003 plus Amendments 1 and 2, and four characters from Amendment 3
64
99,089
Balinese, Cuneiform, N'Ko, Phags-pa, and Phoenician added.[29]
5.1
April 2008
ISO/IEC 10646:2003 plus Amendments 1, 2, 3 and 4
75
100,713
Carian, Cham, Kayah Li, Lepcha, Lycian, Lydian, Ol Chiki, Rejang, Saurashtra, Sundanese, and Vai added, as well as sets of symbols for the Phaistos Disc, Mahjong tiles, and Domino tiles. There were also important additions for Burmese, additions of letters and Scribal abbreviations used in medieval manuscripts, and the addition of capital ß.[30]
5.2
October 2009
ISO/IEC 10646:2003 plus Amendments 1, 2, 3, 4, 5 and 6
90
107,361
Avestan, Bamum, Egyptian hieroglyphs (the Gardiner Set, comprising 1,071 characters), Imperial Aramaic, Inscriptional Pahlavi, Inscriptional Parthian, Javanese, Kaithi, Lisu, Meetei Mayek, Old South Arabian, Old Turkic, Samaritan, Tai Tham and Tai Viet added. 4,149 additional CJK Unified Ideographs (CJK-C), as well as extended Jamo for Old Hangul, and characters for Vedic Sanskrit.[31]
6.0
October 2010
ISO/IEC 10646:2010 plus the Indian rupee sign
93
109,449
Batak, Brahmi, Mandaic, playing card symbols, transport and map symbols, alchemical symbols, emoticons and emoji.[32]
6.1
January 2012
ISO/IEC 10646:2012
100
110,181
Chakma, Tags:Collation,Bidirectional,Arabic,Hebrew,International Organization For Standardization,Iso/iec 10646,Isbn 0-201-56788-1,Armenian,Bengali,Bopomofo,Cyrillic,Devanagari,Georgian,Greek And Coptic,Gujarati,Gurmukhi,Hangul,Hiragana,Kannada,Katakana,Lao,Malayalam,Oriya,Tamil,Telugu,Thai,Tibetan,Isbn 0-201-60845-6,Cjk Unified Ideographs,Isbn 0-201-48345-9,Euro Sign,Isbn 0-201-61633-5,Cherokee,Ethiopic,Khmer,Mongolian,Burmese,Ogham,Runic,Sinhala,Syriac, | |
z³ote monety click here click here click here click here |