Unicode

  • unicode
    new unicode logo.svg
    logo of the unicode consortium
    alias(es)universal coded character set (ucs)
    language(s)international
    standardunicode standard
    encoding formatsutf-8, utf-16, gb18030
    less common: utf-32, bocu, scsu, utf-7
    preceded byiso 8859, various others

    unicode is a computing industry standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. the standard is maintained by the unicode consortium, and as of may 2019 the most recent version, unicode 12.1, contains a repertoire of 137,994 characters (consisting of 137,766 graphic characters, 163 format characters and 65 control characters) covering 150 modern and historic scripts, as well as multiple symbol sets and emoji. the character repertoire of the unicode standard is synchronized with iso/iec 10646, and both are code-for-code identical.

    the unicode standard consists of a set of code charts for visual reference, an encoding method and set of standard character encodings, a set of reference data files, and a number of related items, such as character properties, rules for normalization, decomposition, collation, rendering, and bidirectional display order (for the correct display of text containing both right-to-left scripts, such as arabic and hebrew, and left-to-right scripts).[1]

    unicode's success at unifying character sets has led to its widespread and predominant use in the internationalization and localization of computer software. the standard has been implemented in many recent technologies, including modern operating systems, xml, java (and other programming languages), and the .net framework.

    unicode can be implemented by different character encodings. the unicode standard defines utf-8, utf-16, and utf-32, and several other encodings are in use. the most commonly used encodings are utf-8, utf-16, and ucs-2 (without full support for unicode), a precursor of utf-16; gb18030 is standardized in china and implements unicode fully, while not an official unicode standard.

    utf-8, the dominant encoding on the world wide web (used in over 94% of websites as of november 2019),[2] uses one byte for the first 128 code points, and up to 4 bytes for other characters.[3] the first 128 unicode code points represent the ascii characters, which means that any ascii text is also a utf-8 text.

    ucs-2 uses two bytes (16 bits) for each character but can only encode the first 65,536 code points, the so-called basic multilingual plane (bmp). with 1,114,112 code points on 17 planes being possible, and with over 137,000 code points defined as of version 12.1, ucs-2 is only able to represent less than half of all encoded unicode characters. therefore, ucs-2 is outdated, though still widely used in software. utf-16 extends ucs-2, by using the same 16-bit encoding as ucs-2 for the basic multilingual plane, and a 4-byte encoding for the other planes. as long as it contains no code points in the reserved range u+d800–u+dfff, a ucs-2 text is a valid utf-16 text.

    utf-32 (also referred to as ucs-4) uses four bytes for each character. like ucs-2, the number of bytes per character is fixed, facilitating character indexing; but unlike ucs-2, utf-32 is able to encode all unicode code points. however, because each character uses four bytes, utf-32 takes significantly more space than other encodings, and is not widely used.

  • origin and development
  • mapping and encodings
  • adoption
  • issues
  • see also
  • notes
  • references
  • further reading
  • external links

Unicode
New Unicode logo.svg
Logo of the Unicode Consortium
Alias(es)Universal Coded Character Set (UCS)
Language(s)International
StandardUnicode Standard
Encoding formatsUTF-8, UTF-16, GB18030
Less common: UTF-32, BOCU, SCSU, UTF-7
Preceded byISO 8859, various others

Unicode is a computing industry standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard is maintained by the Unicode Consortium, and as of May 2019 the most recent version, Unicode 12.1, contains a repertoire of 137,994 characters (consisting of 137,766 graphic characters, 163 format characters and 65 control characters) covering 150 modern and historic scripts, as well as multiple symbol sets and emoji. The character repertoire of the Unicode Standard is synchronized with ISO/IEC 10646, and both are code-for-code identical.

The Unicode Standard consists of a set of code charts for visual reference, an encoding method and set of standard character encodings, a set of reference data files, and a number of related items, such as character properties, rules for normalization, decomposition, collation, rendering, and bidirectional display order (for the correct display of text containing both right-to-left scripts, such as Arabic and Hebrew, and left-to-right scripts).[1]

Unicode's success at unifying character sets has led to its widespread and predominant use in the internationalization and localization of computer software. The standard has been implemented in many recent technologies, including modern operating systems, XML, Java (and other programming languages), and the .NET Framework.

Unicode can be implemented by different character encodings. The Unicode standard defines UTF-8, UTF-16, and UTF-32, and several other encodings are in use. The most commonly used encodings are UTF-8, UTF-16, and UCS-2 (without full support for Unicode), a precursor of UTF-16; GB18030 is standardized in China and implements Unicode fully, while not an official Unicode standard.

UTF-8, the dominant encoding on the World Wide Web (used in over 94% of websites as of November 2019),[2] uses one byte for the first 128 code points, and up to 4 bytes for other characters.[3] The first 128 Unicode code points represent the ASCII characters, which means that any ASCII text is also a UTF-8 text.

UCS-2 uses two bytes (16 bits) for each character but can only encode the first 65,536 code points, the so-called Basic Multilingual Plane (BMP). With 1,114,112 code points on 17 planes being possible, and with over 137,000 code points defined as of version 12.1, UCS-2 is only able to represent less than half of all encoded Unicode characters. Therefore, UCS-2 is outdated, though still widely used in software. UTF-16 extends UCS-2, by using the same 16-bit encoding as UCS-2 for the Basic Multilingual Plane, and a 4-byte encoding for the other planes. As long as it contains no code points in the reserved range U+D800–U+DFFF, a UCS-2 text is a valid UTF-16 text.

UTF-32 (also referred to as UCS-4) uses four bytes for each character. Like UCS-2, the number of bytes per character is fixed, facilitating character indexing; but unlike UCS-2, UTF-32 is able to encode all Unicode code points. However, because each character uses four bytes, UTF-32 takes significantly more space than other encodings, and is not widely used.

Other Languages
Afrikaans: Unicode
Alemannisch: Unicode
አማርኛ: ዩኒኮድ
العربية: يونيكود
অসমীয়া: ইউনিক’ড
asturianu: Unicode
azərbaycanca: Unicode
বাংলা: ইউনিকোড
Bân-lâm-gú: Unicode
беларуская: Унікод
беларуская (тарашкевіца)‎: Юнікод
български: Уникод
Boarisch: Unicode
bosanski: Unicode
brezhoneg: Unicode
català: Unicode
Чӑвашла: Юникод
čeština: Unicode
Cymraeg: Unicode
dansk: Unicode
Deutsch: Unicode
eesti: Unicode
Ελληνικά: Unicode
español: Unicode
Esperanto: Unikodo
euskara: Unicode
فارسی: یونی‌کد
français: Unicode
Gaeilge: Unicode
galego: Unicode
ગુજરાતી: યુનિકોડ
客家語/Hak-kâ-ngî: Unicode
한국어: 유니코드
հայերեն: Յունիկոդ
हिन्दी: यूनिकोड
hrvatski: Unikod
Ilokano: Unicode
Bahasa Indonesia: Unicode
interlingua: Unicode
íslenska: Unicode
italiano: Unicode
עברית: יוניקוד
Jawa: Unicode
ಕನ್ನಡ: ಯುನಿಕೋಡ್
ქართული: უნიკოდი
कॉशुर / کٲشُر: यूनिकोड
қазақша: Юникод
kurdî: Unicode
Кыргызча: Юникод
Latina: Unicodex
latviešu: Unikods
lietuvių: Unikodas
Lingua Franca Nova: Unicode
magyar: Unicode
मैथिली: युनिकोड
македонски: Уникод
മലയാളം: യൂണികോഡ്
मराठी: युनिकोड
მარგალური: იუნიკოდი
Bahasa Melayu: Unicode
Mìng-dĕ̤ng-ngṳ̄: Unicode
монгол: Юникод
မြန်မာဘာသာ: ယူနီကုဒ်
Nederlands: Unicode
नेपाली: युनिकोड
नेपाल भाषा: युनिकोड
日本語: Unicode
norsk: Unicode
norsk nynorsk: Unicode
occitan: Unicode
олык марий: Unicode
ਪੰਜਾਬੀ: ਯੂਨੀਕੋਡ
Plattdüütsch: Unicode
polski: Unikod
português: Unicode
română: Unicode
русский: Юникод
саха тыла: Юникод
संस्कृतम्: युनिकोड
Scots: Unicode
shqip: Unicode
සිංහල: යුනිකෝඩ්
Simple English: Unicode
slovenčina: Unicode
slovenščina: Unicode
کوردی: یوونیکۆد
српски / srpski: Junikod
srpskohrvatski / српскохрватски: Unikod
Sunda: Unicode
suomi: Unicode
svenska: Unicode
Tagalog: Unikodigo
తెలుగు: యూనికోడ్
тоҷикӣ: Юникод
ᏣᎳᎩ: ᏳᏂᎪᏛ
Türkçe: Unicode
українська: Юнікод
اردو: یونیکوڈ
ئۇيغۇرچە / Uyghurche: Unicode
Tiếng Việt: Unicode
walon: Unicôde
文言: 萬國碼
吴语: Unicode
ייִדיש: יוניקאד
Yorùbá: Unicode
粵語: 統一碼
中文: Unicode