knowledge/technology/files/Unicode.md
2024-03-07 00:06:37 +01:00

3.6 KiB

obj website aliases
concept https://unicode.org
utf8
UTF-8

Unicode

Unicode is a standardized character encoding system that aims to represent text in most of the world's writing systems consistently. It provides a unique code point for every character, regardless of platform, program, or language. Unicode allows for the consistent representation of text across different devices and applications, fostering global communication and interoperability.

Code Points:

Unicode assigns a unique numerical value to each character, symbol, or glyph, known as a code point. These code points are typically represented in hexadecimal.

Character Sets:

Unicode encompasses a vast range of character sets, including Latin, Greek, Cyrillic, Arabic, Chinese, Japanese, and many more. This inclusivity allows Unicode to support a diverse array of languages and scripts.

Multilingual Support:

Unicode is designed to be multilingual, providing a single character encoding standard that can represent text in multiple languages simultaneously.

UTF Encoding Schemes:

Unicode Transformation Format (UTF) is the encoding scheme used to serialize Unicode code points into binary data. Common UTF variants include UTF-8, UTF-16, and UTF-32.

Compatibility and Normalization:

Unicode addresses compatibility issues, offering compatibility equivalence for characters that look similar but have different underlying code points. Unicode normalization ensures consistent representation.

Character Representation

1. Code Point Representation:

  • Represent a Unicode code point using the following syntax: U+XXXX, where XXXX is the hexadecimal code point.

Example: The code point for the letter 'A' is U+0041.

2. Escape Sequences:

In programming languages and markup languages, Unicode characters can be represented using escape sequences. For example, \uXXXX in JavaScript or \u{XXXX} in languages like Rust and JavaScript ES6.

Example: The escape sequence for the heart symbol (❤) is \u2764.

3. UTF-8 Encoding:

UTF-8 is a variable-width encoding scheme that represents Unicode characters using 8-bit code units. It is widely used for its compact representation of ASCII characters and compatibility with existing systems.

4. UTF-16 Encoding:

UTF-16 uses 16-bit code units and is common in systems that work with surrogate pairs for characters outside the Basic Multilingual Plane (BMP).

5. UTF-32 Encoding:

UTF-32 uses 32-bit code units for each character, providing a fixed-width encoding. It simplifies random access to characters but may consume more memory.

Usage and Applications

1. Programming and Software Development:

  • Unicode is crucial in programming for supporting a diverse set of characters in strings and text processing.

2. Web and Document Standards:

  • HTML, XML, and other web standards rely on Unicode for consistent representation of text across different platforms and devices.

3. Localization and Internationalization:

  • Unicode facilitates the localization of software and content for different languages and regions, enabling a global audience.

4. Operating Systems:

  • Modern operating systems, such as Windows, macOS, and Linux, use Unicode for character encoding, ensuring compatibility and interoperability.

5. Communication and Social Media:

  • Unicode is fundamental in digital communication, ensuring that users can express themselves using a wide range of symbols, emojis, and scripts.