knowledge/technology/files/Unicode.md
2023-12-04 11:02:23 +01:00

58 lines
3.6 KiB
Markdown

---
obj: concept
website: https://unicode.org
---
# Unicode
Unicode is a standardized character encoding system that aims to represent text in most of the world's writing systems consistently. It provides a unique code point for every character, regardless of platform, program, or language. Unicode allows for the consistent representation of text across different devices and applications, fostering global communication and interoperability.
## **Code Points:**
Unicode assigns a unique numerical value to each character, symbol, or glyph, known as a code point. These code points are typically represented in hexadecimal.
## **Character Sets:**
Unicode encompasses a vast range of character sets, including Latin, Greek, Cyrillic, Arabic, Chinese, Japanese, and many more. This inclusivity allows Unicode to support a diverse array of languages and scripts.
## **Multilingual Support:**
Unicode is designed to be multilingual, providing a single character encoding standard that can represent text in multiple languages simultaneously.
## **UTF Encoding Schemes:**
Unicode Transformation Format (UTF) is the encoding scheme used to serialize Unicode code points into binary data. Common UTF variants include UTF-8, UTF-16, and UTF-32.
## **Compatibility and Normalization:**
Unicode addresses compatibility issues, offering compatibility equivalence for characters that look similar but have different underlying code points. Unicode normalization ensures consistent representation.
## Character Representation
### 1. **Code Point Representation:**
- Represent a Unicode code point using the following syntax: U+XXXX, where XXXX is the hexadecimal code point.
Example: The code point for the letter 'A' is `U+0041`.
### 2. **Escape Sequences:**
In programming languages and markup languages, Unicode characters can be represented using escape sequences. For example, `\uXXXX` in JavaScript or `\u{XXXX}` in languages like [Rust](../programming/languages/Rust.md) and JavaScript ES6.
Example: The escape sequence for the heart symbol (❤) is `\u2764`.
### 3. **UTF-8 Encoding:**
UTF-8 is a variable-width encoding scheme that represents Unicode characters using 8-bit code units. It is widely used for its compact representation of [ASCII](ASCII.md) characters and compatibility with existing systems.
### 4. **UTF-16 Encoding:**
UTF-16 uses 16-bit code units and is common in systems that work with surrogate pairs for characters outside the Basic Multilingual Plane (BMP).
### 5. **UTF-32 Encoding:**
UTF-32 uses 32-bit code units for each character, providing a fixed-width encoding. It simplifies random access to characters but may consume more memory.
## Usage and Applications
### 1. **Programming and Software Development:**
- Unicode is crucial in programming for supporting a diverse set of characters in strings and text processing.
### 2. **Web and Document Standards:**
- [HTML](../internet/HTML.md), [XML](XML.md), and other web standards rely on Unicode for consistent representation of text across different platforms and devices.
### 3. **Localization and Internationalization:**
- Unicode facilitates the localization of software and content for different languages and regions, enabling a global audience.
### 4. **Operating Systems:**
- Modern operating systems, such as [Windows](../windows/Windows.md), [macOS](../macos/macOS.md), and [Linux](../linux/Linux.md), use Unicode for character encoding, ensuring compatibility and interoperability.
### 5. **Communication and Social Media:**
- Unicode is fundamental in digital communication, ensuring that users can express themselves using a wide range of symbols, emojis, and scripts.