59 lines
3.6 KiB
Markdown
59 lines
3.6 KiB
Markdown
---
|
|
obj: concept
|
|
website: https://unicode.org
|
|
aliases: ["utf8", "UTF-8"]
|
|
---
|
|
|
|
# Unicode
|
|
Unicode is a standardized character encoding system that aims to represent text in most of the world's writing systems consistently. It provides a unique code point for every character, regardless of platform, program, or language. Unicode allows for the consistent representation of text across different devices and applications, fostering global communication and interoperability.
|
|
|
|
## **Code Points:**
|
|
Unicode assigns a unique numerical value to each character, symbol, or glyph, known as a code point. These code points are typically represented in hexadecimal.
|
|
|
|
## **Character Sets:**
|
|
Unicode encompasses a vast range of character sets, including Latin, Greek, Cyrillic, Arabic, Chinese, Japanese, and many more. This inclusivity allows Unicode to support a diverse array of languages and scripts.
|
|
|
|
## **Multilingual Support:**
|
|
Unicode is designed to be multilingual, providing a single character encoding standard that can represent text in multiple languages simultaneously.
|
|
|
|
## **UTF Encoding Schemes:**
|
|
Unicode Transformation Format (UTF) is the encoding scheme used to serialize Unicode code points into binary data. Common UTF variants include UTF-8, UTF-16, and UTF-32.
|
|
|
|
## **Compatibility and Normalization:**
|
|
Unicode addresses compatibility issues, offering compatibility equivalence for characters that look similar but have different underlying code points. Unicode normalization ensures consistent representation.
|
|
|
|
## Character Representation
|
|
### 1. **Code Point Representation:**
|
|
- Represent a Unicode code point using the following syntax: U+XXXX, where XXXX is the hexadecimal code point.
|
|
|
|
Example: The code point for the letter 'A' is `U+0041`.
|
|
|
|
### 2. **Escape Sequences:**
|
|
In programming languages and markup languages, Unicode characters can be represented using escape sequences. For example, `\uXXXX` in JavaScript or `\u{XXXX}` in languages like [Rust](../dev/programming/languages/Rust.md) and JavaScript ES6.
|
|
|
|
Example: The escape sequence for the heart symbol (❤) is `\u2764`.
|
|
|
|
### 3. **UTF-8 Encoding:**
|
|
UTF-8 is a variable-width encoding scheme that represents Unicode characters using 8-bit code units. It is widely used for its compact representation of [ASCII](ASCII.md) characters and compatibility with existing systems.
|
|
|
|
### 4. **UTF-16 Encoding:**
|
|
UTF-16 uses 16-bit code units and is common in systems that work with surrogate pairs for characters outside the Basic Multilingual Plane (BMP).
|
|
|
|
### 5. **UTF-32 Encoding:**
|
|
UTF-32 uses 32-bit code units for each character, providing a fixed-width encoding. It simplifies random access to characters but may consume more memory.
|
|
|
|
## Usage and Applications
|
|
### 1. **Programming and Software Development:**
|
|
- Unicode is crucial in programming for supporting a diverse set of characters in strings and text processing.
|
|
|
|
### 2. **Web and Document Standards:**
|
|
- [HTML](../internet/HTML.md), [XML](XML.md), and other web standards rely on Unicode for consistent representation of text across different platforms and devices.
|
|
|
|
### 3. **Localization and Internationalization:**
|
|
- Unicode facilitates the localization of software and content for different languages and regions, enabling a global audience.
|
|
|
|
### 4. **Operating Systems:**
|
|
- Modern operating systems, such as [Windows](../windows/Windows.md), [macOS](../macos/macOS.md), and [Linux](../linux/Linux.md), use Unicode for character encoding, ensuring compatibility and interoperability.
|
|
|
|
### 5. **Communication and Social Media:**
|
|
- Unicode is fundamental in digital communication, ensuring that users can express themselves using a wide range of symbols, emojis, and scripts.
|