knowledge/technology/files/Unicode.md

---
obj: concept
website: https://unicode.org
aliases: ["utf8", "UTF-8"]
---

# Unicode
Unicode is a standardized character encoding system that aims to represent text in most of the world's writing systems consistently. It provides a unique code point for every character, regardless of platform, program, or language. Unicode allows for the consistent representation of text across different devices and applications, fostering global communication and interoperability.

## **Code Points:**
Unicode assigns a unique numerical value to each character, symbol, or glyph, known as a code point. These code points are typically represented in hexadecimal.

## **Character Sets:**
Unicode encompasses a vast range of character sets, including Latin, Greek, Cyrillic, Arabic, Chinese, Japanese, and many more. This inclusivity allows Unicode to support a diverse array of languages and scripts.

## **Multilingual Support:**
Unicode is designed to be multilingual, providing a single character encoding standard that can represent text in multiple languages simultaneously.

## **UTF Encoding Schemes:**
Unicode Transformation Format (UTF) is the encoding scheme used to serialize Unicode code points into binary data. Common UTF variants include UTF-8, UTF-16, and UTF-32.

## **Compatibility and Normalization:**
Unicode addresses compatibility issues, offering compatibility equivalence for characters that look similar but have different underlying code points. Unicode normalization ensures consistent representation.

## Character Representation
### 1. **Code Point Representation:**
- Represent a Unicode code point using the following syntax: U+XXXX, where XXXX is the hexadecimal code point.

Example: The code point for the letter 'A' is `U+0041`.

### 2. **Escape Sequences:**
In programming languages and markup languages, Unicode characters can be represented using escape sequences. For example, `\uXXXX` in JavaScript or `\u{XXXX}` in languages like [Rust](../dev/programming/languages/Rust.md) and JavaScript ES6.

Example: The escape sequence for the heart symbol (❤) is `\u2764`.

### 3. **UTF-8 Encoding:**
UTF-8 is a variable-width encoding scheme that represents Unicode characters using 8-bit code units. It is widely used for its compact representation of [ASCII](ASCII.md) characters and compatibility with existing systems.

### 4. **UTF-16 Encoding:**
UTF-16 uses 16-bit code units and is common in systems that work with surrogate pairs for characters outside the Basic Multilingual Plane (BMP).

### 5. **UTF-32 Encoding:**
UTF-32 uses 32-bit code units for each character, providing a fixed-width encoding. It simplifies random access to characters but may consume more memory.

## Usage and Applications
### 1. **Programming and Software Development:**
- Unicode is crucial in programming for supporting a diverse set of characters in strings and text processing.

### 2. **Web and Document Standards:**
- [HTML](../internet/HTML.md), [XML](XML.md), and other web standards rely on Unicode for consistent representation of text across different platforms and devices.

### 3. **Localization and Internationalization:**
- Unicode facilitates the localization of software and content for different languages and regions, enabling a global audience.

### 4. **Operating Systems:**
- Modern operating systems, such as [Windows](../windows/Windows.md), [macOS](../macos/macOS.md), and [Linux](../linux/Linux.md), use Unicode for character encoding, ensuring compatibility and interoperability.

### 5. **Communication and Social Media:**
- Unicode is fundamental in digital communication, ensuring that users can express themselves using a wide range of symbols, emojis, and scripts.