--- obj: concept website: https://unicode.org aliases: ["utf8", "UTF-8"] --- # Unicode Unicode is a standardized character encoding system that aims to represent text in most of the world's writing systems consistently. It provides a unique code point for every character, regardless of platform, program, or language. Unicode allows for the consistent representation of text across different devices and applications, fostering global communication and interoperability. ## **Code Points:** Unicode assigns a unique numerical value to each character, symbol, or glyph, known as a code point. These code points are typically represented in hexadecimal. ## **Character Sets:** Unicode encompasses a vast range of character sets, including Latin, Greek, Cyrillic, Arabic, Chinese, Japanese, and many more. This inclusivity allows Unicode to support a diverse array of languages and scripts. ## **Multilingual Support:** Unicode is designed to be multilingual, providing a single character encoding standard that can represent text in multiple languages simultaneously. ## **UTF Encoding Schemes:** Unicode Transformation Format (UTF) is the encoding scheme used to serialize Unicode code points into binary data. Common UTF variants include UTF-8, UTF-16, and UTF-32. ## **Compatibility and Normalization:** Unicode addresses compatibility issues, offering compatibility equivalence for characters that look similar but have different underlying code points. Unicode normalization ensures consistent representation. ## Character Representation ### 1. **Code Point Representation:** - Represent a Unicode code point using the following syntax: U+XXXX, where XXXX is the hexadecimal code point. Example: The code point for the letter 'A' is `U+0041`. ### 2. **Escape Sequences:** In programming languages and markup languages, Unicode characters can be represented using escape sequences. For example, `\uXXXX` in JavaScript or `\u{XXXX}` in languages like [Rust](../dev/programming/languages/Rust.md) and JavaScript ES6. Example: The escape sequence for the heart symbol (❤) is `\u2764`. ### 3. **UTF-8 Encoding:** UTF-8 is a variable-width encoding scheme that represents Unicode characters using 8-bit code units. It is widely used for its compact representation of [ASCII](ASCII.md) characters and compatibility with existing systems. ### 4. **UTF-16 Encoding:** UTF-16 uses 16-bit code units and is common in systems that work with surrogate pairs for characters outside the Basic Multilingual Plane (BMP). ### 5. **UTF-32 Encoding:** UTF-32 uses 32-bit code units for each character, providing a fixed-width encoding. It simplifies random access to characters but may consume more memory. ## Usage and Applications ### 1. **Programming and Software Development:** - Unicode is crucial in programming for supporting a diverse set of characters in strings and text processing. ### 2. **Web and Document Standards:** - [HTML](../internet/HTML.md), [XML](XML.md), and other web standards rely on Unicode for consistent representation of text across different platforms and devices. ### 3. **Localization and Internationalization:** - Unicode facilitates the localization of software and content for different languages and regions, enabling a global audience. ### 4. **Operating Systems:** - Modern operating systems, such as [Windows](../windows/Windows.md), [macOS](../macos/macOS.md), and [Linux](../linux/Linux.md), use Unicode for character encoding, ensuring compatibility and interoperability. ### 5. **Communication and Social Media:** - Unicode is fundamental in digital communication, ensuring that users can express themselves using a wide range of symbols, emojis, and scripts.