What is Unicode?
Unicode is a universal character encoding standard that assigns a unique code point to every character across all writing systems. Maintained by the Unicode Consortium, the standard currently defines over 149,000 characters covering 159 modern and historic scripts, as well as symbols, emoji, and control characters. Unicode replaced the fragmented set of regional character encodings (like ASCII, Latin-1, Shift_JIS) with a single, unified system.
Unicode code points are written in the format U+XXXX, where XXXX is a hexadecimal number. For example, the Latin letter “A” is U+0041, the Greek letter alpha is U+03B1, and the emoji grinning face is U+1F600. The full Unicode range spans from U+0000 to U+10FFFF.
What is Unicode Escaping?
Unicode escaping is the process of converting characters into a text-based representation using their hexadecimal code points. The most common format is \uXXXX, used in JavaScript, Java, C#, and JSON. For characters outside the Basic Multilingual Plane (above U+FFFF), surrogate pairs or extended syntax like \u{1F600} may be used.
Unicode escaping is needed when you want to include non-ASCII characters in source code files that use ASCII encoding, represent special characters in JSON strings, transmit Unicode data through systems that only support ASCII, or debug character encoding issues.
How to Use the Unicode Escape/Unescape Tool
- Paste your text or escaped sequences into the input area
- Click “Escape” to convert text to
\uXXXXsequences, “Unescape” to convert back, or “Code Points” to view the Unicode code point for each character - Copy the result with the “Copy” button or
Ctrl+Shift+C
Unicode Escape Formats Across Languages
| Language | Escape Format | Example (for “A”) |
|---|---|---|
| JavaScript/JSON | \u0041 | \u0041 |
| Python | \u0041 or \U00000041 | \u0041 |
| Java | \u0041 | \u0041 |
| C# | \u0041 | \u0041 |
| HTML | A or A | A |
| CSS | \0041 | \0041 |
| Ruby | \u0041 or \u{41} | \u0041 |
Common Use Cases for Unicode Escaping
Internationalization (i18n): When building multilingual applications, Unicode escaping ensures that non-Latin characters in translation files and resource bundles are correctly preserved regardless of the file encoding.
JSON Data: The JSON specification requires that certain characters be escaped, and Unicode escaping is the standard way to include non-ASCII characters in JSON payloads when UTF-8 encoding isn’t available.
Debugging Encoding Issues: When text appears garbled or contains unexpected characters, viewing the Unicode code points helps identify whether the issue is a wrong encoding, a missing font, or corrupted data.
Source Code Portability: Escaping non-ASCII characters in source code ensures that the code works correctly even if the file is opened in an editor or system that doesn’t support UTF-8.
Common Unicode Escape Characters — Quick Reference
Here are frequently escaped characters developers encounter in everyday work:
| Character | Name | Code Point | Escape |
|---|---|---|---|
| © | Copyright sign | U+00A9 | © |
| ® | Registered sign | U+00AE | ® |
| ™ | Trademark | U+2122 | ™ |
| € | Euro sign | U+20AC | € |
| £ | Pound sign | U+00A3 | £ |
| ¥ | Yen sign | U+00A5 | ¥ |
| ° | Degree sign | U+00B0 | ° |
| — | Em dash | U+2014 | — |
| ' | Right single quote | U+2019 | ’ |
| " " | Smart quotes | U+201C/U+201D | “ / ” |
| … | Ellipsis | U+2026 | … |
| • | Bullet | U+2022 | • |
| → | Right arrow | U+2192 | → |
| ≠ | Not equal | U+2260 | ≠ |
| ≤ ≥ | Less/greater-equal | U+2264/U+2265 | ≤ / ≥ |
These characters frequently cause issues when copy-pasted from word processors, PDFs, or web pages into source code or configuration files. Escaping them prevents encoding mismatches across different systems and editors.
Understanding UTF-8, UTF-16, and Code Points
Unicode defines code points, but the actual byte representation depends on the encoding:
- UTF-8 uses 1 to 4 bytes per character and is the dominant encoding on the web
- UTF-16 uses 2 or 4 bytes per character and is used internally by JavaScript and Java
- UTF-32 uses exactly 4 bytes per character, providing direct code point mapping
The \uXXXX escape format corresponds to UTF-16 code units. Characters in the Basic Multilingual Plane (U+0000 to U+FFFF) use a single \uXXXX escape, while characters above U+FFFF (like emoji) require a surrogate pair of two \uXXXX escapes.
Troubleshooting Unicode Escape Issues
Surrogate pair errors: If you see 😀 instead of a readable emoji, these are UTF-16 surrogate pairs. The pair 😀 decodes to the grinning face emoji (U+1F600). Modern JavaScript engines handle this automatically, but older tools may require manual pairing. This tool correctly decodes surrogate pairs back to their original characters.
Mojibake (garbled text): Text like é instead of é or ’ instead of ' means UTF-8 bytes were interpreted as Latin-1 or Windows-1252. The fix is to ensure every layer in your stack — file encoding, database charset, HTTP Content-Type header, and HTML <meta charset> — consistently uses UTF-8.
Mixed escaped and plain text: It’s valid to have Hello World where only some characters are escaped. The unescape operation in this tool handles mixed content correctly, converting only the \uXXXX sequences while leaving plain text untouched.
Escape format mismatch: Different languages use different escape syntax. If é doesn’t work in your context, check whether your language expects \x{E9} (Perl/PHP regex), \U000000E9 (Python 32-bit), é (HTML), or %C3%A9 (URL encoding). Use the format table above to match the correct syntax.