Every character you're reading right now is a number. The letter "A" is 65. A space is 32. The emoji "🎉" is 127,881. Character encoding is the system that maps between the symbols humans read and the numbers computers store.
This might sound academic, but encoding issues cause real bugs. Garbled text in emails, broken characters on websites, mysterious question marks in database exports — all encoding problems. Understanding the basics saves you from debugging sessions that feel like deciphering ancient runes.
ASCII: Where It Started
In 1963, ASCII (American Standard Code for Information Interchange) defined 128 characters: the English alphabet (upper and lower), digits 0-9, punctuation, and a handful of control characters like newline and tab.
A = 65 a = 97 0 = 48
B = 66 b = 98 1 = 49
Z = 90 z = 122 9 = 57
Each character fits in 7 bits. Simple, efficient, and completely useless for anyone not writing in English. French accents? No. Chinese characters? Absolutely not. Emoji? Decades away from existing.
This limitation led to a proliferation of competing encoding schemes — ISO-8859-1 for Western European languages, Shift_JIS for Japanese, Windows-1252 for... Windows being Windows. Every system assumed its own encoding, and when data crossed boundaries, characters broke.
Unicode: One Table to Rule Them All
Unicode solved the fragmentation by assigning a unique number (called a "code point") to every character in every writing system. Over 149,000 characters so far, covering 161 scripts from Latin to Linear B.
Code points are written as U+ followed by a hex number:
A = U+0041
é = U+00E9
中 = U+4E2D
🎉 = U+1F389
Unicode is just the mapping — it tells you that "A" is code point 65. It doesn't specify how to store that number in bytes. That's where encodings come in.
UTF-8: The Encoding That Won
UTF-8 is a variable-width encoding that represents each Unicode code point using 1 to 4 bytes:
| Byte count | Code point range | Example | |-----------|-----------------|---------| | 1 byte | U+0000 to U+007F | A, z, 5, ! | | 2 bytes | U+0080 to U+07FF | é, ñ, ü | | 3 bytes | U+0800 to U+FFFF | 中, 日, 한 | | 4 bytes | U+10000 to U+10FFFF | 🎉, 🚀, 😊 |
The genius of UTF-8: ASCII characters use exactly 1 byte, identical to their ASCII encoding. This means any valid ASCII text is also valid UTF-8. Backward compatibility at its finest.
UTF-8 dominates the web. Over 98% of websites use it, and it's the default encoding for HTML5, JSON, YAML, TOML, and most modern programming languages. If you're choosing an encoding today, choose UTF-8.
See how your text looks in UTF-8 byte sequences with the UTF-8 Encoder.
Hexadecimal: Human-Friendly Bytes
Binary is how computers think, but it's awful for humans. Hexadecimal (base 16) is the compromise — compact enough to be readable, directly mappable to binary (each hex digit = 4 bits).
Text: Hello
ASCII: 72 101 108 108 111
Hex: 48 65 6C 6C 6F
Binary: 01001000 01100101 01101100 01101100 01101111
You'll encounter hex everywhere in programming:
- Color codes:
#FF5733 - Memory addresses:
0x7FFE4B2C - Byte sequences:
\x48\x65\x6C\x6C\x6F - Unicode escapes:
\u00E9 - Hash values:
d41d8cd98f00b204e9800998ecf8427e
The Hex-Text converter translates between human-readable text and its hexadecimal representation, which is useful for inspecting byte-level content or debugging encoding issues.
Binary: The Foundation
At the lowest level, everything is binary — ones and zeros. Each digit is a bit, eight bits make a byte, and bytes represent everything from text to images to executables.
Text: A
Decimal: 65
Binary: 01000001
Hex: 41
You rarely work with raw binary in application development, but understanding it helps when:
- Debugging bitwise operations
- Working with network protocols
- Reading file format specifications
- Understanding why some characters take more bytes than others
Convert text to binary and back with the Binary-Text converter.
HTML Entities: Encoding for the Web
HTML entities are a different kind of encoding — they represent characters that have special meaning in HTML or that might not display correctly in all browsers.
< → <
> → >
& → &
" → "
© → ©
€ → €
🎉 → 🎉
The first four are essential: if you write <div> in HTML content without encoding the angle brackets, the browser interprets it as an actual div element. Using <div> displays the literal text.
Named entities (©, €) cover common symbols. Numeric entities (€, 🎉) can represent any Unicode code point — useful when you need a specific character but aren't sure about the page encoding.
Encode and decode HTML entities with the HTML Entities tool.
Failing to encode user input as HTML entities is a common source of XSS (Cross-Site Scripting) vulnerabilities. If a user submits <script>alert('hack')</script> and you render it unescaped, the script executes. Always encode untrusted content.
Common Encoding Problems and Fixes
Mojibake (Garbled Text)
You see something like é instead of é, or 中文 instead of 中文. This happens when text encoded in UTF-8 is read as if it were ISO-8859-1 (or vice versa).
The fix: ensure your entire pipeline — database, application, HTTP headers, HTML meta tags — all agree on UTF-8. A single layer using a different encoding breaks the chain.
Replacement Characters (�)
The U+FFFD replacement character appears when a decoder encounters bytes that don't form a valid sequence in the expected encoding. It usually means data was corrupted or truncated mid-character.
Double Encoding
Text that was encoded to UTF-8, then encoded again as if the UTF-8 bytes were Latin-1 characters. You get things like é (the UTF-8 bytes for é interpreted as Latin-1, then re-encoded). The fix: encode once, decode once.
BOM (Byte Order Mark)
EF BB BF at the start of a UTF-8 file. Some Windows editors add it, most Unix tools don't expect it. It's technically valid but causes issues with certain parsers and scripts that don't handle it. If a CSV import fails on the first column name, check for a BOM.
Practical Guide: When to Use What
| Format | Use case | |--------|----------| | UTF-8 text | Default for everything. Files, APIs, databases, HTML | | Hex encoding | Debugging, inspecting bytes, color codes, crypto | | Binary | Low-level protocols, bitwise operations, education | | HTML entities | Displaying special chars in HTML, preventing XSS | | Base64 | Embedding binary data in text (covered in our Base64 guide) |
Try It Yourself
Understanding encoding is easier with tools that show you what's happening at the byte level:
- UTF-8 Encoder — see UTF-8 byte representations of your text
- Hex-Text Converter — translate between text and hexadecimal
- Binary-Text Converter — convert text to binary and back
- HTML Entities — encode and decode HTML special characters
All processing happens in your browser. Your text stays on your machine.