Understanding Character Encoding: UTF-8, Hex, and Binary

Every character you're reading right now is a number. The letter "A" is 65. A space is 32. The emoji "🎉" is 127,881. Character encoding is the system that maps between the symbols humans read and the numbers computers store.

This might sound academic, but encoding issues cause real bugs. Garbled text in emails, broken characters on websites, mysterious question marks in database exports — all encoding problems. Understanding the basics saves you from debugging sessions that feel like deciphering ancient runes.

ASCII: Where It Started

In 1963, ASCII (American Standard Code for Information Interchange) defined 128 characters: the English alphabet (upper and lower), digits 0-9, punctuation, and a handful of control characters like newline and tab.

A = 65    a = 97    0 = 48
B = 66    b = 98    1 = 49
Z = 90    z = 122   9 = 57

Each character fits in 7 bits. Simple, efficient, and completely useless for anyone not writing in English. French accents? No. Chinese characters? Absolutely not. Emoji? Decades away from existing.

This limitation led to a proliferation of competing encoding schemes — ISO-8859-1 for Western European languages, Shift_JIS for Japanese, Windows-1252 for... Windows being Windows. Every system assumed its own encoding, and when data crossed boundaries, characters broke.

Unicode: One Table to Rule Them All

Unicode solved the fragmentation by assigning a unique number (called a "code point") to every character in every writing system. Over 149,000 characters so far, covering 161 scripts from Latin to Linear B.

Code points are written as U+ followed by a hex number:

A      = U+0041
é      = U+00E9
中     = U+4E2D
🎉     = U+1F389

Unicode is just the mapping — it tells you that "A" is code point 65. It doesn't specify how to store that number in bytes. That's where encodings come in.

UTF-8: The Encoding That Won

UTF-8 is a variable-width encoding that represents each Unicode code point using 1 to 4 bytes:

| Byte count | Code point range | Example | |-----------|-----------------|---------| | 1 byte | U+0000 to U+007F | A, z, 5, ! | | 2 bytes | U+0080 to U+07FF | é, ñ, ü | | 3 bytes | U+0800 to U+FFFF | 中, 日, 한 | | 4 bytes | U+10000 to U+10FFFF | 🎉, 🚀, 😊 |

The genius of UTF-8: ASCII characters use exactly 1 byte, identical to their ASCII encoding. This means any valid ASCII text is also valid UTF-8. Backward compatibility at its finest.

UTF-8 dominates the web. Over 98% of websites use it, and it's the default encoding for HTML5, JSON, YAML, TOML, and most modern programming languages. If you're choosing an encoding today, choose UTF-8.

See how your text looks in UTF-8 byte sequences with the UTF-8 Encoder.

Hexadecimal: Human-Friendly Bytes

Binary is how computers think, but it's awful for humans. Hexadecimal (base 16) is the compromise — compact enough to be readable, directly mappable to binary (each hex digit = 4 bits).

Text:    Hello
ASCII:   72 101 108 108 111
Hex:     48 65 6C 6C 6F
Binary:  01001000 01100101 01101100 01101100 01101111

You'll encounter hex everywhere in programming:

Color codes: #FF5733
Memory addresses: 0x7FFE4B2C
Byte sequences: \x48\x65\x6C\x6C\x6F
Unicode escapes: \u00E9
Hash values: d41d8cd98f00b204e9800998ecf8427e

The Hex-Text converter translates between human-readable text and its hexadecimal representation, which is useful for inspecting byte-level content or debugging encoding issues.

Binary: The Foundation

At the lowest level, everything is binary — ones and zeros. Each digit is a bit, eight bits make a byte, and bytes represent everything from text to images to executables.

Text:     A
Decimal:  65
Binary:   01000001
Hex:      41

You rarely work with raw binary in application development, but understanding it helps when:

Debugging bitwise operations
Working with network protocols
Reading file format specifications
Understanding why some characters take more bytes than others

Convert text to binary and back with the Binary-Text converter.

HTML Entities: Encoding for the Web

HTML entities are a different kind of encoding — they represent characters that have special meaning in HTML or that might not display correctly in all browsers.

&lt;     → <
&gt;     → >
&amp;    → &
&quot;   → "
&copy;   → ©
&#8364;  → €
&#x1F389; → 🎉

The first four are essential: if you write <div> in HTML content without encoding the angle brackets, the browser interprets it as an actual div element. Using <div> displays the literal text.

Named entities (©, €) cover common symbols. Numeric entities (€, 🎉) can represent any Unicode code point — useful when you need a specific character but aren't sure about the page encoding.

Encode and decode HTML entities with the HTML Entities tool.

⚠️

Failing to encode user input as HTML entities is a common source of XSS (Cross-Site Scripting) vulnerabilities. If a user submits <script>alert('hack')</script> and you render it unescaped, the script executes. Always encode untrusted content.

Common Encoding Problems and Fixes

Mojibake (Garbled Text)

You see something like Ã© instead of é, or 中文 instead of 中文. This happens when text encoded in UTF-8 is read as if it were ISO-8859-1 (or vice versa).

The fix: ensure your entire pipeline — database, application, HTTP headers, HTML meta tags — all agree on UTF-8. A single layer using a different encoding breaks the chain.

Replacement Characters (�)

The U+FFFD replacement character appears when a decoder encounters bytes that don't form a valid sequence in the expected encoding. It usually means data was corrupted or truncated mid-character.

Double Encoding

Text that was encoded to UTF-8, then encoded again as if the UTF-8 bytes were Latin-1 characters. You get things like Ã© (the UTF-8 bytes for é interpreted as Latin-1, then re-encoded). The fix: encode once, decode once.

BOM (Byte Order Mark)

EF BB BF at the start of a UTF-8 file. Some Windows editors add it, most Unix tools don't expect it. It's technically valid but causes issues with certain parsers and scripts that don't handle it. If a CSV import fails on the first column name, check for a BOM.

Practical Guide: When to Use What

| Format | Use case | |--------|----------| | UTF-8 text | Default for everything. Files, APIs, databases, HTML | | Hex encoding | Debugging, inspecting bytes, color codes, crypto | | Binary | Low-level protocols, bitwise operations, education | | HTML entities | Displaying special chars in HTML, preventing XSS | | Base64 | Embedding binary data in text (covered in our Base64 guide) |

Try It Yourself

Understanding encoding is easier with tools that show you what's happening at the byte level:

UTF-8 Encoder — see UTF-8 byte representations of your text
Hex-Text Converter — translate between text and hexadecimal
Binary-Text Converter — convert text to binary and back
HTML Entities — encode and decode HTML special characters

All processing happens in your browser. Your text stays on your machine.

Understanding Character Encoding: UTF-8, Hex, and Binary

ASCII: Where It Started

Unicode: One Table to Rule Them All

UTF-8: The Encoding That Won

Hexadecimal: Human-Friendly Bytes

Binary: The Foundation

HTML Entities: Encoding for the Web

Common Encoding Problems and Fixes

Mojibake (Garbled Text)

Replacement Characters (�)

Double Encoding

BOM (Byte Order Mark)

Practical Guide: When to Use What

Try It Yourself

Tools Mentioned

UTF-8 Byte Inspector

Hex to Text / Text to Hex Converter

Binary to Text / Text to Binary Converter

HTML Entity Encoder / Decoder

Related Articles

Bold, Italic, and Fancy Text for Social Media

Formatting Code for Readability: JSON, HTML, CSS, and SQL

Working with CSV and JSON: Converting Between Data Formats

Related Articles

Bold, Italic, and Fancy Text for Social Media
Writing
Create bold, italic, strikethrough, and decorative Unicode text for social media bios, posts, and comments — no special apps needed.
Feb 17, 20266 min read

Formatting Code for Readability: JSON, HTML, CSS, and SQL
Development
Learn why code formatting matters and how to quickly beautify JSON, HTML, CSS, SQL, and other languages. Covers indentation, minification, and formatting tools.
Feb 17, 20266 min read

Working with CSV and JSON: Converting Between Data Formats
Development
Learn how to convert between CSV and JSON formats. Covers practical use cases, handling nested data, common pitfalls, and when to use each format.
Feb 17, 20266 min read