Why Your MySQL Database Thinks 'café' 'café'
Source: Dev.to
The Problem
It’s 2:47 AM. You’re debugging in production (yes, we’ve all been there). A customer in Paris reports they can’t find their favorite restaurant, “Café Rouge,” even though they just created it.
SELECT * FROM restaurants WHERE name = 'Café Rouge';
Result: Empty set.
But the data is there:
| id | name |
|---|---|
| 42 | Café Rouge |
The issue? The database can’t match the string you typed.
Why Encodings Matter
There are multiple ways to encode the letter “é”:
| Form | Unicode code points |
|---|---|
| Precomposed | é (U+00E9) |
| Decomposed | e + ́ (U+0065 + U+0301) |
They look identical to humans but are completely different to computers. If a user types one form while the database stores the other, a direct comparison fails.
Character Sets
A character set maps symbols to numbers (encodings). Common MySQL character sets:
- utf8mb4 – Full UTF‑8 support (4‑byte characters, emojis, etc.).
- latin1 – Western European languages.
- ascii – 7‑bit English only.
Note: MySQL’s older
utf8charset only supports up to 3‑byte characters, so it cannot store many emojis or rare symbols.
-- This fails with an emoji
CREATE TABLE old_table (
message VARCHAR(100) CHARACTER SET utf8
);
INSERT INTO old_table VALUES ('I love coding! 😍');
-- ERROR 1366: Incorrect string value
Use utf8mb4 instead:
CREATE TABLE modern_table (
message VARCHAR(100) CHARACTER SET utf8mb4
);
INSERT INTO modern_table VALUES ('I love coding! 😍');
-- Success
Collations
If a character set is the dictionary, a collation defines the grammar for comparison:
- Case sensitivity (
Avsa) - Accent sensitivity (
cafévscafe) - Multi‑character mappings (
ßvsss) - Sorting rules (natural ordering)
MySQL collation naming pattern: charset_language_sensitivity
| Collation | Meaning |
|---|---|
utf8mb4_general_ci | General, case‑insensitive |
utf8mb4_0900_ai_ci | Unicode 9.0, accent‑insensitive, case‑insensitive |
utf8mb4_bin | Binary (byte‑by‑byte) |
Comparison Examples
SET @name1 = 'José';
SET @name2 = 'jose';
SET @name3 = 'José'; -- Different é encoding
-- Binary collation (byte comparison)
SELECT @name1 = @name2 COLLATE utf8mb4_bin; -- 0 (false)
-- Case‑insensitive collation
SELECT @name1 = @name2 COLLATE utf8mb4_general_ci; -- 1 (true)
-- Accent‑insensitive collation
SELECT 'José' = 'Jose' COLLATE utf8mb4_0900_ai_ci; -- 1 (true)
Practical Impact
Unique Usernames
CREATE TABLE users (
username VARCHAR(50) COLLATE utf8mb4_0900_as_cs
-- as = accent sensitive, cs = case sensitive
);
INSERT INTO users VALUES ('José'), ('jose'), ('Jose');
-- Three distinct rows
Flexible Search
CREATE TABLE search_terms (
query VARCHAR(100) COLLATE utf8mb4_0900_ai_ci
);
SELECT * FROM products
WHERE name LIKE '%café%' COLLATE utf8mb4_0900_ai_ci;
-- Matches café, Café, CAFÉ, cafe, CAFE
Real‑World Issue: German “Müller”
A site stored names in latin1 with a binary collation. Users typed “Müller” using a decomposed ü (u + ¨). The query returned no rows.
Fix:
ALTER TABLE users
MODIFY name VARCHAR(100)
CHARACTER SET utf8mb4
COLLATE utf8mb4_unicode_ci;
Emoji Errors
INSERT INTO posts (content) VALUES ('This is fire! 🔥');
-- ERROR 1366: Incorrect string value
Prevention: Ensure the table uses utf8mb4.
Performance Considerations
Binary collations are faster because they compare raw bytes.
-- Binary collation (≈0.05 s)
SELECT COUNT(*) FROM users
WHERE email = 'test@example.com' COLLATE utf8mb4_bin;
-- Unicode collation (≈0.12 s)
SELECT COUNT(*) FROM users
WHERE email = 'test@example.com' COLLATE utf8mb4_unicode_ci;
When to Use Which
| Use Case | Recommended Collation |
|---|---|
| Exact matching (emails, API keys) | utf8mb4_bin |
| Case‑sensitive identifiers | utf8mb4_bin |
| Performance‑critical queries | utf8mb4_bin |
| User‑facing search / international | utf8mb4_unicode_ci |
| Name matching, multilingual content | utf8mb4_unicode_ci |
Mixing Character Sets & Collations
MySQL allows different settings at various levels (server, database, table, column, query). Mixing them can cause errors:
CREATE DATABASE app
CHARACTER SET utf8mb4
COLLATE utf8mb4_general_ci;
CREATE TABLE app.users (
id INT PRIMARY KEY,
email VARCHAR(255) CHARACTER SET latin1, -- ⚠️
username VARCHAR(100) COLLATE utf8mb4_bin -- ⚠️
);
SELECT * FROM users u1
JOIN users u2 ON u1.username = u2.email;
-- ERROR 1267: Illegal mix of collations
Golden Rule: Choose a single character set and collation for the entire application (e.g., utf8mb4 + utf8mb4_unicode_ci) and stick with it.
Recommended Default
CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci
Forensic Queries
Inspect a String’s Encoding
SELECT HEX('café'); -- Returns 636166C3A9 (UTF‑8)
Check Column Character Set & Collation
SELECT
TABLE_NAME,
COLUMN_NAME,
CHARACTER_SET_NAME,
COLLATION_NAME
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_SCHEMA = 'your_database'
AND TABLE_NAME = 'your_table';
Compare Strings with Specific Collation
SELECT 'Müller' = 'Mueller' COLLATE utf8mb4_unicode_ci;
Find Collation Mismatches
SELECT DISTINCT COLLATION_NAME
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_SCHEMA = 'your_database';
-- Ideally returns only one or two collations
Migrating an Existing Database
If your database already uses the wrong character set/collation:
-
Backup (always):
mysqldump your_database > backup.sql -
Convert tables/columns (example for a single table):
ALTER TABLE your_table CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci; -
Update server defaults (in
my.cnf):[mysqld] character-set-server = utf8mb4 collation-server = utf8mb4_unicode_ci -
Apply per‑connection setting (if needed):
SET NAMES utf8mb4;
After migration, re‑run your application tests to ensure all queries behave as expected.