Why Your MySQL Database Thinks 'café' 'café'

Published: 1 month ago (December 7, 2025 at 05:48 PM EST)

4 min read

Source: Dev.to

The Problem

It’s 2:47 AM. You’re debugging in production (yes, we’ve all been there). A customer in Paris reports they can’t find their favorite restaurant, “Café Rouge,” even though they just created it.

SELECT * FROM restaurants WHERE name = 'Café Rouge';

Result: Empty set.

But the data is there:

id	name
42	Café Rouge

The issue? The database can’t match the string you typed.

Why Encodings Matter

There are multiple ways to encode the letter “é”:

Form	Unicode code points
Precomposed	`é` (U+00E9)
Decomposed	`e` + `́` (U+0065 + U+0301)

They look identical to humans but are completely different to computers. If a user types one form while the database stores the other, a direct comparison fails.

Character Sets

A character set maps symbols to numbers (encodings). Common MySQL character sets:

utf8mb4 – Full UTF‑8 support (4‑byte characters, emojis, etc.).
latin1 – Western European languages.
ascii – 7‑bit English only.

Note: MySQL’s older utf8 charset only supports up to 3‑byte characters, so it cannot store many emojis or rare symbols.

-- This fails with an emoji
CREATE TABLE old_table (
    message VARCHAR(100) CHARACTER SET utf8
);

INSERT INTO old_table VALUES ('I love coding! 😍');
-- ERROR 1366: Incorrect string value

Use utf8mb4 instead:

CREATE TABLE modern_table (
    message VARCHAR(100) CHARACTER SET utf8mb4
);

INSERT INTO modern_table VALUES ('I love coding! 😍');
-- Success

Collations

If a character set is the dictionary, a collation defines the grammar for comparison:

Case sensitivity (A vs a)
Accent sensitivity (café vs cafe)
Multi‑character mappings (ß vs ss)
Sorting rules (natural ordering)

MySQL collation naming pattern: charset_language_sensitivity

Collation	Meaning
`utf8mb4_general_ci`	General, case‑insensitive
`utf8mb4_0900_ai_ci`	Unicode 9.0, accent‑insensitive, case‑insensitive
`utf8mb4_bin`	Binary (byte‑by‑byte)

Comparison Examples

SET @name1 = 'José';
SET @name2 = 'jose';
SET @name3 = 'José'; -- Different é encoding

-- Binary collation (byte comparison)
SELECT @name1 = @name2 COLLATE utf8mb4_bin;          -- 0 (false)

-- Case‑insensitive collation
SELECT @name1 = @name2 COLLATE utf8mb4_general_ci; -- 1 (true)

-- Accent‑insensitive collation
SELECT 'José' = 'Jose' COLLATE utf8mb4_0900_ai_ci; -- 1 (true)

Practical Impact

Unique Usernames

CREATE TABLE users (
    username VARCHAR(50) COLLATE utf8mb4_0900_as_cs
    -- as = accent sensitive, cs = case sensitive
);

INSERT INTO users VALUES ('José'), ('jose'), ('Jose');
-- Three distinct rows

Flexible Search

CREATE TABLE search_terms (
    query VARCHAR(100) COLLATE utf8mb4_0900_ai_ci
);

SELECT * FROM products
WHERE name LIKE '%café%' COLLATE utf8mb4_0900_ai_ci;
-- Matches café, Café, CAFÉ, cafe, CAFE

Real‑World Issue: German “Müller”

A site stored names in latin1 with a binary collation. Users typed “Müller” using a decomposed ü (u + ¨). The query returned no rows.

Fix:

ALTER TABLE users
MODIFY name VARCHAR(100)
CHARACTER SET utf8mb4
COLLATE utf8mb4_unicode_ci;

Emoji Errors

INSERT INTO posts (content) VALUES ('This is fire! 🔥');
-- ERROR 1366: Incorrect string value

Prevention: Ensure the table uses utf8mb4.

Performance Considerations

Binary collations are faster because they compare raw bytes.

-- Binary collation (≈0.05 s)
SELECT COUNT(*) FROM users
WHERE email = 'test@example.com' COLLATE utf8mb4_bin;

-- Unicode collation (≈0.12 s)
SELECT COUNT(*) FROM users
WHERE email = 'test@example.com' COLLATE utf8mb4_unicode_ci;

When to Use Which

Use Case	Recommended Collation
Exact matching (emails, API keys)	`utf8mb4_bin`
Case‑sensitive identifiers	`utf8mb4_bin`
Performance‑critical queries	`utf8mb4_bin`
User‑facing search / international	`utf8mb4_unicode_ci`
Name matching, multilingual content	`utf8mb4_unicode_ci`

Mixing Character Sets & Collations

MySQL allows different settings at various levels (server, database, table, column, query). Mixing them can cause errors:

CREATE DATABASE app
CHARACTER SET utf8mb4
COLLATE utf8mb4_general_ci;

CREATE TABLE app.users (
    id INT PRIMARY KEY,
    email VARCHAR(255) CHARACTER SET latin1,   -- ⚠️
    username VARCHAR(100) COLLATE utf8mb4_bin -- ⚠️
);

SELECT * FROM users u1
JOIN users u2 ON u1.username = u2.email;
-- ERROR 1267: Illegal mix of collations

Golden Rule: Choose a single character set and collation for the entire application (e.g., utf8mb4 + utf8mb4_unicode_ci) and stick with it.

Recommended Default

CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci

Forensic Queries

Inspect a String’s Encoding

SELECT HEX('café');  -- Returns 636166C3A9 (UTF‑8)

Check Column Character Set & Collation

SELECT
    TABLE_NAME,
    COLUMN_NAME,
    CHARACTER_SET_NAME,
    COLLATION_NAME
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_SCHEMA = 'your_database'
  AND TABLE_NAME = 'your_table';

Compare Strings with Specific Collation

SELECT 'Müller' = 'Mueller' COLLATE utf8mb4_unicode_ci;

Find Collation Mismatches

SELECT DISTINCT COLLATION_NAME
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_SCHEMA = 'your_database';
-- Ideally returns only one or two collations

Migrating an Existing Database

If your database already uses the wrong character set/collation:

Backup (always):
```
mysqldump your_database > backup.sql
```

Convert tables/columns (example for a single table):

ALTER TABLE your_table
CONVERT TO CHARACTER SET utf8mb4
COLLATE utf8mb4_unicode_ci;

Update server defaults (in my.cnf):

[mysqld]
character-set-server = utf8mb4
collation-server = utf8mb4_unicode_ci

Apply per‑connection setting (if needed):
```
SET NAMES utf8mb4;
```

After migration, re‑run your application tests to ensure all queries behave as expected.