I made search engines understand emojis (and it's weirdly useful)

Published: 4 days ago (December 10, 2025 at 07:28 PM EST)

2 min read

Source: Dev.to

Source: Dev.to

Demo URLs

Key emoji → get actual keys:
Bike emoji → get bicycles and accessories:
Printer + paper → get printer supplies:
Cute domestic pet earrings (jewelry store) → finds cat and dog earrings even though product titles are in a different language:
Measure 🔥 (technical documentation) → recommends a device for measuring temperature/fire:

Pipeline

Crawl website → extract text with Trafilatura.
Generate embeddings: 1024‑dimensional vectors via BGE‑M3 (BAAI).
Store in Solr with both raw text and vectors.
Query time: run lexical search + KNN vector search and combine scores (hybrid approach).

Why Emojis Work

BGE‑M3 was trained on multilingual + multimodal data, learning that an emoji (e.g., 🔑) is semantically close to its textual equivalents in many languages (“key”, “Schlüssel”, “cheie”, etc.).
Consequently, searching with 🚲 returns results for “bicycle”, “bike”, “Fahrrad”, “bicicletă”, etc., without any explicit translation layer.

Embeddings & Infrastructure

Component	Details
Embedding model	BGE‑M3 (BAAI), 1024 dimensions
Inference hardware	RTX 4000 Ada, ~2–5 ms per query
Search engine	Solr 9.6 with dense vector support
Crawling stack	Custom PHP + Python (Playwright for JS‑heavy sites, Trafilatura for extraction)
Extra features	VADER (sentiment), langid (language detection), custom price extraction
Query latency	~40–50 ms total (including embedding generation)

Hybrid vs. Pure Vector Search

Pure vector search can surface semantically similar items but may rank exact matches lower, mishandle product codes/SKUs, and conflict with user expectations (e.g., “nike shoes” should prioritize Nike products).
Hybrid approach: lexical component guarantees exact matches; vector component handles “I don’t know the exact word but I know what I want” queries.

Solr Query Example

# Vector part
vectorQuery = {!knn f=embeddings topK=250}[-0.032, 0.009, -0.049, ...]

# Lexical part
lexicalQuery = {!edismax qf="title^550 description^450 uri^1 text^0.1"
                         pf="title^1100 description^900" ...}

# Combined score
q = {!func}sum(
      product(1, query($vectorQuery)),
      product(1, div(query($lexicalQuery), sum(query($lexicalQuery), 6)))
    )

The Solr debug view (bottom‑right button) shows the actual vector query functions.

Experimental Feature: Result Explanation

A local LLM (running on the same GPU) can generate explanations for results. Example: searching “measure 🔥” on a technical documentation site returns a specific device recommendation, pulling context from indexed PDFs.

Closing Thoughts

The emoji capability emerged naturally from using multilingual embeddings and proved surprisingly useful for cross‑language and conceptual searches. Feel free to ask questions about the setup or hybrid search in general.

I made search engines understand emojis (and it's weirdly useful)

Demo URLs

Pipeline

Why Emojis Work

Embeddings & Infrastructure

Hybrid vs. Pure Vector Search

Solr Query Example

Experimental Feature: Result Explanation

Closing Thoughts

Related posts

Building PolyScan: Free CC0 PBR Textures & 3D Models for Real Projects

Zero-to-Scale ML: Deploying ONNX Models on Kubernetes with FastAPI and HPA

Unpacking the Google File System Paper: A Simple Breakdown

How to Adapt Tone to User Sentiment in Voice AI and Integrate Calendar Checks