I made search engines understand emojis (and it's weirdly useful)
Source: Dev.to
Demo URLs
- Key emoji → get actual keys:
- Bike emoji → get bicycles and accessories:
- Printer + paper → get printer supplies:
- Cute domestic pet earrings (jewelry store) → finds cat and dog earrings even though product titles are in a different language:
- Measure 🔥 (technical documentation) → recommends a device for measuring temperature/fire:
Pipeline
- Crawl website → extract text with Trafilatura.
- Generate embeddings: 1024‑dimensional vectors via BGE‑M3 (BAAI).
- Store in Solr with both raw text and vectors.
- Query time: run lexical search + KNN vector search and combine scores (hybrid approach).
Why Emojis Work
- BGE‑M3 was trained on multilingual + multimodal data, learning that an emoji (e.g., 🔑) is semantically close to its textual equivalents in many languages (“key”, “Schlüssel”, “cheie”, etc.).
- Consequently, searching with 🚲 returns results for “bicycle”, “bike”, “Fahrrad”, “bicicletă”, etc., without any explicit translation layer.
Embeddings & Infrastructure
| Component | Details |
|---|---|
| Embedding model | BGE‑M3 (BAAI), 1024 dimensions |
| Inference hardware | RTX 4000 Ada, ~2–5 ms per query |
| Search engine | Solr 9.6 with dense vector support |
| Crawling stack | Custom PHP + Python (Playwright for JS‑heavy sites, Trafilatura for extraction) |
| Extra features | VADER (sentiment), langid (language detection), custom price extraction |
| Query latency | ~40–50 ms total (including embedding generation) |
Hybrid vs. Pure Vector Search
- Pure vector search can surface semantically similar items but may rank exact matches lower, mishandle product codes/SKUs, and conflict with user expectations (e.g., “nike shoes” should prioritize Nike products).
- Hybrid approach: lexical component guarantees exact matches; vector component handles “I don’t know the exact word but I know what I want” queries.
Solr Query Example
# Vector part
vectorQuery = {!knn f=embeddings topK=250}[-0.032, 0.009, -0.049, ...]
# Lexical part
lexicalQuery = {!edismax qf="title^550 description^450 uri^1 text^0.1"
pf="title^1100 description^900" ...}
# Combined score
q = {!func}sum(
product(1, query($vectorQuery)),
product(1, div(query($lexicalQuery), sum(query($lexicalQuery), 6)))
)
The Solr debug view (bottom‑right button) shows the actual vector query functions.
Experimental Feature: Result Explanation
A local LLM (running on the same GPU) can generate explanations for results. Example: searching “measure 🔥” on a technical documentation site returns a specific device recommendation, pulling context from indexed PDFs.
Closing Thoughts
The emoji capability emerged naturally from using multilingual embeddings and proved surprisingly useful for cross‑language and conceptual searches. Feel free to ask questions about the setup or hybrid search in general.