I made search engines understand emojis (and it's weirdly useful)

Published: (December 10, 2025 at 07:28 PM EST)
2 min read
Source: Dev.to

Source: Dev.to

Demo URLs

  • Key emoji → get actual keys:
  • Bike emoji → get bicycles and accessories:
  • Printer + paper → get printer supplies:
  • Cute domestic pet earrings (jewelry store) → finds cat and dog earrings even though product titles are in a different language:
  • Measure 🔥 (technical documentation) → recommends a device for measuring temperature/fire:

Pipeline

  1. Crawl website → extract text with Trafilatura.
  2. Generate embeddings: 1024‑dimensional vectors via BGE‑M3 (BAAI).
  3. Store in Solr with both raw text and vectors.
  4. Query time: run lexical search + KNN vector search and combine scores (hybrid approach).

Why Emojis Work

  • BGE‑M3 was trained on multilingual + multimodal data, learning that an emoji (e.g., 🔑) is semantically close to its textual equivalents in many languages (“key”, “Schlüssel”, “cheie”, etc.).
  • Consequently, searching with 🚲 returns results for “bicycle”, “bike”, “Fahrrad”, “bicicletă”, etc., without any explicit translation layer.

Embeddings & Infrastructure

ComponentDetails
Embedding modelBGE‑M3 (BAAI), 1024 dimensions
Inference hardwareRTX 4000 Ada, ~2–5 ms per query
Search engineSolr 9.6 with dense vector support
Crawling stackCustom PHP + Python (Playwright for JS‑heavy sites, Trafilatura for extraction)
Extra featuresVADER (sentiment), langid (language detection), custom price extraction
Query latency~40–50 ms total (including embedding generation)
  • Pure vector search can surface semantically similar items but may rank exact matches lower, mishandle product codes/SKUs, and conflict with user expectations (e.g., “nike shoes” should prioritize Nike products).
  • Hybrid approach: lexical component guarantees exact matches; vector component handles “I don’t know the exact word but I know what I want” queries.

Solr Query Example

# Vector part
vectorQuery = {!knn f=embeddings topK=250}[-0.032, 0.009, -0.049, ...]

# Lexical part
lexicalQuery = {!edismax qf="title^550 description^450 uri^1 text^0.1"
                         pf="title^1100 description^900" ...}

# Combined score
q = {!func}sum(
      product(1, query($vectorQuery)),
      product(1, div(query($lexicalQuery), sum(query($lexicalQuery), 6)))
    )

The Solr debug view (bottom‑right button) shows the actual vector query functions.

Experimental Feature: Result Explanation

A local LLM (running on the same GPU) can generate explanations for results. Example: searching “measure 🔥” on a technical documentation site returns a specific device recommendation, pulling context from indexed PDFs.

Closing Thoughts

The emoji capability emerged naturally from using multilingual embeddings and proved surprisingly useful for cross‑language and conceptual searches. Feel free to ask questions about the setup or hybrid search in general.

Back to Blog

Related posts

Read more »