How I Built a Semantic Search Engine with CocoIndex

Published: (December 2, 2025 at 05:04 PM EST)
3 min read
Source: Dev.to

Source: Dev.to

Introduction

In this tutorial, I’ll walk you through how I built a semantic search engine using CocoIndex, an open-source Python library for creating powerful search experiences. If you’ve ever wanted to build a search feature that understands context and meaning (not just exact keyword matches), this post is for you!

What is CocoIndex?

CocoIndex is a lightweight semantic search library that makes it easy to index and search through documents using vector embeddings. Unlike traditional keyword‑based search, semantic search understands the meaning behind queries, allowing users to find relevant results even when they use different words.

Why I Chose CocoIndex

I needed a search solution that was:

  • Easy to integrate – No complex setup or infrastructure required
  • Fast – Quick indexing and search performance
  • Semantic – Understanding context, not just keywords
  • Open source – Free to use and modify

CocoIndex checked all these boxes!

Getting Started

First, install CocoIndex:

pip install cocoindex

Building the Search Engine

1. Initialize CocoIndex

from cocoindex import CocoIndex

2. Add Documents

@cocoindex.flow_def(name="TextEmbedding")
def text_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
    """
    Define an example flow that embeds text into a vector database.
    """
    data_scope["documents"] = flow_builder.add_source(
        cocoindex.sources.LocalFile(path="markdown_files")
    )

    doc_embeddings = data_scope.add_collector()

Index the documents

Process each document

with data_scope["documents"].row() as doc:
    doc["chunks"] = doc["content"].transform(
        cocoindex.functions.SplitRecursively(),
        language="markdown",
        chunk_size=2000,
        chunk_overlap=500,
    )

Embed

with doc["chunks"].row() as chunk:
    chunk["embedding"] = chunk["text"].transform(
        cocoindex.functions.SentenceTransformerEmbed(
            model="sentence-transformers/all-MiniLM-L6-v2"
        )
    )
    doc_embeddings.collect(
        filename=doc["filename"],
        location=chunk["location"],
        text=chunk["text"],
        embedding=chunk["embedding"],
    )

Export

doc_embeddings.export(
    "doc_embeddings",
    cocoindex.storages.Postgres(),
    primary_key_fields=["filename", "location"],
    vector_indexes=[
        cocoindex.VectorIndexDef(
            field_name="embedding",
            metric=cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY,
        )
    ],
)
def search(pool: ConnectionPool, query: str, top_k: int = 5):
    table_name = cocoindex.utils.get_target_storage_default_name(
        text_embedding_flow, "doc_embeddings"
    )
    query_vector = text_to_embedding.eval(query)

    with pool.connection() as conn:
        with conn.cursor() as cur:
            cur.execute(
                f"""
                SELECT filename, text, embedding  %s::vector AS distance
                FROM {table_name}
                ORDER BY distance
                LIMIT %s
                """,
                (query_vector, top_k),
            )
            return [
                {"filename": row[0], "text": row[1], "score": 1.0 - row[2]}
                for row in cur.fetchall()
            ]

Key Features I Implemented

Fast Indexing

CocoIndex uses efficient vector storage, making indexing thousands of documents quick and painless.

Semantic Understanding

The search understands that “teaching computers” relates to “machine learning” even without exact keyword matches.

Customizable Embeddings

You can use different embedding models depending on your use case and accuracy requirements.

Real-World Example

I built a documentation search for my project with 500+ markdown files. With CocoIndex:

  • Indexing took less than 30 seconds
  • Search response time averaged 50 ms
  • Users found relevant docs even with vague queries

Performance Tips

  • Batch indexing – Add multiple documents at once for better performance
  • Choose the right embedding model – Balance between accuracy and speed
  • Cache frequently accessed results – Store common queries for instant responses

Challenges I Faced

Challenge 1: Choosing Embedding Dimensions

Higher dimensions give better accuracy but slower performance. I settled on 384 dimensions as a sweet spot.

Challenge 2: Handling Large Document Collections

For collections over 10 k documents, I implemented pagination and lazy loading.

Results

After implementing CocoIndex:

  • User satisfaction increased significantly
  • Implementation took only 2 days vs. weeks for alternatives

Conclusion

CocoIndex made building a semantic search engine surprisingly simple. Whether you’re building a documentation site, blog search, or product catalog, it’s a fantastic tool that punches above its weight. The library is actively maintained, well‑documented, and the community is helpful. I highly recommend giving it a try for your next search implementation!

Resources

Back to Blog

Related posts

Read more »

What Happens When You Run Python Code?

Python is a popular programming language, but have you ever wondered what happens behind the scenes when you run a Python program on your computer? In this arti...