I built two high-performance Python libraries for production AI: LLM log analytics and vector similarity search

Published: (December 3, 2025 at 05:45 AM EST)
4 min read
Source: Dev.to

Source: Dev.to

What My Projects Do

llmlog_engine: Columnar Analytics for LLM Logs

A specialized embedded database for analyzing LLM application logs stored as JSONL.

Core capabilities

  • Fast JSONL ingestion into columnar storage format
  • Efficient filtering on numeric and string columns
  • Group‑by aggregations (COUNT, SUM, AVG, MIN, MAX)
  • Dictionary encoding for low‑cardinality strings (model names, routes)
  • SIMD‑friendly memory layout for performance
  • pandas DataFrame integration

Performance

  • 6.8× faster than pure Python on 100 k rows
    • Benchmark: filter by model + latency, group by route, compute 6 metrics
    • Pure Python: 0.82 s
    • C++ Engine: 0.12 s

A focused, high‑performance library for similarity search in dense embeddings.

Core capabilities

  • SIMD‑accelerated distance computation (L2 and inner product)
  • NumPy‑friendly API with clean type signatures
  • ~1500 lines of readable C++ code
  • Support for both Euclidean and cosine similarity
  • Heap‑based top‑k selection

Performance

  • ≈ 7× faster than pure NumPy on typical workloads
    • Benchmark: 100 k vectors, 768 dimensions
    • mini_faiss: 0.067 s
    • NumPy: 0.48 s

Architecture Philosophy

Both libraries follow the same design pattern:

  • Core logic in C++17 – performance‑critical operations using modern C++
  • Python bindings via pybind11 – zero‑copy data transfer with NumPy
  • Minimal dependencies – no heavy frameworks or complex build chains
  • Columnar / SIMD‑friendly layouts – data structures optimized for CPU cache
  • Type safety – strict validation at the Python/C++ boundary

This approach delivers near‑native performance while preserving Python’s developer experience.

Syntax Examples

llmlog_engine

Load and analyze logs

from llmlog_engine import LogStore

# Load JSONL logs
store = LogStore.from_jsonl("production_logs.jsonl")

# Analyze slow responses by model
slow_by_model = (
    store.query()
    .filter(min_latency_ms=500)
    .aggregate(
        by=["model"],
        metrics={
            "count": "count",
            "avg_latency": "avg(latency_ms)",
            "max_latency": "max(latency_ms)",
        },
    )
)

print(slow_by_model)  # Returns pandas DataFrame

Error analysis

# Analyze error rates by model and route
errors = (
    store.query()
    .filter(status="error")
    .aggregate(
        by=["model", "route"],
        metrics={"count": "count"},
    )
)

Combined filters

# Filter by multiple conditions (AND logic)
result = (
    store.query()
    .filter(
        model="gpt-4.1",
        min_latency_ms=1000,
        route="chat",
    )
    .aggregate(
        by=["model"],
        metrics={"avg_tokens": "avg(tokens_output)"},
    )
)

Expected JSONL format

{"ts": "2024-01-01T12:00:00Z", "model": "gpt-4.1", "latency_ms": 423, "tokens_input": 100, "tokens_output": 921, "route": "chat", "status": "ok"}
{"ts": "2024-01-01T12:00:15Z", "model": "gpt-4.1-mini", "latency_ms": 152, "tokens_input": 50, "tokens_output": 214, "route": "rag", "status": "ok"}

mini_faiss

import numpy as np
from mini_faiss import IndexFlatL2

# Create index for 768‑dimensional vectors
d = 768
index = IndexFlatL2(d)

# Add vectors to index
xb = np.random.randn(10000, d).astype("float32")
index.add(xb)

# Search for nearest neighbors
xq = np.random.randn(5, d).astype("float32")
distances, indices = index.search(xq, k=10)

print(distances.shape)  # (5, 10) - 5 queries, 10 neighbors each
print(indices.shape)    # (5, 10)
from mini_faiss import IndexFlatIP

# Create inner product index
index = IndexFlatIP(d=768)

# Normalize vectors for cosine similarity
xb = np.random.randn(10000, 768).astype("float32")
xb /= np.linalg.norm(xb, axis=1, keepdims=True)

index.add(xb)

# Assume xq_normalized is similarly normalized
distances, indices = index.search(xq_normalized, k=10)
# Higher distances = more similar

Implementation Highlights

llmlog_engine

Columnar storage with dictionary encoding

  • String columns (model, route, status) mapped to int32 IDs
  • Numeric columns stored as contiguous arrays
  • Filtering operates on compact integer representations

Query execution

  • Build boolean mask from filter predicates (AND logic)
  • Group matching rows by specified columns
  • Compute aggregations only on filtered rows
  • Return a pandas DataFrame

Example internal representation

Column: model       [0, 1, 0, 2, 0, ...] (int32 IDs)
Column: latency_ms  [423, 1203, 512, ...] (int32)
Dictionary: model   {0: "gpt-4.1-mini", 1: "gpt-4.1", 2: "gpt-4-turbo"}

mini_faiss

Distance computation (L2)

||q - db||² = ||q||² - 2·q·db + ||db||²

  • Precomputes database norms for efficiency
  • Vectorizable loops enable SIMD auto‑vectorization

Top‑k selection

  • Heap‑based algorithm: O(N log k) per query
  • Efficient when k << N
  • Separate implementations for min (L2) and max (inner product)

Row‑major storage

data = [v_0[0], v_0[1], ..., v_0[d-1],
        v_1[0], v_1[1], ..., v_1[d-1],
        ...]

Cache‑friendly for batch distance computation.

Installation

Both libraries use standard Python packaging:

# llmlog_engine
git clone https://github.com/yuuichieguchi/llmlog_engine.git
cd llmlog_engine
pip install -e .

# mini_faiss
git clone https://github.com/yuuichieguchi/mini_faiss.git
cd mini_faiss
pip install .

Requirements

  • Python 3.8+
  • C++17 compiler (GCC, Clang, MSVC)
  • CMake 3.15+
  • pybind11 (installed via pip)

Use Cases

llmlog_engine

  • Monitor LLM application health in production
  • Analyze latency patterns by model and endpoint
  • Track error rates and failure modes
  • Debug performance regressions
  • Generate usage reports for cost analysis

mini_faiss

  • Dense retrieval for RAG systems
  • Document similarity search
  • Image search using vision model embeddings
  • Recommendation systems (nearest‑neighbor recommendations)
  • Prototyping before scaling to full FAISS

Known Limitations

llmlog_engine

  • In‑memory only (no persistence yet)
  • Single‑threaded query execution
  • No complex expressions or advanced query features at this time

mini_faiss

  • Limited to flat indexes (no IVF, HNSW, etc.)
  • No built‑in persistence; index must be rebuilt or serialized manually
  • Primarily optimized for CPU; GPU acceleration not provided
Back to Blog

Related posts

Read more »