Building an AI Matching Engine Without Big Tech Resources

Published: (January 9, 2026 at 05:32 PM EST)
5 min read
Source: Dev.to

Source: Dev.to

Why Matching Is Harder Than It Looks

Matching looks trivial from the outside, but production‑grade matching is an outcome‑driven system.

LinkedIn – a data‑rich example

LinkedIn’s matching works because it learns from:

SignalDescription
ApplicationsWho applies to which jobs
Acceptance ratesWhich offers are accepted
Recruiter behaviorHow recruiters interact with candidates
Network overlapShared connections
Engagement signalsClicks, messages, etc.
Retention dataLong‑term success of hires

In other words, LinkedIn doesn’t “guess relevance”. It learns relevance from outcomes.

Seed‑stage reality

Pairfect started with:

  • no labeled data
  • no behavioral data
  • no interactions
  • no click‑through signals
  • no embeddings graph
  • no GPUs
  • PostgreSQL as the only accepted infra

A completely different world. Yet many early teams try to copy Big‑Tech architecture without the data – it simply doesn’t work.

The Real Beginning: Constraints, Not Models

Most teams begin matching by asking:

“Which ML model should we use?”

We started by asking a different question:

“What constraints make certain architectures impossible?”

Below is a simplified version of our constraint table (the architecture itself).

ConstraintImpact
Self‑fundedNo GPUs, no distributed systems
Must run on PostgresMatching logic must be SQL‑native
No labelsNo LTR, no two‑tower training
CPU‑onlyLightweight embeddings only
MVP in 3 monthsSimple > complex
Need explainabilityNo black‑box ranking
Sparse metadataMust extract from text
Minimal DevOpsNo vector‑DB clusters

Before we wrote a single line of code, we knew what we couldn’t build. Ironically, that saved Pairfect, a self‑funded startup.

Defining What “Good Match” Means (Critical & Often Missed)

You cannot architect matching until you define what a good match means in your domain.

  • LinkedIn: hired + retained
  • Pairfect:
    • semantic fit between campaign & influencer
    • audience expectations align
    • tone compatibility
    • price compatibility
    • content‑format alignment
    • worldview alignment (yes, that matters for creators)

If your team cannot answer “What constitutes a good match here?”, any discussion of embeddings vs. rules vs. transformers is premature.

Why We Didn’t Go Straight for SOTA Models

We evaluated the standard architectural options. Most didn’t survive the constraint filter:

OptionWhy Not (at MVP stage)
Rules‑onlyToo rigid
Pure embeddingsToo noisy without deterministic anchors
LLM rankingToo slow + expensive on CPU
Learning‑to‑RankNeeds labeled data
Two‑towerNeeds training data + GPUs
Collaborative filteringNeeds behavior data
Graph modelsNeeds graph maturity

That left one viable category:

Hybrid Matching

Not because it’s “cool”, but because it’s appropriate for the stage.

The Architecture: Hybrid Matching in Practice

Our hybrid pipeline looked like this:

Hard Filters → One‑Hot Features → Embeddings → Fusion → Top‑K

1. Hard Filters

Eliminate impossible cases upfront:

  • price
  • language
  • content format
  • region
  • campaign type
SELECT *
FROM influencers
WHERE price BETWEEN 500 AND 1500
  AND language = 'en'
  AND region = 'eu'
  AND format @> ARRAY['video']::text[];

2. One‑Hot Signals

Encode domain knowledge explicitly (tone, niche, vertical, channel, creative style). This prevents “semantic nonsense” (e.g., matching a financial brand with a prank channel).

SELECT influencer_id,
       (CASE WHEN tone = campaign.tone THEN 1 ELSE 0 END)      AS tone_match,
       (CASE WHEN vertical = campaign.vertical THEN 1 ELSE 0 END) AS vertical_match
FROM influencers;

3. Embeddings

We generated embeddings for:

  • bios
  • captions
  • descriptions
  • LLM summaries

Stored in pgvector, similarity via cosine.

SELECT influencer_id,
       1 - (bio_embedding <=> campaign.bio_embedding) AS semantic_score
FROM influencers
ORDER BY semantic_score DESC
LIMIT 50;

4. Rank Fusion (RRF)

RRF (Reciprocal Rank Fusion) let us merge multiple ranking signals into one stable ranking without training.

Formula

[ \text{Score} = \sum_i \frac{1}{k + \text{rank}_i} ]

SQL/CTE implementation (simplified)

WITH ranked AS (
  SELECT influencer_id,
         ROW_NUMBER() OVER (ORDER BY semantic_score DESC) AS r1,
         ROW_NUMBER() OVER (ORDER BY tone_match DESC)    AS r2,
         ROW_NUMBER() OVER (ORDER BY vertical_match DESC) AS r3
  FROM candidates
)
SELECT influencer_id,
       (1.0 / (60 + r1)) +
       (1.0 / (60 + r2)) +
       (1.0 / (60 + r3)) AS final_score
FROM ranked
ORDER BY final_score DESC
LIMIT 10;

Benefits of RRF‑based fusion

  • No ML pipeline to maintain
  • Consistent, deterministic behavior
  • Fully explainable scoring (each component is visible)
  • Cheap to compute on CPU
  • Resistant to noisy embeddings

Takeaways

✅ What worked❌ What didn’t
Start from constraints → architectureCopy‑pasting Big‑Tech stacks without data
Keep everything SQL‑native (Postgres + pgvector)Rely on GPU‑heavy, black‑box models
Use hard filters to prune earlyPure embeddings without anchors
Encode domain knowledge with one‑hot featuresBlindly trust LLM ranking on CPU
Merge signals with RRF → deterministic, explainableLearning‑to‑Rank without labels

If you’re at the seed stage and need a matching engine today, start with the hybrid pipeline above. It scales from a few hundred to tens of thousands of candidates, stays cheap, and gives you the explainability investors love.

Happy matching!

Top‑K Output

Return a shortlist, not an infinite scroll.

Top 10 most compatible influencers
+ explanation layer

This is not personalization; it is decision support.

Why Everything Ran on PostgreSQL

Our entire matching system ran on:

PostgreSQL + pgvector + CPU

Reasons

  • Infra should reduce risk, not increase it
  • One system > five micro‑services
  • Fewer moving parts = fewer failures
  • Debugging in SQL is fast & deterministic
  • Product iteration > infra optimization

Hot take: infra is not tooling; infra is liability—especially at the MVP stage.

Explainability Was a Feature, Not a Nice‑to‑Have

We built full explainability into the matching layer:

  • Why this recommendation?
  • Which signals contributed?
  • How fusion scored them?
  • What would disqualify it?
  • How to override?

Trust matters in early marketplaces. LinkedIn can hide behind a black box. Startups cannot.

The Evolution Path (Critical CTO Work)

Founders often ask:

“Will hybrid scale forever?”

No. And it doesn’t need to.

Our planned evolution path:

Hybrid → Behavioral Signals → LTR → Two‑Tower → Graph → RL → Agents

Each step unlocks the next:

StepWhat it unlocks
HybridUsable matching Day 1
Behavioral SignalsLabels
LTR (Learning‑to‑Rank)Enables label‑driven ranking
Two‑TowerScalable encoders
GraphMulti‑objective optimization
RL (Reinforcement Learning)Personalization
AgentsReasoning

This is how marketplace intelligence actually grows in the real world.

Final Lessons

Three lessons emerged from building Pairfect:

  1. Lesson 1 – Matching is not a model problem; it’s a business‑constraint problem.
  2. Lesson 2 – Appropriate complexity wins at the MVP stage. Over‑engineering extends time‑to‑market.
  3. Lesson 3 – You don’t need Big‑Tech architecture without Big‑Tech data.

The goal is not to replicate LinkedIn.
The goal is to build a system honest about your stage and prepared to evolve.

If You’re Building Something Similar

Happy to discuss:

  • Marketplace matching
  • Ranking architectures
  • Hybrid systems
  • pgvector setups
  • Evolution paths

DMs open.

Back to Blog

Related posts

Read more »

Hello, Newbie Here.

Hi! I'm falling back into the realm of S.T.E.M. I enjoy learning about energy systems, science, technology, engineering, and math as well. One of the projects I...