Training GitHub Repository Embeddings using Stars

Published: 1 hour ago (January 2, 2026 at 04:55 PM EST)

9 min read

Source: Dev.to

TL;DR

The Idea – People use GitHub Stars as bookmarks. This is an excellent signal for understanding which repositories are semantically similar.

The Data – Processed ~1 TB of raw data from the GitHub Archive (BigQuery) to build an interest matrix of 4 M developers.

The ML – Trained embeddings for 300 k+ repositories using metric learning (EmbeddingBag + MultiSimilarityLoss).

The Frontend – Built a client‑only demo that runs vector search (KNN) directly in the browser via WASM, with no backend involved.

The Result – The system finds non‑obvious library alternatives and allows for semantic comparison of developer profiles.

Personal Motivation

Finishing ideas is usually harder than it looks. It’s easy to build a prototype, but the real struggle begins afterwards: polishing the rough edges, writing the text, and setting up the demo. This project is my attempt to go the full distance and tie up one of those “loose ends.”

I also started thinking about the nature of our GitHub Stars. We treat them simply as bookmarks “for later,” but in reality they are a valuable digital asset—a snapshot of our professional interests and skills. I wondered:

Can we put this passive asset to work?
Can we use the accumulated knowledge of millions of developers to build a repository‑recommendation system and let people compare their technical interests?

The Concept

Cluster Hypothesis

The people reading this article are much more similar to you in interests than a randomly selected person from the street. In our universe, similar things often appear in similar contexts.

We intuitively sense these “hidden” preferences. If you see a new colleague typing in Vim and exiting it without help, you’ve probably already built a mental vector of their interests: you can likely discuss patching KDE2 under FreeBSD with them, but asking for advice on an RGB gaming mouse might be a miss.

Repo Representation

We want a space where semantically similar repositories are located close to each other.
To illustrate, imagine a 2‑D space whose axes are:

Axis X: Data (Preparation & Analysis) ↔ Models (Training & Inference)
Axis Y: Local / Single‑node ↔ Big Data / Cluster

semantic quadrants

In reality the neural network learns these (often non‑interpretable) features itself. The mathematical essence remains: similar repos are pulled together into clusters based on learned features.

With these vectors we can:

Search for similar repositories using cosine similarity (angle between vectors).
Obtain a user‑interest vector by averaging the vectors of their starred repositories.
Compare user profiles with each other.

Signal Source

To get quality vectors, I used a hybrid approach.

Text (README.md) – Initialization
Many repositories have a README, which is a great “cold‑start” source. I used the Qwen3‑Embedding‑0.6B model (supports Matryoshka Representation Learning) and kept only the first 128 D (the most important components). These vectors served as the initial weight initialization for the trainable model.

Note: This step adds ~10 % to the final quality. To keep the public repo lightweight I omitted it; the model learns fine from random initialization, just a bit slower.
Stars Matrix – Main Training
Text alone doesn’t show how tools are used together. Collaborative filtering captures this.
```
User → Starred repositories
A    → Pandas, Dask, scikit‑learn, NumPy
B    → Vue, React, TypeScript, Vite
```
Approaches include graph algorithms (LightGCN) or matrix factorization. I chose metric learning because it needs fewer GPU resources (≈1 GPU with ~8 GB) and offers flexibility in managing the vector space.

Data Preparation

Data were sourced from the public GitHub Archive dataset in BigQuery.

Two queries were required:

Query	Purpose
Stars (WatchEvent)	Collect users with 10 – 800 stars (filtering bots & inactive users) while preserving star order.
Meta (PushEvent)	Collect repository names, commit dates, and descriptions.

The queries processed ~1 TB of data and almost fit within the BigQuery Free Tier. The output was a Parquet file containing ~4 M users and ~2.5 M unique repositories.

Training Vectors

Model Choice

To keep the solution lightweight for the browser, I ruled out Transformers.
The model is a classic torch.nn.EmbeddingBag – essentially a large lookup table:

repo_id → vector[128]

It can efficiently aggregate (average) vectors.

Sampling & Loss Function

How do we tell the network that Pandas and NumPy are “close”?

For each user, I split their list of starred repositories into two random, non‑overlapping buckets and used them as positive pairs. Negative samples were drawn from other users’ buckets. The loss function was MultiSimilarityLoss, which encourages:

Pulling positive pairs together.
Pushing negative pairs apart.

This simple scheme captures the collaborative‑filtering signal without expensive graph computations.

Inference & Front‑end

The trained embeddings (128‑dim vectors) are exported as a static binary file (~150 MB). In the browser:

The file is loaded via fetch and decoded into a Float32Array.
A WebAssembly (WASM) module implements fast cosine‑similarity search using FAISS‑like product quantization.
The UI (pure HTML + CSS + Vanilla JS) lets users:
- Authenticate with GitHub → fetch starred repos → compute their interest vector.
- Visualise a radar chart of skill distribution.
- Search for similar repositories or compare profiles.

All of this runs client‑side; no server or API keys are required.

Results

The system surfaces non‑obvious library alternatives (e.g., recommending Polars to a Pandas fan).
User‑profile similarity heatmaps reveal clusters of developers with shared interests.

Feel free to explore the demo, star the repository, or open an issue if you have suggestions!

EmbeddingBag for Bucket‑wise Aggregation

We used EmbeddingBag to compute the aggregated embedding of each bucket.

User	Bucket	Repos in bucket	Mean (Embeddings in Bucket)
A	A1	NumPy, Dask, SciPy	`[0.2, -1.1, 0.9, …]`
A	A2	Pandas, SK‑Learn	`[0.1, -1.3, 0.6, …]`
B	B1	Vue, Vite	`[-0.4, 0.6, 0.2, …]`
B	B2	React, TypeScript	`[-0.3, 0.7, 0.1, …]`

Training Objective

We train embeddings so that both of the following conditions hold simultaneously:

Intra‑user cohesion – buckets belonging to the same user should be as close as possible (e.g., A1 ↔ A2).
Inter‑user separation – buckets from different users should be pushed far apart (e.g., B1 ↔ A2).

Gradient descent balances these opposing forces to minimise the overall error.

Loss Function

The loss that performed best is MultiSimilarityLoss from the pytorch-metric-learning library.

Note: We owe a debt of gratitude to StarSpace, which introduced this idea eight years ago.

Advanced Methods vs. Simplicity

It seemed natural to assume that the order in which a user stars repositories (the “sequence of stars”) would carry a strong signal, so we experimented with Word2Vec‑style sliding‑window models.

Surprisingly, the simplest random split outperformed all the more complex approaches.
Possible reasons:

Timing data is too noisy.
We failed to extract useful information from it.

We also tried:

Hard Negative Miners.
Alternative losses such as NTXentLoss (which uses ~4× more memory than MultiSimilarityLoss).
Cross‑Batch Memory (no noticeable benefit).

None of these beat the original baseline. Sometimes Occam’s razor truly wins; other times, the razor is just dull.

Quality Evaluation

Having vectors is one thing—are they any good?

Instead of synthetic data from an LLM, we used a more elegant ground truth: Awesome Lists (e.g., “Awesome Python”, “Awesome React”). These are human‑curated clusters of similar libraries.

Downloaded the READMEs of the lists.
Extracted collocations (which repos appear together).
Applied heuristic weighting.
Evaluated ranking with the NDCG metric.

This pipeline let us fairly compare loss functions, hyper‑parameters, and sampling strategies.

Frontend: Showcase & AI‑Assisted Development

Even with a decade of data‑science experience, I’m not a frontend expert. The challenge was to build sophisticated client‑side logic without a backend and without being a JS developer.

All frontend and “glue” code was written with the help of an AI Coding Agent.

Architecture

Data – The client downloads compressed embeddings (FP16, ~80 MB) and metadata, then caches them in IndexedDB.
Search (WASM) – Uses the core of the USearch library, compiled to WebAssembly.

Low‑level Magic

Initially we tried a pre‑computed HNSW index, but it consumed more memory than the raw embeddings.

The AI agent implemented Exact Search (still in WASM) by:

Exposing the low‑level _usearch_exact_search methods.
Generating a worker (coreWorker.js) that manually manages memory, allocates buffers via _malloc, and manipulates pointers.
Adding an on‑the‑fly FP16 → FP32 converter because browsers don’t handle native FP16 well.

The result is fast exact search on ~300 k vectors without any HNSW index.

User Profile & Skill Radar

User Embedding

The client queries the GitHub API for the user’s starred repositories.
Retrieves the embeddings of those repositories.
Averages them to obtain a mean vector – a digital fingerprint of the user’s interests.

Because this vector lives in the same metric space as the repository embeddings, we can search for the “nearest” libraries.

Skill Radar (Interpreting the Vector)

Prompt an LLM to generate 20 reference repositories for each of 10 categories (e.g., “GenAI”, “Web3”, “System Programming”).
Train simple Logistic Regression (Linear Probes) to distinguish the vectors of these categories.
In the browser, pass the user vector through these probes to obtain probability scores for the radar chart.

To add a social element we pre‑computed vectors for famous developers.

Similarity metric: Cosine similarity, transformed via a Quantile Transformation so that scores are shown as percentiles (e.g., “95 % Match” means the user is more similar to that developer than 95 % of random pairs).
Sharing mechanism: The user vector is compressed, Base64‑encoded, and embedded directly into the URL fragment identifier (#…). No database, no backend—pure client‑side math.

Results: Expectations vs. Reality

Beyond quantitative metrics, we performed an “eyeball test”.

What didn’t work

Vector arithmetic akin to NLP (King – Man + Woman = Queen).
Hypothesis: Pandas – Python + TypeScript = Danfo.js.
Reality: The repository vector space is far more complex; simple linear operations don’t yield interpretable results.
Distinct clustering – The embeddings do not form clearly separated visual clusters.

What did work

The primary goal was achieved: search finds relevant alternatives for a user’s starred repositories.

Overall, the system demonstrates that a simple, serverless, client‑only architecture can deliver useful, personalized recommendations from a large‑scale embedding space.

Niche Tools & Fresh Solutions

Unlike LLMs, which often have a bias toward the most popular solutions, this approach—based on the behavior of IT professionals—uncovers:

Niche Tools: Libraries used by pros but rarely written about in blogs.
Fresh Solutions: Repositories that gained popularity recently and share a similar “starring pattern.”
Local‑first: Everything runs locally on client devices.

Future Vision

The current demo shows what is possible without a backend, but many other use‑cases are imaginable.

Semantic Text Search

A text encoder could be trained with a projection layer into the repository‑embedding space, enabling search for tools or people by abstract description.

GitHub Tinder (Networking)

With user vectors, we can match people:

Mentor or co‑founder search – Find a person with a complementary stack.
Contributor discovery – Identify developers who star similar projects but haven’t seen yours yet.
HR‑Tech – Match candidates to positions based on technical interests.

Trend Analytics

Adding a time dimension would allow tracking emerging technologies and shifting developer interests over months or years.

Training GitHub Repository Embeddings using Stars

TL;DR

Personal Motivation

The Concept

Cluster Hypothesis

Repo Representation

Signal Source

Data Preparation

Training Vectors

Model Choice

Sampling & Loss Function

Inference & Front‑end

Results

EmbeddingBag for Bucket‑wise Aggregation

Training Objective

Loss Function

Advanced Methods vs. Simplicity

Quality Evaluation

Frontend: Showcase & AI‑Assisted Development

Architecture

Low‑level Magic

User Profile & Skill Radar

User Embedding

Skill Radar (Interpreting the Vector)

Results: Expectations vs. Reality

What didn’t work

What did work

Niche Tools & Fresh Solutions

Future Vision

Semantic Text Search

GitHub Tinder (Networking)

Trend Analytics

Related posts

Building a Resilient Edge Architecture for Remote Farms with Starlink + LoRa

Turning the Page (Without Resetting the System)

Micro language model

Shipping a Scalable AI SaaS: How DeepSeek + Node.js Changed My Workflow

TL;DR

Personal Motivation

The Concept

Cluster Hypothesis

Repo Representation

Signal Source

Data Preparation

Training Vectors

Model Choice

Sampling & Loss Function

Inference & Front‑end

Results

EmbeddingBag for Bucket‑wise Aggregation

Training Objective

Loss Function

Advanced Methods vs. Simplicity

Quality Evaluation

Frontend: Showcase & AI‑Assisted Development

Architecture

Low‑level Magic

User Profile & Skill Radar

User Embedding

Skill Radar (Interpreting the Vector)

Serverless Sharing

Results: Expectations vs. Reality

What didn’t work

What did work

Niche Tools & Fresh Solutions

Future Vision

Semantic Text Search

GitHub Tinder (Networking)

Trend Analytics

Related posts

Building a Resilient Edge Architecture for Remote Farms with Starlink + LoRa

Turning the Page (Without Resetting the System)

Micro language model

Shipping a Scalable AI SaaS: How DeepSeek + Node.js Changed My Workflow