Dataset Generator v1.0.3-beta ships local LLM support — fine-tune your model without paying a cent for API

Published: 1 day ago (May 3, 2026 at 01:06 PM EDT)

7 min read

Source: Dev.to

Overview

A while back I shipped a desktop app that generates LLL‑fine‑tuning datasets.
It worked: my Qwen2.5‑Coder‑7B fine‑tune jumped from 55.5 % → 72.3 % on HumanEval.
The whole pipeline ran on OpenRouter – pick a model, click Generate, get a JSONL file.

v1.0.3‑beta now ships multi‑provider LLM support – Ollama, LM Studio, llama.cpp, or any custom OpenAI‑compatible endpoint, plus the original OpenRouter.
Mix and match: generate on your local Qwen3‑14B, judge on a cheap cloud model, or stay fully offline.

Below is a quick rundown of what shipped, what turned out harder than expected, and the lessons learned.

New Features

1️⃣ One‑click local LLM detection

Path: Settings → Providers → "Auto‑detect local"
The app probes the following ports:

Provider	Port
Ollama	11434
LM Studio	1234
llama.cpp	8080

Any endpoint that answers gets a one‑click “Add” button.
On‑boarding for an offline‑first user now takes ≈ 30 s.

2️⃣ Mixed‑mode pipelines

Each category can use its own provider.
- Example: Gen on local Qwen2.5‑Coder‑14B, Judge on a cheap cloud model (e.g., GPT‑4‑mini).
- Or different generators per category – e.g., algorithm category on a code‑specialised local model.
The pipeline automatically routes each call to the correct backend.

3️⃣ Custom endpoints

Any OpenAI‑compatible URL works (vLLM, TGI, self‑hosted gateways, etc.).
Just paste the base URL + optional bearer token → done.

4️⃣ Instant cancel for local jobs

Cloud APIs finish in seconds, so cooperative cancel is trivial.
A local 14B model can sit on a single chat completion for minutes.
v1.0.3‑beta wires asyncio.Task.cancel() straight into the in‑flight HTTP request, making cancel feel instant (~1 s) instead of waiting for a timeout (≈ 8 min).

5️⃣ Auto‑handling for reasoning models

Models like Qwen3, DeepSeek‑R1, etc., emit … blocks that gobble the whole token budget before any real output.
The pipeline detects “reasoning starvation” (empty content + finish=length + reasoning present) and automatically retries with a 4× larger budget.
No manual fiddling required.

6️⃣ Token accounting across providers

Provider	Issue	Fix
OpenRouter	Cleanly separates `reasoning_tokens` in the usage payload.	–
Ollama	`completion_tokens` includes think + content (e.g., 800 + 80 = 880).	Detect “ blocks (Format A) or `message.reasoning` (Format B), strip reasoning, recount with tiktoken, write corrected number back to `usage.completion_tokens`.
LM Studio	Uses `message.reasoning_content`.	Same stripping logic; LM Studio also surfaces `reasoning_tokens` in `completion_tokens_details`, so the “subtract path” catches it.

Result: Quality Report and per‑example token counts now agree.

7️⃣ Capability‑driven provider abstraction

Early version scattered if provider.kind == "ollama" checks throughout the code.
Refactored to ProviderCapabilities flags:

supports_provider_routing
supports_reasoning
requires_api_key
has_pricing
supports_embeddings

Adding a new backend now requires one class + one registry entry – zero changes to job_runner.py.

8️⃣ Default provider reassignment UX

Old behaviour: disabling the default (e.g., OpenRouter) left the system in a silent orphan state; next job failed with “Provider ‘openrouter‑default’ is disabled” (422).
New behaviour: the backend auto‑promotes the next enabled provider to default and the frontend shows a 4‑second toast – “Default switched to Ollama (local)”.

A tiny bug to spot, trivial to fix once seen.

User Personas Unlocked

Persona	Pain point	How v1.0.3‑beta helps
Privacy‑conscious	Corporate/NDA‑bound code can’t leave the laptop.	All processing can stay offline on local hardware.
Cost‑conscious	Generating 5 000 multi‑turn examples on cloud GPT‑4 costs a fortune.	Use a cheap local generator (e.g., Qwen3‑14B) + cloud judge → ≈ 1/10 the bill.
No‑cloud‑account	Regulations, no credit card, or unsupported country.	Entire pipeline runs without a single external API call.

Lessons Learned

1️⃣ 14B local models are the practical floor

7B/9B variants produce technically valid output but drift off‑topic, repeat patterns, and misunderstand categories.
The time saved on the cloud is spent 5× on rejected examples.
14B is the minimum; 32B feels comfortable if you have the VRAM.

2️⃣ The judge model matters more than the generator

Small local judges (≈ 8B) tend to rubber‑stamp scores (95‑100) regardless of quality.
Larger judges (≥ 14B) can miss good examples because they don’t grasp the category.
Spend cloud money on the judge or use a 32B+ local judge if hardware permits.

3️⃣ Mixed mode is the killer feature

Expected “fully offline” to be the main win, but most users want:

cheap local model → generate volume (≈ 7 000 examples)
strong cloud model → judge (quality control)

v1.0.3‑beta makes this a one‑line config – pick generator from one provider, judge from another, ship it.

4️⃣ Per‑provider concurrency limits add complexity for little gain

Prototype: configure “Ollama: 1, OpenRouter: 10” so a global semaphore doesn’t drown a local GPU.
In practice, single‑user, single‑GPU setups dominate (≈ 99 % of users).
Feature was cut from v1.0.3‑beta and parked for future enterprise use.

What’s Next?

Enterprise‑grade concurrency controls (re‑introduce when demand appears).
Better token‑budget introspection for providers that hide reasoning tokens.
More provider capability flags as the ecosystem expands (e.g., streaming, function‑calling).

Feel free to try the beta, report bugs, and suggest improvements! 🚀

Multi‑GPU vLLM – Who Actually Needs It?

Provider badge in the model picker
When two providers serve the same model name (e.g., llama‑3.1‑8b on both Ollama and OpenRouter), the picker shows two identical‑looking entries. I sketched a small badge UI to differentiate them, then realized typical setups don’t have name collisions (you know which models you put where). Punted to a future polish pass.

Same Foundations, New Layer

Layer	Tech Stack	New Additions
Frontend	Next.js 16 (static export) + Tailwind + base‑ui	• `ProvidersSection` for CRUD, auto‑detect, per‑row connection test
Backend	FastAPI + SQLite (WAL) + Pydantic	• `app/services/llm/` – provider abstraction (`LLMProvider` ABC + `ProviderCapabilities`) • `app/routers/providers.py`

Schema Migration

providers table added in v6 with back‑fill of the legacy single OpenRouter key.
Existing setups migrate silently on first launch.

Tests

460 passing (up from 329 in the previous release)
Full coverage for the four backends, registry resolution, auto‑detect, mixed‑mode jobs.

License & Distribution

AGPL‑3.0 (same as before)
One‑binary distribution (Linux AppImage, Windows .exe).

Open‑Source Repositories

App repo (v1.0.3‑beta):
Original release post (with HumanEval + 16pp benchmark): previous dev.to post
Dataset (2,248 examples):
Fine‑tuned model:

What’s Next

System‑tray version
- Long generation runs (5,000+ examples on local hardware = hours) deserve a quieter UX than a permanent open window.
- Tray icon, “next job ready” notification, click to bring back the dashboard.
Embedding provider picker
- Deduplication works multi‑provider on the backend, but the UI only exposes OpenRouter embedding models.
- Add a small dropdown so local users can run dedup on nomic‑embed‑text via Ollama too.
Two new categories targeting LiveCodeBench and BigCodeBench
- The previous post explained why those benchmarks barely moved (format mismatch on LCB, too‑generic library category for BCB).
- Both fixes are in progress:
  - Algorithmic drill with edge‑case coverage for LCB.
  - Library‑API‑precise taxonomy for BCB.
Community feedback
- If you generate datasets locally, what model size are you using and what’s your acceptance rate?
- Especially curious if anyone got real value out of ** Disclosure: I drafted this post with AI help — the same way I built the app.