Dataset Generator v1.0.3-beta ships local LLM support — fine-tune your model without paying a cent for API

Published: (May 3, 2026 at 01:06 PM EDT)
7 min read
Source: Dev.to

Source: Dev.to

Overview

A while back I shipped a desktop app that generates LLL‑fine‑tuning datasets.
It worked: my Qwen2.5‑Coder‑7B fine‑tune jumped from 55.5 % → 72.3 % on HumanEval.
The whole pipeline ran on OpenRouter – pick a model, click Generate, get a JSONL file.

v1.0.3‑beta now ships multi‑provider LLM support – Ollama, LM Studio, llama.cpp, or any custom OpenAI‑compatible endpoint, plus the original OpenRouter.
Mix and match: generate on your local Qwen3‑14B, judge on a cheap cloud model, or stay fully offline.

Below is a quick rundown of what shipped, what turned out harder than expected, and the lessons learned.

New Features

1️⃣ One‑click local LLM detection

  • Path: Settings → Providers → "Auto‑detect local"
  • The app probes the following ports:
ProviderPort
Ollama11434
LM Studio1234
llama.cpp8080
  • Any endpoint that answers gets a one‑click “Add” button.
  • On‑boarding for an offline‑first user now takes ≈ 30 s.

2️⃣ Mixed‑mode pipelines

  • Each category can use its own provider.
    • Example: Gen on local Qwen2.5‑Coder‑14B, Judge on a cheap cloud model (e.g., GPT‑4‑mini).
    • Or different generators per category – e.g., algorithm category on a code‑specialised local model.
  • The pipeline automatically routes each call to the correct backend.

3️⃣ Custom endpoints

  • Any OpenAI‑compatible URL works (vLLM, TGI, self‑hosted gateways, etc.).
  • Just paste the base URL + optional bearer token → done.

4️⃣ Instant cancel for local jobs

  • Cloud APIs finish in seconds, so cooperative cancel is trivial.
  • A local 14B model can sit on a single chat completion for minutes.
  • v1.0.3‑beta wires asyncio.Task.cancel() straight into the in‑flight HTTP request, making cancel feel instant (~1 s) instead of waiting for a timeout (≈ 8 min).

5️⃣ Auto‑handling for reasoning models

  • Models like Qwen3, DeepSeek‑R1, etc., emit blocks that gobble the whole token budget before any real output.
  • The pipeline detects “reasoning starvation” (empty content + finish=length + reasoning present) and automatically retries with a 4× larger budget.
  • No manual fiddling required.

6️⃣ Token accounting across providers

ProviderIssueFix
OpenRouterCleanly separates reasoning_tokens in the usage payload.
Ollamacompletion_tokens includes think + content (e.g., 800 + 80 = 880).Detect “ blocks (Format A) or message.reasoning (Format B), strip reasoning, recount with tiktoken, write corrected number back to usage.completion_tokens.
LM StudioUses message.reasoning_content.Same stripping logic; LM Studio also surfaces reasoning_tokens in completion_tokens_details, so the “subtract path” catches it.

Result: Quality Report and per‑example token counts now agree.

7️⃣ Capability‑driven provider abstraction

  • Early version scattered if provider.kind == "ollama" checks throughout the code.
  • Refactored to ProviderCapabilities flags:
supports_provider_routing
supports_reasoning
requires_api_key
has_pricing
supports_embeddings
  • Adding a new backend now requires one class + one registry entryzero changes to job_runner.py.

8️⃣ Default provider reassignment UX

  • Old behaviour: disabling the default (e.g., OpenRouter) left the system in a silent orphan state; next job failed with “Provider ‘openrouter‑default’ is disabled” (422).
  • New behaviour: the backend auto‑promotes the next enabled provider to default and the frontend shows a 4‑second toast – “Default switched to Ollama (local)”.

A tiny bug to spot, trivial to fix once seen.

User Personas Unlocked

PersonaPain pointHow v1.0.3‑beta helps
Privacy‑consciousCorporate/NDA‑bound code can’t leave the laptop.All processing can stay offline on local hardware.
Cost‑consciousGenerating 5 000 multi‑turn examples on cloud GPT‑4 costs a fortune.Use a cheap local generator (e.g., Qwen3‑14B) + cloud judge → ≈ 1/10 the bill.
No‑cloud‑accountRegulations, no credit card, or unsupported country.Entire pipeline runs without a single external API call.

Lessons Learned

1️⃣ 14B local models are the practical floor

  • 7B/9B variants produce technically valid output but drift off‑topic, repeat patterns, and misunderstand categories.
  • The time saved on the cloud is spent on rejected examples.
  • 14B is the minimum; 32B feels comfortable if you have the VRAM.

2️⃣ The judge model matters more than the generator

  • Small local judges (≈ 8B) tend to rubber‑stamp scores (95‑100) regardless of quality.
  • Larger judges (≥ 14B) can miss good examples because they don’t grasp the category.
  • Spend cloud money on the judge or use a 32B+ local judge if hardware permits.

3️⃣ Mixed mode is the killer feature

  • Expected “fully offline” to be the main win, but most users want:
cheap local model → generate volume (≈ 7 000 examples)
strong cloud model → judge (quality control)
  • v1.0.3‑beta makes this a one‑line config – pick generator from one provider, judge from another, ship it.

4️⃣ Per‑provider concurrency limits add complexity for little gain

  • Prototype: configure “Ollama: 1, OpenRouter: 10” so a global semaphore doesn’t drown a local GPU.
  • In practice, single‑user, single‑GPU setups dominate (≈ 99 % of users).
  • Feature was cut from v1.0.3‑beta and parked for future enterprise use.

What’s Next?

  • Enterprise‑grade concurrency controls (re‑introduce when demand appears).
  • Better token‑budget introspection for providers that hide reasoning tokens.
  • More provider capability flags as the ecosystem expands (e.g., streaming, function‑calling).

Feel free to try the beta, report bugs, and suggest improvements! 🚀

Multi‑GPU vLLM – Who Actually Needs It?

Provider badge in the model picker
When two providers serve the same model name (e.g., llama‑3.1‑8b on both Ollama and OpenRouter), the picker shows two identical‑looking entries. I sketched a small badge UI to differentiate them, then realized typical setups don’t have name collisions (you know which models you put where). Punted to a future polish pass.

Same Foundations, New Layer

LayerTech StackNew Additions
FrontendNext.js 16 (static export) + Tailwind + base‑uiProvidersSection for CRUD, auto‑detect, per‑row connection test
BackendFastAPI + SQLite (WAL) + Pydanticapp/services/llm/ – provider abstraction (LLMProvider ABC + ProviderCapabilities)
app/routers/providers.py

Schema Migration

  • providers table added in v6 with back‑fill of the legacy single OpenRouter key.
  • Existing setups migrate silently on first launch.

Tests

  • 460 passing (up from 329 in the previous release)
  • Full coverage for the four backends, registry resolution, auto‑detect, mixed‑mode jobs.

License & Distribution

  • AGPL‑3.0 (same as before)
  • One‑binary distribution (Linux AppImage, Windows .exe).

Open‑Source Repositories

  • App repo (v1.0.3‑beta):
  • Original release post (with HumanEval + 16pp benchmark): previous dev.to post
  • Dataset (2,248 examples):
  • Fine‑tuned model:

What’s Next

  1. System‑tray version

    • Long generation runs (5,000+ examples on local hardware = hours) deserve a quieter UX than a permanent open window.
    • Tray icon, “next job ready” notification, click to bring back the dashboard.
  2. Embedding provider picker

    • Deduplication works multi‑provider on the backend, but the UI only exposes OpenRouter embedding models.
    • Add a small dropdown so local users can run dedup on nomic‑embed‑text via Ollama too.
  3. Two new categories targeting LiveCodeBench and BigCodeBench

    • The previous post explained why those benchmarks barely moved (format mismatch on LCB, too‑generic library category for BCB).
    • Both fixes are in progress:
      • Algorithmic drill with edge‑case coverage for LCB.
      • Library‑API‑precise taxonomy for BCB.
  4. Community feedback

    • If you generate datasets locally, what model size are you using and what’s your acceptance rate?
    • Especially curious if anyone got real value out of ** Disclosure: I drafted this post with AI help — the same way I built the app.
0 views
Back to Blog

Related posts

Read more »

What’s Your Fear Score as a Developer?

Fear costs us everything. I once heard the quote, “You miss 100% of the shots you don’t take,” and it resonated throughout my career. Looking back, I missed man...