Colored Petri Nets, LLMs, and distributed applications

Published: (February 14, 2026 at 04:08 PM EST)
8 min read

Source: Hacker News

Overview

A big theme in LLM‑enabled software development is that verifiable correctness makes it much easier to take bigger leaps with LLMs (e.g., tests, compilers, state machines, etc.). While researching for databuild, I came across colored Petri nets (CPNs) and instantly saw an opportunity.

What are Colored Petri Nets?

  • Petri nets are directed bipartite graphs consisting of places (which hold tokens) and transitions (where side‑effects happen).
  • In classic Petri nets a token carries no data; it only represents an identity‑less location.
  • If every transition has exactly one input, one output, and there is only one token globally, the net is equivalent to a finite‑state machine.

Colored Petri nets extend this model by allowing each token to carry data (its “color”). This brings CPNs close to the Rust typestate pattern and suggests that Rust could implement CPN semantics cleanly.

Why CPNs Matter for Concurrency

  1. Concurrent programming is still hard.
  2. CPNs give us a path to formal verification of concurrent programs at build time.

If we pair a high‑performance datastore with a CPN framework, we can offload many of the hard parts of concurrent applications:

FeatureHow CPNs help
State synchronizationTokens move atomically between places.
Conflict detectionGuards prevent illegal transitions.
Deadlock avoidanceMulti‑token joins/joins make required preconditions explicit.
Coordinating shared resourcesMulti‑token consumption/production models forks and joins.

Guards

A guard is a list of boolean conditions that must hold for a transition to fire (e.g., “there must be at least one free connection in the pool”).

Multi‑token consumption / production

  • Fork: P1 → T1 → (P2, P3) – one token from P1 produces tokens in both P2 and P3.
  • Join: (P1, P2) → T1 → P3 – tokens must be present in both P1 and P2 before a token can appear in P3.

Example Application: Web Scraping with Leased Proxies

Problem statement

  • Limited pool of proxies.
  • Need to rate‑limit usage per proxy and per target domain.
  • Must avoid concurrent duplicate requests to the same target.
  • Want to enforce responsible crawling (e.g., domain‑level cooldowns).

Classic solution

Use a central database with SELECT … FOR UPDATE to lease resources.

CPN‑based solution

PlaceMeaning
available_proxiesFree proxy tokens
prioritized_targetsTargets ready to be scraped
domainsDomain‑level tokens for rate‑limiting
cool_downTargets waiting before they can be retried
failed_logRecords of failed scrapes
raw_html, parsed, validated, storedStages of the result pipeline (each with its own concurrency limits)

Core transition

scrape_target:  available_proxies × prioritized_targets → in_progress

The transition fires only when both a proxy and a target token are present.

Additional features (implemented as extra places / guards)

  • Scrape target cooldowns – after a scrape, the target token moves to cool_down. A timed Petri net delays its return to prioritized_targets.
  • Domain‑level rate limiting – transition becomes a 3‑way join: domains × available_proxies × prioritized_targets.
  • Retry with backoff – on failure, the token forks: one copy to failed_log, another back to prioritized_targets with an incremented retry count and a longer cooldown. A guard caps the maximum attempts.
  • Result pipeline – tokens flow through raw_html → parsed → validated → stored. Each stage can have its own token limits, naturally providing back‑pressure.

Example Application: Databuild

The network consists of:

  • Partitions (leased to job runs)
  • Wants (user‑declared data dependencies)
  • Job runs (executors)

A self‑organizing dynamic process could propagate data dependencies and resolve user‑stated partition wants safely, efficiently, and quickly.

(Details omitted for brevity; see “partitioning problem” below.)


Implementation Strategies

StrategyDescriptionProsCons
Traditional app + PostgresUse transactions to move tokens atomically; SELECT … FOR UPDATE claims tokens for a transition.Mature persistence, ACID guarantees, easy scaling with DB sharding.Latency of round‑trips; DB becomes a bottleneck under heavy contention.
Single‑process Rust binaryKeep all CPN state in memory; use Rust’s typestate pattern for compile‑time guarantees. Persistence via snapshotting or an event log.Extremely fast, zero‑copy moves, strong compile‑time safety.Needs a persistence strategy; limited by a single machine’s memory.

The Partitioning Problem

If the CPN state grows beyond a single server’s memory, we need automatic partitioning to enable fearless concurrency and horizontal scalability.

Possible approaches:

  1. Application‑level archiving – add an archive place/transition that moves “cold” tokens out of the active net.
  2. Database‑level sharding – store different sub‑nets in separate DB shards and coordinate via cross‑shard joins.
  3. Distributed CPN apps – run multiple CPN instances that expose a query/consume API; the global net is a composition of these sub‑nets.

Each approach trades off latency, complexity, and consistency guarantees.


Vision

If we can build a CPN framework with sensible persistence (e.g., event‑sourced logs, snapshots, or a backing store), we can:

  • Impose compile‑time constraints on agent‑authored code (boosting developer velocity).
  • Provide correctness guarantees and simulation capabilities out of the box—features that unconstrained computation apps typically lack.

The result: a powerful, formally‑verifiable foundation for building concurrent, distributed systems.


## Appendix

Random Thoughts

  • Open‑source benchmark:
    Is there an existing open‑source service whose test suite I can use—or benchmark against—to validate this? Ideally, I could just point Claude’s code at it.

  • Timed Petri nets:
    Could we emit metrics by default and simulate the net‑wide impact of a change to the whole system?

  • Database joins as transitions:
    Treat join transitions as literal joins in the database, with transition conditions implying indexes. Token transitions would then be implemented as inserts and deletes inside a transaction.

  • Scope of the project:
    Are we aiming to build merely a datastore, or a full‑blown CPN (Colored Petri Net) database?

  • Distributed architecture:
    Should we envision a distributed network of CPN applications, or an application that is itself a distributed network of CPN services? In either case, the focus is on partitioning.

  • Re‑entrancy in CPNs:
    CPNs don’t need explicit re‑entrance. Every “step” is simply a movement of tokens, which can be derived entirely from the state of all involved tokens.

Potential Bets (LLM‑Written)

To validate the idea we need a concrete target: re‑implement a component using Colored Petri Net (CPN) semantics and compare it against the existing implementation.
The core question isn’t whether CPNs can be fast or correct—obviously they can. The question is:

Does the CPN paradigm make writing simple, correct, and fast concurrent programs easier?
In other words, does the formalism pay for itself in reduced bugs and less bespoke coordination code?


The Bet: Scraper Scheduler (spider‑rs)

Goal: Implement the decision‑making core of a web scraper—URL prioritization, domain‑level rate limiting, proxy assignment, and retry scheduling—using CPNs.

  • Mock the HTTP layer.
  • Benchmark decisions per second.
  • Compare the amount and complexity of coordination code against the original Rust implementation.

Why Scrapers?

  • Traditional scrapers are riddled with ad‑hoc coordination (rate‑limit maps, cooldown trackers, retry counters, domain queues).
  • These concerns map directly onto CPN concepts (places, timed transitions, token ownership).
  • Correctness is critical: politeness violations can lead to bans.
  • Timed transitions are a first‑class feature of CPNs, not an afterthought.
  • spider‑rs is already a Rust project, making for a clean fork and direct performance comparison.

Why Not the Other Candidates First?

OptionReason for Deferring
Job queue (Faktory)Coordination is incidental; the pattern is well‑understood and CPN overhead may not provide a benefit.
Connection pooler (pgcat)Too small a workload – it may not exercise enough CPN capabilities to be compelling.
CI runner (Buildkite)Too large and complex (artifacts, logs, shell execution) – the coordination story gets obscured.

Implementation Approach

  • In‑memory Rust + SQLite snapshots – single‑threaded, move semantics apply, extremely fast.
  • Partitioning – separate binaries with disjoint token ownership (enforced at net‑definition time via partition‑key requirements).
  • Simulation layer – run the same CPN execution with mocked side effects, enabling:
    • Correctness testing,
    • Failure injection,
    • Time fast‑forwarding.

This setup gives us a reproducible benchmark and a clear lens on how much coordination code the CPN model eliminates.

0 views
Back to Blog

Related posts

Read more »