How We Built a Distributed Work Scheduling System for Pulumi Cloud

Published: (February 25, 2026 at 07:00 PM EST)
13 min read

Source: Pulumi Blog

Pulumi Cloud Background Activity System

Pulumi Cloud orchestrates a growing number of workflow types: Deployments, Insights discovery scans, and policy evaluations. Some of that work runs on Pulumi’s infrastructure, and some runs on yours via customer‑managed workflow runners. We needed a scheduling system that could handle all of these workflow types reliably across both environments. In this post, we’ll take a look at the system we built.


Where we started

For our first workflow integration—Deployments—scheduling wasn’t too complicated:

  1. A deployment was queued.
  2. A worker picked it up.
  3. It ran.

The queue was purpose‑built for deployments, and it worked well for that single use case. Over time we added more sophisticated logic to handle retries, ordering, rate limiting, observability, and more.

When Insights launched, the number of workflow types grew. Pulumi Cloud now manages:

  • Discovery scans to catalog cloud resources.
  • Audit policy evaluations to continuously verify compliance.

While these workflows share similarities, each type needed its own scheduling, retry logic, and failure handling.

Later we added the option for customers to run workflows on their own infrastructure using customer‑managed workflow runners. As the requirements grew, we realized that our initial Deployments‑only approach wouldn’t scale. We needed a single system that could:

  • Schedule any type of work.
  • Route it to the right place.
  • Handle the messy reality of distributed execution (crashes, network failures, rate limits, retries).

We call this the background activity system.


Why not use an off‑the‑shelf queue?

We considered Amazon SQS, RabbitMQ, and other queue libraries, but chose to build our own for several reasons.

1. Self‑hosted & air‑gapped environments

Pulumi Cloud supports self‑hosted installations, including air‑gapped environments. We intentionally minimize external dependencies so that self‑hosted customers don’t have to stand up additional infrastructure. A system built on an external queue works fine for our hosted service, but it would force self‑hosted customers to provide a compatible backend. By building on top of the database we already require, we avoid adding another system to maintain.

2. Scheduling + durability, not just queuing

What we actually need is scheduling with durability:

  • Remote workers must not lose activities on restart.
  • Priority handling so urgent work gets compute resources first.
  • Constraints such as “only n scans per org at a time.”
  • Structured logging for observability.
  • Checkpointing so long‑running operations can resume after a failure.

These features can be layered onto a generic queue library, but doing so often requires more code than implementing them directly.

Example: Priority queues are frequently built with multiple ranked queues, which breaks single‑activity‑at‑a‑time constraints. A second queue wouldn’t see a job already running in the first one, and producers in a distributed system can’t coordinate across queues without native support.

3. Capacity management

Generic queues fall short on dynamic capacity management. Distributed systems need to respond to:

  • Slowdowns.
  • Network interruptions.
  • Rate limits from downstream services.

These low‑level details are common to every workflow type, so embedding them in the scheduling layer prevents individual handlers from re‑solving the same problems.

4. Structured logging everywhere

We need logging that works even on customer‑managed runners behind firewalls where centralized logging services aren’t accessible.

Building this ourselves gave us a system that works with existing infrastructure and handles these requirements natively.


Design constraints

With that context, here are the constraints that shaped the design:

  • Pull‑only agents – Customer‑managed workflow runners live behind NATs, corporate proxies, and air‑gapped networks. They can’t accept inbound connections, so all communication must be agent‑initiated.
  • Mixed execution environments – The same system must serve both Pulumi‑hosted workers (direct internal access) and customer‑managed runners (communicating entirely over REST). We didn’t want to maintain two separate code paths.
  • Different workflow types – Deployments, Insights scans, and audit policy evaluations have distinct payloads and execution semantics, yet all require the same scheduling guarantees: exactly‑once execution, automatic retries, failure recovery, and observability.
  • Automatic fault tolerance – Agents crash, networks drop, and machines are recycled by autoscalers. The system must detect these failures and recover without human intervention.
  • Extensibility – New workflow types will keep being added. Adding one should mean writing a handler and registering it, not building new infrastructure.

The background activity

At the center of the system is the background activity, a persistent, typed work unit. Each activity includes:

FieldDescription
type discriminatorIdentifies the kind of work (e.g., insights-discovery or policy-evaluation).
payloadType‑specific data the handler needs.
routing contextDetermines which runner pool should execute the activity.
scheduling metadataPriority, activation time, retry configuration, etc.
statusTracks where the activity is in its lifecycle.

The type discriminator makes the system polymorphic. The scheduling engine doesn’t need to understand the payload; it simply moves activities through their lifecycle and delegates the actual work to a type‑specific handler.


The state machine

Every activity follows the same lifecycle regardless of type. The states fall into two groups:

Running states (work is in flight or can be resumed)

  • Ready – Queued and eligible to be claimed by a worker.
  • Pending – Claimed by a worker; execution about to start.
  • Executing – Actively running on a worker.

(The rest of the state diagram continues with terminal states such as Succeeded, Failed, Retrying, etc., but is omitted here for brevity.)


By keeping the structure and content intact while applying proper Markdown formatting, the document is now easier to read, navigate, and maintain.

Workflow Activity Lifecycle

  • Waiting – parked, blocked on one or more dependency activities
  • Restarting – recovered after a worker failure, ready to be re‑claimed
  • Terminal states (work is done) – Completed, Failed, Canceled

Note: New workflow types automatically get scheduling, retries, dependency management, and observability.


Leases: Distributed Execution Without Coordination

A central challenge of any distributed work queue is preventing double‑execution.
If two agents try to execute the same activity simultaneously, you get duplicate work and data corruption.

  • A central coordinator can solve this, but it becomes a single point of failure.
  • Instead we use lease‑based optimistic concurrency, a well‑known pattern adapted for long‑running, stateful workflows.

How it works

  1. Agent requests work – it asks the service to lease an activity.
  2. Service selects work – atomically picks the highest‑priority Ready activity, assigns a lease token with an expiration time, and transitions the activity to Pending. No other agent can claim the same activity.
  3. Agent executes – while running, the agent periodically renews its lease to signal that it’s still working.
  4. Failure detection – if the agent crashes, loses network connectivity, or is terminated, it stops renewing. When the lease expires, the service moves the activity to Restarting, making it available for another agent.

The service never needs explicit coordination between workers; leases are acquired using atomic database operations. The lease expiration itself acts as the failure detector.


Routing Work to the Right Runner Pool

Pulumi Cloud supports multiple workflow runner pools (e.g., production in us‑east‑1, staging in eu‑west‑1, Pulumi‑hosted runners for development).

  • Routing context – each activity carries a context that identifies the target runner pool.
  • Pool‑based filtering – when a runner polls for work, it filters by its own pool identifier, seeing only activities meant for it.

Prefix matching

  • Runners match activities whose context starts with their pool’s identifier.
  • Example hierarchy: pool‑abc/insights/scan‑123
  • Deleting a pool is simple: bulk‑cancel all activities whose context starts with that pool’s prefix.

This routing works the same for every workflow type; adding a new type requires no changes to the routing layer.


Dependencies and Multi‑Step Workflows

Many workflows consist of multiple steps (e.g., an Insights discovery scan followed by policy evaluation).

  • Dependency set – an activity can declare a list of other activities that must complete before it can run.
  • Waiting state – a dependent activity starts in Waiting.
  • Ready transition – as each dependency finishes, the system checks whether all prerequisites are satisfied. When the last one completes, the activity moves to Ready and enters the scheduling queue.

This creates a lightweight DAG of work without a separate orchestration engine. Dependent activities receive the same guarantees as any other activity: lease‑based execution, automatic recovery, and observability.


Two Execution Modes, One Interface

ModeDescriptionTypical Use
DirectRuns in‑process alongside the Pulumi Cloud service. Workers have low‑latency access to internal systems and process activities with minimal overhead.Pulumi‑hosted runners
RemoteCommunicates over REST APIs. The runner polls for activities, leases them, executes work locally, and reports results back via HTTP. No database access, no internal network access, no inbound connectivity required.Customer‑managed runners
  • Both modes share the same handler interface, so a workflow handler does not need to know where it’s running.
  • Whether the handler runs on Pulumi’s infrastructure or on a customer’s Kubernetes cluster, it simply processes the payload and reports a result.

Putting It All Together – Example Flow

  1. User configuration – a user configures an AWS account for Insights scanning in Pulumi Cloud and assigns it to a workflow runner pool.
  2. Activity creation – Pulumi Cloud creates a background activity:
    • type = insights discovery
    • routing context =
    • payload = account configuration
  3. Runner polls – a customer‑managed workflow runner polling that pool detects the new work.
  4. Lease acquisition – the runner leases the activity, receiving an exclusive lock via the lease token.
  5. Workflow initialization – the runner receives any required cloud‑provider credentials (e.g., resolved from Pulumi ESCEnvironments, Secrets, and Configuration) and a job token from Pulumi Cloud.
  6. Execution – the runner executes the scan locally on the customer’s infrastructure, talking directly to the cloud‑provider APIs.
  7. Lease renewal – during execution, the runner periodically renews its lease to signal liveness.
  8. Completion – the scan finishes; the runner reports the result back to Pulumi Cloud.
  9. Activity finalization – the service marks the activity as Completed and archives it.
  10. Dependency handling – if a policy‑evaluation activity was waiting on this scan, it automatically transitions to Ready and enters the scheduling queue, where another runner in the pool can pick it up.

The flow is identical for Pulumi‑hosted runners (direct mode) and customer‑managed runners (remote mode); only the execution mode differs.


Retries and Scheduling

Failures are expected in distributed systems. The background activity system handles them at several levels:

  • Lease expiration – covers hard failures (crash, network loss). The activity is moved to Restarting and becomes eligible for another lease.
  • Automatic retries – configurable retry policies (exponential back‑off, max attempts) are applied when an activity transitions to Failed.
  • Prioritization & back‑pressure – the scheduler respects activity priority and pool capacity, ensuring that high‑priority work is processed first while preventing overload.

These mechanisms together provide robust, self‑healing execution of complex, multi‑step workflows across both direct and remote runner environments.

Failure Modes & Lease Handling

  • Agent crashes, network partitions, and machine terminations can cause a lease to expire.
  • When a lease expires, the activity moves to Restarting and becomes available for another agent to pick up.

Handler‑Controlled Retries

  • Designed for soft failures such as transient API errors and rate‑limit responses.
  • A handler can request a reschedule with a delay, which puts the activity back into Ready with a future activation time.

Automatic Retries

  • Each activity can define a retry budget:
    • Number of retry attempts.
    • Delay between attempts.
  • This prevents runaway retry loops while still giving flaky work a chance to succeed.

Priority Scheduling

  • Urgent work is processed first.
  • Higher‑priority activities are leased before lower‑priority ones, even if the lower‑priority activity has been waiting longer.

Lease Renewal During Slowdowns

  • The lease can be renewed while waiting on a downstream service, keeping the activity alive without blocking other work.
  • The agent continues renewing its lease, and the scheduler remains free to assign other activities to other agents.

Observability

  • Every activity generates a structured log that includes:
    • Timestamps.
    • Severity levels.
    • Code context.
  • Logs are stored with the activity record and are accessible via an API and admin tooling.

Benefits for Customer‑Managed Runners

  • The service cannot directly observe the execution environment, so the structured log provides visibility even when the runner is behind a firewall.
  • Handlers can use these logs as a progress journal, encoding checkpoints that allow a restarted activity to resume where it left off rather than starting from scratch.

Retention Policies

  • Configurable per organization and per workflow type.
  • Completed activities can be retained for auditing or purged to manage storage.
  • Failed activities are typically retained longer for debugging.

What We Learned

  1. A generic system pays off quickly

    • Initial instinct: build targeted solutions for each workflow type.
    • Investing in a generic activity system required more upfront design work, but now adding a new workflow type requires a fraction of the effort.
    • New workflows ship with full scheduling, retry, and observability support from day one.
  2. Leases handle many failure modes

    • Evaluated several approaches for distributed work coordination, including message queues with explicit acknowledgment and coordinator‑based assignment.
    • The lease model works well because all failure modes are handled through timeouts:
      • If an agent is running as expected, it renews.
      • If it isn’t, the lease expires.
  3. Keeping execution paths symmetric requires discipline

    • Making the hosted and self‑hosted paths share the same handler interface was a deliberate choice.
    • It would be easy to add shortcuts for the hosted path that bypass the remote API, but resisting that temptation means features work for both cases automatically.
  4. The hard part isn’t running the work

    • Running a scan or a deployment is straightforward once you have the right credentials.
    • The real complexity lies in everything around the execution: scheduling, routing, leasing, retrying, resolving dependencies, and cleaning up.
    • These operational concerns aren’t visible to users, but they are essential to providing a reliable experience.

Wrapping It Up

  • Today this system powers deployments, Insights discovery scans, and policy evaluations across both Pulumi Cloud and customer‑managed infrastructure.
  • The architecture is general enough that every new workflow type we add inherits the full scheduling, routing, retry, and observability stack without additional plumbing.

Next Steps

  • Run workflows on your own infrastructure – check out customer‑managed workflow runners.
  • Explore Pulumi Insights – see how it can help you understand and manage your cloud infrastructure.
0 views
Back to Blog

Related posts

Read more »