Durable Workflow Engines for Security Automation

Most security automation works in seconds. A finding is created, a webhook fires, a tag is applied, a ticket opens. The whole thing starts and finishes inside a single function call, and nobody thinks about what happens if the process dies halfway through, because it almost never does.

Then you write the automation that has to wait. Wait three days for a scan to finish. Wait for a human to approve a risky change. Wait for an external signal that may never arrive. Suddenly the comfortable model breaks. A process restart, a deployment, or a crashed worker erases everything the automation was holding in memory, and your carefully designed playbook silently disappears.

This is the problem durable workflow engines solve. They let an automation pause for days, survive a restart, and pick up exactly where it left off. This article explains what “durable” actually means, walks through the execution model PMAP uses for its runbooks, and shows where durability matters and where it does not. If you want the broader picture of automating a vulnerability program, start with the vulnerability management automation pillar. This piece goes one layer down, into the engine that keeps long-running playbooks alive.

When Automation Has to Wait

The defining feature of a durable workflow is patience. Two automation primitives create the need for it, and both show up constantly in real security playbooks.

The first is sleep. You want a runbook to pause for a fixed duration before continuing. Open a ticket, then wait 24 hours, then check whether anyone acted on it. Trigger a re-scan, then wait an hour for results to land. In PMAP the sleep action accepts durations such as 30s, 5m, 2h, or 7d. A seven-day pause is not unusual in a remediation playbook, and a seven-day pause is exactly the kind of thing an in-memory timer cannot guarantee.

The second is await_signal. Here the automation does not wait for a clock, it waits for an event that originates outside the workflow. A human approves a change. A remote scan finishes and reports back. Some external system posts a signal the runbook was listening for. In PMAP the await_signal action waits for a named external signal with a configurable timeout, and the default timeout is 72 hours. Three days is a sensible window for a human-in-the-loop gate, because people take weekends off and approvals are not instant.

Both sleep and await_signal belong to the same category in the runbook action catalog, flow control. They are the two actions that fundamentally change the lifetime of an automation. Every other action runs and returns. These two suspend. The moment an automation suspends, you have to answer a hard question. Where does its state live while it waits, and what happens to that state if the process restarts in the middle?

Inline Execution and Its Limits

To understand why durability matters, it helps to see the default model it replaces.

PMAP runs runbooks through three engine modes, controlled by the PMAP_RUNBOOK_ENGINE environment variable. The default mode is inline, and inline execution is exactly what it sounds like. When a platform event matches a runbook, the action list runs synchronously, in-process, in the same goroutine that handled the event. There is no queue, no separate worker, no persistence of intermediate state. The actions fire one after another and the run finishes.

Inline execution is the right default for the overwhelming majority of runbooks. Most automations are short sequences of mutations and notifications. A finding is created, the runbook changes its severity, adds a tag, sends a Slack message, and creates a Jira issue. All of that completes in well under a second. Persisting the state of each step to a database between actions would add latency and complexity for no benefit, because the run never needs to survive anything. It either completes immediately or it fails immediately.

The limit is precise. Inline mode skips the durable-wait primitives. When the engine is set to inline, a sleep action does not durably persist a timer, and an await_signal action cannot wait for an external signal across a restart, because there is nowhere durable to hold that suspended state. The await_signal action in particular is durable only via the workflow engine. Inline execution simply has no place to put a workflow that needs to live for three days while everyone else moves on.

So inline is fast, simple, and stateless by default, and that is its strength and its boundary at the same time. The moment your automation needs to wait, you need a different execution model underneath it.

What “Durable” Actually Means

Durability is a precise property, not a marketing adjective. A workflow engine is durable when its timers and signals are persisted and survive worker restarts. Strip away everything else and that is the whole idea. The state of a paused automation lives in a database, not in process memory, so a restart cannot lose it.

Concretely, this is what durability buys you in PMAP’s workflow mode. Each runbook execution is dispatched as a durable workflow instance backed by a PostgreSQL store. The actions run as registered activities. The timers behind sleep and the signal channels behind await_signal are written to Postgres tables rather than held in RAM. If the worker process is restarted by a deployment, a crash, or a routine scaling event, the suspended workflows are still sitting in the database, waiting. When the worker comes back, the engine reconstructs their state and they continue as if nothing happened.

This is the difference between an automation that can wait minutes and one that can genuinely wait days. An in-memory timer is only as durable as the process holding it. A persisted timer is as durable as your database. For long-running security playbooks, that distinction is the entire reason the durable engine exists. A standards-aligned way to think about this is resilience as a system property, in the sense described by NIST SP 800-160 on systems security engineering, where the goal is for a system to keep performing its function despite disruption rather than failing cleanly and forgetting its work.

The Three Engine Modes

PMAP does not force a single execution model on every runbook deployment. The PMAP_RUNBOOK_ENGINE variable selects one of three modes, and each one exists for a specific operational reason.

Inline, the Default Synchronous Path

Inline is the default. Actions run synchronously in-process, and the durable workflow backend is never even constructed. In code terms, the NewBackend function that provisions the Postgres-backed engine is never called when the mode is inline. The package that owns the durable backend sits unused at runtime. This matters operationally, because it means a fresh PMAP install runs all of its runbooks without any extra database schema, any extra worker, or any extra moving parts. You opt into durability deliberately, you do not pay for it by default.

Workflow, the Durable Path

Workflow mode is the durable path. When the engine is set to workflow, each runbook execution becomes a durable workflow instance dispatched against the Postgres backend. Non-durable actions run as activities. Durable timers and signals are persisted. Worker restarts are survivable. This is the mode you enable when your runbooks include sleep or await_signal and you need those waits to be real rather than best-effort.

Shadow, Validating Before Cut-Over

Shadow mode is the careful operator’s bridge between the two. In shadow mode the inline path runs for real and finalizes the execution, while a durable workflow instance is started in parallel without committing any side-effects. The point is validation. You can run the durable engine alongside the proven inline engine, compare their behavior on real production traffic, and confirm that the workflow path produces the results you expect, all before you flip the switch and let it own execution. Shadow mode is how you de-risk the cut-over to durable execution instead of discovering a behavioral difference in production.

The three modes form a natural adoption path. Start inline. Run shadow to validate. Promote to workflow when you trust it. Nothing about the runbooks themselves changes between modes, only the engine underneath.

How Durable Sleep and Await Signal Work

Once the durable engine is active, the two flow-control primitives behave very differently from their inline counterparts. This is the heart of durable execution, so it is worth being concrete.

A durable sleep is implemented as a workflow timer inside the workflow context, not as a thread that blocks for the duration. When a workflow reaches a sleep, the engine schedules a timer in the persistent store and the workflow yields. No process is sitting and counting down. The timer lives in the database. When it fires, the engine wakes the workflow and execution continues. A seven-day sleep consumes no live process for those seven days, and it survives every restart in between, because nothing about it ever depended on a process staying alive.

A durable await_signal is built from a persistent signal channel paired with a scheduled timer. The signal channel is what the workflow listens on, and the timer enforces the timeout. When an external signal with the matching name arrives, it is delivered to the waiting workflow instance and execution resumes. If no signal arrives before the timeout expires, the timer fires instead and the wait ends. The default timeout is 72 hours, which gives a real human or a real external system three days to respond before the automation gives up gracefully.

Human-in-the-Loop Gates

The 72-hour default is what makes human-in-the-loop automation practical. A runbook can pause at a sensitive step, signal a person that a decision is needed, and wait. The approver has three days by default to respond, the runbook holds its place durably the entire time, and if nobody responds the timeout closes the gate rather than leaving the automation stuck forever. This is exactly the pattern described in incident handling guidance such as NIST SP 800-61 on computer security incident handling, where playbooks routinely pause for human judgment before taking a consequential action. A durable engine is what lets that pause be a real, recoverable state instead of a fragile in-memory wait.

Replay From History

The mechanism that makes all of this possible is replay. The engine keeps an append-only event history for each workflow execution, and that history is the source of truth for deterministic replay. When a worker resumes a suspended workflow, it does not somehow restore a frozen snapshot of memory. Instead it replays the recorded history of everything that has happened, step by step, to reconstruct the workflow’s state up to the point where it paused. The activities that already ran are not re-executed, their recorded results are replayed from history. Then the workflow continues from where it left off.

Replay is why a durable workflow can survive an arbitrary restart at an arbitrary moment. The truth of what happened is in the history table, the timers and pending signals are in their own tables, and the engine can always rebuild the live picture from those persisted records. The cost of this design is that workflow logic must be deterministic, because non-deterministic code would replay differently than it originally ran. The benefit is that an automation can pause for a week and resume correctly no matter what happened to the infrastructure underneath it.

Reliability Around the Engine

Durability handles the waiting. It does not, on its own, handle the failure of the work itself. A robust automation platform needs guardrails around the engine, and PMAP’s runbook domain wraps several of them around every execution regardless of which engine mode is active.

The first is per-action retry. Each action carries an optional retry policy with a configurable maximum number of attempts and exponential backoff. When a transient failure hits an action, such as a momentarily unreachable ITSM endpoint, the action can retry with growing delays rather than failing the entire run on the first hiccup. This keeps automations resilient to the ordinary flakiness of the external systems they talk to.

The second is the circuit breaker. PMAP counts consecutive failures per runbook, and after 10 consecutive failures the runbook is automatically deactivated and stamped with the time the breaker tripped. The reasoning is operational hygiene. A runbook that fails ten times in a row is almost certainly misconfigured or pointed at a broken dependency, and continuing to fire it just generates noise and side-effects. Auto-deactivation stops the bleeding, and an explicit reset re-enables the runbook once the underlying problem is fixed. Notably, skipped runs do not count toward the breaker, because a skip is a deliberate non-action, not a health signal.

The third is the throttle and concurrency gate. A throttle window prevents a runbook from re-firing within a configured number of seconds, which protects against event storms triggering the same automation hundreds of times. A concurrency key caps how many instances of a runbook can run simultaneously, which protects shared downstream resources from being overwhelmed. Together they keep automation volume sane even when the triggering events are bursty.

These guardrails are not durability features, but they are what make durable automation safe to run. Durability ensures a long-running workflow survives. Retry, the circuit breaker, and the throttle gate ensure that survival is worth having, because the work being preserved is itself protected from transient errors, runaway failure loops, and overload.

Graceful Degradation When the Engine Cannot Start

A subtle but important design choice sits at the boundary between the durable engine and the rest of PMAP. The durable backend depends on PostgreSQL, and it is constructed during server startup. So a fair question is, what happens if the durable engine cannot initialise, for example because the database is briefly unreachable at the moment the server boots?

The answer is graceful degradation. If the durable backend fails to construct, PMAP logs a warning and falls back to inline execution rather than crashing the server. The rest of the platform stays operational. Runbooks that do not need durability keep working exactly as before, and the only thing lost is the durable-wait capability until the engine can be brought up.

This is a deliberate reliability stance. The durable subsystem is an enhancement layered on top of a platform that must stay up, and a problem in the enhancement is never allowed to take down the platform. The engine also isolates its own database migrations under a separate tracking table, so its schema management never collides with PMAP’s core migrations and never races for a shared migration lock. The durable engine, in other words, is built to fail safe. It either improves the platform or quietly steps aside, and it never makes things worse.

There is one more piece of failure handling worth naming. The platform runs a stale-running watchdog that sweeps execution records periodically. If a workflow-driven execution is left stuck in a running state for an extended window, for example because a worker crashed before it could finalize the execution record, the watchdog marks that record as failed rather than leaving it hanging forever. Durability keeps live workflows alive across restarts, and the watchdog cleans up the rare execution that genuinely fell through, so the system does not accumulate ghost runs.

How PMAP Wires Durable Runbooks

It helps to see how the pieces connect, because the architecture explains why the engine behaves the way it does.

The durable engine is a thin, isolated infrastructure layer. Its single responsibility is to provision a correctly migrated PostgreSQL-backed workflow backend and hand it to the runbook system. It owns no business logic, exposes no HTTP endpoints, and knows nothing about findings, assets, or scans. It is pure plumbing.

At startup, when the engine mode calls for it, PMAP constructs the durable backend, builds a worker on top of it, and registers exactly one workflow definition and one activity set against that worker. The single generic workflow is the durable wrapper that every runbook execution runs inside. The activity set is how individual runbook actions get executed durably and how the final execution record gets written. The worker is then started, a client is created for dispatching workflows and delivering signals, and the runbook service is told which engine mode to use and handed the client. From that point on, runbook executions that need durability flow through the workflow engine, and everything else continues to run inline.

A few details fall out of this wiring and matter in practice. Activity results are chained back into the workflow’s payload between steps, so state that one action produces is available to the actions that follow it. For example, an action that resolves an asset owner writes the resolved owner into the payload, and a later action can route a ticket to that owner. A finalization activity updates the execution record from running to its terminal status, success, failed, or partial, at the end of every run, so the run history stays accurate. And on graceful shutdown, the worker drains its in-flight tasks before the process exits, so a planned restart does not abandon work that was mid-flight.

The result is an automation platform where the common case stays simple and fast, and the long-running case becomes genuinely durable, with a clean fallback if the durable layer is unavailable. If you want to go from understanding the engine to actually building automations on it, the next step is designing the runbooks themselves, the triggers that start them and the actions they run. That is the subject of designing runbooks with triggers and actions. For a structured way to think about building automation into a security program more broadly, the OWASP DevSecOps Guideline is a useful companion.

Frequently Asked Questions

What makes a workflow engine durable?

A workflow engine is durable when its timers and signals are persisted and survive worker restarts. In PMAP’s workflow mode, durable sleep timers and await_signal channels are written to PostgreSQL tables rather than held in process memory. If the worker restarts, the engine replays the persisted event history to reconstruct each suspended workflow and continues from exactly where it paused. The state of a paused automation lives in the database, so a restart cannot lose it.

What is the difference between inline and durable execution?

Inline execution runs a runbook’s actions synchronously in-process and skips the durable-wait primitives. A sleep does not persist a durable timer, and await_signal cannot wait for an external signal across a restart. Durable execution, available in PMAP’s workflow mode, dispatches each run as a persistent workflow instance, so sleep and await_signal become durable timers and signals that survive restarts. Inline is the fast default for short automations. Durable is for automations that need to wait minutes, hours, or days.

How long can an automation wait for a signal?

The await_signal action waits for a named external signal with a configurable timeout, and the default timeout is 72 hours. That three-day window is designed for human-in-the-loop gates, where a person needs time to review and approve a sensitive action. If the signal arrives before the timeout, execution resumes immediately. If the timeout expires first, the wait ends gracefully rather than leaving the automation stuck forever.

What happens if the durable engine cannot start?

PMAP degrades gracefully. If the durable backend fails to initialise at startup, for example because the database is briefly unreachable, the platform logs a warning and falls back to inline execution rather than crashing. The rest of the platform stays operational, and runbooks that do not require durability keep working normally. The durable subsystem is an enhancement that is never allowed to take down the platform, so a problem in the engine quietly steps aside instead of causing an outage.

Does durable execution re-run actions that already completed after a restart?

No. The engine reconstructs a suspended workflow by replaying its append-only event history, and activities that already ran have their recorded results replayed from history rather than being executed again. The workflow rebuilds its state up to the pause point and then continues forward. This is why durable workflows require deterministic logic. The replay must follow the same path the original execution took.

When should I use the shadow engine mode?

Use shadow mode to validate the durable engine before you cut over to it. In shadow mode the proven inline path runs for real and finalizes the execution, while a durable workflow instance runs in parallel without committing side-effects. This lets you compare the durable engine’s behavior against the inline engine on real production traffic and confirm it produces the results you expect. Shadow mode is the low-risk way to build confidence before promoting a deployment to full durable execution.

Do I need the durable engine for every runbook?

No, and most runbooks do not need it. The majority of automations are short sequences of mutations and notifications that complete in under a second, and inline execution handles them well with less overhead. You only need the durable engine for runbooks that include sleep or await_signal and require those waits to survive restarts. A common approach is to run most automation inline and enable the durable engine specifically for the long-running, human-in-the-loop, or wait-for-scan playbooks that genuinely need it.

PMAP Security Team

See Full Bio

One platform to ingest, correlate, triage and remediate every vulnerability finding.

Build and deliver vulnerability management with PMAP

Help Build the Vulnerability Management Platform Security Teams Trust