Production Deployment Checklist
A pragmatic checklist for shipping FlowOrchestrator to production. Walk through every section before go-live; bookmark this page for the first incident review.
For changing a flow once it is in production, see Versioning Flows.
1. Storage and Persistence
FlowOrchestrator writes to twelve tables. The auto-migrator
(FlowOrchestratorSqlMigrator) creates them idempotently on startup.
| Table | Holds |
|---|---|
FlowDefinitions |
One row per registered flow (id, name, version, manifest JSON). |
FlowRuns |
One row per TriggerAsync invocation (status, trigger data, timestamps). |
FlowSteps |
One row per step instance per run (status, inputs, output). |
FlowStepAttempts |
Per-attempt history for retried steps (attempt number, error, durations). |
FlowOutputs |
Step outputs keyed by (RunId, StepKey). |
FlowStepDispatches |
Idempotent dispatch ledger — supports the Dispatch many, Execute once invariant. |
FlowStepClaims |
Exclusive execution claim per step (the "Execute once" half). |
FlowRunControls |
Cancellation, idempotency key, run timeout. |
FlowIdempotencyKeys |
Idempotency dedupe at trigger time. |
FlowEvents |
State-transition audit log (when event persistence is enabled). |
FlowSignalWaiters |
Parked WaitForSignal steps. |
FlowScheduleStates |
Recurring-trigger sync state. |
Warning
UseInMemory() is not for production. All run data is stored in-process and
lost on restart. The runtime is supported only for development and tests; the
manifest's Getting Started page
already calls this out, but it bears repeating here.
SQL Server
- Use a case-insensitive collation (the default
SQL_Latin1_General_CP1_CI_ASis fine). Step keys are matched as-is and a CS collation will surprise developers. - Indexes are created by the auto-migrator. If you sharpen them, do it under a separate review — the engine relies on them for the dispatch-ledger fast path.
- Enable point-in-time recovery (PITR) — every table is part of the operational record and a 24 h gap is hard to reconstruct.
PostgreSQL
- Tune
max_connectionsagainst the Hangfire worker count (default 20) plus dashboard / API traffic. - Leave autovacuum on. The dispatch ledger is hot under high throughput; bloat shows up as
EXPLAINplans falling off index scans. - Logical replication or
pg_basebackupfor backups — both work, just have a tested restore drill.
Backup recipe
Full backup + transaction-log / WAL replay covers every table above. Step inputs and outputs are persisted in FlowOutputs, so restoring is restoring state, not just configuration.
2. Multi-Instance Deployment
FlowOrchestrator's core invariant — Dispatch many, Execute once — is what makes horizontal scale safe.
IFlowRunStore.TryRecordDispatchAsyncis an idempotent INSERT intoFlowStepDispatches. Multiple workers may try; only one row lands. Backed bySqlFlowRunStore(source).TryClaimStepAsync(the claim guard) ensures only one runtime instance executes a given step attempt; the others exit silently.- Cron triggers under the Hangfire runtime use
IRecurringJobManager, which fires each schedule on exactly one server. The ServiceBus runtime fires cron via self-perpetuating scheduled messages on a queue — Service Bus's exactly-once-per-tick delivery handles cross-replica coordination without leader election. No extra coordination needed in either case.
Warning
InMemory runtime + multi-instance is not supported. The dispatcher is a Channel<T> inside one process — instance B has no way to observe a job that instance A enqueued. Always combine UseInMemoryRuntime() with a single instance. For multi-replica deployments use UseHangfire() (with shared Hangfire storage) or UseAzureServiceBusRuntime() (with shared SB namespace).
Important
Disabled flows (v1.22+). Toggling IsEnabled = false on a flow record (via dashboard
or API) silently rejects ALL trigger paths at the engine layer — manual, cron, webhook,
re-trigger. Cron jobs are additionally pulled from the scheduler. In-flight runs are not
cancelled; use the dashboard's Cancel run action for live work. Watch EventId 1010
TriggerRejectedDisabledFlow to track how often disabled-flow triggers are still being
attempted (a webhook producer or external cron not yet aware of the disable).
Blue-green deployment
- Pause every cron trigger (dashboard → trigger pause toggle).
- Wait for
GET /flows/api/runs/activeto drain to[]. - Cut over traffic to the new version.
- Resume cron triggers.
The dispatch ledger means an interrupted run will be re-driven by FlowRunRecoveryHostedService on the new instance — no manual replay step required.
3. Monitoring
Logs
Every log line emitted by the engine includes the RunId and step key (when
applicable). Forward the following correlation fields if your aggregator strips
unknown ones:
RunId(Guid)FlowId(Guid)StepKey(string)Attempt(int)
OpenTelemetry traces and metrics
Enable via options.Observability.EnableOpenTelemetry = true and wire up via
AddFlowOrchestratorInstrumentation() — see Observability.
| Span | Covers |
|---|---|
flow.trigger |
TriggerAsync from receipt to first step dispatch. |
flow.step |
One step end-to-end (expression resolution + handler). |
flow.step.poll |
One poll attempt of a PollableStepHandler<T>. |
| Metric | Type | Tags |
|---|---|---|
floworch.runs.started |
counter | flow_id, trigger_type |
floworch.runs.completed |
counter | flow_id, status |
floworch.steps.duration |
histogram (ms) | flow_id, step_type |
floworch.steps.poll_attempts |
histogram | flow_id, step_key |
Suggested alerts
- Stuck runs.
count(active_runs WHERE status='Running' AND age > 1h) > 0— usually a missing handler or external dependency timeout. - Failed-run rate.
rate(floworch.runs.completed{status="failed"}[5m]) > 1— tune to your volume. - Cron lag.
now() - last_recurring_fire > 2 × schedule_interval— Hangfire server unhealthy or paused inadvertently. - Queue depth. Hangfire's own
enqueuedcounter; FlowOrchestrator inherits it.
Health-check endpoint
Wire up the bundled health check so the load balancer can drop traffic when storage is unreachable:
builder.Services
.AddHealthChecks()
.AddFlowOrchestratorHealthChecks(); // probes IFlowStore reachability
var app = builder.Build();
app.MapHealthChecks("/health");
The check resolves IFlowStore from DI on every probe, so SQL Server, PostgreSQL,
and in-memory all work without re-registration. The probe budget defaults to 5
seconds and is configurable:
.AddFlowOrchestratorHealthChecks(timeout: TimeSpan.FromSeconds(2));
Verify it:
curl -f https://app.example.com/health
# Healthy
GET /health HTTP/1.1
HTTP/1.1 200 OK
Content-Type: text/plain
Healthy
The response degrades to 503 Service Unavailable with body Unhealthy when the
flow store throws or the probe budget elapses — point your load balancer at it.
4. Secrets and Credentials
- Connection strings. Never in source. Use environment variables,
appsettings.{Environment}.jsonoutside ofmain, or a managed secret store. - Webhook secrets. Each webhook trigger carries a
webhookSecretinput. Rotate it like any shared secret; supply a new value in the manifest and pause-resume the trigger so the routing table refreshes. - Webhook hardening pipeline (v1.25). Treat the dashboard webhook endpoint as a public attack surface. Recommended rollout:
- Set
WebhookSecurityOptions.EnforcementMode = Auditfor one release. The pipeline fires every gate (HMAC signature, replay, rate-limit, IP allow/deny, body cap), logsEventId 4001-4006for failures, and writes the DLQ — but the endpoint still returns 202. Use the dashboard "Webhooks" tab to confirm legitimate traffic still validates. - Flip to
Enforceoncewebhook_rejected_totalis dominated by traffic you actually want to reject. - Set
webhookHmacKeyon every webhook trigger;webhookHmacKeyPreviousis your zero-downtime rotation slot. Successful matches against the previous key emitEventId 4010so you know when to revoke it. - Lock down source IPs. The matcher accepts CIDR (
10.0.0.0/8), inclusive ranges (10.0.0.10-10.0.0.42), wildcards (10.0.*.*), single addresses, and curated presets — mix-and-match inwebhookIpAllowList(array OR comma-delimited string). Bundled presets:github,stripe,shopify,twilio,square,atlassian/bitbucket,slack,mailgun,zoom, pluslocalfor RFC 1918 + loopback. Combine multiple presets viawebhookIpAllowListPresets(plural). SetWebhookSecurityOptions.ForwardedHeaderDepthto your actual reverse-proxy depth — never higher. - For multi-replica deployments, swap the in-memory replay/DLQ stores for SQL Server (
AddSqlServerWebhookHardening) or PostgreSQL (AddPostgreSqlWebhookHardening). The in-memory stores are single-process and will let a replay through on the second replica. See Webhook hardening for the full per-publisher cookbook.
- Set
- Dashboard Basic Auth. The dashboard ships with optional Basic Auth — sufficient for an internal-only surface behind a VPN. Insufficient for any production fronting the public internet — put a real auth proxy (OIDC, Azure AD, Cloudflare Access) in front.
- PII in step inputs/outputs. Step inputs and outputs are persisted in
FlowStepsandFlowOutputsand are visible in the dashboard's run detail view. Treat them as logs. If you need to ingest PII, redact before the step boundary and pass an opaque token instead.
5. Capacity Planning
Storage cost per run
A single run typically writes:
- 1 row in
FlowRuns - N rows in
FlowSteps(one per step) - N rows in
FlowOutputs(one per step that produced output) - M rows in
FlowStepAttempts(one per attempt — usually 1, more with retries) - K rows in
FlowEvents(when event persistence is on; ~5 per step)
For a 5-step flow with retries off and event persistence on, that is ~30 rows per run. At one million runs per month, expect tens of GB per year before retention.
Retention
Turn it on:
options.Retention.Enabled = true;
options.Retention.DataTtl = TimeSpan.FromDays(30);
options.Retention.SweepInterval = TimeSpan.FromHours(1);
Pick DataTtl against your audit requirements, not your storage budget — storage is cheap and a forensics request always wants a longer window than you set.
Hangfire worker tuning
The default WorkerCount = Environment.ProcessorCount * 5 is fine for most
FlowOrchestrator workloads (steps are typically I/O-bound). Increase only if
metrics show queue depth growing while CPU stays low. For background-CPU-bound
steps, lower it to avoid thrashing.
6. Upgrade Path
Within a major version (1.x → 1.y)
FlowOrchestratorSqlMigrator runs on every startup and applies pending DDL idempotently. Releases note any table or column additions; rolling deploys are safe.
Across major versions (1.x → 2.x)
A major bump may rename or split tables. The release notes include the migration path; run it in a non-production environment first and verify with a smoke run (see below).
Disabling the auto-migrator
If your operations team owns DDL changes:
options.UseSqlServer(connectionString, autoMigrate: false);
…then run the migration script (published in each release) under your own deployment pipeline.
Smoke-run verification
After any deploy:
- Trigger a manual run of a no-op flow (the simplest one in your manifest).
- Assert the run reaches
Status = Succeededwithin the SLA you expect. - Hit
/healthand check the response isHealthy.
If any of those fail, roll back before letting business traffic resume.
Tip
Running FlowOrchestrator in production? Tell us about your deployment in the GitHub Discussions — patterns and pain points feed directly into the roadmap.