Observability
FlowOrchestrator exposes run events, OpenTelemetry traces/metrics, and a retention system to give you visibility into what your flows are doing in production.
Run Events
When event persistence is enabled, FlowOrchestrator writes structured FlowEvent records for every state transition: run started, step queued, step started, step completed/failed, run completed. Built-in step types — including WaitForSignal (step.pending while waiting, step.completed once the signal lands) and ForEach (per child-iteration events) — emit through the same channel as user-written handlers.
options.Observability.EnableEventPersistence = true;
Events are queryable via the REST API:
GET /flows/api/runs/{runId}/events
Returns a time-ordered list of events with timestamps and payload. These power the step timeline view in the dashboard.
Without event persistence
The dashboard can still show step status and I/O from FlowSteps and FlowOutputs, but the precise timing of each state transition is unavailable.
OpenTelemetry
When enabled, FlowOrchestrator registers an ActivitySource and an IMeterFactory-backed Meter that emit spans and metrics compatible with any OTLP-compatible backend (Jaeger, Grafana Tempo, Azure Monitor, Aspire Dashboard).
options.Observability.EnableOpenTelemetry = true;
Wire up the instrumentation in your OTel pipeline:
using FlowOrchestrator.Core.Observability;
builder.Services.AddOpenTelemetry()
.ConfigureResource(r => r.AddService("MyApp"))
.WithTracing(t => t
.AddFlowOrchestratorInstrumentation()
.AddAspNetCoreInstrumentation()
.AddOtlpExporter())
.WithMetrics(m => m
.AddFlowOrchestratorInstrumentation()
.AddAspNetCoreInstrumentation()
.AddOtlpExporter());
AddFlowOrchestratorInstrumentation() is an extension method on both TracerProviderBuilder and MeterProviderBuilder. It subscribes to the FlowOrchestrator activity source and meter, and now lives in FlowOrchestrator.Core.Observability (moved from FlowOrchestrator.Hangfire in v1.19). The Hangfire namespace still exposes [Obsolete] shims for one release so existing code keeps compiling.
What is emitted
Traces (every span is on the FlowOrchestrator activity source):
| Span | Kind | When | Notable tags |
|---|---|---|---|
flow.trigger |
Internal | One per TriggerAsync call |
flow.id, run.id, trigger.key, trigger.type, duplicate (set when idempotency dedupe fires) |
flow.step |
Internal | One per RunStepAsync call |
flow.id, run.id, step.key, step.type |
flow.step.retry |
Internal | One per RetryStepAsync call |
flow.id, run.id, step.key |
flow.step.when |
Internal | When a step's When clause is evaluated |
flow.id, run.id, step.key, flow.when.expression, flow.when.resolved, flow.when.result |
flow.step.poll |
Internal | One per polling iteration in PollableStepHandler |
flow.id, run.id, step.key, flow.poll.attempt, flow.poll.condition_met |
flow.runtime.execute |
Consumer | Wraps each Hangfire job. Restores the parent traceparent captured at enqueue, so step spans become children of the original caller. |
messaging.system=hangfire, messaging.message.id |
flow.webhook.receive |
Server | One per inbound webhook hit, parented onto the caller's traceparent header |
flow.webhook.slug_or_id |
flow.signal.deliver |
Server | One per inbound signal HTTP call, parented onto the caller's traceparent |
flow.run_id, flow.signal_name |
Failures set Status = Error on the activity and add an exception event with the standard
OTel tags (exception.type, exception.message, exception.stacktrace). APMs treat the span as
red without any extra configuration.
Metrics (every instrument is on the FlowOrchestrator meter):
| Metric | Type | Unit | Tags |
|---|---|---|---|
flow_runs_started |
counter | runs | flow_id, trigger_key |
flow_runs_completed |
counter | runs | status |
flow_steps_completed |
counter | steps | flow_id, status |
flow_step_duration_ms |
histogram | ms | flow_id, step_key, status |
flow_step_queue_delay_ms |
histogram | ms | flow_id, step_key |
flow_step_retries |
counter | retries | flow_id, step_key |
flow_step_skipped |
counter | steps | flow_id, step_key, reason (when_false / prerequisites_unmet) |
flow_step_poll_attempts |
counter | attempts | flow_id, step_key |
flow_signal_wait_ms |
histogram | ms | flow_id, step_key, signal_name — recorded by FlowSignalDispatcher on delivery |
flow_cron_lag_ms |
histogram | ms | flow_id, trigger_key, runtime (hangfire / in_memory / service_bus) — gap between scheduled fire and actual dispatch |
webhook_received_total |
counter | webhooks | flow, result (accepted / rejected / off), scheme — emitted on every webhook receive (v1.25) |
webhook_rejected_total |
counter | webhooks | flow, reason (signature_invalid / replay / rate_limited / payload_too_large / ip_denied / secret_invalid) (v1.25) |
webhook_body_bytes |
histogram | bytes | flow — body size of every webhook receive (v1.25) |
webhook_processing_ms |
histogram | ms | flow, result — wall-clock pipeline processing time (v1.25) |
Distributed tracing across the runtime
A single traceId connects everything from the inbound HTTP request to the last step's exit:
caller traceparent
└── flow.webhook.receive (Dashboard, Server)
└── flow.trigger (engine)
└── flow.runtime.execute (Hangfire, Consumer)
└── flow.step (engine, per dispatched step)
└── flow.step.poll (handler, per poll attempt)
The flow.runtime.execute wrapper is opened by TraceContextHangfireFilter (registered automatically when options.UseHangfire() is set). It captures Activity.Current.Context on enqueue, persists the W3C identifiers as Hangfire job parameters, and restores them as the parent context when the worker picks the job up. Inbound webhook and signal endpoints in the Dashboard read the traceparent / tracestate headers via InboundTraceContext and start their entry-point activity as a child of the parsed context.
Without this plumbing, a Hangfire-backed run would appear as a forest of disconnected root spans — one per step. With it, an APM shows a single connected tree from the original caller down to every step.
Sampling
OTel sampling for traces should be configured at the SDK, not at FlowOrchestrator. For low-volume systems start with AlwaysOnSampler; for high-volume production prefer parent-based sampling so trace continuity is preserved across the runtime boundary:
.WithTracing(t => t
.SetSampler(new ParentBasedSampler(new TraceIdRatioBasedSampler(0.05))) // 5% root sampling
.AddFlowOrchestratorInstrumentation()
.AddOtlpExporter());
Metrics are aggregated, not sampled — leave the default cardinality limits in place and only override if you see exporter overflow warnings.
Logger scopes and EventIds
The engine wraps every public entry point (TriggerAsync, RunStepAsync, RetryStepAsync) in _logger.BeginScope(...) carrying RunId, FlowId, StepKey, and Attempt (when applicable). Every nested log line — including logs from your own step handlers — carries those properties automatically. Logging providers that honour scopes (Serilog, NLog, OpenTelemetry Logs, Application Insights, Datadog, …) surface them as searchable fields. Engine hot-path log calls go through source-generated [LoggerMessage] partial methods (EngineLog.cs) for zero-allocation, AOT-friendly emission. See Logging integrations below for concrete provider wireup.
Stable EventId constants are defined in FlowOrchestrator.Core.Observability.LogEvents so production users can filter or alert on a specific log event without parsing the message template:
LogEvents.RunStarted = 1000
LogEvents.RunCompleted = 1001
LogEvents.TriggerRejectedDisabledFlow = 1010 // v1.22+: TriggerAsync skipped because IFlowStore.IsEnabled = false
LogEvents.StepStarted = 2000
LogEvents.StepCompleted = 2001
LogEvents.StepFailed = 2002
LogEvents.StepSkipped = 2003
LogEvents.WhenEvaluationFailed = 2005
LogEvents.DispatchEnqueued = 3000
// Webhook hardening pipeline (4000–4099 reserved, v1.25+)
WebhookLog.WebhookReceived = 4000 // every receive — info
WebhookLog.SignatureRejected = 4001 // HMAC mismatch / malformed header — warning
WebhookLog.ReplayRejected = 4002 // skew or nonce reused — warning
WebhookLog.RateLimited = 4003 // token bucket empty — warning
WebhookLog.PayloadTooLarge = 4004 // body cap exceeded — warning
WebhookLog.IpDenied = 4005 // CIDR allow/deny miss — warning
WebhookLog.SecretInvalid = 4006 // bearer-secret mismatch — warning
WebhookLog.DeliveryAccepted = 4007 // run dispatched — info
WebhookLog.RejectionStoreFailed = 4008 // DLQ write failed — warning
WebhookLog.ReplayStoreFailed = 4009 // replay store error — warning
WebhookLog.RotationUsedPreviousKey = 4010 // accepted with rotated-out key — info
// …see the source for the full list.
The flow.trigger activity also receives a flow.disabled = true tag when the engine
silent-skips a trigger for a disabled flow (v1.22+). Filter your APM by this tag to count
how often disabled-flow triggers are still being attempted (a useful signal that a webhook
producer or cron job hasn't picked up the disable yet).
Logging integrations
The library is logging-framework-agnostic — it only uses Microsoft.Extensions.Logging.ILogger<T>. Plug in any provider that honours ILogger.BeginScope and you get the engine's correlation properties (RunId, FlowId, StepKey, Attempt) on every nested log line, including logs emitted by your own step handlers.
Microsoft.Extensions.Logging (Console)
Built into the framework. Scopes are off by default — opt in via the formatter options:
builder.Logging.AddJsonConsole(o => o.IncludeScopes = true);
// or
builder.Logging.AddSimpleConsole(o => o.IncludeScopes = true);
Output:
{ "Timestamp":"…", "EventId":2002, "LogLevel":"Error",
"Category":"FlowOrchestrator.Core.Execution.FlowOrchestratorEngine",
"Message":"Step execution failed for submit_to_wms",
"Scopes":[{"RunId":"3fa85f64-…","FlowId":"a1b2c3d4-…","StepKey":"submit_to_wms"}] }
Serilog
dotnet add package Serilog.AspNetCore
dotnet add package Serilog.Sinks.Seq # or Console / File / Datadog / Splunk / …
builder.Host.UseSerilog((ctx, cfg) => cfg
.ReadFrom.Configuration(ctx.Configuration)
.Enrich.FromLogContext() // <-- required so scope props become structured fields
.Enrich.WithProperty("Service", "OrderHub")
.WriteTo.Seq("http://seq:5341"));
Query in Seq / any structured sink:
EventId.Id = 2002 and StepKey = 'submit_to_wms'
NLog
dotnet add package NLog.Web.AspNetCore
builder.Host.UseNLog();
nlog.config — use ${scopeproperty} to render scope keys:
<targets>
<target xsi:type="Console" name="console"
layout="${longdate} ${level} ${event-properties:item=EventId_Id} run=${scopeproperty:item=RunId} step=${scopeproperty:item=StepKey} - ${message} ${exception:format=tostring}" />
</targets>
OpenTelemetry Logs (auto trace correlation)
When you already use OTel for traces, exporting logs through the same pipeline gives automatic TraceId / SpanId correlation in every log line — clicking a log in your APM jumps straight to the trace:
using FlowOrchestrator.Core.Observability;
builder.Logging.AddOpenTelemetry(o =>
{
o.IncludeFormattedMessage = true;
o.IncludeScopes = true; // <-- emits RunId/FlowId/StepKey as log attributes
o.ParseStateValues = true;
o.AddOtlpExporter();
});
builder.Services.AddOpenTelemetry()
.WithTracing(t => t.AddFlowOrchestratorInstrumentation().AddOtlpExporter())
.WithMetrics(m => m.AddFlowOrchestratorInstrumentation().AddOtlpExporter());
Application Insights / Datadog / Splunk / Seq / Loki
All honour BeginScope — scope properties surface as customDimensions (App Insights), structured tags (Datadog/Splunk/Loki), or top-level fields (Seq) automatically. No FlowOrchestrator-specific configuration required beyond enabling scopes on the provider.
Tip
If your scope properties are missing in the sink output, the provider almost certainly has scopes disabled by default. Search its docs for IncludeScopes (Microsoft.Extensions.Logging.Console, OpenTelemetry), Enrich.FromLogContext (Serilog), or ${scopeproperty} (NLog).
Health checks
Wire the bundled storage probe so a load balancer can drop traffic when the flow store is unreachable:
builder.Services.AddHealthChecks().AddFlowOrchestratorHealthChecks();
app.MapHealthChecks("/health");
The check resolves whichever IFlowStore you registered (SQL Server, PostgreSQL, in-memory). Probe budget defaults to 5 s and is configurable. See Production Checklist for the full operational story.
Running with .NET Aspire
When running under Aspire, OTEL_EXPORTER_OTLP_ENDPOINT is injected automatically. Spans and metrics appear in the Aspire Dashboard with no extra configuration beyond AddFlowOrchestratorInstrumentation().
Important
For the engine's structured logs to show up in Aspire's Logs tab (with RunId / StepKey / EventId as searchable attributes), wire up builder.Logging.AddOpenTelemetry(...) with IncludeScopes = true and AddOtlpExporter() — see the OpenTelemetry Logs example above. OTel's tracing and metrics pipelines do not automatically wire the logging pipeline; without this snippet the Logs tab will be empty even though traces and metrics flow through.
Run Control State
GET /flows/api/runs/{runId}/control
Returns the current control record for a run:
{
"runId": "...",
"cancellationRequested": false,
"timedOutAt": null,
"idempotencyKey": "batch-2026-04-20-001",
"timeoutAt": "2026-04-20T12:10:00Z"
}
This is useful for diagnosing why a run stopped or was cancelled.
Active Runs
GET /flows/api/runs/active
Returns all runs currently in Running status. Use this to build operational monitors or alert on stuck runs.
Dashboard Statistics
GET /flows/api/runs/stats
{
"totalFlows": 6,
"activeRuns": 2,
"succeededToday": 47,
"failedToday": 1,
"cancelledToday": 0
}
Data Retention
FlowOrchestrator can automatically delete old run data to prevent unbounded database growth.
options.Retention.Enabled = true;
options.Retention.DataTtl = TimeSpan.FromDays(30); // delete runs older than 30 days
options.Retention.SweepInterval = TimeSpan.FromHours(1); // run the sweep every hour
When enabled, FlowRetentionHostedService runs on the configured interval and calls IFlowRetentionStore.DeleteOldRunsAsync(cutoff). The SQL Server and PostgreSQL backends cascade-delete all related records (steps, outputs, events, control) when a run is deleted.
Tip
SweepInterval defaults to 1 hour and DataTtl defaults to 30 days. Retention is disabled by default — opt in explicitly.
| Option | Default | Description |
|---|---|---|
Retention.Enabled |
false |
Enable the background sweep |
Retention.DataTtl |
30 days | Runs older than this threshold are deleted |
Retention.SweepInterval |
1 hour | How often the sweep runs |