Observability

FlowOrchestrator exposes run events, OpenTelemetry traces/metrics, and a retention system to give you visibility into what your flows are doing in production.

Run Events

When event persistence is enabled, FlowOrchestrator writes structured FlowEvent records for every state transition: run started, step queued, step started, step completed/failed, run completed. Built-in step types — including WaitForSignal (step.pending while waiting, step.completed once the signal lands) and ForEach (per child-iteration events) — emit through the same channel as user-written handlers.

options.Observability.EnableEventPersistence = true;

Events are queryable via the REST API:

GET /flows/api/runs/{runId}/events

Returns a time-ordered list of events with timestamps and payload. These power the step timeline view in the dashboard.

Without event persistence

The dashboard can still show step status and I/O from FlowSteps and FlowOutputs, but the precise timing of each state transition is unavailable.

OpenTelemetry

When enabled, FlowOrchestrator registers an ActivitySource and an IMeterFactory-backed Meter that emit spans and metrics compatible with any OTLP-compatible backend (Jaeger, Grafana Tempo, Azure Monitor, Aspire Dashboard).

options.Observability.EnableOpenTelemetry = true;

Wire up the instrumentation in your OTel pipeline:

using FlowOrchestrator.Core.Observability;

builder.Services.AddOpenTelemetry()
    .ConfigureResource(r => r.AddService("MyApp"))
    .WithTracing(t => t
        .AddFlowOrchestratorInstrumentation()
        .AddAspNetCoreInstrumentation()
        .AddOtlpExporter())
    .WithMetrics(m => m
        .AddFlowOrchestratorInstrumentation()
        .AddAspNetCoreInstrumentation()
        .AddOtlpExporter());

AddFlowOrchestratorInstrumentation() is an extension method on both TracerProviderBuilder and MeterProviderBuilder. It subscribes to the FlowOrchestrator activity source and meter, and now lives in FlowOrchestrator.Core.Observability (moved from FlowOrchestrator.Hangfire in v1.19). The Hangfire namespace still exposes [Obsolete] shims for one release so existing code keeps compiling.

What is emitted

Traces (every span is on the FlowOrchestrator activity source):

Span	Kind	When	Notable tags
`flow.trigger`	Internal	One per `TriggerAsync` call	`flow.id`, `run.id`, `trigger.key`, `trigger.type`, `duplicate` (set when idempotency dedupe fires)
`flow.step`	Internal	One per `RunStepAsync` call	`flow.id`, `run.id`, `step.key`, `step.type`
`flow.step.retry`	Internal	One per `RetryStepAsync` call	`flow.id`, `run.id`, `step.key`
`flow.step.when`	Internal	When a step's `When` clause is evaluated	`flow.id`, `run.id`, `step.key`, `flow.when.expression`, `flow.when.resolved`, `flow.when.result`
`flow.step.poll`	Internal	One per polling iteration in `PollableStepHandler`	`flow.id`, `run.id`, `step.key`, `flow.poll.attempt`, `flow.poll.condition_met`
`flow.runtime.execute`	Consumer	Wraps each Hangfire job. Restores the parent `traceparent` captured at enqueue, so step spans become children of the original caller.	`messaging.system=hangfire`, `messaging.message.id`
`flow.webhook.receive`	Server	One per inbound webhook hit, parented onto the caller's `traceparent` header	`flow.webhook.slug_or_id`
`flow.signal.deliver`	Server	One per inbound signal HTTP call, parented onto the caller's `traceparent`	`flow.run_id`, `flow.signal_name`

Failures set Status = Error on the activity and add an exception event with the standard OTel tags (exception.type, exception.message, exception.stacktrace). APMs treat the span as red without any extra configuration.

Metrics (every instrument is on the FlowOrchestrator meter):

Metric	Type	Unit	Tags
`flow_runs_started`	counter	runs	`flow_id`, `trigger_key`
`flow_runs_completed`	counter	runs	`status`
`flow_steps_completed`	counter	steps	`flow_id`, `status`
`flow_step_duration_ms`	histogram	ms	`flow_id`, `step_key`, `status`
`flow_step_queue_delay_ms`	histogram	ms	`flow_id`, `step_key`
`flow_step_retries`	counter	retries	`flow_id`, `step_key`
`flow_step_skipped`	counter	steps	`flow_id`, `step_key`, `reason` (`when_false` / `prerequisites_unmet`)
`flow_step_poll_attempts`	counter	attempts	`flow_id`, `step_key`
`flow_signal_wait_ms`	histogram	ms	`flow_id`, `step_key`, `signal_name` — recorded by `FlowSignalDispatcher` on delivery
`flow_cron_lag_ms`	histogram	ms	`flow_id`, `trigger_key`, `runtime` (`hangfire` / `in_memory` / `service_bus`) — gap between scheduled fire and actual dispatch
`webhook_received_total`	counter	webhooks	`flow`, `result` (`accepted` / `rejected` / `off`), `scheme` — emitted on every webhook receive (v1.25)
`webhook_rejected_total`	counter	webhooks	`flow`, `reason` (`signature_invalid` / `replay` / `rate_limited` / `payload_too_large` / `ip_denied` / `secret_invalid`) (v1.25)
`webhook_body_bytes`	histogram	bytes	`flow` — body size of every webhook receive (v1.25)
`webhook_processing_ms`	histogram	ms	`flow`, `result` — wall-clock pipeline processing time (v1.25)

Distributed tracing across the runtime

A single traceId connects everything from the inbound HTTP request to the last step's exit:

caller traceparent
   └── flow.webhook.receive          (Dashboard, Server)
         └── flow.trigger             (engine)
               └── flow.runtime.execute  (Hangfire, Consumer)
                     └── flow.step       (engine, per dispatched step)
                           └── flow.step.poll  (handler, per poll attempt)

The flow.runtime.execute wrapper is opened by TraceContextHangfireFilter (registered automatically when options.UseHangfire() is set). It captures Activity.Current.Context on enqueue, persists the W3C identifiers as Hangfire job parameters, and restores them as the parent context when the worker picks the job up. Inbound webhook and signal endpoints in the Dashboard read the traceparent / tracestate headers via InboundTraceContext and start their entry-point activity as a child of the parsed context.

Without this plumbing, a Hangfire-backed run would appear as a forest of disconnected root spans — one per step. With it, an APM shows a single connected tree from the original caller down to every step.

Sampling

OTel sampling for traces should be configured at the SDK, not at FlowOrchestrator. For low-volume systems start with AlwaysOnSampler; for high-volume production prefer parent-based sampling so trace continuity is preserved across the runtime boundary:

.WithTracing(t => t
    .SetSampler(new ParentBasedSampler(new TraceIdRatioBasedSampler(0.05))) // 5% root sampling
    .AddFlowOrchestratorInstrumentation()
    .AddOtlpExporter());

Metrics are aggregated, not sampled — leave the default cardinality limits in place and only override if you see exporter overflow warnings.

Logger scopes and EventIds

The engine wraps every public entry point (TriggerAsync, RunStepAsync, RetryStepAsync) in _logger.BeginScope(...) carrying RunId, FlowId, StepKey, and Attempt (when applicable). Every nested log line — including logs from your own step handlers — carries those properties automatically. Logging providers that honour scopes (Serilog, NLog, OpenTelemetry Logs, Application Insights, Datadog, …) surface them as searchable fields. Engine hot-path log calls go through source-generated [LoggerMessage] partial methods (EngineLog.cs) for zero-allocation, AOT-friendly emission. See Logging integrations below for concrete provider wireup.

Stable EventId constants are defined in FlowOrchestrator.Core.Observability.LogEvents so production users can filter or alert on a specific log event without parsing the message template:

LogEvents.RunStarted                 = 1000
LogEvents.RunCompleted               = 1001
LogEvents.TriggerRejectedDisabledFlow = 1010   // v1.22+: TriggerAsync skipped because IFlowStore.IsEnabled = false
LogEvents.StepStarted                = 2000
LogEvents.StepCompleted              = 2001
LogEvents.StepFailed                 = 2002
LogEvents.StepSkipped                = 2003
LogEvents.WhenEvaluationFailed       = 2005
LogEvents.DispatchEnqueued           = 3000

// Webhook hardening pipeline (4000–4099 reserved, v1.25+)
WebhookLog.WebhookReceived               = 4000   // every receive — info
WebhookLog.SignatureRejected             = 4001   // HMAC mismatch / malformed header — warning
WebhookLog.ReplayRejected                = 4002   // skew or nonce reused — warning
WebhookLog.RateLimited                   = 4003   // token bucket empty — warning
WebhookLog.PayloadTooLarge               = 4004   // body cap exceeded — warning
WebhookLog.IpDenied                      = 4005   // CIDR allow/deny miss — warning
WebhookLog.SecretInvalid                 = 4006   // bearer-secret mismatch — warning
WebhookLog.DeliveryAccepted              = 4007   // run dispatched — info
WebhookLog.RejectionStoreFailed          = 4008   // DLQ write failed — warning
WebhookLog.ReplayStoreFailed             = 4009   // replay store error — warning
WebhookLog.RotationUsedPreviousKey       = 4010   // accepted with rotated-out key — info
// …see the source for the full list.

The flow.trigger activity also receives a flow.disabled = true tag when the engine silent-skips a trigger for a disabled flow (v1.22+). Filter your APM by this tag to count how often disabled-flow triggers are still being attempted (a useful signal that a webhook producer or cron job hasn't picked up the disable yet).

Logging integrations

The library is logging-framework-agnostic — it only uses Microsoft.Extensions.Logging.ILogger<T>. Plug in any provider that honours ILogger.BeginScope and you get the engine's correlation properties (RunId, FlowId, StepKey, Attempt) on every nested log line, including logs emitted by your own step handlers.

Microsoft.Extensions.Logging (Console)

Built into the framework. Scopes are off by default — opt in via the formatter options:

builder.Logging.AddJsonConsole(o => o.IncludeScopes = true);
// or
builder.Logging.AddSimpleConsole(o => o.IncludeScopes = true);

Output:

{ "Timestamp":"…", "EventId":2002, "LogLevel":"Error",
  "Category":"FlowOrchestrator.Core.Execution.FlowOrchestratorEngine",
  "Message":"Step execution failed for submit_to_wms",
  "Scopes":[{"RunId":"3fa85f64-…","FlowId":"a1b2c3d4-…","StepKey":"submit_to_wms"}] }

Serilog

dotnet add package Serilog.AspNetCore
dotnet add package Serilog.Sinks.Seq      # or Console / File / Datadog / Splunk / …

builder.Host.UseSerilog((ctx, cfg) => cfg
    .ReadFrom.Configuration(ctx.Configuration)
    .Enrich.FromLogContext()                        // <-- required so scope props become structured fields
    .Enrich.WithProperty("Service", "OrderHub")
    .WriteTo.Seq("http://seq:5341"));

Query in Seq / any structured sink:

EventId.Id = 2002 and StepKey = 'submit_to_wms'

NLog

dotnet add package NLog.Web.AspNetCore

builder.Host.UseNLog();

nlog.config — use ${scopeproperty} to render scope keys:

<targets>
  <target xsi:type="Console" name="console"
          layout="${longdate} ${level} ${event-properties:item=EventId_Id} run=${scopeproperty:item=RunId} step=${scopeproperty:item=StepKey} - ${message} ${exception:format=tostring}" />
</targets>

OpenTelemetry Logs (auto trace correlation)

When you already use OTel for traces, exporting logs through the same pipeline gives automatic TraceId / SpanId correlation in every log line — clicking a log in your APM jumps straight to the trace:

using FlowOrchestrator.Core.Observability;

builder.Logging.AddOpenTelemetry(o =>
{
    o.IncludeFormattedMessage = true;
    o.IncludeScopes = true;                         // <-- emits RunId/FlowId/StepKey as log attributes
    o.ParseStateValues = true;
    o.AddOtlpExporter();
});

builder.Services.AddOpenTelemetry()
    .WithTracing(t => t.AddFlowOrchestratorInstrumentation().AddOtlpExporter())
    .WithMetrics(m => m.AddFlowOrchestratorInstrumentation().AddOtlpExporter());

Application Insights / Datadog / Splunk / Seq / Loki

All honour BeginScope — scope properties surface as customDimensions (App Insights), structured tags (Datadog/Splunk/Loki), or top-level fields (Seq) automatically. No FlowOrchestrator-specific configuration required beyond enabling scopes on the provider.

Tip

If your scope properties are missing in the sink output, the provider almost certainly has scopes disabled by default. Search its docs for IncludeScopes (Microsoft.Extensions.Logging.Console, OpenTelemetry), Enrich.FromLogContext (Serilog), or ${scopeproperty} (NLog).

Health checks

Wire the bundled storage probe so a load balancer can drop traffic when the flow store is unreachable:

builder.Services.AddHealthChecks().AddFlowOrchestratorHealthChecks();
app.MapHealthChecks("/health");

The check resolves whichever IFlowStore you registered (SQL Server, PostgreSQL, in-memory). Probe budget defaults to 5 s and is configurable. See Production Checklist for the full operational story.

Running with .NET Aspire

When running under Aspire, OTEL_EXPORTER_OTLP_ENDPOINT is injected automatically. Spans and metrics appear in the Aspire Dashboard with no extra configuration beyond AddFlowOrchestratorInstrumentation().

Important

For the engine's structured logs to show up in Aspire's Logs tab (with RunId / StepKey / EventId as searchable attributes), wire up builder.Logging.AddOpenTelemetry(...) with IncludeScopes = true and AddOtlpExporter() — see the OpenTelemetry Logs example above. OTel's tracing and metrics pipelines do not automatically wire the logging pipeline; without this snippet the Logs tab will be empty even though traces and metrics flow through.

Run Control State

GET /flows/api/runs/{runId}/control

Returns the current control record for a run:

{
  "runId": "...",
  "cancellationRequested": false,
  "timedOutAt": null,
  "idempotencyKey": "batch-2026-04-20-001",
  "timeoutAt": "2026-04-20T12:10:00Z"
}

This is useful for diagnosing why a run stopped or was cancelled.

Active Runs

GET /flows/api/runs/active

Returns all runs currently in Running status. Use this to build operational monitors or alert on stuck runs.

Dashboard Statistics

GET /flows/api/runs/stats

{
  "totalFlows": 6,
  "activeRuns": 2,
  "succeededToday": 47,
  "failedToday": 1,
  "cancelledToday": 0
}

Data Retention

FlowOrchestrator can automatically delete old run data to prevent unbounded database growth.

options.Retention.Enabled = true;
options.Retention.DataTtl = TimeSpan.FromDays(30);     // delete runs older than 30 days
options.Retention.SweepInterval = TimeSpan.FromHours(1); // run the sweep every hour

When enabled, FlowRetentionHostedService runs on the configured interval and calls IFlowRetentionStore.DeleteOldRunsAsync(cutoff). The SQL Server and PostgreSQL backends cascade-delete all related records (steps, outputs, events, control) when a run is deleted.

Tip

SweepInterval defaults to 1 hour and DataTtl defaults to 30 days. Retention is disabled by default — opt in explicitly.

Option	Default	Description
`Retention.Enabled`	`false`	Enable the background sweep
`Retention.DataTtl`	30 days	Runs older than this threshold are deleted
`Retention.SweepInterval`	1 hour	How often the sweep runs

Table of Contents