← All case studies
Case Study — Incident Response

Pipeline Fault Recovery: Two Failure Domains, One Session

Sequential poller timeout and reconciliation serialization defect diagnosed under production load. Both root causes isolated, pipeline restored to SUCCESS.

2026-03-13 Dual-fault incident Poller retry defect Reconciliation scoping MISMATCH_COUNT=0

Problem / Hypothesis

On March 13, 2026, SignalFoundry — a 7-stage automated SOC triage pipeline processing live Wazuh alerts — began failing on every scheduled run. The pipeline heartbeat reported repeated failures at the poll_alerts stage at roughly 5-minute intervals. No unit test regression. No missing configuration. No credential rotation.

The hypothesis: something between the runner and the Wazuh Indexer endpoint had broken. But the failure mode masked a second, independent defect that only surfaced after the first was resolved.

Environment

Pre-failure validated state Baseline
total_cases:       25,167
escalated_cases:   2,478
detection_rules:   210 (CI-verified)
host_coverage:     8/8
mismatch_count:    0

Stack: SignalFoundry (Python + PowerShell), Wazuh Manager + Indexer (OpenSearch), Windows Task Scheduler (5-min interval), heartbeat.json telemetry.

Methodology

Step 1 — Locate the failure

Read heartbeat.json. The fail_stage field pointed directly to poll_alerts. No ambiguity. The pipeline was dying before triage, before case processing, before reconciliation.

Step 2 — Rule out configuration drift

Confirmed the poller had a configured endpoint, valid user value, and readable password-file source. No credential rotation. No config modifications since last successful run.

Step 3 — Separate local from remote

The runner had general network connectivity. Direct connection to the Wazuh Indexer REST API was timing out. Failure isolated to the network path — not the poller logic itself.

Step 4 — Trace the retry defect

fetch_with_retry handled HTTPError and URLError differently. On URLError (connection timeout), the retry counter was not incremented before the sleep call. Under delayed connection reset conditions, the first attempt consumed an exception without entering the retry path.

Retry defect — before fix (pseudocode) Bug
def fetch_with_retry(url, retries=3, backoff=2):
    for attempt in range(retries):
        try:
            return urlopen(Request(url, ...))
        except HTTPError as e:
            if e.code == 401:
                raise
        except URLError as e:
            pass  # BUG: counter not incremented, sleep skipped
        time.sleep(backoff ** attempt)

Step 5 — Restore polling and re-run

After upstream connectivity was corrected and the retry logic fixed, the poller completed successfully. Full pipeline triggered. It failed again — this time at reconcile.

Step 6 — Diagnose the reconciliation defect

reconcile-state.py was computing mismatches against unscoped repo_ids instead of repo_ids_autosoc. Non-AutoSOC directories were counted as phantom mismatches, inflating mismatch_count and triggering FAIL even though the actual ledger/content state was clean.

Step 7 — Fix and validate

Scoped all six mismatch computations to repo_ids_autosoc. Re-ran strict reconciliation.

Evidence

Standalone poller validation after fix PASS
SECRET_SOURCE=PASS_FILE
MODE=realtime
POLLED=0
SAVED=0
NO_NEW_ALERTS=TRUE
Reconciliation validation after scoping fix PASS
ledger_total_cases=25167
ledger_escalated_metric=2478
repo_incident_dirs_autosoc_scoped=2478
content_incidents=2478
MISMATCH_COUNT=0
Full platform-health validation — same day SUCCESS
run_id: autosoc-20260313T215029Z-31020
status: SUCCESS
duration_seconds: 31.843
cases_scanned: 26032
cases_processed: 173
reconciliation.status: PASS
reconciliation.mismatch_count: 0
coverage.status: PASS
coverage.present_hosts: 8
coverage.missing_hosts: 0

Findings

Root Cause 1: Poller retry defect — URLError on connection timeout did not enter the retry path on the first attempt. Every run hit the same failure and exhausted retries without meaningful backoff.

Root Cause 2: Reconciliation scoping error — mismatches computed against unscoped repo_ids instead of repo_ids_autosoc. Latent since the repo was small; manifested at scale when non-AutoSOC directories accumulated.

The second defect was invisible while the first was active. The poller failure prevented the pipeline from reaching reconciliation. Only after polling was restored did the reconciliation defect surface — a sequential failure domain that would have been missed by restarting the pipeline without a full end-to-end validation pass.

Operational Impact

Pipeline restored from repeated FAIL to sustained SUCCESS in one session. Strict reconciliation restored to zero hard mismatches. 173 new cases processed on the recovery run, confirming live alert ingestion was operational. Host telemetry coverage confirmed at 8/8 — no monitoring gaps during the outage window.

Verification

A reviewer can confirm these claims through:

  • Pipeline recovery documentation: docs/execution/AUTOSOC_PIPELINE_RECOVERY_CASE_STUDY_03-13-2026.md
  • Debug trace: docs/execution/AUTOSOC_PIPELINE_INCIDENT_DEBUG_03-13-2026.md
  • Authority snapshot (locked 03-25): PROOF_PACK/VERIFIED_COUNTS.md — 324,074 total cases post-recovery
  • Reconciliation fix: Commit history for reconcile-state.pyrepo_ids_autosoc replacing repo_ids