Pipeline Fault Recovery: Two Failure Domains, One Session
Sequential poller timeout and reconciliation serialization defect diagnosed under production load. Both root causes isolated, pipeline restored to SUCCESS.
Problem / Hypothesis
On March 13, 2026, SignalFoundry — a 7-stage automated SOC triage pipeline processing live Wazuh alerts — began failing on every scheduled run. The pipeline heartbeat reported repeated failures at the poll_alerts stage at roughly 5-minute intervals. No unit test regression. No missing configuration. No credential rotation.
The hypothesis: something between the runner and the Wazuh Indexer endpoint had broken. But the failure mode masked a second, independent defect that only surfaced after the first was resolved.
Environment
total_cases: 25,167 escalated_cases: 2,478 detection_rules: 210 (CI-verified) host_coverage: 8/8 mismatch_count: 0
Stack: SignalFoundry (Python + PowerShell), Wazuh Manager + Indexer (OpenSearch), Windows Task Scheduler (5-min interval), heartbeat.json telemetry.
Methodology
Step 1 — Locate the failure
Read heartbeat.json. The fail_stage field pointed directly to poll_alerts. No ambiguity. The pipeline was dying before triage, before case processing, before reconciliation.
Step 2 — Rule out configuration drift
Confirmed the poller had a configured endpoint, valid user value, and readable password-file source. No credential rotation. No config modifications since last successful run.
Step 3 — Separate local from remote
The runner had general network connectivity. Direct connection to the Wazuh Indexer REST API was timing out. Failure isolated to the network path — not the poller logic itself.
Step 4 — Trace the retry defect
fetch_with_retry handled HTTPError and URLError differently. On URLError (connection timeout), the retry counter was not incremented before the sleep call. Under delayed connection reset conditions, the first attempt consumed an exception without entering the retry path.
def fetch_with_retry(url, retries=3, backoff=2):
for attempt in range(retries):
try:
return urlopen(Request(url, ...))
except HTTPError as e:
if e.code == 401:
raise
except URLError as e:
pass # BUG: counter not incremented, sleep skipped
time.sleep(backoff ** attempt)
Step 5 — Restore polling and re-run
After upstream connectivity was corrected and the retry logic fixed, the poller completed successfully. Full pipeline triggered. It failed again — this time at reconcile.
Step 6 — Diagnose the reconciliation defect
reconcile-state.py was computing mismatches against unscoped repo_ids instead of repo_ids_autosoc. Non-AutoSOC directories were counted as phantom mismatches, inflating mismatch_count and triggering FAIL even though the actual ledger/content state was clean.
Step 7 — Fix and validate
Scoped all six mismatch computations to repo_ids_autosoc. Re-ran strict reconciliation.
Evidence
SECRET_SOURCE=PASS_FILE MODE=realtime POLLED=0 SAVED=0 NO_NEW_ALERTS=TRUE
ledger_total_cases=25167 ledger_escalated_metric=2478 repo_incident_dirs_autosoc_scoped=2478 content_incidents=2478 MISMATCH_COUNT=0
run_id: autosoc-20260313T215029Z-31020 status: SUCCESS duration_seconds: 31.843 cases_scanned: 26032 cases_processed: 173 reconciliation.status: PASS reconciliation.mismatch_count: 0 coverage.status: PASS coverage.present_hosts: 8 coverage.missing_hosts: 0
Findings
The second defect was invisible while the first was active. The poller failure prevented the pipeline from reaching reconciliation. Only after polling was restored did the reconciliation defect surface — a sequential failure domain that would have been missed by restarting the pipeline without a full end-to-end validation pass.
Operational Impact
Verification
A reviewer can confirm these claims through:
- Pipeline recovery documentation: docs/execution/AUTOSOC_PIPELINE_RECOVERY_CASE_STUDY_03-13-2026.md
- Debug trace: docs/execution/AUTOSOC_PIPELINE_INCIDENT_DEBUG_03-13-2026.md
- Authority snapshot (locked 03-25): PROOF_PACK/VERIFIED_COUNTS.md — 324,074 total cases post-recovery
- Reconciliation fix: Commit history for reconcile-state.py — repo_ids_autosoc replacing repo_ids