SignalFoundry Case Study

Live Recovery of a Production Race Condition at 505,000-File Scale

A production SOC automation pipeline hit a scaling threshold that turned a latent filesystem race condition into a guaranteed crash. The infrastructure held. The data held. The defect was diagnosed from production tracebacks, fixed with 20 lines across 4 code sites, and verified under live load—with the race actively firing and every instance caught.

System SignalFoundry

Status Resolved

Data Loss Zero

Date April 2026

Review history

Public Review Record

This case study has a paired review record on The Ledger, where the incident was examined across multiple evaluation axes and the disagreements were preserved on the record.

Ledger entry	The Pipeline Ate Itself at Five Hundred Thousand →
Review status	Resolved
Review date	2026-04-01
Reviewers	Build (Codex) · Architecture (Claude Code) · Verification (ChatGPT) · Reflection (Claude Web)
Standing	Consensus on diagnosis and patch · Point of Contention on the missing queue invariant · Open Question on the next scale boundary

01 — System Context

What SignalFoundry Does

SignalFoundry is a Python-based SOC automation pipeline that processes security alerts end-to-end without manual intervention. It runs on a scheduled cadence, executing a seven-stage workflow: poll alerts from a Wazuh SIEM indexer via HTTPS, enforce queue capacity limits, triage each alert against a policy engine with false-positive signatures and agent alias mappings, generate structured case directories, assemble escalation packs for high-severity detections, reconcile ledger totals against case counts on disk, and write pipeline health heartbeats with per-run metrics.

The system handles alerts from a multi-host environment spanning Windows and Linux endpoints. Rule types range from file-integrity monitoring and rootcheck anomalies to Sysmon behavioral detections and authentication events.

Total Cases

Pipeline Stages

Escalated

Ledger Mismatches

Case Disposition Breakdown

Benign

199,672

Known FP

85,953

Review

29,875

Escalated

8,574

The sum—199,672 + 85,953 + 29,875 + 8,574 = 324,074 total cases—reconciliation mismatches: zero before, during, and after.

02 — The Incident

Three Crashes Before Breakfast

Three consecutive pipeline runs crashed at the alert-ingestion stage within 2.5 hours. Each terminated the entire pipeline. Two stages, three files, same exception: FileNotFoundError.

05:29:07 UTC

Crash #1 — poll-alerts.py enforce_queue_cap() — stat() on vanished file. Exited in 5.4s.

05:39:02 UTC

Crash #2 — triage.py main() — read_text() on different vanished file.

07:44:03 UTC

Crash #3 — poll-alerts.py again — third file. Heartbeat: FAILED. All downstream blocked.

~10:00 UTC

Root cause identified. TOCTOU race at 4 code sites across 2 files.

~10:30 UTC

Hotfix applied. 4 guards, ~20 lines. 28/28 tests pass. Zero behavioral changes.

11:54 UTC

First post-hotfix run. Will sustain 80+ minutes without crashing.

12:49 UTC

Live verification. 1,056 files vanished during sort—all caught. VERDICT=PASS.

Three crashes, three different files, same exception class = timing-dependent, not data-dependent. Queue at incident start: 505,836 files. Intended cap: 2,000.

03 — Root Cause Analysis

TOCTOU: The Race Between Enumeration and Action

A TOCTOU (time-of-check-to-time-of-use) race condition—state observed at T₀ is invalid at T₁ when code acts on it.

vulnerable code path (before fix)

# Step 1: enumerate (snapshot) — Step 2: stat() each — crashes if moved

queue_files = sorted(
[p for p in QUEUE_ROOT.glob("*.json")],
key=lambda p: p.stat().st_mtime ← crashes here
)

With 505,836 files, glob() takes seconds. During that window, files move to the processed archive. Any vanished file throws FileNotFoundError, terminating the pipeline.

Why It Surfaced at This Scale

The race existed from inception. At <10K files, the glob-to-stat window is milliseconds. At 505K, it stretches to seconds. Overflow archival moves hundreds of thousands of files per pass. The race goes from theoretical to near-certain.

The infrastructure was architecturally sound. Queue logic, triage, policy rules, ledger accounting, reconciliation—all correct. What failed was a single assumption: that glob() results remain valid across subsequent operations.

04 — The Failure Loop

Self-Reinforcing Failure Cycle

Pipeline crashes at enforce_queue_cap()

↓

Cap enforcement never completes → queue grows

↓

Larger queue = wider glob-to-stat window

↓

Next crash guaranteed → cycle repeats

The symptom was "pipeline keeps crashing." The cause was a race condition. The amplifier was that crashing prevented cap enforcement from completing. Each failure guaranteed the next.

05 — The Fix

Four Guards, Two Files, Twenty Lines

Minimum viable change. No behavioral modifications. Defensive-only. Immediately reversible.

Site	File	Operation	Guard
1	poll-alerts.py	`stat()` in sort	`_safe_mtime()` → returns 0.0
2	poll-alerts.py	`shutil.move()`	`.exists()` + `try/except`
3	triage.py	`read_text()`	`try/except: continue`
4	triage.py	`shutil.move()`	`try/except: pass`

Queue ordering. Cursor state. Credentials. Config files. Lock logic. Heartbeat and reconciliation. Processing order. Triage rules. Ledger accounting. The only difference: vanished files are skipped instead of crashing the pipeline.

06 — Verification

Verified Under Production Load

Three verification levels—static, targeted live, full-scale live—all against the production queue while the pipeline was processing. No downtime. No flush. No restart.

py_compile clean · 28/28 tests · 0.029sPASS

Targeted Race Exercise

3/3 vanished — all caught

QUEUE_SAMPLE_SIZE3

STAT_VANISHEDfile_1.json (guard → 0.0)

STAT_VANISHEDfile_2.json (guard → 0.0)

STAT_VANISHEDfile_3.json (guard → 0.0)

GUARD_CAUGHTTRUE

FILENOTFOUNDERRORFALSE

VERDICTPASS

Full-Queue Sort

1,056 vanished — all caught

FILES_ENUMERATED35,162

SORT_DURATION307.3s

FILES_VANISHED1,056

UNHANDLED_EXCEPTIONS0

VERDICTPASS

Before / After

Pipeline Run Comparison

Stability

<10 seconds to crash

Queue

Growing (crashes block cleanup)

Processed

0 files per run

Exceptions

1 per run (fatal)

Stages

0 of 7

Heartbeat

FAILED

Stability

80+ minutes sustained

Queue

Draining: 35,340 → 28,901

Drain rate

~340 files/min

Exceptions

0 unhandled

Stages

All 7 progressing

Heartbeat

Active (no errors)

07 — Infrastructure Integrity

What Held

The pipeline architecture did not fail. What failed was a single assumption at four code sites.

Data Loss

Recon Mismatches

Ledger Verified

Case Dirs Intact

~7 hours of processing delay. Zero data loss. Zero credential exposure. Zero architectural compromise. Pipeline resumed from exactly where it left off, draining the backlog at 340 files per minute.

08 — Quantitative Analysis

The Math Behind the Race

At ~114 files/second throughput, glob at 505K files takes ~4.4s. Full sort: ~74 minutes of stat calls. Files enumerated early have their stat calls occur seconds to minutes later.

Observed Race Frequency

Targeted (3)

100%

Full sort (35K)

3.0%

3% vanish rate across 35K files = ~1,055 crash-triggering events per sort without the fix. Failure was guaranteed.

09 — What This Demonstrates

Incident Response on a Live System

Symptom (heartbeat FAILED) → evidence (exact tracebacks) → pattern (three crashes, three files = timing) → root cause (TOCTOU at scale) → fix (four guards) → verification (race firing, all caught).

The system could not be taken offline. Diagnosis, fix, and verification all occurred under active load.

The fix was 20 lines. The diagnosis was the hard part.

10 — Lessons Learned

What Broke, What Held, What Changed

Scale is a threat model

Code safe at 1,000 files is unsafe at 500,000. Performance characteristics become security characteristics when they widen race windows.

glob() returns a snapshot, not a contract

Every stat(), read(), and move() on a globbed path must handle FileNotFoundError.

Feedback loops hide root causes

Each crash grew the queue, which widened the window, which guaranteed the next crash. Breaking the cycle required fixing the race, not restarting the pipeline.

Defensive filesystem code is not optional

The cost of try/except FileNotFoundError is zero in the success path and prevents a pipeline-terminating crash in the failure path. On any live queue, this is not defensive programming. It is correct programming.

hawkinsops.com · SignalFoundry

All evidence from production logs and live filesystem state—not synthetic tests or staging.