SignalFoundry
SignalFoundry Case Study

Live Recovery of a Production Race Condition at 505,000-File Scale

A production SOC automation pipeline hit a scaling threshold that turned a latent filesystem race condition into a guaranteed crash. The infrastructure held. The data held. The defect was diagnosed from production tracebacks, fixed with 20 lines across 4 code sites, and verified under live load—with the race actively firing and every instance caught.

System SignalFoundry
Status Resolved
Data Loss Zero
Date April 2026
01 — System Context

What SignalFoundry Does

SignalFoundry is a Python-based SOC automation pipeline that processes security alerts end-to-end without manual intervention. It runs on a scheduled cadence, executing a seven-stage workflow: poll alerts from a Wazuh SIEM indexer via HTTPS, enforce queue capacity limits, triage each alert against a policy engine with false-positive signatures and agent alias mappings, generate structured case directories, assemble escalation packs for high-severity detections, reconcile ledger totals against case counts on disk, and write pipeline health heartbeats with per-run metrics.

The system handles alerts from a multi-host environment spanning Windows and Linux endpoints. Rule types range from file-integrity monitoring and rootcheck anomalies to Sysmon behavioral detections and authentication events.

0
Total Cases
7
Pipeline Stages
0
Escalated
0
Ledger Mismatches
Case Disposition Breakdown
Benign
199,672
Known FP
85,953
Review
29,875
Escalated
8,574

The sum—199,672 + 85,953 + 29,875 + 8,574 = 324,074 total cases—reconciliation mismatches: zero before, during, and after.

02 — The Incident

Three Crashes Before Breakfast

Three consecutive pipeline runs crashed at the alert-ingestion stage within 2.5 hours. Each terminated the entire pipeline. Two stages, three files, same exception: FileNotFoundError.

05:29:07 UTC
Crash #1poll-alerts.py enforce_queue_cap()stat() on vanished file. Exited in 5.4s.
05:39:02 UTC
Crash #2triage.py main()read_text() on different vanished file.
07:44:03 UTC
Crash #3poll-alerts.py again — third file. Heartbeat: FAILED. All downstream blocked.
~10:00 UTC
Root cause identified. TOCTOU race at 4 code sites across 2 files.
~10:30 UTC
Hotfix applied. 4 guards, ~20 lines. 28/28 tests pass. Zero behavioral changes.
11:54 UTC
First post-hotfix run. Will sustain 80+ minutes without crashing.
12:49 UTC
Live verification. 1,056 files vanished during sort—all caught. VERDICT=PASS.

Three crashes, three different files, same exception class = timing-dependent, not data-dependent. Queue at incident start: 505,836 files. Intended cap: 2,000.

03 — Root Cause Analysis

TOCTOU: The Race Between Enumeration and Action

A TOCTOU (time-of-check-to-time-of-use) race condition—state observed at T0 is invalid at T1 when code acts on it.

vulnerable code path (before fix)
# Step 1: enumerate (snapshot) — Step 2: stat() each — crashes if moved

queue_files = sorted(
  [p for p in QUEUE_ROOT.glob("*.json")],
  key=lambda p: p.stat().st_mtime  ← crashes here
)

With 505,836 files, glob() takes seconds. During that window, files move to the processed archive. Any vanished file throws FileNotFoundError, terminating the pipeline.

Why It Surfaced at This Scale

The race existed from inception. At <10K files, the glob-to-stat window is milliseconds. At 505K, it stretches to seconds. Overflow archival moves hundreds of thousands of files per pass. The race goes from theoretical to near-certain.

The infrastructure was architecturally sound. Queue logic, triage, policy rules, ledger accounting, reconciliation—all correct. What failed was a single assumption: that glob() results remain valid across subsequent operations.

04 — The Failure Loop

Self-Reinforcing Failure Cycle

Pipeline crashes at enforce_queue_cap()
Cap enforcement never completes → queue grows
Larger queue = wider glob-to-stat window
Next crash guaranteed → cycle repeats

The symptom was "pipeline keeps crashing." The cause was a race condition. The amplifier was that crashing prevented cap enforcement from completing. Each failure guaranteed the next.

05 — The Fix

Four Guards, Two Files, Twenty Lines

Minimum viable change. No behavioral modifications. Defensive-only. Immediately reversible.

SiteFileOperationGuard
1poll-alerts.pystat() in sort_safe_mtime() → returns 0.0
2poll-alerts.pyshutil.move().exists() + try/except
3triage.pyread_text()try/except: continue
4triage.pyshutil.move()try/except: pass

Queue ordering. Cursor state. Credentials. Config files. Lock logic. Heartbeat and reconciliation. Processing order. Triage rules. Ledger accounting. The only difference: vanished files are skipped instead of crashing the pipeline.

06 — Verification

Verified Under Production Load

Three verification levels—static, targeted live, full-scale live—all against the production queue while the pipeline was processing. No downtime. No flush. No restart.

py_compile clean · 28/28 tests · 0.029sPASS

Targeted Race Exercise

3/3 vanished — all caught
QUEUE_SAMPLE_SIZE3
  STAT_VANISHEDfile_1.json (guard → 0.0)
  STAT_VANISHEDfile_2.json (guard → 0.0)
  STAT_VANISHEDfile_3.json (guard → 0.0)

GUARD_CAUGHTTRUE
FILENOTFOUNDERRORFALSE
VERDICTPASS

Full-Queue Sort

1,056 vanished — all caught
FILES_ENUMERATED35,162
SORT_DURATION307.3s
FILES_VANISHED1,056
UNHANDLED_EXCEPTIONS0

VERDICTPASS

Before / After

Pipeline Run Comparison
Stability
<10 seconds to crash
Queue
Growing (crashes block cleanup)
Processed
0 files per run
Exceptions
1 per run (fatal)
Stages
0 of 7
Heartbeat
FAILED
Stability
80+ minutes sustained
Queue
Draining: 35,340 → 28,901
Drain rate
~340 files/min
Exceptions
0 unhandled
Stages
All 7 progressing
Heartbeat
Active (no errors)
07 — Infrastructure Integrity

What Held

The pipeline architecture did not fail. What failed was a single assumption at four code sites.

0
Data Loss
0
Recon Mismatches
0
Ledger Verified
0
Case Dirs Intact

~7 hours of processing delay. Zero data loss. Zero credential exposure. Zero architectural compromise. Pipeline resumed from exactly where it left off, draining the backlog at 340 files per minute.

08 — Quantitative Analysis

The Math Behind the Race

At ~114 files/second throughput, glob at 505K files takes ~4.4s. Full sort: ~74 minutes of stat calls. Files enumerated early have their stat calls occur seconds to minutes later.

Observed Race Frequency
Targeted (3)
100%
Full sort (35K)
3.0%

3% vanish rate across 35K files = ~1,055 crash-triggering events per sort without the fix. Failure was guaranteed.

09 — What This Demonstrates

Incident Response on a Live System

Symptom (heartbeat FAILED) → evidence (exact tracebacks) → pattern (three crashes, three files = timing) → root cause (TOCTOU at scale) → fix (four guards) → verification (race firing, all caught).

The system could not be taken offline. Diagnosis, fix, and verification all occurred under active load.

The fix was 20 lines. The diagnosis was the hard part.

10 — Lessons Learned

What Broke, What Held, What Changed

Scale is a threat model

Code safe at 1,000 files is unsafe at 500,000. Performance characteristics become security characteristics when they widen race windows.

glob() returns a snapshot, not a contract

Every stat(), read(), and move() on a globbed path must handle FileNotFoundError.

Feedback loops hide root causes

Each crash grew the queue, which widened the window, which guaranteed the next crash. Breaking the cycle required fixing the race, not restarting the pipeline.

Defensive filesystem code is not optional

The cost of try/except FileNotFoundError is zero in the success path and prevents a pipeline-terminating crash in the failure path. On any live queue, this is not defensive programming. It is correct programming.

hawkinsops.com · SignalFoundry

All evidence from production logs and live filesystem state—not synthetic tests or staging.