A production SOC automation pipeline hit a scaling threshold that turned a latent filesystem race condition into a guaranteed crash. The infrastructure held. The data held. The defect was diagnosed from production tracebacks, fixed with 20 lines across 4 code sites, and verified under live load—with the race actively firing and every instance caught.
SignalFoundry is a Python-based SOC automation pipeline that processes security alerts end-to-end without manual intervention. It runs on a scheduled cadence, executing a seven-stage workflow: poll alerts from a Wazuh SIEM indexer via HTTPS, enforce queue capacity limits, triage each alert against a policy engine with false-positive signatures and agent alias mappings, generate structured case directories, assemble escalation packs for high-severity detections, reconcile ledger totals against case counts on disk, and write pipeline health heartbeats with per-run metrics.
The system handles alerts from a multi-host environment spanning Windows and Linux endpoints. Rule types range from file-integrity monitoring and rootcheck anomalies to Sysmon behavioral detections and authentication events.
The sum—199,672 + 85,953 + 29,875 + 8,574 = 324,074 total cases—reconciliation mismatches: zero before, during, and after.
Three consecutive pipeline runs crashed at the alert-ingestion stage within 2.5 hours. Each terminated the entire pipeline. Two stages, three files, same exception: FileNotFoundError.
poll-alerts.py enforce_queue_cap() — stat() on vanished file. Exited in 5.4s.triage.py main() — read_text() on different vanished file.poll-alerts.py again — third file. Heartbeat: FAILED. All downstream blocked.VERDICT=PASS.Three crashes, three different files, same exception class = timing-dependent, not data-dependent. Queue at incident start: 505,836 files. Intended cap: 2,000.
A TOCTOU (time-of-check-to-time-of-use) race condition—state observed at T0 is invalid at T1 when code acts on it.
With 505,836 files, glob() takes seconds. During that window, files move to the processed archive. Any vanished file throws FileNotFoundError, terminating the pipeline.
The race existed from inception. At <10K files, the glob-to-stat window is milliseconds. At 505K, it stretches to seconds. Overflow archival moves hundreds of thousands of files per pass. The race goes from theoretical to near-certain.
The infrastructure was architecturally sound. Queue logic, triage, policy rules, ledger accounting, reconciliation—all correct. What failed was a single assumption: that glob() results remain valid across subsequent operations.
The symptom was "pipeline keeps crashing." The cause was a race condition. The amplifier was that crashing prevented cap enforcement from completing. Each failure guaranteed the next.
Minimum viable change. No behavioral modifications. Defensive-only. Immediately reversible.
| Site | File | Operation | Guard |
|---|---|---|---|
| 1 | poll-alerts.py | stat() in sort | _safe_mtime() → returns 0.0 |
| 2 | poll-alerts.py | shutil.move() | .exists() + try/except |
| 3 | triage.py | read_text() | try/except: continue |
| 4 | triage.py | shutil.move() | try/except: pass |
Queue ordering. Cursor state. Credentials. Config files. Lock logic. Heartbeat and reconciliation. Processing order. Triage rules. Ledger accounting. The only difference: vanished files are skipped instead of crashing the pipeline.
Three verification levels—static, targeted live, full-scale live—all against the production queue while the pipeline was processing. No downtime. No flush. No restart.
The pipeline architecture did not fail. What failed was a single assumption at four code sites.
~7 hours of processing delay. Zero data loss. Zero credential exposure. Zero architectural compromise. Pipeline resumed from exactly where it left off, draining the backlog at 340 files per minute.
At ~114 files/second throughput, glob at 505K files takes ~4.4s. Full sort: ~74 minutes of stat calls. Files enumerated early have their stat calls occur seconds to minutes later.
3% vanish rate across 35K files = ~1,055 crash-triggering events per sort without the fix. Failure was guaranteed.
Symptom (heartbeat FAILED) → evidence (exact tracebacks) → pattern (three crashes, three files = timing) → root cause (TOCTOU at scale) → fix (four guards) → verification (race firing, all caught).
The system could not be taken offline. Diagnosis, fix, and verification all occurred under active load.
The fix was 20 lines. The diagnosis was the hard part.
Code safe at 1,000 files is unsafe at 500,000. Performance characteristics become security characteristics when they widen race windows.
Every stat(), read(), and move() on a globbed path must handle FileNotFoundError.
Each crash grew the queue, which widened the window, which guaranteed the next crash. Breaking the cycle required fixing the race, not restarting the pipeline.
The cost of try/except FileNotFoundError is zero in the success path and prevents a pipeline-terminating crash in the failure path. On any live queue, this is not defensive programming. It is correct programming.
hawkinsops.com · SignalFoundry
All evidence from production logs and live filesystem state—not synthetic tests or staging.