SignalFoundry: March 2026 System Hardening & Pipeline Evolution
This case study documents how I built, broke, and recovered a SOC automation pipeline processing ~50,000 security alerts and what that revealed about detection quality, system trust, and operational reliability.
Author: Raylee Hawkins · Date: March 25, 2026 · Target reader: technical hiring manager, MSSP team lead, or senior detection engineer
The most important outcome was not a detection. It was identifying that critical telemetry required for triage did not exist.
- Command-line logging was disabled on key endpoints
- Failed logon events were absent from expected sources
- Key detection paths could not be validated regardless of rule quality
Detection quality is limited by telemetry quality.
1. Executive Summary
What SignalFoundry Is
SignalFoundry is the detection-automation and case-management engine at the core of the HawkinsOps home SOC. It is not a commercial product. It is a bespoke Python-and-PowerShell pipeline built to transform raw Wazuh SIEM alerts into structured, evidence-backed incident cases without human intervention on the majority of events. It runs continuously via Windows Task Scheduler and publishes vetted escalations as GitHub pull requests with full sanitization enforced at the pipeline level.
The system demonstrates a core engineering thesis: that a single operator with a manufacturing background, the right tooling judgment, and disciplined automation can run a SOC-quality detection loop at scale, producing artifacts that are reviewable, reproducible, and CI-verified.
Scale of Operation
| Metric | Value | Source |
|---|---|---|
| Total lifetime cases processed | 324,074 | Canonical snapshot, April 2026 |
| Auto-close rate | ~88% | Canonical snapshot |
| Escalated cases (published) | 8,574 | Reconciliation |
| Hosts monitored | 8 / 8 | Canonical snapshot |
| Ledger-to-repo mismatches | 0 | Canonical snapshot |
| Pipeline run duration | ~5.4s | Latest heartbeat |
| Detection inventory | 211 detections | VERIFIED_COUNTS.md |
What Changed in the Past 14 Days
The March 11-25 window captured the system recovering from a two-failure sequence (poller timeout + reconciliation serialization defect), completing a multi-day stress-test validation window, and advancing policy tuning on known-FP noise. The pipeline reached a verified SUCCESS state and held it. Every change was made under test coverage and without modifying the public portfolio count sources out-of-band.
2. System Architecture
Pipeline Stages
| Stage | Script | Purpose | Duration |
|---|---|---|---|
tests | pytest | Triage logic regression gate | 0.227s |
poll_alerts | poll-alerts.py | Query Wazuh Indexer, write queue | 0.397s |
triage | triage.py | Disposition all queued alerts | 0.790s |
triage_quality | triage-quality.py | Score and classify triaged cases | 1.765s |
cases | redact + pack + PR | Sanitize and publish escalations | 0.460s |
reconcile | reconcile-state.py | 4-way consistency check | 0.279s |
coverage | coverage-check.py | 168-hour host presence validation | 0.513s |
Infrastructure Stack
- Proxmox hypervisor (multi-VM)
- Wazuh Manager + Indexer (OpenSearch)
- pfSense network perimeter
- Windows Task Scheduler (contract host)
- Python 3.14 (core automation)
- PowerShell 7 (orchestration, reporting)
- GitHub Actions CI/CD
- Cloudflare Pages (static portfolio)
3. Recovery Event: March 13, 2026
The pipeline failed on March 13 due to two independent defects. Both were diagnosed from heartbeat telemetry and fixed in the same operational session.
The poll-alerts.py script was not correctly entering the retry path on the first connection timeout. URLError under delayed connection reset silently exhausted the timeout window without triggering retry logic.
Fix: Separated HTTPError (auth/server, some non-retriable) from URLError (connection, always retriable). Ensured retry counter increments before sleep.
The reconcile-state.py script was computing mismatch counts against the unscoped repo ID list, including non-SignalFoundry-format directories. This inflated mismatch counts and triggered false FAIL status.
Fix: Scoped all six mismatch category calculations to SignalFoundry-format case IDs only.
Result: Both fixes applied same session. Pipeline returned to SUCCESS. Reconciliation dropped to zero hard mismatches. 8/8 hosts confirmed post-recovery.
4. Policy Tuning Work
Persistent device enumeration noise (rule 60227: HP printer, Bluetooth audio, monitor attach/detach) consuming queue depth. Added rule_overrides in policy.yaml with contains_any fragments covering known device strings. Disposition: AUTO_CLOSE_KNOWN_FP.
Honeypot and file server generating rules 2902/2904 (dpkg status messages) during apt maintenance windows. Scoped overrides added for these agents matching on /var/log/dpkg.log location.
Network connection events matching high-risk binaries (rundll32, regsvr32, mshta, powershell, certutil, bitsadmin) now escalate instead of routing to REVIEW. Prevents living-off-the-land lateral movement indicators from being silently downgraded.
Older alerts used historical hostname tokens. Added LEGACY_TOKEN_HOST_MAP dict applied during normalization pass across five candidate fields. Required hosts no longer falsely reported as missing.
Metrics Evolution
| Metric | Before Hardening | After |
|---|---|---|
| FP queue depth | High (device / dpkg / Sysmon noise) | Suppressed by policy |
| Auto-close rate | ~87-88% | ~88% |
| Reconciliation hard mismatches | 1+ (serialization defect) | 0 |
| Sysmon Event 3 LOtL coverage | REVIEW only | ESCALATE on high-risk match |
| Pipeline run time | ~6-8s | 5.4s |
5. Stress-Test Window: March 2-4, 2026
Highest-volume burst the pipeline has processed in a single continuous window. Auto-close rate held above 90% throughout, demonstrating that triage policy scales without degradation under load. Escalated cases were processed through the full redact, assemble-pack, and create-PR pipeline.
6. What This Demonstrates
When the pipeline failed, I did not restart blindly. The heartbeat JSON provided the exact failure stage. The reconcile defect required reading the Python control flow, identifying the scoping error in a multi-list join, and understanding how the case ID predicate interacted with the mismatch calculation. The poller defect required distinguishing between three error types and understanding backoff semantics. Both fixes were surgical, each tested against existing behavior without modifying surrounding logic.
SignalFoundry was developed with AI code assistance, but all validation, policy decisions, and operational truth surfaces are human-owned. The CI pipeline enforces that counts cannot be inflated. The triage policy is a human-authored decision tree encoded in YAML. The known-FP library is built from observed signals, not vendor templates. AI tooling accelerated implementation; the operator owns the logic.
The architecture reflects a manufacturing-derived mental model: define the process, instrument every stage, make failure modes visible and recoverable, and validate outputs against known-good state. The queue cap, backoff retries, and freshness thresholds are deliberate operating parameters, not defaults left in place. On a manufacturing floor, a bad process kills throughput at scale; the same is true in a SOC at volume.
This project forced me to think less like someone writing detections, and more like someone responsible for whether a system can be trusted.