Engineering Case Study

SignalFoundry: March 2026 System Hardening & Pipeline Evolution

This case study documents how I built, broke, and recovered a SOC automation pipeline processing ~50,000 security alerts and what that revealed about detection quality, system trust, and operational reliability.

Author: Raylee Hawkins · Date: March 25, 2026 · Target reader: technical hiring manager, MSSP team lead, or senior detection engineer

Key Finding

The most important outcome was not a detection. It was identifying that critical telemetry required for triage did not exist.

Command-line logging was disabled on key endpoints
Failed logon events were absent from expected sources
Key detection paths could not be validated regardless of rule quality

Detection quality is limited by telemetry quality.

1. Executive Summary

What SignalFoundry Is

SignalFoundry is the detection-automation and case-management engine at the core of the HawkinsOps home SOC. It is not a commercial product. It is a bespoke Python-and-PowerShell pipeline built to transform raw Wazuh SIEM alerts into structured, evidence-backed incident cases without human intervention on the majority of events. It runs continuously via Windows Task Scheduler and publishes vetted escalations as GitHub pull requests with full sanitization enforced at the pipeline level.

The system demonstrates a core engineering thesis: that a single operator with a manufacturing background, the right tooling judgment, and disciplined automation can run a SOC-quality detection loop at scale, producing artifacts that are reviewable, reproducible, and CI-verified.

Scale of Operation

Metric	Value	Source
Total lifetime cases processed	324,074	Canonical snapshot, April 2026
Auto-close rate	~88%	Canonical snapshot
Escalated cases (published)	8,574	Reconciliation
Hosts monitored	8 / 8	Canonical snapshot
Ledger-to-repo mismatches	0	Canonical snapshot
Pipeline run duration	~5.4s	Latest heartbeat
Detection inventory	211 detections	VERIFIED_COUNTS.md

What Changed in the Past 14 Days

The March 11-25 window captured the system recovering from a two-failure sequence (poller timeout + reconciliation serialization defect), completing a multi-day stress-test validation window, and advancing policy tuning on known-FP noise. The pipeline reached a verified SUCCESS state and held it. Every change was made under test coverage and without modifying the public portfolio count sources out-of-band.

2. System Architecture

Pipeline Stages

Stage	Script	Purpose	Duration
`tests`	`pytest`	Triage logic regression gate	0.227s
`poll_alerts`	`poll-alerts.py`	Query Wazuh Indexer, write queue	0.397s
`triage`	`triage.py`	Disposition all queued alerts	0.790s
`triage_quality`	`triage-quality.py`	Score and classify triaged cases	1.765s
`cases`	`redact + pack + PR`	Sanitize and publish escalations	0.460s
`reconcile`	`reconcile-state.py`	4-way consistency check	0.279s
`coverage`	`coverage-check.py`	168-hour host presence validation	0.513s

Infrastructure Stack

Compute & Storage

Proxmox hypervisor (multi-VM)
Wazuh Manager + Indexer (OpenSearch)
pfSense network perimeter
Windows Task Scheduler (contract host)

Pipeline & Delivery

Python 3.14 (core automation)
PowerShell 7 (orchestration, reporting)
GitHub Actions CI/CD
Cloudflare Pages (static portfolio)

3. Recovery Event: March 13, 2026

The pipeline failed on March 13 due to two independent defects. Both were diagnosed from heartbeat telemetry and fixed in the same operational session.

Failure 1: Poller Timeout

The poll-alerts.py script was not correctly entering the retry path on the first connection timeout. URLError under delayed connection reset silently exhausted the timeout window without triggering retry logic.

Fix: Separated HTTPError (auth/server, some non-retriable) from URLError (connection, always retriable). Ensured retry counter increments before sleep.

Failure 2: Reconciliation Scoping

The reconcile-state.py script was computing mismatch counts against the unscoped repo ID list, including non-SignalFoundry-format directories. This inflated mismatch counts and triggered false FAIL status.

Fix: Scoped all six mismatch category calculations to SignalFoundry-format case IDs only.

Result: Both fixes applied same session. Pipeline returned to SUCCESS. Reconciliation dropped to zero hard mismatches. 8/8 hosts confirmed post-recovery.

4. Policy Tuning Work

Windows Workstation FP Suppression

Persistent device enumeration noise (rule 60227: HP printer, Bluetooth audio, monitor attach/detach) consuming queue depth. Added rule_overrides in policy.yaml with contains_any fragments covering known device strings. Disposition: AUTO_CLOSE_KNOWN_FP.

Linux dpkg Churn Suppression

Honeypot and file server generating rules 2902/2904 (dpkg status messages) during apt maintenance windows. Scoped overrides added for these agents matching on /var/log/dpkg.log location.

Sysmon Event 3 Escalation Hardening

Network connection events matching high-risk binaries (rundll32, regsvr32, mshta, powershell, certutil, bitsadmin) now escalate instead of routing to REVIEW. Prevents living-off-the-land lateral movement indicators from being silently downgraded.

Coverage-Check Host Alias Normalization

Older alerts used historical hostname tokens. Added LEGACY_TOKEN_HOST_MAP dict applied during normalization pass across five candidate fields. Required hosts no longer falsely reported as missing.

Metrics Evolution

Metric	Before Hardening	After
FP queue depth	High (device / dpkg / Sysmon noise)	Suppressed by policy
Auto-close rate	~87-88%	~88%
Reconciliation hard mismatches	1+ (serialization defect)	0
Sysmon Event 3 LOtL coverage	REVIEW only	ESCALATE on high-risk match
Pipeline run time	~6-8s	5.4s

5. Stress-Test Window: March 2-4, 2026

25,167

Cases processed

90.1%

Auto-close rate

3 days

Continuous window

Highest-volume burst the pipeline has processed in a single continuous window. Auto-close rate held above 90% throughout, demonstrating that triage policy scales without degradation under load. Escalated cases were processed through the full redact, assemble-pack, and create-PR pipeline.

6. What This Demonstrates

Engineering Judgment Under Failure

When the pipeline failed, I did not restart blindly. The heartbeat JSON provided the exact failure stage. The reconcile defect required reading the Python control flow, identifying the scoping error in a multi-list join, and understanding how the case ID predicate interacted with the mismatch calculation. The poller defect required distinguishing between three error types and understanding backoff semantics. Both fixes were surgical, each tested against existing behavior without modifying surrounding logic.

AI-Augmented Development, Human-Owned Validation

SignalFoundry was developed with AI code assistance, but all validation, policy decisions, and operational truth surfaces are human-owned. The CI pipeline enforces that counts cannot be inflated. The triage policy is a human-authored decision tree encoded in YAML. The known-FP library is built from observed signals, not vendor templates. AI tooling accelerated implementation; the operator owns the logic.

Manufacturing Systems Thinking Applied to Security

The architecture reflects a manufacturing-derived mental model: define the process, instrument every stage, make failure modes visible and recoverable, and validate outputs against known-good state. The queue cap, backoff retries, and freshness thresholds are deliberate operating parameters, not defaults left in place. On a manufacturing floor, a bad process kills throughput at scale; the same is true in a SOC at volume.

This project forced me to think less like someone writing detections, and more like someone responsible for whether a system can be trusted.

SignalFoundry Overview Proof & Metrics Home GitHub Repo