Engineering Case Study

SignalFoundry: March 2026 System Hardening & Pipeline Evolution

This case study documents how I built, broke, and recovered a SOC automation pipeline processing ~50,000 security alerts and what that revealed about detection quality, system trust, and operational reliability.

Author: Raylee Hawkins · Date: March 25, 2026 · Target reader: technical hiring manager, MSSP team lead, or senior detection engineer

Key Finding

The most important outcome was not a detection. It was identifying that critical telemetry required for triage did not exist.

Detection quality is limited by telemetry quality.


1. Executive Summary

What SignalFoundry Is

SignalFoundry is the detection-automation and case-management engine at the core of the HawkinsOps home SOC. It is not a commercial product. It is a bespoke Python-and-PowerShell pipeline built to transform raw Wazuh SIEM alerts into structured, evidence-backed incident cases without human intervention on the majority of events. It runs continuously via Windows Task Scheduler and publishes vetted escalations as GitHub pull requests with full sanitization enforced at the pipeline level.

The system demonstrates a core engineering thesis: that a single operator with a manufacturing background, the right tooling judgment, and disciplined automation can run a SOC-quality detection loop at scale, producing artifacts that are reviewable, reproducible, and CI-verified.

Scale of Operation

Metric Value Source
Total lifetime cases processed324,074Canonical snapshot, April 2026
Auto-close rate~88%Canonical snapshot
Escalated cases (published)8,574Reconciliation
Hosts monitored8 / 8Canonical snapshot
Ledger-to-repo mismatches0Canonical snapshot
Pipeline run duration~5.4sLatest heartbeat
Detection inventory211 detectionsVERIFIED_COUNTS.md

What Changed in the Past 14 Days

The March 11-25 window captured the system recovering from a two-failure sequence (poller timeout + reconciliation serialization defect), completing a multi-day stress-test validation window, and advancing policy tuning on known-FP noise. The pipeline reached a verified SUCCESS state and held it. Every change was made under test coverage and without modifying the public portfolio count sources out-of-band.


2. System Architecture

Pipeline Stages

Stage Script Purpose Duration
testspytestTriage logic regression gate0.227s
poll_alertspoll-alerts.pyQuery Wazuh Indexer, write queue0.397s
triagetriage.pyDisposition all queued alerts0.790s
triage_qualitytriage-quality.pyScore and classify triaged cases1.765s
casesredact + pack + PRSanitize and publish escalations0.460s
reconcilereconcile-state.py4-way consistency check0.279s
coveragecoverage-check.py168-hour host presence validation0.513s

Infrastructure Stack

Compute & Storage
  • Proxmox hypervisor (multi-VM)
  • Wazuh Manager + Indexer (OpenSearch)
  • pfSense network perimeter
  • Windows Task Scheduler (contract host)
Pipeline & Delivery
  • Python 3.14 (core automation)
  • PowerShell 7 (orchestration, reporting)
  • GitHub Actions CI/CD
  • Cloudflare Pages (static portfolio)

3. Recovery Event: March 13, 2026

The pipeline failed on March 13 due to two independent defects. Both were diagnosed from heartbeat telemetry and fixed in the same operational session.

Failure 1: Poller Timeout

The poll-alerts.py script was not correctly entering the retry path on the first connection timeout. URLError under delayed connection reset silently exhausted the timeout window without triggering retry logic.

Fix: Separated HTTPError (auth/server, some non-retriable) from URLError (connection, always retriable). Ensured retry counter increments before sleep.

Failure 2: Reconciliation Scoping

The reconcile-state.py script was computing mismatch counts against the unscoped repo ID list, including non-SignalFoundry-format directories. This inflated mismatch counts and triggered false FAIL status.

Fix: Scoped all six mismatch category calculations to SignalFoundry-format case IDs only.

Result: Both fixes applied same session. Pipeline returned to SUCCESS. Reconciliation dropped to zero hard mismatches. 8/8 hosts confirmed post-recovery.


4. Policy Tuning Work

Windows Workstation FP Suppression

Persistent device enumeration noise (rule 60227: HP printer, Bluetooth audio, monitor attach/detach) consuming queue depth. Added rule_overrides in policy.yaml with contains_any fragments covering known device strings. Disposition: AUTO_CLOSE_KNOWN_FP.

Linux dpkg Churn Suppression

Honeypot and file server generating rules 2902/2904 (dpkg status messages) during apt maintenance windows. Scoped overrides added for these agents matching on /var/log/dpkg.log location.

Sysmon Event 3 Escalation Hardening

Network connection events matching high-risk binaries (rundll32, regsvr32, mshta, powershell, certutil, bitsadmin) now escalate instead of routing to REVIEW. Prevents living-off-the-land lateral movement indicators from being silently downgraded.

Coverage-Check Host Alias Normalization

Older alerts used historical hostname tokens. Added LEGACY_TOKEN_HOST_MAP dict applied during normalization pass across five candidate fields. Required hosts no longer falsely reported as missing.

Metrics Evolution

Metric Before Hardening After
FP queue depthHigh (device / dpkg / Sysmon noise)Suppressed by policy
Auto-close rate~87-88%~88%
Reconciliation hard mismatches1+ (serialization defect)0
Sysmon Event 3 LOtL coverageREVIEW onlyESCALATE on high-risk match
Pipeline run time~6-8s5.4s

5. Stress-Test Window: March 2-4, 2026

25,167
Cases processed
90.1%
Auto-close rate
3 days
Continuous window

Highest-volume burst the pipeline has processed in a single continuous window. Auto-close rate held above 90% throughout, demonstrating that triage policy scales without degradation under load. Escalated cases were processed through the full redact, assemble-pack, and create-PR pipeline.


6. What This Demonstrates

Engineering Judgment Under Failure

When the pipeline failed, I did not restart blindly. The heartbeat JSON provided the exact failure stage. The reconcile defect required reading the Python control flow, identifying the scoping error in a multi-list join, and understanding how the case ID predicate interacted with the mismatch calculation. The poller defect required distinguishing between three error types and understanding backoff semantics. Both fixes were surgical, each tested against existing behavior without modifying surrounding logic.

AI-Augmented Development, Human-Owned Validation

SignalFoundry was developed with AI code assistance, but all validation, policy decisions, and operational truth surfaces are human-owned. The CI pipeline enforces that counts cannot be inflated. The triage policy is a human-authored decision tree encoded in YAML. The known-FP library is built from observed signals, not vendor templates. AI tooling accelerated implementation; the operator owns the logic.

Manufacturing Systems Thinking Applied to Security

The architecture reflects a manufacturing-derived mental model: define the process, instrument every stage, make failure modes visible and recoverable, and validate outputs against known-good state. The queue cap, backoff retries, and freshness thresholds are deliberate operating parameters, not defaults left in place. On a manufacturing floor, a bad process kills throughput at scale; the same is true in a SOC at volume.

This project forced me to think less like someone writing detections, and more like someone responsible for whether a system can be trusted.


SignalFoundry Overview Proof & Metrics Home GitHub Repo
Public Review Record

No Public Ledger Review Yet

Status
No public review
Coverage
Eligible for public review on The Ledger; no review slot assigned. Current states for every case study are published on the registry.