Detection Engineering Process


Key Terminology

  • Detection / Rule is the search syntax in your SIEM that identifies malicious activity
  • Alert is the output generated when a detection fires
  • Ticket is what gets created in your ticketing system (alerts may or may not become tickets based on defined criteria)

These terms are often used interchangeably but represent distinct stages in the pipeline.

The Scientific Method as Detection Engineering

  • Detection engineering benefits from the same structured, rigorous approach used in science
  • The scientific method maps neatly onto the detection lifecycle:
Scientific MethodDetection Engineering Equivalent
ObservationDetection Story (initial input)
Research / HypothesisResearch phase
ExperimentationQuery building + back testing
Validation / AnalysisCanary testing
Reporting / ConclusionsDocumentation
Theory (repeatedly tested)Onboarded, continuously improved detection

A high-quality detection = scientific theory

  • well-researched, verifiable, reproducible, and continuously improved over time

Why Follow a Structured Process?

  • Better-defined scope
    • know what you’re looking for from the start
  • Fewer missed steps
    • A framework prevents you from skipping critical phases
  • Higher overall detection quality
    • Research and documentation are built in, not bolted on

The Detection Engineering Process

  1. Detection Story
  2. Research
  3. Build the Query
  4. Back Test
  5. Build a Canary
  6. Documentation
  7. Onboard
  8. Continuous Improvement

Detection Story (Initial Input)

The detection story is the formalized input that kicks off the process.

  • Think of it as a structured intake ticket

A good detection story includes:

  • Reason for the detection
    • e.g., observed malicious IOC, customer request, blog/research
  • Data sources available
    • e.g., Windows event logs, EDR telemetry
  • Example Logic
  • Expected volume in your environment
  • Supporting Artifacts
    • links to IOCs, sample commands, related reports

Tip

  • Don’t accept a ticket that just says “make a detection for PSExec.”
  • Require structure upfront or you’ll waste time guessing intent or build something entirely unnecessary.

Example Scenario: Detection Story

  • IOCs observed as part of malicious traffic

A command was observed during a real attack where PSExec was used to authenticate with a plaintext password via the -p flag, then copy malware to a list of hosts using a batch script.

Research

  • Good detection requires full understanding of the idea
  • Research informs every decision downstream
  • Research and document findings related to the artifacts in question as you go
  • Document different avenues you go down, things you look at, etc.

Key principles:

  • Understand what you’re looking for fully before writing a query
  • Identify flags, behaviors, and patterns specific to the technique
  • Research whether the behavior should ever occur legitimately in your environment
  • Watch your scope — it’s easy to drift from “PSExec plaintext auth” into “all PSExec” or “PSExec + file share interaction,” which are separate detections

Example Scenario: Research

  • Research reveals the -p flag passes a password in plaintext on the command line
  • Best practice requires interactive password prompts
  • Legitimate use of -p in a production environment is essentially nonexistent
    • this is a strong signal

Build the Query

  • The query is the core of your detection
  • should be directly informed by your research
    • in what to look for, possible exclusions, etc.
  • Without a good query, you don’t have a detection
  • Prototype queries as you go and evaluate the results

Balance is key

Too BroadJust RightToo Narrow
All PSExec activityPSExec process name + -p argumentPSExec with -p and a specific known password
High volume, burnout riskBalance between fidelity and volumeToo specific, Misses variants (quoted args, file-based passwords)
  • Practical tips:
    • Lowercase the process name field to avoid case-sensitivity mismatches
    • Consider PSExec clone/rename variants
    • Look at process arguments as an array, not just a string match
    • Think about corroborating signals (e.g., PSEXESVC service installation)

Example Scenario: KQL Query

process.name : "psexec.exe" AND process.args : "-p"
  • This catches any PSExec execution with the -p argument regardless of other arguments
  • not too specific, not too broad

Back Test

  • Estimating volume should be done before the SOC brings you issues
  • Before going live, validate your query against historical data
    • Run the query against 90 days of data
      • 30 days is the absolute minimum
    • Identify noise, known-good activity, or filters you need to apply
    • If results are unexpectedly high, consider:
      • Dropping or deprioritizing the detection
      • Lowering its alert priority
      • Reformulating the query
        • stack the data
        • find common fields in legitimate events
        • filter accordingly
    • Document your results
      • use as evidence:
        • to propose a severity/priority
        • If it fires later, you have proof of due diligence
      • screenshot the query, time range, and result count
  • Zero results in a back test is either:
    • great news (high fidelity)
    • or a sign something is broken
  • Verify by confirming the query would have caught known-bad traffic

Example Scenario: Back Test

  • A 90-day back test returned zero results after excluding a known malicious username
  • confirms the query correctly identified the original attack traffic

Build a Canary

A canary is code that executes on a schedule to generate the exact traffic your detection is supposed to catch.

  • A canary validates that your detection continues to work after deployment
    • Without one, a broken detection could go unnoticed for months
  • logic can be extremely complex or very simple
  • run on scheduled interval
  • If it doesn’t fire an alert, you get notified of the failure

Canary tiers:

  • Best: Dedicated canary infrastructure with scheduled runs and failure alerting
    • removes environmental variables
  • Good: Regularly scheduled manual runs
    • still catches broken detections over time
  • Avoid: Testing once at deploy time and never again

Tip

  • Generate traffic as close to the real thing as possible
    • echoing a command to the CLI may not produce the same log artifacts as actually running the tool
  • Reference Atomic Red Team (ART) for a library of pre-built canary-style test scripts
  • If replaying captured malicious traffic,
    • ensure nothing in your environment has changed that could make it artificially always pass

Example Scenario: Build a Canary

  • Simply run psexec.exe -p <password> ... in a controlled test environment on a schedule
  • Confirm it triggers the detection
  • psexec \\remote_computer -u username -p password command

Documentation

  • Documentation is arguably the most important step
  • Good documentation should cover:
    • What the detection is looking for and why
    • Why specific fields were included or excluded
    • How a SOC analyst should investigate the alert
    • Blind spots, how could an attacker evade this detection?
    • MITRE ATT&CK technique mappings

Alerting & Detection Strategy (ADS)

  • Open source
  • covers everything from high-level goal to specific technical logic, blind spots, and investigation steps
  • ADS Framework (palantir/alerting-detection-strategy-framework)
  • Fill this out from the beginning as you move along
  • can prompt an LLM with your detection logic and have it draft an ADS document for you
    • Review and refine
    • it won’t be perfect but can save significant time on formatting and structure

Onboard

  • is enabling the rule in the production side of the SIEM

Checklist:

  • Paste in the final, validated query
  • Configure suppression correctly
    • e.g., suppress by host.name + user.name to avoid duplicate alerts for the same actor
  • Configure throttling appropriately
  • Create a change ticket or pull request if your environment requires it
  • Run a burn-in period
    • detection is enabled in production
    • but doesn’t produce alerts yet
    • observe output in real conditions

Tip

  • Suppressing by host + user is about as specific as you want to go
  • Too broad (e.g., suppress by organization) and you might get one alert for 10 simultaneous threat actors

Continuous Improvement

  • Detection engineering doesn’t end at deploy
    • it is a continuous process
  • Treat your detection like a living document
  • Sources of improvement:
    • SOC analyst feedback
      • they see the alerts daily and will identify noise or gaps
    • New research
      • techniques evolve, tools get renamed, evasions emerge
    • Structural changes
      • field mappings change, data sources shift
  • For large query changes, repeat the back test and burn-in period
  • Risk management:
    • Every filter or exception you add accepts some risk
    • Understand and document that tradeoff
    • Defense-in-depth means a single filter won’t make or break you
      • but be intentional about what you’re accepting

Example Scenario: Continuous Improvement

  • An admin needs to execute PSEXEC in the specific alerted way
  • We accept that risk
    • filter them out
  • excluded because it is known activity that is risk-accepted

Summary

A detection that completes this full lifecycle is the equivalent of a scientific theory:

✅ Carefully scoped
✅ Well-researched
✅ Experimentally validated
✅ Documented thoroughly
✅ Continuously tested and improved

“If you have a structured process for your detection engineering and you do it well and it’s well thought out, you will have a much better output at the end.”
— Hayden Covington


Resources

Reference

Presenter: Hayden Covington — SOC Analyst & Detection Engineer, Black Hills Information Security
Source: BHIS Webcast — The Detection Engineering Process