45%
of SOC alerts are never investigated due to alert volume (Ponemon Institute)
3 tiers
alert routing model: P1 (page now), P2 (investigate today), P3 (review weekly)
80/20
rule: typically 20% of alert sources generate 80% of noise -- find them first

Alert fatigue is not a technology problem -- it's a prioritization problem. Every SIEM ships with hundreds of detection rules enabled by default, tuned for generic environments that don't match yours. The path from 10,000 alerts per day to 50 actionable ones is a source analysis to find your top noise generators, a tier-based routing model that separates 'page now' from 'review later,' and systematic allowlisting for known-good behavior. This guide walks each step with Splunk and Microsoft Sentinel query examples.

Step 1: Source Analysis -- Find Your Top 5 Noise Generators

Before touching any rule, know where your volume is coming from.

Splunk:

index=notable earliest=-30d
| stats count by source
| sort -count
| head 20

Microsoft Sentinel:

SecurityAlert
| where TimeGenerated > ago(30d)
| summarize count() by AlertName, ProviderName
| order by count_ desc
| take 20

What to look for in the results:

  • Any single rule generating >1,000 alerts/day: almost certainly needs tuning or suppression
  • Antivirus/EDR detections from known-benign software being flagged repeatedly
  • Network IDS rules firing on internal scanning tools or monitoring agents
  • Authentication failure rules firing on service accounts with incorrect cached passwords
  • Scheduled task or backup software triggering process creation rules

For each top-5 source, answer:

  1. What percentage of these alerts result in a confirmed incident? (check your closure reason data)
  2. Is there a common pattern in the false positives? (same source IP, same process, same time window)
  3. Can the rule be scoped more narrowly, or is the entire rule generating noise?

Step 2: Build a Tier-Based Alert Routing Model

Every alert being 'P1' means nothing is P1. Define tiers clearly and stick to them.

Tier definitions:

TierLabelResponse SLACriteriaExamples
P1Page nowImmediate (24/7)High-confidence, high-impact; automated response would cause harmRansomware beacon, admin credential use from new country, data exfil to known C2
P2Investigate todayWithin 4 business hoursMedium-confidence or requires context to assessImpossible travel, new service account created, first-time admin logon
P3Review weeklyBatch review, not real-timeLow-confidence, hunt hypothesis generation, compliance loggingFailed logins below threshold, SMB access to common shares, port scan from internal host
InfoLog onlyNot workedTelemetry with no direct action value; used for correlation lookupsDNS query logs, process creation (most), file access (most)

Tagging rules in Splunk ES:

| makeresults
| eval rule_name="Impossible Travel Login"
| eval tier="P2"
| eval sla_hours=4
| outputlookup alert_tier_lookup.csv append=true

Routing in Sentinel (action groups): Create separate action groups in Azure Monitor for each tier:

  • P1: PagerDuty/OpsGenie webhook + SMS to on-call
  • P2: ServiceNow ticket creation + email to SOC queue
  • P3: Log to SharePoint/Teams channel for batch review

Analytics rule alert severity maps to tier: High = P1, Medium = P2, Low = P3.

Free daily briefing

Briefings like this, every morning before 9am.

Threat intel, active CVEs, and campaign alerts, distilled for practitioners. 50,000+ subscribers. No noise.

Step 3: Allowlisting Known-Good Behavior

Allowlisting suppresses alerts for specific, documented conditions that you have confirmed are benign. Each allowlist entry should have: what it suppresses, why it's known-good, who approved it, and an expiry date.

Splunk: allowlist via lookup table

| inputlookup allowlist_processes.csv

Structure of allowlist_processes.csv:

process_name,justification,approved_by,expiry_date
powershell.exe -EncodedCommand dABlAHMAdAA=,Base64 encoded 'test' -- used by CI pipeline,security@company.com,2026-12-01
psexec.exe,IT uses PsExec for remote admin on server OU only,it-director@company.com,2026-08-01

Apply in a detection rule:

index=sysmon EventCode=1 Image="*powershell.exe*"
| lookup allowlist_processes.csv process_name AS CommandLine OUTPUT justification
| where isnull(justification)  // only alert on non-allowlisted commands
| stats count by ComputerName, CommandLine, User

Sentinel: allowlist via watchlist

let allowlisted_processes = _GetWatchlist('AllowlistedProcesses')
  | project process_name, justification;
SecurityEvent
| where EventID == 4688
| where NewProcessName has "powershell"
| join kind=leftanti allowlisted_processes
  on $left.CommandLine == $right.process_name

Allowlist governance rules:

  • Maximum 90-day expiry on any allowlist entry (default; extend with re-approval)
  • Monthly review: run a query to show all entries expiring in the next 30 days
  • Any allowlist entry for a process on the LOLBAS list (living-off-the-land binaries) requires CISO sign-off
  • Developers cannot add their own entries -- security team approves all

Step 4: Threshold Tuning for High-Volume Rules

Some detections fire too frequently because the threshold is wrong for your environment. Failed login rules are the most common example -- a threshold of 5 failures in 10 minutes generates enormous noise in an environment with aggressive password policies or legacy apps.

Finding your right threshold (Splunk):

index=wineventlog EventCode=4625
| bin _time span=10m
| stats count by _time, user, src_ip
| stats avg(count) as avg_count, max(count) as max_count, 
    perc90(count) as p90_count by user
| sort -p90_count
| head 20

This shows the 90th percentile failure count per user per 10-minute window. Set your alert threshold above the p90 for normal accounts (so normal bad password days don't alert) but below the max for accounts you know have been spray-attacked.

Typical threshold calibration approach:

  1. Disable the alert (or set to report-only/log-only)
  2. Run for 2 weeks; collect the count distribution
  3. Set threshold at 95th percentile + 20% buffer
  4. Re-enable and track true positive rate for 2 weeks
  5. Adjust if needed

Sentinel: dynamic thresholds Sentinel's ML-based Anomaly rules use adaptive baselines rather than fixed thresholds. For key detections (logon anomalies, data exfiltration volume), prefer Anomaly rules over fixed-threshold Scheduled Query rules when the baseline varies significantly across users or time of day.

Step 5: Retire Rules That Generate Zero True Positives

Every SIEM deployment accumulates rules that have never produced a confirmed incident. These consume analyst time, erode trust in the alert stream, and create noise that masks real detections.

Identify candidates for retirement (Splunk):

index=notable earliest=-90d
| stats count as alert_count, 
    count(eval(status="resolved")) as resolved,
    count(eval(owner!="unassigned")) as worked
  by source
| eval worked_rate=round(worked/alert_count*100,1)
| eval resolve_rate=round(resolved/alert_count*100,1)
| where worked_rate < 5  // less than 5% of alerts were ever worked
| sort -alert_count

For any rule with <5% worked rate over 90 days, conduct a quick review:

  • Is the alert volume so high analysts gave up working it? (tuning problem)
  • Does the rule detect something real that's never happened in 90 days? (consider moving to P3/hunt)
  • Was the rule designed for a system we no longer have? (retire it)

Retirement process:

  1. Move from P1/P2 to P3 (log-only) for 30 days -- don't delete immediately
  2. Confirm no incidents are missed during the 30-day observation period
  3. Disable with a comment documenting why and the date
  4. Full delete after 6 months if no reason to reinstate

Keep a rule retirement log:

Rule name | Date disabled | Reason | Approved by | Review date
Brute force - 5 failures in 1 min | 2026-03-01 | 0 TPs in 180 days, threshold wrong for env | CISO | 2026-09-01

Measuring Whether Your Tuning Is Working

Tuning without measurement is guesswork. Track these four metrics monthly:

MetricHow to measureTarget
Total alert volumeCount of all alerts in SIEM per day (rolling 30-day avg)Declining month-over-month
Alert worked rateAlerts assigned to an analyst / total alerts>80%
False positive rateAlerts closed as false positive / total worked alerts<20%
MTTD (mean time to detect)Time from attack start to first alert (use purple team data)Declining

Splunk dashboard query for monthly trend:

index=notable earliest=-6mon
| bin _time span=1mon
| stats count as total_alerts,
    count(eval(status="false positive")) as false_positives,
    count(eval(owner!="unassigned")) as worked
  by _time
| eval fp_rate=round(false_positives/total_alerts*100,1)
| eval worked_rate=round(worked/total_alerts*100,1)
| table _time, total_alerts, worked_rate, fp_rate

If total alert volume is declining and worked rate is increasing: tuning is working. If total alert volume is declining but worked rate is flat: you're suppressing without fixing root causes.

The bottom line

Alert fatigue kills security programs quietly. Analysts stop investigating, real incidents get missed, and the SIEM becomes a compliance checkbox rather than a detection tool. The fix is methodical: source analysis first, tier routing second, allowlisting third, threshold calibration fourth, rule retirement last. Each step takes less than a week. Do them in order and measure every 30 days.

Frequently asked questions

Where do I start when SIEM alert volume is unmanageable?

Start with a source analysis, not a rule analysis. Run a query to count alerts by source or rule over the last 30 days, sorted by volume descending. The top 5 sources almost always account for more than half your total volume. Fix those five before touching anything else. In Splunk: index=notable | stats count by source | sort -count. In Sentinel: SecurityAlert | summarize count() by AlertName | order by count_ desc.

What is the difference between suppression and tuning?

Suppression silences an alert for a specific condition without changing the underlying rule (e.g., suppress alerts from a known vulnerability scanner IP). Tuning modifies the detection logic itself to reduce false positives at the source (e.g., adding a minimum threshold or excluding known-good processes). Suppression is faster but accumulates technical debt; tuning is more work but improves rule quality permanently. Use suppression for one-off exceptions, tuning for systematic false positive patterns.

How do I know if my SIEM tuning is creating detection blind spots?

Run your suppression/exclusion list through a purple team exercise quarterly: execute the ATT&CK technique that your suppressed rule was designed to detect, and confirm the alert still fires for the malicious version. Also review your exclusion list monthly -- entries added for 'temporary' reasons often become permanent. If an allowlisted process or IP starts exhibiting new behavior, your suppression may be hiding a real threat.

What is a tier-based alert routing model?

A three-tier model routes alerts by required response speed: P1 (page the on-call analyst immediately -- high-confidence, high-impact detections like ransomware indicators or admin credential use from foreign IP), P2 (investigate within 4 business hours -- medium-confidence detections requiring analyst judgment), P3 (review weekly in batch -- low-confidence detections used for hunting, not real-time response). Most orgs have everything as P1, which means nothing gets P1 treatment.

Should I tune alerts or buy a SOAR to handle the volume?

Tune first. A SOAR automates the response to alerts -- it does not reduce the number of alerts. If you feed a SOAR 10,000 noisy alerts per day, you get 10,000 automated false positive responses per day, each of which may have side effects (blocking legitimate traffic, flooding ticketing systems). SOAR is most valuable when applied to a well-tuned, lower-volume, higher-fidelity alert stream.

How do I justify SIEM tuning time to leadership when it's not 'adding new detections'?

Frame it as analyst capacity. If your team investigates 50 alerts per day and 40 are false positives, they have capacity for 10 real investigations. Tuning to 20 false positives doubles their investigation capacity without hiring. Calculate: average analyst investigation time (e.g., 15 min per alert) x false positive volume x days per year = hours wasted. Convert to FTE cost. That's the ROI of tuning.

Sources & references

  1. Ponemon Institute / IBM Cost of Data Breach Report
  2. SANS SOC Survey
  3. Splunk Enterprise Security Documentation
  4. Microsoft Sentinel Documentation

Free resources

25
Free download

Critical CVE Reference Card 2025–2026

25 actively exploited vulnerabilities with CVSS scores, exploit status, and patch availability. Print it, pin it, share it with your SOC team.

No spam. Unsubscribe anytime.

Free download

Ransomware Incident Response Playbook

Step-by-step 24-hour IR checklist covering detection, containment, eradication, and recovery. Built for SOC teams, IR leads, and CISOs.

No spam. Unsubscribe anytime.

Free newsletter

Get threat intel before your inbox does.

50,000+ security professionals read Decryption Digest for early warnings on zero-days, ransomware, and nation-state campaigns. Free, weekly, no spam.

Unsubscribe anytime. We never sell your data.

Eric Bang
Author

Founder & Cybersecurity Evangelist, Decryption Digest

Cybersecurity professional with expertise in threat intelligence, vulnerability research, and enterprise security. Covers zero-days, ransomware, and nation-state operations for 50,000+ security professionals weekly.

Free Brief

The Mythos Brief is free.

AI that finds 27-year-old zero-days. What it means for your security program.

Joins Decryption Digest. Unsubscribe anytime.

Daily Briefing

Get briefings like this every morning

Actionable threat intelligence for working practitioners. Free. No spam. Trusted by 50,000+ SOC analysts, CISOs, and security engineers.

Unsubscribe anytime.

Mythos Brief

Anthropic's AI finds zero-days your scanners miss.