Practitioner GuideMay 14, 202613 min read

Preventing Sensitive Data Leakage to AI Tools in the Enterprise

Sources:Netskope Cloud and Threat Report 2025|Cyberhaven AI Data Security Report 2025|Microsoft Purview AI Hub Documentation|OWASP Top 10 for LLM Applications 2025|CISA AI Cybersecurity Guidance 2025

Eric Bang

Founder & Cybersecurity Evangelist

15%

of employees regularly paste sensitive enterprise data into AI tools, per Cyberhaven 2025

increase in sensitive data submitted to AI tools between 2023 and 2025

27%

of AI-submitted data classified as confidential or restricted by enterprise DLP policies

Generative AI tools have created a new data exfiltration vector that most DLP programs were not designed to detect: employees typing or pasting sensitive information into a web browser text box. Unlike traditional exfiltration methods (email, USB, cloud upload), AI data leakage is often intentional from the user's perspective (they are trying to be productive) but unintentional as a security event (they do not consider that the data now exists in an external training or inference pipeline). The governance response requires both technical controls and cultural change.

What Data Is Being Shared and With Whom

Cyberhaven's analysis of millions of clipboard and form submission events identifies the most commonly shared sensitive data categories: source code (developers using AI coding assistants to debug proprietary code), customer PII (support agents using AI to draft responses with customer data pasted for context), financial data (analysts using AI to summarize earnings reports containing non-public information), legal documents (lawyers using AI to draft contracts with confidential terms), and HR data (managers drafting performance reviews with identifiable employee information). The AI tools receiving this data span corporate-sanctioned tools (Microsoft 365 Copilot, GitHub Copilot, Google Workspace Gemini), consumer AI tools used at work (ChatGPT, Claude, Gemini), and specialized AI assistants in SaaS applications (Notion AI, Salesforce Einstein, Slack AI).

The Actual Risk

The risk profile of AI data leakage depends on the tool and its data handling policies. Enterprise agreements: tools like Microsoft 365 Copilot, Google Workspace Gemini under enterprise terms, and GitHub Copilot Business explicitly prohibit using customer data for model training and provide data processing agreements. Consumer versions: ChatGPT's consumer product (non-enterprise tier) historically used conversations for training, though OpenAI provides opt-out mechanisms. The actual risk is not always training data exposure: it is data persistence in AI provider infrastructure, potential exposure if the AI provider has a breach, compliance violations (GDPR, HIPAA prohibit sharing regulated data with unauthorized processors), and regulatory risk if non-public financial information is shared with external AI services.

Free daily briefing

Briefings like this, every morning before 9am.

Threat intel, active CVEs, and campaign alerts, distilled for practitioners. 50,000+ subscribers. No noise.

Acceptable Use Policy for AI Tools

Before deploying technical controls, establish a clear AI acceptable use policy. The policy should define which AI tools are approved for enterprise use, categorize acceptable and prohibited data inputs by sensitivity level, require employees to avoid inputting customer data or regulated data into non-enterprise AI tools, clarify that AI-generated output must be reviewed before use in customer communications or regulatory filings, and require disclosure when AI tools are used to draft significant documents. The policy should be enforced through training and acknowledgment, not just posted on the intranet. Role-specific guidance (developers, legal, HR, customer support) is more effective than a generic policy.

Technical Controls: AI-Aware DLP

Traditional DLP tools detect sensitive data in email attachments, file uploads, and clipboard operations but were not designed to inspect data submitted in web forms and browser input fields to AI services. AI-aware DLP extends detection to these vectors:

Browser extension-based DLP

Cyberhaven, Nightfall, and Microsoft Purview browser extensions can intercept data submitted to AI tool URLs and apply DLP policies. They can block submission, warn the user, or log the event for review.

Network-level DLP for AI domains

Block or monitor traffic to unapproved AI services at the proxy or SWG layer. Identify AI service domains (chat.openai.com, claude.ai, gemini.google.com) and apply DLP inspection to HTTPS traffic destined for these domains using TLS inspection.

CASB controls

Cloud Access Security Broker (CASB) platforms can classify AI tools as sanctioned or unsanctioned and enforce policy: allow Microsoft Copilot, block ChatGPT consumer, monitor Claude for sensitive data uploads.

Microsoft Purview AI Hub

For Microsoft 365 environments, Purview AI Hub provides visibility into how employees use Microsoft Copilot and what sensitive data they input, without requiring additional tooling.

AI Gateway Controls

AI gateways are a newer control category that sits between users and AI APIs, providing centralized policy enforcement, logging, and data sanitization. Organizations deploying AI internally via API (rather than consumer web interfaces) can route all LLM traffic through an AI gateway that strips or pseudonymizes sensitive data before it reaches the model. Open-source options include LiteLLM proxy and Portkey. Commercial options include Aporia, Lakera Guard, and AWS Bedrock Guardrails. AI gateways also provide prompt injection detection, output filtering to prevent the model from returning sensitive data it retrieved from internal systems, and rate limiting.

Governance for Sanctioned AI Tools

Blocking all AI tools is not a viable long-term strategy: employees route around blocks using personal devices or other methods, and organizations that block AI tools cede productivity advantages to competitors. The sustainable approach is to provide sanctioned, enterprise-licensed versions of AI tools that meet your data handling requirements, while blocking or monitoring unsanctioned consumer alternatives. This requires: enterprise agreements with data processing terms for approved AI tools, CASB or network controls that allow approved AI services and block unapproved ones, and clear communication to employees about why certain tools are blocked and what approved alternatives exist.

The bottom line

AI data leakage is primarily a governance and awareness problem, not a purely technical one. Employees are not trying to cause breaches; they are trying to work faster. Provide sanctioned AI tools with enterprise data protection, communicate clearly what data cannot be shared with external AI services, and use technical controls to catch edge cases. Blocking everything fails; governing everything thoughtfully can succeed.

Frequently asked questions

Does ChatGPT use my data to train its models?

For the ChatGPT consumer product: conversations are used for model training by default, though users can opt out in settings. For ChatGPT Team and Enterprise tiers: OpenAI explicitly states that conversation data is not used for training and is not retained beyond the session. For the OpenAI API: data submitted via API is not used for training by default under standard terms. Always verify current terms directly with the vendor, as policies change. For enterprise security purposes, only use AI tools under enterprise agreements with explicit data processing terms.

What AI tools are safe to use for work with sensitive data?

Enterprise-licensed versions of major AI tools typically provide appropriate data protection: Microsoft 365 Copilot under enterprise terms, Google Workspace Gemini under enterprise terms, GitHub Copilot Business or Enterprise, and Claude for Enterprise (Anthropic's enterprise offering). These products include data processing agreements, prohibit use of customer data for training, and provide security controls. Consumer versions of the same tools have weaker protections. Never use any AI tool for HIPAA-covered protected health information without a signed Business Associate Agreement.

How do I detect if employees are sharing sensitive data with AI tools?

Three primary detection methods: (1) CASB visibility into which AI services employees access and approximate data volumes submitted, (2) DLP alerts when sensitive data patterns (SSNs, credit card numbers, patient identifiers) appear in browser submissions to AI service domains, (3) endpoint DLP browser extensions that can inspect form submissions before they leave the device. Network-level detection requires TLS inspection for HTTPS traffic to AI service domains.

What is prompt injection and how does it relate to enterprise AI security?

Prompt injection is an attack where malicious instructions embedded in data processed by an AI system override or manipulate the AI's intended behavior. In enterprise contexts, the risk arises when AI tools process external content (emails, documents, web pages) that contains adversarial instructions designed to exfiltrate data the AI has access to, bypass restrictions, or manipulate the AI's output. This is a risk in AI-powered email assistants, document summarizers, and AI agents with tool-use capabilities. AI gateway controls and output filtering mitigate prompt injection risk.

How should we handle GDPR compliance for AI tools used with EU personal data?

GDPR requires a lawful basis for processing personal data and restricts transfers to countries without adequate data protection. Processing EU personal data in AI tools raises: (1) whether the AI vendor is an authorized data processor (requires a Data Processing Agreement), (2) whether data is transferred outside the EU/EEA (US-based AI providers require Standard Contractual Clauses or equivalent), (3) whether the processing purpose is compatible with the original collection purpose. Work with your legal team and DPO to assess each AI tool under GDPR before allowing employees to use it with EU personal data.

Can we use open-source or self-hosted LLMs to avoid data leakage risk?

Self-hosted open-source LLMs (Llama, Mistral, Falcon) run entirely within your infrastructure, eliminating the third-party data sharing risk. This approach requires significant infrastructure investment, ML engineering expertise for model serving and fine-tuning, and ongoing maintenance for security patches. It is viable for organizations with the technical capability and a strong requirement to keep all data on-premises (highly regulated industries, government). For most enterprises, enterprise-licensed versions of commercial AI tools with strong data processing terms are more practical.

Sources & references

Netskope Cloud and Threat Report 2025
Cyberhaven AI Data Security Report 2025
Microsoft Purview AI Hub Documentation
OWASP Top 10 for LLM Applications 2025
CISA AI Cybersecurity Guidance 2025

Free resources

Free download

Critical CVE Reference Card 2025–2026

25 actively exploited vulnerabilities with CVSS scores, exploit status, and patch availability. Print it, pin it, share it with your SOC team.

Free download

Ransomware Incident Response Playbook

Step-by-step 24-hour IR checklist covering detection, containment, eradication, and recovery. Built for SOC teams, IR leads, and CISOs.

Free newsletter

Get threat intel before your inbox does.

50,000+ security professionals read Decryption Digest for early warnings on zero-days, ransomware, and nation-state campaigns. Free, weekly, no spam.

Unsubscribe anytime. We never sell your data.

Author

Eric BangCISSP

Founder & Cybersecurity Evangelist, Decryption Digest

Cybersecurity professional with expertise in threat intelligence, vulnerability research, and enterprise security. Covers zero-days, ransomware, and nation-state operations for 50,000+ security professionals weekly.

View profile →LinkedIn

Back to all briefings

Subscribe for Updates

AI security DLP data leakage ChatGPT Microsoft Copilot shadow IT data governance

Free Brief

The Mythos Brief is free.

AI that finds 27-year-old zero-days. What it means for your security program.