AI Engineering 10 min read

Privacy and Compliance for LLM Systems

PII in prompts, data residency, model training on user data, GDPR/CCPA implications, and how to build a compliance architecture for production LLMs.

The question your legal team will eventually ask: 'Where does the user's data go when it hits the LLM?' If you don't have a crisp answer to that question, you are not ready for enterprise customers, regulated industries, or any geography with meaningful data protection law.

This isn't about being paranoid. It's about being specific. Privacy and compliance for LLM systems is a solvable problem once you understand the actual requirements.

The data flows that create risk

User messages sent to third-party model API (OpenAI, Anthropic, Google) — data leaves your infrastructure
RAG retrieval — user queries are embedded and matched against your knowledge base; the query itself is sensitive
Tool outputs fed back to the model — if tools return PII from your database, that PII is now in the model's context
Conversation history — accumulates PII over time; needs explicit retention limits
Fine-tuning data — if you fine-tune on user data, that data affects model weights indefinitely
Logging and observability — traces of LLM calls often contain full user messages; treat logs as sensitive data

What GDPR and CCPA actually require

Requirement	LLM implication
Data minimisation	Don't send more user data to the LLM than necessary for the task
Purpose limitation	User data collected for support cannot be used to train your model without separate consent
Right to erasure	If user data was used in fine-tuning, erasure is technically very hard — avoid fine-tuning on opt-in data unless you have a clear policy
Data processing agreements	If you use a third-party model API, you need a DPA with that provider — most major providers offer these
Data residency	For EU customers, you may need to use EU-region API endpoints; check provider availability
Consent for AI processing	In some jurisdictions, automated decision-making with significant effects requires explicit consent

PII scrubbing before the LLM

For many use cases, user messages can be de-identified before being sent to the model. Named Entity Recognition can strip names, emails, phone numbers, account numbers, and addresses — replacing them with placeholders the model works with, which you then re-identify in post-processing.

from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine

analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()

def scrub_pii(text: str) -> tuple[str, dict]:
    results = analyzer.analyze(text=text, language="en",
        entities=["PERSON", "EMAIL_ADDRESS", "PHONE_NUMBER",
                  "CREDIT_CARD", "US_SSN", "LOCATION"])
    anonymized = anonymizer.anonymize(text=text, analyzer_results=results)
    # Returns scrubbed text + mapping for re-identification if needed
    return anonymized.text, {r.entity_type: r for r in results}

scrubbed, pii_map = scrub_pii("My name is Sarah Chen, reach me at sarah@acme.com")
# scrubbed: "My name is <PERSON>, reach me at <EMAIL_ADDRESS>"

Compliance by industry

Industry	Key regulation	Critical LLM requirements
Healthcare (US)	HIPAA	BAA with model provider required; PHI cannot be sent to non-covered entity APIs; audit logs mandatory
Finance (US)	GLBA, SOX	PII controls; model decisions affecting credit/risk must be explainable; audit trails required
EU (any sector)	GDPR	DPA with provider; data residency options; right to erasure documented; AI Act compliance for high-risk uses
Legal	Attorney-client privilege	User data may be privileged; extra care on data retention and third-party sharing
Government (US)	FedRAMP	Model providers must be FedRAMP authorised; Azure OpenAI or self-hosted Llama are common choices

The safest architecture for heavily regulated environments: self-hosted open-source models (Llama 3, Mistral) on your own infrastructure. Data never leaves your network. You control retention, logging, and access. The tradeoff: you own the serving infrastructure and model quality is behind frontier.

Privacy architecture patterns →: Explore compliant LLM pipeline designs in the Systems module.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →