Privacy and Compliance for LLM Systems
PII in prompts, data residency, model training on user data, GDPR/CCPA implications, and how to build a compliance architecture for production LLMs.
The question your legal team will eventually ask: 'Where does the user's data go when it hits the LLM?' If you don't have a crisp answer to that question, you are not ready for enterprise customers, regulated industries, or any geography with meaningful data protection law.
This isn't about being paranoid. It's about being specific. Privacy and compliance for LLM systems is a solvable problem once you understand the actual requirements.
The data flows that create risk
- User messages sent to third-party model API (OpenAI, Anthropic, Google) — data leaves your infrastructure
- RAG retrieval — user queries are embedded and matched against your knowledge base; the query itself is sensitive
- Tool outputs fed back to the model — if tools return PII from your database, that PII is now in the model's context
- Conversation history — accumulates PII over time; needs explicit retention limits
- Fine-tuning data — if you fine-tune on user data, that data affects model weights indefinitely
- Logging and observability — traces of LLM calls often contain full user messages; treat logs as sensitive data
What GDPR and CCPA actually require
| Requirement | LLM implication |
|---|---|
| Data minimisation | Don't send more user data to the LLM than necessary for the task |
| Purpose limitation | User data collected for support cannot be used to train your model without separate consent |
| Right to erasure | If user data was used in fine-tuning, erasure is technically very hard — avoid fine-tuning on opt-in data unless you have a clear policy |
| Data processing agreements | If you use a third-party model API, you need a DPA with that provider — most major providers offer these |
| Data residency | For EU customers, you may need to use EU-region API endpoints; check provider availability |
| Consent for AI processing | In some jurisdictions, automated decision-making with significant effects requires explicit consent |
PII scrubbing before the LLM
For many use cases, user messages can be de-identified before being sent to the model. Named Entity Recognition can strip names, emails, phone numbers, account numbers, and addresses — replacing them with placeholders the model works with, which you then re-identify in post-processing.
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()
def scrub_pii(text: str) -> tuple[str, dict]:
results = analyzer.analyze(text=text, language="en",
entities=["PERSON", "EMAIL_ADDRESS", "PHONE_NUMBER",
"CREDIT_CARD", "US_SSN", "LOCATION"])
anonymized = anonymizer.anonymize(text=text, analyzer_results=results)
# Returns scrubbed text + mapping for re-identification if needed
return anonymized.text, {r.entity_type: r for r in results}
scrubbed, pii_map = scrub_pii("My name is Sarah Chen, reach me at sarah@acme.com")
# scrubbed: "My name is <PERSON>, reach me at <EMAIL_ADDRESS>"
Compliance by industry
| Industry | Key regulation | Critical LLM requirements |
|---|---|---|
| Healthcare (US) | HIPAA | BAA with model provider required; PHI cannot be sent to non-covered entity APIs; audit logs mandatory |
| Finance (US) | GLBA, SOX | PII controls; model decisions affecting credit/risk must be explainable; audit trails required |
| EU (any sector) | GDPR | DPA with provider; data residency options; right to erasure documented; AI Act compliance for high-risk uses |
| Legal | Attorney-client privilege | User data may be privileged; extra care on data retention and third-party sharing |
| Government (US) | FedRAMP | Model providers must be FedRAMP authorised; Azure OpenAI or self-hosted Llama are common choices |
The safest architecture for heavily regulated environments: self-hosted open-source models (Llama 3, Mistral) on your own infrastructure. Data never leaves your network. You control retention, logging, and access. The tradeoff: you own the serving infrastructure and model quality is behind frontier.
Privacy architecture patterns →: Explore compliant LLM pipeline designs in the Systems module.
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →