GenAI Systems Lab Open interactive version →
AI Engineering 10 min read

Privacy and Compliance for LLM Systems

PII in prompts, data residency, model training on user data, GDPR/CCPA implications, and how to build a compliance architecture for production LLMs.

The question your legal team will eventually ask: 'Where does the user's data go when it hits the LLM?' If you don't have a crisp answer to that question, you are not ready for enterprise customers, regulated industries, or any geography with meaningful data protection law.

This isn't about being paranoid. It's about being specific. Privacy and compliance for LLM systems is a solvable problem once you understand the actual requirements.

The data flows that create risk

What GDPR and CCPA actually require

RequirementLLM implication
Data minimisationDon't send more user data to the LLM than necessary for the task
Purpose limitationUser data collected for support cannot be used to train your model without separate consent
Right to erasureIf user data was used in fine-tuning, erasure is technically very hard — avoid fine-tuning on opt-in data unless you have a clear policy
Data processing agreementsIf you use a third-party model API, you need a DPA with that provider — most major providers offer these
Data residencyFor EU customers, you may need to use EU-region API endpoints; check provider availability
Consent for AI processingIn some jurisdictions, automated decision-making with significant effects requires explicit consent

PII scrubbing before the LLM

For many use cases, user messages can be de-identified before being sent to the model. Named Entity Recognition can strip names, emails, phone numbers, account numbers, and addresses — replacing them with placeholders the model works with, which you then re-identify in post-processing.

from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine

analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()

def scrub_pii(text: str) -> tuple[str, dict]:
    results = analyzer.analyze(text=text, language="en",
        entities=["PERSON", "EMAIL_ADDRESS", "PHONE_NUMBER",
                  "CREDIT_CARD", "US_SSN", "LOCATION"])
    anonymized = anonymizer.anonymize(text=text, analyzer_results=results)
    # Returns scrubbed text + mapping for re-identification if needed
    return anonymized.text, {r.entity_type: r for r in results}

scrubbed, pii_map = scrub_pii("My name is Sarah Chen, reach me at sarah@acme.com")
# scrubbed: "My name is <PERSON>, reach me at <EMAIL_ADDRESS>"

Compliance by industry

IndustryKey regulationCritical LLM requirements
Healthcare (US)HIPAABAA with model provider required; PHI cannot be sent to non-covered entity APIs; audit logs mandatory
Finance (US)GLBA, SOXPII controls; model decisions affecting credit/risk must be explainable; audit trails required
EU (any sector)GDPRDPA with provider; data residency options; right to erasure documented; AI Act compliance for high-risk uses
LegalAttorney-client privilegeUser data may be privileged; extra care on data retention and third-party sharing
Government (US)FedRAMPModel providers must be FedRAMP authorised; Azure OpenAI or self-hosted Llama are common choices

The safest architecture for heavily regulated environments: self-hosted open-source models (Llama 3, Mistral) on your own infrastructure. Data never leaves your network. You control retention, logging, and access. The tradeoff: you own the serving infrastructure and model quality is behind frontier.

Privacy architecture patterns →: Explore compliant LLM pipeline designs in the Systems module.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →