Building AI systems that process protected health information (PHI) is not simply a matter of adding encryption after the fact. HIPAA compliance needs to be an architectural decision, baked into the system from the first line of code. Retrofitting compliance onto a system designed without it is expensive, disruptive, and often results in compromises that leave gaps.
This guide covers the practical engineering requirements for building HIPAA-compliant AI systems, drawn from our experience building healthcare technology for payers, providers, and pharmaceutical organizations.
Understanding What HIPAA Actually Requires
HIPAA has three rules that matter for technology systems:
- The Privacy Rule defines what PHI is and who can access it under what circumstances. For engineers, this translates to access controls, minimum necessary data exposure, and audit logging.
- The Security Rule specifies the administrative, physical, and technical safeguards required to protect electronic PHI (ePHI). This is where encryption, access management, and incident response come in.
- The Breach Notification Rule defines what constitutes a breach and the notification requirements. For engineers, this means building systems that can detect unauthorized access and generate audit trails sufficient to assess scope.
A common misconception is that HIPAA mandates specific technologies. It does not. It is technology-neutral and specifies outcomes, not implementations. AES-256 encryption is not "HIPAA-compliant" by itself. A system that encrypts data at rest and in transit, restricts access to authorized users, logs all access events, and has a plan for breach detection and response is compliant.
The Business Associate Agreement
Before any PHI touches your system, you need a Business Associate Agreement (BAA) with every entity that handles the data. This includes your cloud provider, your AI model provider, your logging service, and any third-party API that receives PHI. If your AI system sends patient data to an LLM API for processing, you need a BAA with that provider.
As of 2024, the major cloud providers (AWS, Azure, GCP) all offer BAAs. For AI model APIs, the landscape is more nuanced. OpenAI offers a BAA for their API (not ChatGPT consumer product). Anthropic offers BAAs for Claude API. If you are using a model provider that will not sign a BAA, you cannot send PHI to that API. Period.
The alternative is to run models locally or in your own cloud environment. Open-source models like Llama, Mistral, or Med-PaLM derivatives can run within your BAA-covered cloud infrastructure without sending data to external APIs.
Technical Architecture Requirements
1. Encryption
Encrypt ePHI at rest (AES-256 is the standard) and in transit (TLS 1.2 or higher). This applies to databases, file storage, message queues, log files, and any temporary storage. If your AI system writes intermediate results to disk during processing, those files must be encrypted.
# Example: Encrypt PHI fields before database storage
from cryptography.fernet import Fernet
class PHIEncryptor:
def __init__(self, key):
self.cipher = Fernet(key)
def encrypt_record(self, record, phi_fields):
"""Encrypt only the PHI fields in a record."""
encrypted = record.copy()
for field in phi_fields:
if field in encrypted and encrypted[field]:
encrypted[field] = self.cipher.encrypt(
str(encrypted[field]).encode()
).decode()
return encrypted
def decrypt_record(self, record, phi_fields):
"""Decrypt PHI fields for authorized access."""
decrypted = record.copy()
for field in phi_fields:
if field in decrypted and decrypted[field]:
decrypted[field] = self.cipher.decrypt(
decrypted[field].encode()
).decode()
return decrypted
# PHI fields in a formulary context
PHI_FIELDS = ["member_id", "date_of_birth", "ssn_last_four"]
2. Access Controls
Implement role-based access control (RBAC) with the minimum necessary principle. A formulary analyst does not need access to individual member claims data. A clinical reviewer needs access to drug utilization data but not member financial information. Map roles to data access levels explicitly.
3. Audit Logging
Every access to PHI must be logged with the user identity, timestamp, data accessed, and action taken. Logs themselves become ePHI (because they record who accessed what patient data) and must be encrypted and retained per your organization's policies (typically six years minimum).
import logging
import json
from datetime import datetime, timezone
class PHIAuditLogger:
def __init__(self, log_path):
self.logger = logging.getLogger("phi_audit")
handler = logging.FileHandler(log_path)
handler.setFormatter(logging.Formatter("%(message)s"))
self.logger.addHandler(handler)
self.logger.setLevel(logging.INFO)
def log_access(self, user_id, action, resource_type,
resource_id, reason):
entry = {
"timestamp": datetime.now(timezone.utc).isoformat(),
"user_id": user_id,
"action": action,
"resource_type": resource_type,
"resource_id": resource_id,
"reason": reason,
"session_id": get_current_session_id()
}
self.logger.info(json.dumps(entry))
4. De-identification for Model Training
If you are training or fine-tuning AI models on healthcare data, HIPAA's Safe Harbor method requires removing 18 specific identifier types including names, geographic data smaller than state, dates (except year), phone numbers, email addresses, SSNs, medical record numbers, and several others. Expert determination is the alternative path, requiring a qualified statistician to certify that re-identification risk is very small.
For most AI training scenarios, Safe Harbor de-identification combined with differential privacy techniques provides a practical path. The key is building de-identification into your data pipeline as a required step before any data enters the training workflow.
AI-Specific Considerations
AI systems introduce HIPAA considerations that traditional software does not:
- Model memorization. Large language models can memorize training data. If you fine-tune on PHI, the model itself may become ePHI because it could reproduce patient data. De-identify before training.
- Prompt injection. If your AI system accepts natural language queries about patient data, prompt injection attacks could cause it to reveal PHI that the user is not authorized to see. Input validation and output filtering are essential.
- Inference outputs. Even when inputs are de-identified, AI model outputs may constitute PHI if they can be linked back to individuals. The system's output pipeline needs the same protections as its input pipeline.
HIPAA compliance for AI systems is not a checkbox exercise. It is a set of architectural decisions that must be made early, implemented consistently, and maintained continuously. The organizations that treat compliance as a design constraint rather than an afterthought build better systems and avoid the catastrophic cost of breaches.