What HIPAA actually asks for
HIPAA's Privacy Rule and Security Rule together govern protected health information (PHI). For LLM-powered agents this resolves to four engineering deliverables: covered processing (BAA), minimum necessary access, audit logging, and breach detection. None of these are optional; all of them are tractable.
The 2024–2026 wave of healthcare AI guidance - HHS Office for Civil Rights bulletins on tracking technologies [1], FDA's Good Machine Learning Practice principles [4], and NIST's AI Risk Management Framework Generative AI Profile [5] - converges on the same engineering checklist. Build to it once; satisfy multiple regulators.
- BAA-covered model providers only
Anthropic Claude on AWS Bedrock, Azure OpenAI, GCP Vertex - all offer BAA tiers. Outside these, no PHI.
- PHI redaction at the SDK boundary
Inbound prompts and tool inputs are redacted before they reach the model unless the BAA covers them. PHI tokens get round-tripped via a secure enclave.
- RBAC at retrieval
Tenant + role + clinical relationship on every retrieval query. The patient's chart is filtered server-side before any LLM sees it.
- Audit log per agent action
Append-only log: who asked, what data, what was returned, what happened next. Retained per state retention rules (typically 6 years).
- Calibrated clinical eval suite
Cases include refusal cases for clinical advice, citation cases for any medical claim, and red-team cases for de-identification leakage.
View data table· Source: Techimax healthcare engagement scope analysis, 2024–2026
| Series | Value |
|---|---|
| PHI redaction + BAA wiring | 22 |
| Eval suite (clinical) | 28 |
| RBAC at retrieval | 14 |
| Audit logs + retention | 12 |
| Reviewer queues | 11 |
| Governance docs | 13 |
PHI redaction: what to redact and where
The intuitive redaction pattern - "strip names and dates before sending to the model" - is wrong twice. First, it doesn't strip enough (HIPAA defines 18 identifier categories). Second, when the BAA covers the model, redaction strips information the model needs to do its job.
The right pattern: classify the deployment surface (BAA-covered? non-covered?) and redact only what isn't covered for that surface. Redact at the SDK boundary, not in application code; replace with stable tokens so retrieval and audit can re-link.
| Surface | Strategy | Reasoning |
|---|---|---|
| Anthropic on Bedrock (BAA) | Pass PHI; log and audit | BAA covers; minimum necessary still applies |
| OpenAI direct API (no BAA in path) | Redact 18 identifiers; tokenize | Not BAA-covered; PHI cannot transit |
| Open-weight on customer GPUs | Pass PHI; encrypt at rest; rotate logs | Customer-owned; covered by infrastructure controls |
| Third-party tool calls | Tokenize before tool call; rehydrate after | Tool vendors typically not BAA-covered |
Clinical evals: what 'good' means
Healthcare evals demand a different bar. A general-purpose agent eval might accept 90% pass-rate as production-ready; a clinical-adjacent agent often needs 99%+ on safety-critical refusals (no diagnosis without provider review; no medication advice; cite all clinical claims).
Calibrate against clinician review. We default to: every clinical-adjacent eval graded by an LLM grader, then sampled (typically 10%) for clinician review. Disagreement rate between LLM grader and clinician is reported; > 5% disagreement triggers a re-calibration sprint.
View data table· Source: Techimax healthcare engagement data 2024–2026
| Series | % pass rate |
|---|---|
| Refusal of clinical advice | 84 |
| Refusal of clinical advice | 99 |
| Citation accuracy on medical claims | 71 |
| Citation accuracy on medical claims | 96 |
| PHI de-identification on outputs | 82 |
| PHI de-identification on outputs | 99 |
| Drug-interaction recognition | 76 |
| Drug-interaction recognition | 94 |
Human-in-loop calibration for clinical workflows
Healthcare is the canonical case for human-in-the-loop. The decision rule: agents draft, clinicians decide. Anything that materially affects diagnosis, treatment, medication, or risk classification routes to a human reviewer with a documented SLA. Anything that doesn't (intake summarization, scheduling, administrative notes) can be fully automated with a quality eval gate.
We track three metrics on the reviewer queue: queue depth (alarm at 1.5× SLA), override rate (alarm at any 2× sustained spike), and reviewer time per item (drift indicator for agent quality). Override rate is the canary: when it climbs, the agent is drifting; when it drops below 5%, the agent is over-cautious and routing too conservatively.
| Decision class | Default policy | SLA | Reviewer |
|---|---|---|---|
| Diagnosis suggestion | Required review | < 4h business | Licensed clinician |
| Medication recommendation | Required review | < 1h | Pharmacist or clinician |
| Triage / risk classification | Required review on high-risk | < 30 min | Clinical lead |
| Intake summarization | Sample review | < 24h | Care coordinator |
| Scheduling action | Auto with audit | - | Audit only |
| Patient-facing message | Required review | < 2h | Care coordinator |
What an audit trail actually contains
OCR investigators ask for specific artifacts. "We have a log" is insufficient. The audit trail must let an investigator reconstruct any decision - what data was retrieved, who could see it, what the agent did, what the clinician decided, when the decision was reversed if applicable. Each event needs an immutable timestamp, an actor (agent ID or user ID), and a payload sufficient to reconstruct context.
Storage: append-only with cryptographic hashing for tamper-evidence (we use Merkle-tree style chaining); 6-year retention default; encrypted at rest with customer-managed keys; access-logged at the row level. The same artifact serves SOC 2 (CC7.2 audit logging), HITRUST, and the OCR's investigation request.
// Append-only event store. Every agent action emits one of these
// before the user sees any UI confirmation. Reconstruction is
// the design goal - investigator must be able to recreate context.
type AuditEvent = {
event_id: string; // UUIDv7 - sortable, immutable
timestamp: string; // ISO-8601 UTC
actor_kind: "agent" | "user" | "system";
actor_id: string;
patient_token: string; // tokenized PHI key - not raw MRN
agent_id?: string;
agent_version?: string;
model_id?: string;
retrieved_docs: { source: string; doc_id: string; version: string }[];
prompt_hash: string;
output_hash: string;
decision_class: DecisionClass;
reviewer_id?: string; // when human review applied
override?: boolean; // reviewer overrode agent
prev_event_id: string; // chain - Merkle integrity
signature: string; // HMAC of canonical event body
};Healthcare agents are not a strategy problem. They're an engineering problem with five known deliverables. Ship the deliverables and the compliance conversation gets shorter.
When does an agent become a medical device?
FDA jurisdiction kicks in when an agent's output materially affects diagnosis, treatment, or prevention. Decision support that surfaces information for a clinician to interpret is generally not regulated as a medical device under the 21st Century Cures Act exclusions. Software that recommends a specific course of treatment may be Software as a Medical Device (SaMD) and require clearance [4].
Engineering implication: classify your agent's outputs against the SaMD framework before shipping. Most enterprise healthcare agents we deploy fall outside SaMD because they surface citations and route to clinicians for interpretation. The line moves with how the agent is used, not just what it says - be intentional about the framing in your UI.
References
- [1]HIPAA Security Rule - HHS (2024 (current))
- [2]Anthropic BAA + Trust Center - Anthropic (2025)
- [3]AWS HIPAA-eligible services - AWS (2025)
- [4]FDA Good Machine Learning Practice for medical devices - FDA / Health Canada / MHRA (2024)
- [5]AI Risk Management Framework Generative AI Profile - NIST (2024)
- [6]21st Century Cures Act information-blocking rule - ONC (2024)
- [7]HITRUST CSF v11 - HITRUST Alliance (2024)
Frequently asked questions
Are open-weight models acceptable for PHI?
Yes when self-hosted in customer-owned infrastructure with appropriate controls (encryption, RBAC, audit). The BAA question only applies when a third-party vendor processes PHI on your behalf.
What about state laws (CCPA, NY SHIELD)?
Layer on top of HIPAA. State laws typically add notice and rights obligations; the engineering deliverables on this page satisfy most state requirements as long as the audit trail is queryable.
Can we use these patterns for HITRUST certification?
Yes. HITRUST CSF includes the controls covered above. We pair with HITRUST assessors during engagement and have shipped agents into HITRUST-certified environments.
How long does this take?
6–10 weeks for a single clinical-adjacent agent, including governance documentation. We've seen organizations try to compress to 2 weeks; the result is debt that surfaces during audit.
Which model providers offer BAA coverage in 2026?
Anthropic Claude on AWS Bedrock (BAA via AWS), Azure OpenAI (BAA via Microsoft), and GCP Vertex AI (BAA via Google). Direct OpenAI API and direct Anthropic API also have BAA tiers for enterprise customers. Open-weight self-hosted on customer infrastructure doesn't require a BAA - the customer is the processor.
Do we need IRB review for AI features in research contexts?
If the AI is used in research with human subjects, yes - your IRB will want to review. Engineering implication: build the audit log to support research data extracts (de-identified per Safe Harbor or Expert Determination) without re-engineering.
How do we handle minor patient data?
Standard HIPAA pediatric considerations apply (parent/guardian consent, age-out at majority for some states). Engineering implication: tag patient records with age category; route minor-related actions through a stricter reviewer queue; log consent status with each access.
What about TEFCA and information-blocking rules?
ONC's information-blocking rules apply to certified health IT and may govern how an AI agent surfaces or withholds information [6]. Practical pattern: an agent that summarizes records visible to the patient via a portal must not selectively suppress the same information from the patient's view. Build to parity.