HealthcareFor Security & complianceFor Head of AI

HIPAA-grade agents: a working playbook for healthcare AI

What shipping an LLM-powered agent into a healthcare workflow actually takes - BAA-covered providers, PHI redaction, audit trails compliance accepts, and clinical-safety evals.

TTechimax EngineeringForward-deployed engineering team14 min readUpdated May 10, 2026

What HIPAA actually asks for

HIPAA's Privacy Rule and Security Rule together govern protected health information (PHI). For LLM-powered agents this resolves to four engineering deliverables: covered processing (BAA), minimum necessary access, audit logging, and breach detection. None of these are optional; all of them are tractable.

The 2024–2026 wave of healthcare AI guidance - HHS Office for Civil Rights bulletins on tracking technologies [1], FDA's Good Machine Learning Practice principles [4], and NIST's AI Risk Management Framework Generative AI Profile [5] - converges on the same engineering checklist. Build to it once; satisfy multiple regulators.

The HIPAA-grade engineering checklist
  • BAA-covered model providers only

    Anthropic Claude on AWS Bedrock, Azure OpenAI, GCP Vertex - all offer BAA tiers. Outside these, no PHI.

  • PHI redaction at the SDK boundary

    Inbound prompts and tool inputs are redacted before they reach the model unless the BAA covers them. PHI tokens get round-tripped via a secure enclave.

  • RBAC at retrieval

    Tenant + role + clinical relationship on every retrieval query. The patient's chart is filtered server-side before any LLM sees it.

  • Audit log per agent action

    Append-only log: who asked, what data, what was returned, what happened next. Retained per state retention rules (typically 6 years).

  • Calibrated clinical eval suite

    Cases include refusal cases for clinical advice, citation cases for any medical claim, and red-team cases for de-identification leakage.

Chart
Where HIPAA-grade agent effort lands in our healthcare engagements (last 12 months)
View data table· Source: Techimax healthcare engagement scope analysis, 2024–2026
SeriesValue
PHI redaction + BAA wiring22
Eval suite (clinical)28
RBAC at retrieval14
Audit logs + retention12
Reviewer queues11
Governance docs13

PHI redaction: what to redact and where

The intuitive redaction pattern - "strip names and dates before sending to the model" - is wrong twice. First, it doesn't strip enough (HIPAA defines 18 identifier categories). Second, when the BAA covers the model, redaction strips information the model needs to do its job.

The right pattern: classify the deployment surface (BAA-covered? non-covered?) and redact only what isn't covered for that surface. Redact at the SDK boundary, not in application code; replace with stable tokens so retrieval and audit can re-link.

SurfaceStrategyReasoning
Anthropic on Bedrock (BAA)Pass PHI; log and auditBAA covers; minimum necessary still applies
OpenAI direct API (no BAA in path)Redact 18 identifiers; tokenizeNot BAA-covered; PHI cannot transit
Open-weight on customer GPUsPass PHI; encrypt at rest; rotate logsCustomer-owned; covered by infrastructure controls
Third-party tool callsTokenize before tool call; rehydrate afterTool vendors typically not BAA-covered
Redaction strategy by deployment surface

Clinical evals: what 'good' means

Healthcare evals demand a different bar. A general-purpose agent eval might accept 90% pass-rate as production-ready; a clinical-adjacent agent often needs 99%+ on safety-critical refusals (no diagnosis without provider review; no medication advice; cite all clinical claims).

Calibrate against clinician review. We default to: every clinical-adjacent eval graded by an LLM grader, then sampled (typically 10%) for clinician review. Disagreement rate between LLM grader and clinician is reported; > 5% disagreement triggers a re-calibration sprint.

Chart · % pass rate
Clinical eval pass-rate target vs typical post-rescue baseline (n = 8 healthcare engagements)
View data table· Source: Techimax healthcare engagement data 2024–2026
Series% pass rate
Refusal of clinical advice84
Refusal of clinical advice99
Citation accuracy on medical claims71
Citation accuracy on medical claims96
PHI de-identification on outputs82
PHI de-identification on outputs99
Drug-interaction recognition76
Drug-interaction recognition94

Human-in-loop calibration for clinical workflows

Healthcare is the canonical case for human-in-the-loop. The decision rule: agents draft, clinicians decide. Anything that materially affects diagnosis, treatment, medication, or risk classification routes to a human reviewer with a documented SLA. Anything that doesn't (intake summarization, scheduling, administrative notes) can be fully automated with a quality eval gate.

We track three metrics on the reviewer queue: queue depth (alarm at 1.5× SLA), override rate (alarm at any 2× sustained spike), and reviewer time per item (drift indicator for agent quality). Override rate is the canary: when it climbs, the agent is drifting; when it drops below 5%, the agent is over-cautious and routing too conservatively.

Decision classDefault policySLAReviewer
Diagnosis suggestionRequired review< 4h businessLicensed clinician
Medication recommendationRequired review< 1hPharmacist or clinician
Triage / risk classificationRequired review on high-risk< 30 minClinical lead
Intake summarizationSample review< 24hCare coordinator
Scheduling actionAuto with audit-Audit only
Patient-facing messageRequired review< 2hCare coordinator
Default human-in-loop policy by clinical decision class

What an audit trail actually contains

OCR investigators ask for specific artifacts. "We have a log" is insufficient. The audit trail must let an investigator reconstruct any decision - what data was retrieved, who could see it, what the agent did, what the clinician decided, when the decision was reversed if applicable. Each event needs an immutable timestamp, an actor (agent ID or user ID), and a payload sufficient to reconstruct context.

Storage: append-only with cryptographic hashing for tamper-evidence (we use Merkle-tree style chaining); 6-year retention default; encrypted at rest with customer-managed keys; access-logged at the row level. The same artifact serves SOC 2 (CC7.2 audit logging), HITRUST, and the OCR's investigation request.

Audit event schema we ship in HIPAA engagementsts
// Append-only event store. Every agent action emits one of these
// before the user sees any UI confirmation. Reconstruction is
// the design goal - investigator must be able to recreate context.
type AuditEvent = {
  event_id:        string;          // UUIDv7 - sortable, immutable
  timestamp:       string;          // ISO-8601 UTC
  actor_kind:      "agent" | "user" | "system";
  actor_id:        string;
  patient_token:   string;          // tokenized PHI key - not raw MRN
  agent_id?:       string;
  agent_version?:  string;
  model_id?:       string;
  retrieved_docs:  { source: string; doc_id: string; version: string }[];
  prompt_hash:     string;
  output_hash:     string;
  decision_class:  DecisionClass;
  reviewer_id?:    string;          // when human review applied
  override?:       boolean;         // reviewer overrode agent
  prev_event_id:   string;          // chain - Merkle integrity
  signature:       string;          // HMAC of canonical event body
};

Healthcare agents are not a strategy problem. They're an engineering problem with five known deliverables. Ship the deliverables and the compliance conversation gets shorter.

When does an agent become a medical device?

FDA jurisdiction kicks in when an agent's output materially affects diagnosis, treatment, or prevention. Decision support that surfaces information for a clinician to interpret is generally not regulated as a medical device under the 21st Century Cures Act exclusions. Software that recommends a specific course of treatment may be Software as a Medical Device (SaMD) and require clearance [4].

Engineering implication: classify your agent's outputs against the SaMD framework before shipping. Most enterprise healthcare agents we deploy fall outside SaMD because they surface citations and route to clinicians for interpretation. The line moves with how the agent is used, not just what it says - be intentional about the framing in your UI.

References

  1. [1]HIPAA Security Rule - HHS (2024 (current))
  2. [2]Anthropic BAA + Trust Center - Anthropic (2025)
  3. [3]AWS HIPAA-eligible services - AWS (2025)
  4. [4]FDA Good Machine Learning Practice for medical devices - FDA / Health Canada / MHRA (2024)
  5. [5]AI Risk Management Framework Generative AI Profile - NIST (2024)
  6. [6]21st Century Cures Act information-blocking rule - ONC (2024)
  7. [7]HITRUST CSF v11 - HITRUST Alliance (2024)

Frequently asked questions

Are open-weight models acceptable for PHI?

Yes when self-hosted in customer-owned infrastructure with appropriate controls (encryption, RBAC, audit). The BAA question only applies when a third-party vendor processes PHI on your behalf.

What about state laws (CCPA, NY SHIELD)?

Layer on top of HIPAA. State laws typically add notice and rights obligations; the engineering deliverables on this page satisfy most state requirements as long as the audit trail is queryable.

Can we use these patterns for HITRUST certification?

Yes. HITRUST CSF includes the controls covered above. We pair with HITRUST assessors during engagement and have shipped agents into HITRUST-certified environments.

How long does this take?

6–10 weeks for a single clinical-adjacent agent, including governance documentation. We've seen organizations try to compress to 2 weeks; the result is debt that surfaces during audit.

Which model providers offer BAA coverage in 2026?

Anthropic Claude on AWS Bedrock (BAA via AWS), Azure OpenAI (BAA via Microsoft), and GCP Vertex AI (BAA via Google). Direct OpenAI API and direct Anthropic API also have BAA tiers for enterprise customers. Open-weight self-hosted on customer infrastructure doesn't require a BAA - the customer is the processor.

Do we need IRB review for AI features in research contexts?

If the AI is used in research with human subjects, yes - your IRB will want to review. Engineering implication: build the audit log to support research data extracts (de-identified per Safe Harbor or Expert Determination) without re-engineering.

How do we handle minor patient data?

Standard HIPAA pediatric considerations apply (parent/guardian consent, age-out at majority for some states). Engineering implication: tag patient records with age category; route minor-related actions through a stricter reviewer queue; log consent status with each access.

What about TEFCA and information-blocking rules?

ONC's information-blocking rules apply to certified health IT and may govern how an AI agent surfaces or withholds information [6]. Practical pattern: an agent that summarizes records visible to the patient via a portal must not selectively suppress the same information from the patient's view. Build to parity.

Talk to engineering

Ready to ship the patterns from this post?

Tell us where you are. A senior forward-deployed engineer replies within 24 hours with a written plan tailored to your stack - never an SDR.

  • Practical engineering review of your current setup
  • Eval discipline + observability + cost controls
  • Free 60-min working session, no sales pitch

Senior reply within 24h

Drop your details and we'll match you with an engineer who's shipped in your industry.

By submitting, you agree to our privacy policy. We'll never share your information.