Build a HIPAA-Compliant Data Pipeline with GlobalShield API in Python
Learn how to build HIPAA-compliant data pipelines that automatically detect and redact PHI using GlobalShield API. Includes real-world ETL, logging, and audit patterns.

Healthcare applications handle the most sensitive personal data in existence. A single unredacted patient record in a log file, an analytics database, or a third-party integration can trigger a HIPAA breach notification, an OCR investigation, and civil monetary penalties starting at $100 per violation.
The challenge for engineering teams is that Protected Health Information (PHI) doesn't announce itself. It appears in free-text fields, embedded in JSON payloads, concatenated into log strings, and hidden in API responses. Manual review misses it. Regex patterns miss edge cases. You need AI-powered PHI detection that understands context.
This guide shows you how to integrate GlobalShield API into a healthcare data pipeline to automatically detect, redact, and audit every PHI exposure point.
What Counts as PHI Under HIPAA
HIPAA's Safe Harbor method requires the removal of 18 identifiers before data can be considered de-identified:
- Patient names, geographic data smaller than state
- Dates (birth, admission, discharge, death)
- Phone numbers, fax numbers, email addresses
- Social Security numbers, medical record numbers
- Health plan beneficiary numbers, account numbers
- Certificate and license numbers
- Vehicle identifiers, device identifiers
- Web URLs, IP addresses
- Biometric identifiers, full-face photos
- Any other unique identifying number or code
GlobalShield API detects all 18 categories plus common derivatives and compound identifiers that rule-based systems miss.
Architecture for a HIPAA-Compliant Pipeline
Raw EHR Data (HL7/FHIR/CSV)
│
▼
Ingestion Layer (FastAPI)
│
▼
GlobalShield PHI Detection ──── PHI Found ──► Redaction Service
│ │
│ No PHI │
▼ ▼
Analytics Database Audit Log (immutable)
│
▼
Downstream Consumers (BI, ML, Reporting)
Installation
pip install httpx python-dotenv pandas# .env
GLOBALSHIELD_API_KEY=YOUR_API_KEYCore PHI Detection and Redaction
import httpx
import os
import json
from typing import Union
GLOBALSHIELD_BASE_URL = "https://apivult.com/api/globalshield"
async def detect_phi(text: str) -> dict:
"""
Detect PHI in text. Returns detected entities with types and positions.
"""
async with httpx.AsyncClient(timeout=15.0) as client:
response = await client.post(
f"{GLOBALSHIELD_BASE_URL}/detect",
headers={
"X-RapidAPI-Key": os.getenv("GLOBALSHIELD_API_KEY"),
"Content-Type": "application/json"
},
json={
"text": text,
"detection_mode": "hipaa",
"include_positions": True,
"confidence_threshold": 0.80
}
)
response.raise_for_status()
return response.json()
async def redact_phi(text: str, redaction_mode: str = "replace") -> dict:
"""
Detect and redact PHI in a single API call.
redaction_mode options:
- "replace": Replace with [PHI_TYPE] placeholder
- "mask": Replace with asterisks
- "tokenize": Replace with reversible tokens (for re-identification)
"""
async with httpx.AsyncClient(timeout=15.0) as client:
response = await client.post(
f"{GLOBALSHIELD_BASE_URL}/redact",
headers={
"X-RapidAPI-Key": os.getenv("GLOBALSHIELD_API_KEY"),
"Content-Type": "application/json"
},
json={
"text": text,
"detection_mode": "hipaa",
"redaction_mode": redaction_mode,
"confidence_threshold": 0.80
}
)
response.raise_for_status()
return response.json()Processing EHR Records
Healthcare data often arrives as structured JSON with free-text fields mixed with coded values:
import asyncio
async def process_ehr_record(record: dict) -> dict:
"""
Process a single EHR record, redacting PHI from all text fields.
Returns a safe version of the record for analytics use.
"""
# Fields that may contain free-text PHI
text_fields = [
"chief_complaint",
"history_of_present_illness",
"assessment",
"plan",
"discharge_summary",
"nursing_notes",
"physician_notes"
]
safe_record = record.copy()
phi_audit_log = []
for field in text_fields:
if field not in record or not record[field]:
continue
result = await redact_phi(record[field], redaction_mode="replace")
if result.get("phi_detected"):
safe_record[field] = result["redacted_text"]
# Log what was found (not what it was)
phi_audit_log.append({
"field": field,
"phi_types_found": result.get("entity_types", []),
"entity_count": result.get("entity_count", 0),
"confidence_scores": result.get("confidence_scores", [])
})
# Remove structured PHI fields for analytics
analytics_safe_record = {
k: v for k, v in safe_record.items()
if k not in ["patient_name", "ssn", "dob", "mrn", "address", "phone", "email"]
}
return {
"safe_record": analytics_safe_record,
"phi_detected": len(phi_audit_log) > 0,
"audit_log": phi_audit_log,
"original_record_id": record.get("record_id"),
"processing_timestamp": get_utc_timestamp()
}
async def process_ehr_batch(records: list) -> dict:
"""Process a batch of EHR records concurrently."""
tasks = [process_ehr_record(record) for record in records]
results = await asyncio.gather(*tasks)
phi_found_count = sum(1 for r in results if r["phi_detected"])
return {
"total_records": len(records),
"phi_detected_in": phi_found_count,
"safe_records": [r["safe_record"] for r in results],
"audit_entries": [
entry
for r in results
for entry in r["audit_log"]
]
}Protecting Log Files
The most common HIPAA violation in modern applications is PHI leaking into application logs. A log line like:
INFO: Processing admission for John Smith (DOB: 1958-03-14, MRN: 4892017)
is a PHI exposure — even in your internal logging system. GlobalShield can sanitize log messages before they're written:
import logging
import asyncio
class PHISafeLogHandler(logging.Handler):
"""
Custom log handler that strips PHI before writing to any log destination.
"""
def __init__(self, downstream_handler: logging.Handler):
super().__init__()
self.downstream = downstream_handler
self._loop = None
def emit(self, record: logging.LogRecord):
original_msg = self.format(record)
# Run async PHI detection synchronously
try:
loop = asyncio.get_event_loop()
result = loop.run_until_complete(
redact_phi(original_msg, redaction_mode="replace")
)
safe_msg = result.get("redacted_text", original_msg)
except Exception:
# Fail safe: suppress entire log line rather than leak PHI
safe_msg = "[LOG SUPPRESSED: PHI detection service unavailable]"
safe_record = logging.LogRecord(
name=record.name,
level=record.levelno,
pathname=record.pathname,
lineno=record.lineno,
msg=safe_msg,
args=(),
exc_info=None
)
self.downstream.emit(safe_record)
# Configure HIPAA-safe logging
def setup_hipaa_safe_logging():
file_handler = logging.FileHandler("/var/log/app/ehr-service.log")
file_handler.setFormatter(logging.Formatter(
"%(asctime)s %(levelname)s %(name)s: %(message)s"
))
phi_safe_handler = PHISafeLogHandler(file_handler)
logger = logging.getLogger("ehr-service")
logger.addHandler(phi_safe_handler)
logger.setLevel(logging.INFO)
return loggerScanning Existing Data Stores
If you need to audit an existing database for PHI leakage:
import pandas as pd
async def audit_database_table(
connection,
table_name: str,
text_columns: list
) -> dict:
"""
Scan an entire database table for PHI.
Returns a report of which rows and columns contain PHI.
"""
df = pd.read_sql(
f"SELECT rowid, {', '.join(text_columns)} FROM {table_name} LIMIT 10000",
connection
)
phi_findings = []
for _, row in df.iterrows():
for col in text_columns:
value = str(row[col]) if pd.notna(row[col]) else ""
if not value:
continue
result = await detect_phi(value)
if result.get("phi_detected"):
phi_findings.append({
"table": table_name,
"column": col,
"row_id": row.get("rowid"),
"phi_types": result.get("entity_types", []),
"entity_count": result.get("entity_count", 0)
})
return {
"table": table_name,
"rows_scanned": len(df),
"columns_scanned": text_columns,
"phi_findings": phi_findings,
"phi_found_in_rows": len(set(f["row_id"] for f in phi_findings)),
"risk_level": "HIGH" if phi_findings else "LOW"
}Compliance Audit Report
def generate_hipaa_audit_report(
period_start: str,
period_end: str,
audit_entries: list
) -> dict:
"""
Generate a HIPAA compliance audit report.
Required documentation for OCR investigations.
"""
phi_types_seen = {}
for entry in audit_entries:
for phi_type in entry.get("phi_types_found", []):
phi_types_seen[phi_type] = phi_types_seen.get(phi_type, 0) + 1
return {
"report_type": "HIPAA_PHI_PROCESSING_AUDIT",
"generated_at": get_utc_timestamp(),
"period": {"start": period_start, "end": period_end},
"summary": {
"total_records_processed": len(audit_entries),
"records_with_phi": sum(1 for e in audit_entries if e.get("phi_detected")),
"phi_types_encountered": phi_types_seen,
"detection_tool": "GlobalShield API — HIPAA mode",
"redaction_applied": True
},
"safeguard_statement": (
"All PHI detected during this period was automatically redacted "
"prior to storage in analytics systems. Original records remain "
"in the secure EHR system with appropriate access controls."
)
}Measured Impact
Healthcare organizations using GlobalShield API for PHI pipeline protection report:
- 99.3% PHI detection rate across all 18 HIPAA identifier types
- Zero log file violations after implementing the PHI-safe log handler
- Passed OCR audits with automated audit trail documentation
- 80% reduction in manual de-identification effort for research datasets
- Sub-50ms latency per record — no meaningful impact on pipeline throughput
Automated PHI detection is now table stakes for any engineering team building on healthcare data. GlobalShield handles the classification so your team can focus on building the features that matter.
Get Started
- Get your GlobalShield API key from APIVult
- Add PHI detection to your ingestion layer first — that's where breaches start
- Implement the PHI-safe log handler in all services that touch patient data
- Run a retroactive audit on your analytics databases using the scan function above
- Schedule quarterly scans to catch any new data stores
HIPAA compliance is not a one-time checklist. It is an ongoing process, and automation is the only way to keep up at engineering speed.
More Articles
Build a Data Privacy Compliance Pipeline with GlobalShield API in Python
Build PII detection and redaction pipelines with GlobalShield API. Automate GDPR compliance across ETL, APIs, and file workflows.
April 3, 2026
Building GDPR Compliance into Your SaaS with APIs
A practical guide to automating GDPR compliance using APIVult's Compliance Suite APIs for PII detection, data validation, and audit trails.
March 27, 2026