Build a Data Privacy Compliance Pipeline with GlobalShield API in Python
Build PII detection and redaction pipelines with GlobalShield API. Automate GDPR compliance across ETL, APIs, and file workflows.

Every data pipeline that touches customer information is a compliance liability waiting to surface. Customer names in log files. Email addresses in debug outputs. Social security numbers in a CSV export someone emailed to accounting. These aren't hypothetical — they're the kinds of issues that trigger GDPR investigations, generate regulatory fines, and appear in breach notifications.
A data privacy compliance pipeline — one that automatically detects and redacts personally identifiable information (PII) at every layer of your stack — is no longer optional for companies operating in regulated markets. This guide shows you how to build one using the GlobalShield API in Python.
What Is a Data Privacy Compliance Pipeline?
A compliance pipeline intercepts data at defined checkpoints and applies privacy controls before data moves to the next stage. Key checkpoints include:
- Ingestion: Raw data arriving from external sources (webhooks, file uploads, API responses)
- ETL processing: Data transformation stages where PII can be inadvertently duplicated or exposed
- Storage: Database writes, S3 uploads, data warehouse loads
- Egress: API responses, exports, reports, logs
At each checkpoint, the pipeline should: detect PII, apply appropriate handling (redaction, tokenization, or flagging), and log the action for audit purposes.
GlobalShield API Capabilities
The GlobalShield API performs real-time PII detection and redaction across unstructured text, JSON payloads, and file content. It identifies:
- Names, email addresses, phone numbers, postal addresses
- Financial identifiers (account numbers, credit card numbers, IBAN)
- Government identifiers (SSN, passport numbers, national IDs)
- Medical identifiers (MRN, health plan IDs)
- Custom entity types via regex or dictionary rules
Detection confidence is returned per entity, allowing you to apply different handling based on certainty level.
Prerequisites
pip install requests python-dotenv boto3 psycopg2-binaryexport RAPIDAPI_KEY="YOUR_API_KEY"Step 1: PII Detection and Redaction
The core function: scan text and redact identified PII.
import requests
import os
import json
from typing import Dict, Any, Optional
RAPIDAPI_KEY = os.environ["RAPIDAPI_KEY"]
GLOBALSHIELD_HOST = "globalshield-pii-detector.p.rapidapi.com"
HEADERS = {
"X-RapidAPI-Key": RAPIDAPI_KEY,
"X-RapidAPI-Host": GLOBALSHIELD_HOST,
"Content-Type": "application/json"
}
def detect_pii(text: str, confidence_threshold: float = 0.7) -> Dict[str, Any]:
"""Detect PII entities in text. Returns detections with positions and confidence."""
response = requests.post(
f"https://{GLOBALSHIELD_HOST}/detect",
headers=HEADERS,
json={
"text": text,
"confidence_threshold": confidence_threshold,
"entity_types": ["all"]
}
)
response.raise_for_status()
return response.json()
def redact_pii(
text: str,
replacement_style: str = "type_label", # "asterisks" | "type_label" | "hash"
confidence_threshold: float = 0.7
) -> Dict[str, Any]:
"""Redact PII from text. Returns redacted text and audit metadata."""
response = requests.post(
f"https://{GLOBALSHIELD_HOST}/redact",
headers=HEADERS,
json={
"text": text,
"replacement_style": replacement_style,
"confidence_threshold": confidence_threshold
}
)
response.raise_for_status()
return response.json()
# Example usage
sample_text = """
Dear John Smith, your account ending in 4521 has been flagged.
Please contact us at [email protected] or call +1 (555) 234-5678.
Your SSN on file is 123-45-6789.
"""
result = redact_pii(sample_text)
print("Redacted text:")
print(result["redacted_text"])
print(f"\nEntities found: {result['entity_count']}")
print(f"Entity types: {[e['type'] for e in result['entities_found']]}")Output:
Redacted text:
Dear [PERSON], your account ending in [FINANCIAL_ID] has been flagged.
Please contact us at [EMAIL] or call [PHONE_NUMBER].
Your SSN on file is [GOVERNMENT_ID].
Entities found: 5
Entity types: ['PERSON', 'FINANCIAL_ID', 'EMAIL', 'PHONE_NUMBER', 'GOVERNMENT_ID']
Step 2: ETL Pipeline Integration
Wrap the redaction function around your ETL transformation stages:
import logging
from datetime import datetime
from typing import List
# Set up structured audit logging
logging.basicConfig(
level=logging.INFO,
format='{"time": "%(asctime)s", "level": "%(levelname)s", "msg": %(message)s}'
)
logger = logging.getLogger("privacy-pipeline")
def process_record_with_privacy(
record: Dict[str, Any],
text_fields: List[str],
record_id: str = "unknown"
) -> Dict[str, Any]:
"""
Process a data record, redacting PII from specified text fields.
Returns the sanitized record and an audit trail entry.
"""
sanitized = dict(record) # immutable: create a copy, don't modify original
audit_trail = {
"record_id": record_id,
"processed_at": datetime.utcnow().isoformat(),
"fields_scanned": [],
"pii_detected": []
}
for field in text_fields:
if field not in record or record[field] is None:
continue
field_value = str(record[field])
audit_trail["fields_scanned"].append(field)
detection = detect_pii(field_value)
if detection["entity_count"] > 0:
# Redact the field
redaction = redact_pii(field_value)
sanitized[field] = redaction["redacted_text"]
# Log what was found (not the actual values)
pii_summary = {
"field": field,
"entity_types": [e["type"] for e in detection["entities"]],
"entity_count": detection["entity_count"]
}
audit_trail["pii_detected"].append(pii_summary)
logger.info(json.dumps({
"event": "pii_redacted",
"record_id": record_id,
"field": field,
"entity_types": pii_summary["entity_types"]
}))
return sanitized, audit_trail
def process_batch(
records: List[Dict[str, Any]],
text_fields: List[str],
id_field: str = "id"
) -> tuple:
"""Process a batch of records with PII redaction."""
sanitized_records = []
all_audits = []
for record in records:
record_id = str(record.get(id_field, "unknown"))
sanitized, audit = process_record_with_privacy(record, text_fields, record_id)
sanitized_records.append(sanitized)
all_audits.append(audit)
total_pii = sum(len(a["pii_detected"]) for a in all_audits)
logger.info(json.dumps({
"event": "batch_complete",
"records_processed": len(records),
"records_with_pii": sum(1 for a in all_audits if a["pii_detected"]),
"total_pii_instances": total_pii
}))
return sanitized_records, all_auditsStep 3: API Response Sanitization Middleware
Protect outgoing API responses from inadvertently leaking PII:
from functools import wraps
import re
class PIIRedactionMiddleware:
"""
Middleware that scans API responses for PII before returning to clients.
Use this as a last-line-of-defense safety net.
"""
def __init__(self, redact_fields: List[str] = None, full_scan: bool = False):
"""
redact_fields: specific JSON fields to scan (faster)
full_scan: serialize entire response and scan as text (comprehensive but slower)
"""
self.redact_fields = redact_fields or []
self.full_scan = full_scan
def sanitize_response(self, response_data: Dict[str, Any]) -> Dict[str, Any]:
"""Sanitize an API response payload."""
if self.full_scan:
# Serialize, scan, and rebuild
serialized = json.dumps(response_data)
result = redact_pii(serialized)
if result["entity_count"] > 0:
logger.warning(json.dumps({
"event": "pii_in_api_response",
"entity_types": [e["type"] for e in result["entities_found"]],
"action": "redacted"
}))
return json.loads(result["redacted_text"])
return response_data
else:
# Only scan specified fields
sanitized = dict(response_data)
for field in self.redact_fields:
if field in sanitized and isinstance(sanitized[field], str):
result = redact_pii(sanitized[field])
sanitized[field] = result["redacted_text"]
return sanitized
# FastAPI integration example
# from fastapi import FastAPI, Response
# from fastapi.middleware.base import BaseHTTPMiddlewareStep 4: Log Sanitization
Application logs are a major source of accidental PII exposure. Sanitize before writing:
import logging
class PIIRedactingHandler(logging.Handler):
"""
Custom logging handler that redacts PII from log messages
before they're written to any output stream.
"""
def __init__(self, base_handler: logging.Handler, enabled: bool = True):
super().__init__()
self.base_handler = base_handler
self.enabled = enabled
self._redaction_cache = {} # Cache results for identical strings
def emit(self, record: logging.LogRecord):
if self.enabled:
message = self.format(record)
# Cache check to avoid redundant API calls for repeated log lines
cache_key = hash(message)
if cache_key not in self._redaction_cache:
try:
result = redact_pii(message, confidence_threshold=0.85)
self._redaction_cache[cache_key] = result["redacted_text"]
except Exception:
self._redaction_cache[cache_key] = message # Fail open, log original
record.msg = self._redaction_cache[cache_key]
record.args = ()
self.base_handler.emit(record)
def setup_privacy_aware_logging():
"""Configure application logging with PII redaction."""
base_handler = logging.StreamHandler()
privacy_handler = PIIRedactingHandler(base_handler, enabled=True)
root_logger = logging.getLogger()
root_logger.handlers = [privacy_handler]
root_logger.setLevel(logging.INFO)Step 5: Audit Report Generation
Generate compliance-ready audit reports from your audit trail:
from collections import Counter
from datetime import datetime
def generate_privacy_audit_report(audit_entries: List[Dict], period: str = "daily") -> Dict:
"""
Generate a GDPR Article 30-style processing record from audit trail data.
Suitable for DPA requests and internal compliance reviews.
"""
total_records = len(audit_entries)
records_with_pii = [a for a in audit_entries if a["pii_detected"]]
all_entity_types = []
for entry in records_with_pii:
for pii in entry["pii_detected"]:
all_entity_types.extend(pii["entity_types"])
entity_counts = Counter(all_entity_types)
return {
"report_generated_at": datetime.utcnow().isoformat(),
"period": period,
"summary": {
"total_records_processed": total_records,
"records_containing_pii": len(records_with_pii),
"pii_detection_rate": f"{len(records_with_pii) / total_records * 100:.1f}%",
"total_pii_instances_redacted": len(all_entity_types)
},
"entity_type_breakdown": dict(entity_counts.most_common()),
"compliance_status": "compliant" if total_records > 0 else "no_data"
}Compliance Impact
Companies implementing automated PII detection and redaction pipelines report:
- 60–80% reduction in accidental PII exposure incidents (Gartner, 2025)
- GDPR Article 25 (data protection by design) compliance documentation
- Audit-ready logs for regulatory responses without manual reconstruction
- Faster incident response: knowing exactly where PII appears means faster breach notification scoping
With 2,245 documented GDPR fines on record and enforcement accelerating across both EU and US state privacy laws (19 US states had comprehensive privacy laws in effect as of early 2026), the cost of not having this pipeline in place continues to rise.
Next Steps
- Get your GlobalShield API key at apivult.com
- Identify your highest-risk data flows: log ingestion, webhook handlers, ETL jobs
- Start with detection-only mode to understand your PII exposure baseline
- Layer in redaction at ingestion, then add middleware for egress protection
- Generate your first audit report to establish a compliance baseline
A properly implemented data privacy compliance pipeline doesn't just reduce risk — it gives your compliance, legal, and security teams the visibility they need to operate confidently in an increasingly regulated environment.
More Articles
How to Add PII Detection to Your Application in 5 Minutes
Learn how to integrate GlobalShield's PII detection API to automatically identify and redact sensitive personal information from text, images, and documents.
March 27, 2026
Building GDPR Compliance into Your SaaS with APIs
A practical guide to automating GDPR compliance using APIVult's Compliance Suite APIs for PII detection, data validation, and audit trails.
March 27, 2026