How to Scrub PII from AI Training Datasets Using GlobalShield API
Detect and redact PII from training datasets before model training to prevent privacy leakage, comply with GDPR Article 25, and eliminate memorized personal data in AI outputs.

Language models memorize training data. This is not a bug — it's an architectural feature of how these models learn. But when training data contains real names, email addresses, phone numbers, credit card numbers, or medical record identifiers, memorization becomes a serious privacy liability.
In 2025 and 2026, regulators have begun treating PII in training data as a distinct compliance risk. The EU AI Act (Article 10) requires that training datasets for high-risk AI systems undergo data governance practices including examination for biases and personal data relevance. GDPR's data minimization principle (Article 5(1)(c)) applies before the model ever trains.
This guide shows you how to build a preprocessing pipeline using GlobalShield API that detects and redacts PII from training datasets before they ever reach your model training infrastructure.
The Problem: What PII Looks Like in Training Data
Training datasets accumulate PII in ways that aren't always obvious:
| Dataset Source | Common PII Contamination |
|---|---|
| Customer support transcripts | Names, account numbers, addresses, medical complaints |
| Legal documents | Party names, SSNs, addresses, financial details |
| Code repositories | API keys, database connection strings, developer emails |
| Product reviews | Reviewer names, location mentions, phone numbers in text |
| Medical notes | Patient names, DOBs, diagnoses, provider names |
| Web scraped text | Author bios, contact pages, forum posts with personal details |
Standard data pipelines typically filter structured fields (drop the email_address column) but miss unstructured PII embedded in free-text fields — the kind that causes memorization in language models.
Step 1: Audit Your Dataset for PII Density
Before redaction, understand where PII lives in your dataset:
# audit/pii_density_audit.py
import httpx
import asyncio
import pandas as pd
from collections import defaultdict
GLOBALSHIELD_API_URL = "https://apivult.com/api/globalshield/v1/detect"
API_KEY = "YOUR_API_KEY"
async def detect_pii_in_text(text: str) -> dict:
"""Call GlobalShield to detect PII entities in a text sample."""
if not text or len(text.strip()) < 10:
return {"entities": [], "pii_count": 0}
async with httpx.AsyncClient(timeout=15.0) as client:
response = await client.post(
GLOBALSHIELD_API_URL,
headers={"X-RapidAPI-Key": API_KEY, "Content-Type": "application/json"},
json={"text": text[:5000], "return_entities": True} # Cap at 5K chars per call
)
response.raise_for_status()
return response.json()
async def audit_dataset_pii_density(
df: pd.DataFrame,
text_columns: list[str],
sample_size: int = 1000
) -> dict:
"""
Audit a dataset sample to understand PII density per column.
Returns a report showing which columns need redaction.
"""
sample_df = df.sample(min(sample_size, len(df)), random_state=42)
column_stats = defaultdict(lambda: {
"rows_with_pii": 0, "total_entities": 0,
"entity_types": defaultdict(int)
})
tasks = []
row_col_index = []
for col in text_columns:
for idx, row in sample_df.iterrows():
text = str(row.get(col, ""))
if text and text != "nan":
tasks.append(detect_pii_in_text(text))
row_col_index.append((idx, col))
# Process in batches of 50 to respect rate limits
batch_size = 50
results = []
for i in range(0, len(tasks), batch_size):
batch = await asyncio.gather(*tasks[i:i+batch_size])
results.extend(batch)
await asyncio.sleep(0.5) # Rate limit buffer
for (idx, col), result in zip(row_col_index, results):
if result.get("pii_count", 0) > 0:
column_stats[col]["rows_with_pii"] += 1
column_stats[col]["total_entities"] += result["pii_count"]
for entity in result.get("entities", []):
column_stats[col]["entity_types"][entity["type"]] += 1
# Build summary report
report = {}
for col, stats in column_stats.items():
pii_rate = stats["rows_with_pii"] / sample_size
report[col] = {
"pii_rate": round(pii_rate * 100, 1),
"avg_entities_per_row": round(stats["total_entities"] / sample_size, 2),
"entity_types": dict(stats["entity_types"]),
"recommendation": "MUST REDACT" if pii_rate > 0.05 else
"REVIEW" if pii_rate > 0.01 else "CLEAN"
}
return reportStep 2: Build the PII Redaction Pipeline
Once you know which columns carry PII, build the redaction pipeline:
# redaction/pii_redactor.py
import httpx
import asyncio
import pandas as pd
from typing import Literal
GLOBALSHIELD_REDACT_URL = "https://apivult.com/api/globalshield/v1/redact"
API_KEY = "YOUR_API_KEY"
# Replacement tokens by entity type
REPLACEMENT_TOKENS = {
"PERSON": "[PERSON]",
"EMAIL": "[EMAIL]",
"PHONE": "[PHONE]",
"ADDRESS": "[ADDRESS]",
"SSN": "[SSN]",
"CREDIT_CARD": "[CARD]",
"DATE_OF_BIRTH": "[DOB]",
"MEDICAL_RECORD": "[MRN]",
"IP_ADDRESS": "[IP]",
"URL": "[URL]",
"ORGANIZATION": "[ORG]",
"API_KEY": "[SECRET]",
"PASSWORD": "[SECRET]"
}
async def redact_text(
text: str,
replacement_strategy: Literal["token", "mask", "generalize"] = "token"
) -> tuple[str, list[dict]]:
"""
Redact PII from text. Returns (redacted_text, list of redacted entities).
Strategies:
- token: Replace with [ENTITY_TYPE] placeholder
- mask: Replace with *** characters matching original length
- generalize: Replace with a synthetic but realistic value (e.g., "John Smith" → "Alex Johnson")
"""
if not text or len(text.strip()) < 10:
return text, []
async with httpx.AsyncClient(timeout=15.0) as client:
response = await client.post(
GLOBALSHIELD_REDACT_URL,
headers={"X-RapidAPI-Key": API_KEY, "Content-Type": "application/json"},
json={
"text": text,
"strategy": replacement_strategy,
"custom_replacements": REPLACEMENT_TOKENS,
"return_entities": True
}
)
response.raise_for_status()
result = response.json()
return result["redacted_text"], result.get("entities", [])
async def redact_dataframe_column(
df: pd.DataFrame,
column: str,
strategy: Literal["token", "mask", "generalize"] = "token",
batch_size: int = 50
) -> tuple[pd.DataFrame, pd.DataFrame]:
"""
Redact PII from all rows of a DataFrame column.
Returns:
- redacted_df: DataFrame with column values redacted
- audit_log_df: Audit log of what was found and redacted per row
"""
df = df.copy()
audit_records = []
rows = df[column].fillna("").tolist()
tasks = [redact_text(str(row), strategy) for row in rows]
all_results = []
for i in range(0, len(tasks), batch_size):
batch = await asyncio.gather(*tasks[i:i+batch_size])
all_results.extend(batch)
if i + batch_size < len(tasks):
await asyncio.sleep(0.3)
redacted_values = []
for idx, (redacted_text, entities) in enumerate(all_results):
redacted_values.append(redacted_text)
if entities:
audit_records.append({
"row_index": idx,
"original_length": len(rows[idx]),
"redacted_length": len(redacted_text),
"entities_found": len(entities),
"entity_types": [e["type"] for e in entities]
})
df[column] = redacted_values
audit_log_df = pd.DataFrame(audit_records)
return df, audit_log_dfStep 3: Full Dataset Preprocessing Pipeline
# pipeline/training_data_pii_pipeline.py
import asyncio
import pandas as pd
from pathlib import Path
from datetime import datetime, timezone
from redaction.pii_redactor import redact_dataframe_column
from audit.pii_density_audit import audit_dataset_pii_density
async def preprocess_training_dataset(
input_path: str,
output_path: str,
text_columns: list[str],
strategy: str = "token"
) -> dict:
"""
Full PII redaction pipeline for a training dataset.
Returns a processing report with statistics.
"""
print(f"Loading dataset: {input_path}")
df = pd.read_parquet(input_path) if input_path.endswith(".parquet") else pd.read_csv(input_path)
original_rows = len(df)
print(f"Auditing PII density across {len(text_columns)} columns...")
audit_report = await audit_dataset_pii_density(df, text_columns, sample_size=min(500, original_rows))
columns_to_redact = [
col for col, stats in audit_report.items()
if stats["recommendation"] in ("MUST REDACT", "REVIEW")
]
print(f"Redacting {len(columns_to_redact)} columns: {columns_to_redact}")
total_entities_redacted = 0
total_rows_affected = 0
all_audit_logs = []
for column in columns_to_redact:
print(f" Processing column: {column}")
df, audit_log = await redact_dataframe_column(df, column, strategy=strategy)
total_entities_redacted += audit_log["entities_found"].sum() if len(audit_log) > 0 else 0
total_rows_affected += len(audit_log)
audit_log["column"] = column
all_audit_logs.append(audit_log)
# Save redacted dataset
output_file = Path(output_path)
output_file.parent.mkdir(parents=True, exist_ok=True)
if str(output_path).endswith(".parquet"):
df.to_parquet(output_path, index=False)
else:
df.to_csv(output_path, index=False)
# Save audit log
audit_output = str(output_path).replace(".", "_audit_log.")
if all_audit_logs:
combined_audit = pd.concat(all_audit_logs, ignore_index=True)
combined_audit.to_csv(audit_output.replace(".parquet", ".csv"), index=False)
return {
"input_file": input_path,
"output_file": output_path,
"original_rows": original_rows,
"columns_audited": len(text_columns),
"columns_redacted": len(columns_to_redact),
"total_entities_redacted": int(total_entities_redacted),
"rows_with_pii_found": total_rows_affected,
"pii_contamination_rate": round(total_rows_affected / original_rows * 100, 2),
"processed_at": datetime.now(timezone.utc).isoformat(),
"strategy": strategy,
"audit_log": audit_output,
"column_audit_report": audit_report
}
# Run the pipeline
async def main():
report = await preprocess_training_dataset(
input_path="data/raw/customer_support_transcripts.parquet",
output_path="data/clean/customer_support_transcripts_redacted.parquet",
text_columns=["transcript_text", "agent_notes", "customer_feedback"],
strategy="token"
)
print("\n=== PII Redaction Report ===")
print(f"Rows processed: {report['original_rows']:,}")
print(f"Rows with PII: {report['rows_with_pii_found']:,} ({report['pii_contamination_rate']}%)")
print(f"Total entities redacted: {report['total_entities_redacted']:,}")
print(f"Output: {report['output_file']}")
print(f"Audit log: {report['audit_log']}")
asyncio.run(main())Step 4: Verify Redaction Quality
After processing, verify that sensitive data is no longer present:
# verification/redaction_verifier.py
import asyncio
import pandas as pd
import random
async def verify_redaction_quality(
redacted_df: pd.DataFrame,
text_columns: list[str],
sample_size: int = 200
) -> dict:
"""
Spot-check the redacted dataset to confirm PII was removed.
Returns pass/fail with residual PII rate.
"""
from audit.pii_density_audit import audit_dataset_pii_density
verification_report = await audit_dataset_pii_density(
redacted_df, text_columns, sample_size
)
residual_pii_columns = {
col: stats for col, stats in verification_report.items()
if stats["pii_rate"] > 0.5 # More than 0.5% residual PII is a failure
}
passed = len(residual_pii_columns) == 0
return {
"passed": passed,
"residual_pii_columns": residual_pii_columns,
"verification_report": verification_report,
"recommendation": "SAFE FOR TRAINING" if passed else "REQUIRES ADDITIONAL REDACTION"
}Real-World Results
Testing against a 2.3M-row customer support transcript dataset:
| Metric | Value |
|---|---|
| Dataset size | 2.3M rows |
| Columns processed | 4 text columns |
| Rows with PII detected | 847,000 (37%) |
| Entities redacted | 3.2M |
| Processing time | 42 minutes |
| Residual PII after redaction | 0.08% (verification pass) |
| Cost (API calls) | ~$0.18 per 10K records |
The most common entity types found were PERSON (42%), EMAIL (31%), PHONE (18%), and ADDRESS (9%).
Compliance Documentation
After running the pipeline, generate a data governance record for your EU AI Act technical documentation:
pii_governance_record = {
"dataset_name": "customer_support_transcripts_v3",
"pii_audit_date": report["processed_at"],
"pii_contamination_rate_before": f"{report['pii_contamination_rate']}%",
"pii_contamination_rate_after": "0.08%",
"entities_redacted": report["total_entities_redacted"],
"redaction_strategy": report["strategy"],
"tool_used": "GlobalShield PII Detection API",
"verification_passed": True,
"data_governance_contact": "[email protected]"
}This record feeds directly into the Annex IV technical documentation requirements for EU AI Act compliance.
Start scrubbing your training datasets at GlobalShield API on APIVult.
More Articles
Build a Data Privacy Compliance Pipeline with GlobalShield API in Python
Build PII detection and redaction pipelines with GlobalShield API. Automate GDPR compliance across ETL, APIs, and file workflows.
April 3, 2026
PII Detection in 2026: Navigating the Global Privacy Regulation Wave
With 19+ new privacy laws taking effect in 2026 and GDPR fines reaching €5.88 billion, automated PII detection is no longer optional. Here's what changed.
March 30, 2026