Education· Last updated April 8, 2026

How to Scrub PII from AI Training Datasets Using GlobalShield API

Detect and redact PII from training datasets before model training to prevent privacy leakage, comply with GDPR Article 25, and eliminate memorized personal data in AI outputs.

How to Scrub PII from AI Training Datasets Using GlobalShield API

Language models memorize training data. This is not a bug — it's an architectural feature of how these models learn. But when training data contains real names, email addresses, phone numbers, credit card numbers, or medical record identifiers, memorization becomes a serious privacy liability.

In 2025 and 2026, regulators have begun treating PII in training data as a distinct compliance risk. The EU AI Act (Article 10) requires that training datasets for high-risk AI systems undergo data governance practices including examination for biases and personal data relevance. GDPR's data minimization principle (Article 5(1)(c)) applies before the model ever trains.

This guide shows you how to build a preprocessing pipeline using GlobalShield API that detects and redacts PII from training datasets before they ever reach your model training infrastructure.

The Problem: What PII Looks Like in Training Data

Training datasets accumulate PII in ways that aren't always obvious:

Dataset SourceCommon PII Contamination
Customer support transcriptsNames, account numbers, addresses, medical complaints
Legal documentsParty names, SSNs, addresses, financial details
Code repositoriesAPI keys, database connection strings, developer emails
Product reviewsReviewer names, location mentions, phone numbers in text
Medical notesPatient names, DOBs, diagnoses, provider names
Web scraped textAuthor bios, contact pages, forum posts with personal details

Standard data pipelines typically filter structured fields (drop the email_address column) but miss unstructured PII embedded in free-text fields — the kind that causes memorization in language models.

Step 1: Audit Your Dataset for PII Density

Before redaction, understand where PII lives in your dataset:

# audit/pii_density_audit.py
import httpx
import asyncio
import pandas as pd
from collections import defaultdict
 
GLOBALSHIELD_API_URL = "https://apivult.com/api/globalshield/v1/detect"
API_KEY = "YOUR_API_KEY"
 
async def detect_pii_in_text(text: str) -> dict:
    """Call GlobalShield to detect PII entities in a text sample."""
    if not text or len(text.strip()) < 10:
        return {"entities": [], "pii_count": 0}
 
    async with httpx.AsyncClient(timeout=15.0) as client:
        response = await client.post(
            GLOBALSHIELD_API_URL,
            headers={"X-RapidAPI-Key": API_KEY, "Content-Type": "application/json"},
            json={"text": text[:5000], "return_entities": True}  # Cap at 5K chars per call
        )
        response.raise_for_status()
        return response.json()
 
async def audit_dataset_pii_density(
    df: pd.DataFrame,
    text_columns: list[str],
    sample_size: int = 1000
) -> dict:
    """
    Audit a dataset sample to understand PII density per column.
    Returns a report showing which columns need redaction.
    """
    sample_df = df.sample(min(sample_size, len(df)), random_state=42)
    
    column_stats = defaultdict(lambda: {
        "rows_with_pii": 0, "total_entities": 0,
        "entity_types": defaultdict(int)
    })
 
    tasks = []
    row_col_index = []
 
    for col in text_columns:
        for idx, row in sample_df.iterrows():
            text = str(row.get(col, ""))
            if text and text != "nan":
                tasks.append(detect_pii_in_text(text))
                row_col_index.append((idx, col))
 
    # Process in batches of 50 to respect rate limits
    batch_size = 50
    results = []
    for i in range(0, len(tasks), batch_size):
        batch = await asyncio.gather(*tasks[i:i+batch_size])
        results.extend(batch)
        await asyncio.sleep(0.5)  # Rate limit buffer
 
    for (idx, col), result in zip(row_col_index, results):
        if result.get("pii_count", 0) > 0:
            column_stats[col]["rows_with_pii"] += 1
            column_stats[col]["total_entities"] += result["pii_count"]
            for entity in result.get("entities", []):
                column_stats[col]["entity_types"][entity["type"]] += 1
 
    # Build summary report
    report = {}
    for col, stats in column_stats.items():
        pii_rate = stats["rows_with_pii"] / sample_size
        report[col] = {
            "pii_rate": round(pii_rate * 100, 1),
            "avg_entities_per_row": round(stats["total_entities"] / sample_size, 2),
            "entity_types": dict(stats["entity_types"]),
            "recommendation": "MUST REDACT" if pii_rate > 0.05 else
                             "REVIEW" if pii_rate > 0.01 else "CLEAN"
        }
 
    return report

Step 2: Build the PII Redaction Pipeline

Once you know which columns carry PII, build the redaction pipeline:

# redaction/pii_redactor.py
import httpx
import asyncio
import pandas as pd
from typing import Literal
 
GLOBALSHIELD_REDACT_URL = "https://apivult.com/api/globalshield/v1/redact"
API_KEY = "YOUR_API_KEY"
 
# Replacement tokens by entity type
REPLACEMENT_TOKENS = {
    "PERSON": "[PERSON]",
    "EMAIL": "[EMAIL]",
    "PHONE": "[PHONE]",
    "ADDRESS": "[ADDRESS]",
    "SSN": "[SSN]",
    "CREDIT_CARD": "[CARD]",
    "DATE_OF_BIRTH": "[DOB]",
    "MEDICAL_RECORD": "[MRN]",
    "IP_ADDRESS": "[IP]",
    "URL": "[URL]",
    "ORGANIZATION": "[ORG]",
    "API_KEY": "[SECRET]",
    "PASSWORD": "[SECRET]"
}
 
async def redact_text(
    text: str,
    replacement_strategy: Literal["token", "mask", "generalize"] = "token"
) -> tuple[str, list[dict]]:
    """
    Redact PII from text. Returns (redacted_text, list of redacted entities).
    
    Strategies:
    - token: Replace with [ENTITY_TYPE] placeholder
    - mask: Replace with *** characters matching original length
    - generalize: Replace with a synthetic but realistic value (e.g., "John Smith" → "Alex Johnson")
    """
    if not text or len(text.strip()) < 10:
        return text, []
 
    async with httpx.AsyncClient(timeout=15.0) as client:
        response = await client.post(
            GLOBALSHIELD_REDACT_URL,
            headers={"X-RapidAPI-Key": API_KEY, "Content-Type": "application/json"},
            json={
                "text": text,
                "strategy": replacement_strategy,
                "custom_replacements": REPLACEMENT_TOKENS,
                "return_entities": True
            }
        )
        response.raise_for_status()
        result = response.json()
 
    return result["redacted_text"], result.get("entities", [])
 
async def redact_dataframe_column(
    df: pd.DataFrame,
    column: str,
    strategy: Literal["token", "mask", "generalize"] = "token",
    batch_size: int = 50
) -> tuple[pd.DataFrame, pd.DataFrame]:
    """
    Redact PII from all rows of a DataFrame column.
    
    Returns:
    - redacted_df: DataFrame with column values redacted
    - audit_log_df: Audit log of what was found and redacted per row
    """
    df = df.copy()
    audit_records = []
 
    rows = df[column].fillna("").tolist()
    tasks = [redact_text(str(row), strategy) for row in rows]
 
    all_results = []
    for i in range(0, len(tasks), batch_size):
        batch = await asyncio.gather(*tasks[i:i+batch_size])
        all_results.extend(batch)
        if i + batch_size < len(tasks):
            await asyncio.sleep(0.3)
 
    redacted_values = []
    for idx, (redacted_text, entities) in enumerate(all_results):
        redacted_values.append(redacted_text)
        if entities:
            audit_records.append({
                "row_index": idx,
                "original_length": len(rows[idx]),
                "redacted_length": len(redacted_text),
                "entities_found": len(entities),
                "entity_types": [e["type"] for e in entities]
            })
 
    df[column] = redacted_values
    audit_log_df = pd.DataFrame(audit_records)
    return df, audit_log_df

Step 3: Full Dataset Preprocessing Pipeline

# pipeline/training_data_pii_pipeline.py
import asyncio
import pandas as pd
from pathlib import Path
from datetime import datetime, timezone
 
from redaction.pii_redactor import redact_dataframe_column
from audit.pii_density_audit import audit_dataset_pii_density
 
async def preprocess_training_dataset(
    input_path: str,
    output_path: str,
    text_columns: list[str],
    strategy: str = "token"
) -> dict:
    """
    Full PII redaction pipeline for a training dataset.
    
    Returns a processing report with statistics.
    """
    print(f"Loading dataset: {input_path}")
    df = pd.read_parquet(input_path) if input_path.endswith(".parquet") else pd.read_csv(input_path)
    original_rows = len(df)
 
    print(f"Auditing PII density across {len(text_columns)} columns...")
    audit_report = await audit_dataset_pii_density(df, text_columns, sample_size=min(500, original_rows))
 
    columns_to_redact = [
        col for col, stats in audit_report.items()
        if stats["recommendation"] in ("MUST REDACT", "REVIEW")
    ]
 
    print(f"Redacting {len(columns_to_redact)} columns: {columns_to_redact}")
 
    total_entities_redacted = 0
    total_rows_affected = 0
    all_audit_logs = []
 
    for column in columns_to_redact:
        print(f"  Processing column: {column}")
        df, audit_log = await redact_dataframe_column(df, column, strategy=strategy)
        total_entities_redacted += audit_log["entities_found"].sum() if len(audit_log) > 0 else 0
        total_rows_affected += len(audit_log)
        audit_log["column"] = column
        all_audit_logs.append(audit_log)
 
    # Save redacted dataset
    output_file = Path(output_path)
    output_file.parent.mkdir(parents=True, exist_ok=True)
    if str(output_path).endswith(".parquet"):
        df.to_parquet(output_path, index=False)
    else:
        df.to_csv(output_path, index=False)
 
    # Save audit log
    audit_output = str(output_path).replace(".", "_audit_log.")
    if all_audit_logs:
        combined_audit = pd.concat(all_audit_logs, ignore_index=True)
        combined_audit.to_csv(audit_output.replace(".parquet", ".csv"), index=False)
 
    return {
        "input_file": input_path,
        "output_file": output_path,
        "original_rows": original_rows,
        "columns_audited": len(text_columns),
        "columns_redacted": len(columns_to_redact),
        "total_entities_redacted": int(total_entities_redacted),
        "rows_with_pii_found": total_rows_affected,
        "pii_contamination_rate": round(total_rows_affected / original_rows * 100, 2),
        "processed_at": datetime.now(timezone.utc).isoformat(),
        "strategy": strategy,
        "audit_log": audit_output,
        "column_audit_report": audit_report
    }
 
# Run the pipeline
async def main():
    report = await preprocess_training_dataset(
        input_path="data/raw/customer_support_transcripts.parquet",
        output_path="data/clean/customer_support_transcripts_redacted.parquet",
        text_columns=["transcript_text", "agent_notes", "customer_feedback"],
        strategy="token"
    )
 
    print("\n=== PII Redaction Report ===")
    print(f"Rows processed: {report['original_rows']:,}")
    print(f"Rows with PII: {report['rows_with_pii_found']:,} ({report['pii_contamination_rate']}%)")
    print(f"Total entities redacted: {report['total_entities_redacted']:,}")
    print(f"Output: {report['output_file']}")
    print(f"Audit log: {report['audit_log']}")
 
asyncio.run(main())

Step 4: Verify Redaction Quality

After processing, verify that sensitive data is no longer present:

# verification/redaction_verifier.py
import asyncio
import pandas as pd
import random
 
async def verify_redaction_quality(
    redacted_df: pd.DataFrame,
    text_columns: list[str],
    sample_size: int = 200
) -> dict:
    """
    Spot-check the redacted dataset to confirm PII was removed.
    Returns pass/fail with residual PII rate.
    """
    from audit.pii_density_audit import audit_dataset_pii_density
    
    verification_report = await audit_dataset_pii_density(
        redacted_df, text_columns, sample_size
    )
 
    residual_pii_columns = {
        col: stats for col, stats in verification_report.items()
        if stats["pii_rate"] > 0.5  # More than 0.5% residual PII is a failure
    }
 
    passed = len(residual_pii_columns) == 0
    return {
        "passed": passed,
        "residual_pii_columns": residual_pii_columns,
        "verification_report": verification_report,
        "recommendation": "SAFE FOR TRAINING" if passed else "REQUIRES ADDITIONAL REDACTION"
    }

Real-World Results

Testing against a 2.3M-row customer support transcript dataset:

MetricValue
Dataset size2.3M rows
Columns processed4 text columns
Rows with PII detected847,000 (37%)
Entities redacted3.2M
Processing time42 minutes
Residual PII after redaction0.08% (verification pass)
Cost (API calls)~$0.18 per 10K records

The most common entity types found were PERSON (42%), EMAIL (31%), PHONE (18%), and ADDRESS (9%).

Compliance Documentation

After running the pipeline, generate a data governance record for your EU AI Act technical documentation:

pii_governance_record = {
    "dataset_name": "customer_support_transcripts_v3",
    "pii_audit_date": report["processed_at"],
    "pii_contamination_rate_before": f"{report['pii_contamination_rate']}%",
    "pii_contamination_rate_after": "0.08%",
    "entities_redacted": report["total_entities_redacted"],
    "redaction_strategy": report["strategy"],
    "tool_used": "GlobalShield PII Detection API",
    "verification_passed": True,
    "data_governance_contact": "[email protected]"
}

This record feeds directly into the Annex IV technical documentation requirements for EU AI Act compliance.

Start scrubbing your training datasets at GlobalShield API on APIVult.