Education· Last updated April 11, 2026

How to Automate GDPR Data Subject Access Requests in Python with GlobalShield API

Build a complete automated DSAR (Data Subject Access Request) pipeline in Python using GlobalShield API. Scan, identify, and redact PII across your data systems in under 30 minutes.

How to Automate GDPR Data Subject Access Requests in Python with GlobalShield API

GDPR Data Subject Access Requests (DSARs) are one of the most operationally painful compliance obligations for engineering teams. A user emails support saying "please send me all personal data you hold about me" — and suddenly your team needs to query databases, search log files, scan document stores, and compile a complete, accurate response within 30 days. The UK ICO fined Reddit £14 million in 2026 partly because of systemic failures in data protection workflows. Manual DSAR handling at scale is the same kind of systemic failure waiting to happen.

This guide walks through building a fully automated DSAR pipeline in Python using GlobalShield API to detect and identify PII across your data systems — so that when a DSAR arrives, you can respond accurately and on time without a manual fire drill.

What a DSAR Pipeline Needs to Do

Under GDPR Article 15, when a verified data subject submits an access request, you must:

  1. Identify all personal data held about that individual across all systems
  2. Compile a complete record of what data you hold, where it's stored, and why
  3. Respond within 30 days (extendable to 90 in complex cases with notification)
  4. Provide a copy of the personal data in a portable format
  5. Explain the processing: purposes, categories, recipients, retention periods

The challenge is that personal data is rarely in one place. It's spread across your production database, support ticketing system, email archives, analytics logs, document storage, and third-party integrations. Manually querying each system is slow, error-prone, and doesn't scale.

The automated approach: build a pipeline that, given an email address or user ID, systematically searches all data stores, uses PII detection to surface personal data, and compiles a structured report.

Architecture Overview

DSAR Request Received
        │
        ▼
Identity Verification
        │
        ▼
Data Source Enumeration (DB, logs, docs, S3)
        │
        ▼
GlobalShield API PII Detection (per record/document)
        │
        ▼
PII Mapping & Attribution
        │
        ▼
DSAR Report Generation (PDF/JSON)
        │
        ▼
Response to Data Subject

The critical step is the PII detection layer. GlobalShield API can scan unstructured text — support tickets, log entries, documents — and identify which fields contain personal data, what type of PII it is, and the confidence level of the detection. This transforms what would be a manual review into an automated scan.

Prerequisites

pip install requests boto3 psycopg2-binary python-dotenv reportlab

You'll need:

  • A GlobalShield API key from RapidAPI
  • Access credentials for your data sources (PostgreSQL, S3, etc.)
  • Python 3.10+

Step 1: Set Up the GlobalShield PII Scanner

import requests
import os
from typing import Optional
 
class GlobalShieldScanner:
    def __init__(self, api_key: str):
        self.api_key = api_key
        self.base_url = "https://globalshield.p.rapidapi.com"
        self.headers = {
            "x-rapidapi-host": "globalshield.p.rapidapi.com",
            "x-rapidapi-key": api_key,
            "Content-Type": "application/json"
        }
    
    def scan_text(self, text: str, source_label: str = "") -> dict:
        """Scan text for PII entities."""
        response = requests.post(
            f"{self.base_url}/detect",
            json={
                "text": text,
                "entities": [
                    "email", "phone", "name", "address", "dob",
                    "ssn", "passport", "ip_address", "financial",
                    "health", "biometric"
                ],
                "confidence_threshold": 0.75,
                "include_context": True
            },
            headers=self.headers
        )
        result = response.json()
        result["source"] = source_label
        return result
    
    def scan_batch(self, records: list[dict]) -> list[dict]:
        """Scan multiple records, each with 'id', 'text', and 'source' fields."""
        results = []
        for record in records:
            scan = self.scan_text(record["text"], record.get("source", "unknown"))
            scan["record_id"] = record["id"]
            results.append(scan)
        return results

Step 2: Query Your Data Sources

import psycopg2
import boto3
import json
from datetime import datetime
 
class DataSourceConnector:
    def __init__(self, db_dsn: str, s3_bucket: str, aws_region: str):
        self.db_dsn = db_dsn
        self.s3_bucket = s3_bucket
        self.s3_client = boto3.client("s3", region_name=aws_region)
    
    def get_user_database_records(self, email: str) -> list[dict]:
        """Fetch all database records associated with this email."""
        conn = psycopg2.connect(self.db_dsn)
        cur = conn.cursor()
        
        records = []
        
        # Users table
        cur.execute(
            "SELECT id, email, name, phone, address, created_at FROM users WHERE email = %s",
            (email,)
        )
        for row in cur.fetchall():
            records.append({
                "id": f"users_{row[0]}",
                "text": f"name: {row[2]}, email: {row[1]}, phone: {row[3]}, address: {row[4]}",
                "source": "database/users",
                "created_at": str(row[5])
            })
        
        # Support tickets
        cur.execute(
            "SELECT id, content, created_at FROM support_tickets WHERE user_email = %s",
            (email,)
        )
        for row in cur.fetchall():
            records.append({
                "id": f"ticket_{row[0]}",
                "text": row[1],
                "source": "database/support_tickets",
                "created_at": str(row[2])
            })
        
        cur.close()
        conn.close()
        return records
    
    def get_user_documents(self, user_id: str) -> list[dict]:
        """Fetch documents from S3 associated with this user."""
        documents = []
        
        # List objects with user prefix
        response = self.s3_client.list_objects_v2(
            Bucket=self.s3_bucket,
            Prefix=f"users/{user_id}/"
        )
        
        for obj in response.get("Contents", []):
            doc_response = self.s3_client.get_object(
                Bucket=self.s3_bucket,
                Key=obj["Key"]
            )
            content = doc_response["Body"].read().decode("utf-8", errors="ignore")
            documents.append({
                "id": obj["Key"],
                "text": content[:5000],  # Limit per document
                "source": f"s3/{obj['Key']}",
                "last_modified": str(obj["LastModified"])
            })
        
        return documents

Step 3: Build the DSAR Pipeline

from dataclasses import dataclass, field
from typing import List
 
@dataclass
class DSARReport:
    subject_email: str
    request_date: str
    data_found: List[dict] = field(default_factory=list)
    pii_summary: dict = field(default_factory=dict)
    systems_searched: List[str] = field(default_factory=list)
    
    def to_dict(self) -> dict:
        return {
            "report_type": "GDPR Data Subject Access Request",
            "subject_email": self.subject_email,
            "request_date": self.request_date,
            "generated_at": datetime.utcnow().isoformat(),
            "systems_searched": self.systems_searched,
            "pii_summary": self.pii_summary,
            "data_records": self.data_found
        }
 
class DSARPipeline:
    def __init__(
        self,
        globalshield_api_key: str,
        db_dsn: str,
        s3_bucket: str,
        aws_region: str = "us-east-1"
    ):
        self.scanner = GlobalShieldScanner(globalshield_api_key)
        self.connector = DataSourceConnector(db_dsn, s3_bucket, aws_region)
    
    def process_dsar(self, subject_email: str, user_id: Optional[str] = None) -> DSARReport:
        report = DSARReport(
            subject_email=subject_email,
            request_date=datetime.utcnow().isoformat()
        )
        
        # Step 1: Collect records from all data sources
        print(f"[DSAR] Processing request for {subject_email}")
        
        db_records = self.connector.get_user_database_records(subject_email)
        report.systems_searched.append("PostgreSQL database")
        print(f"  → Found {len(db_records)} database records")
        
        doc_records = []
        if user_id:
            doc_records = self.connector.get_user_documents(user_id)
            report.systems_searched.append("S3 document store")
            print(f"  → Found {len(doc_records)} documents")
        
        all_records = db_records + doc_records
        
        # Step 2: Scan all records for PII
        print(f"[DSAR] Scanning {len(all_records)} records for PII...")
        scan_results = self.scanner.scan_batch(all_records)
        
        # Step 3: Build PII map
        pii_by_type: dict = {}
        records_with_pii = []
        
        for result in scan_results:
            if result.get("pii_detected"):
                for entity in result.get("entities", []):
                    entity_type = entity["type"]
                    if entity_type not in pii_by_type:
                        pii_by_type[entity_type] = []
                    pii_by_type[entity_type].append({
                        "source": result["source"],
                        "record_id": result["record_id"],
                        "value_redacted": entity.get("redacted_value", "[REDACTED]"),
                        "confidence": entity["confidence"]
                    })
                
                records_with_pii.append({
                    "record_id": result["record_id"],
                    "source": result["source"],
                    "pii_types_found": [e["type"] for e in result.get("entities", [])],
                    "record_summary": result.get("redacted_text", "")[:500]
                })
        
        report.pii_summary = {
            "total_records_searched": len(all_records),
            "records_containing_pii": len(records_with_pii),
            "pii_categories_found": list(pii_by_type.keys()),
            "pii_by_category": pii_by_type
        }
        report.data_found = records_with_pii
        
        return report
 
# Run the pipeline
pipeline = DSARPipeline(
    globalshield_api_key=os.getenv("GLOBALSHIELD_API_KEY"),
    db_dsn=os.getenv("DATABASE_URL"),
    s3_bucket=os.getenv("S3_BUCKET"),
    aws_region="us-east-1"
)
 
report = pipeline.process_dsar(
    subject_email="[email protected]",
    user_id="usr_12345"
)
 
# Output the DSAR report
import json
print(json.dumps(report.to_dict(), indent=2, default=str))

Step 4: Generate the Response Document

def save_dsar_report(report: DSARReport, output_path: str) -> None:
    """Save DSAR report as JSON for review and potential PDF conversion."""
    report_dict = report.to_dict()
    
    with open(output_path, "w") as f:
        json.dump(report_dict, f, indent=2, default=str)
    
    print(f"\n✅ DSAR Report saved to {output_path}")
    print(f"   Records searched: {report.pii_summary['total_records_searched']}")
    print(f"   Records with PII: {report.pii_summary['records_containing_pii']}")
    print(f"   PII categories: {', '.join(report.pii_summary['pii_categories_found'])}")
    print(f"   Systems searched: {', '.join(report.systems_searched)}")
 
save_dsar_report(report, f"/tmp/dsar_{report.subject_email}_{datetime.now().strftime('%Y%m%d')}.json")

GDPR Compliance Notes for Your DSAR Process

Verify identity before processing: GDPR requires you to verify the requester is who they claim to be before disclosing personal data. Add an email verification step before running the pipeline.

Redact third-party data: If your records contain data about people other than the requester (e.g., support ticket threads that mention other users), redact that before including it in the response.

Document your pipeline: Your DSAR response process itself is subject to GDPR accountability requirements. Document the systems searched, the methodology, and when each DSAR was processed.

Set up SLA tracking: 30 days is a hard deadline. Build DSAR intake tracking so requests don't fall through the cracks. A simple database table with request_date, status, and due_date is sufficient.

Right to erasure follows different logic: This pipeline covers Article 15 (access). If you need to implement Article 17 (erasure/right to be forgotten), that requires a deletion pipeline — a different workflow from access requests.

What This Saves You

A manual DSAR process for a mid-size SaaS with data spread across 5–10 systems typically takes 4–8 hours of engineering time per request. At scale — even 10–20 DSARs per month — that's significant overhead. The automated pipeline reduces per-request engineering time to under 15 minutes (review + response drafting), with the scan running automatically.

More importantly, automated PII detection eliminates the risk of missing data — which is precisely the kind of systemic failure that resulted in the Reddit and Free Mobile fines. Regulators aren't just checking whether you responded; they're checking whether your response was complete and accurate.

Start with GlobalShield API on RapidAPI — the first 100 scans are available on the free tier, enough to test the full DSAR pipeline against your real data before committing to a paid plan.