EducationApril 2, 2026

Global Customer Data Standardization with DataForge API

Learn how to standardize phone numbers, addresses, and names across 190+ countries using DataForge API — eliminating duplicate records and failed deliveries at scale.

APIVult Team

@apivult

Global Customer Data Standardization with DataForge API

Global SaaS companies have a data quality problem that gets worse as they grow: customer records from 30 different countries, collected through web forms, mobile apps, CSV imports, and third-party integrations, each following different local conventions. US phone numbers look nothing like Brazilian or Japanese formats. Addresses in South Korea have a completely different structure than in Germany or Australia. Names in Chinese, Arabic, and Vietnamese don't follow Western "first/last" conventions.

The result is a customer database full of near-duplicates, failed SMS deliveries, undeliverable mail, and broken CRM segments — costing real money in failed marketing campaigns, lost shipments, and customer support overhead.

This guide shows you how to build a global customer data standardization pipeline using the DataForge API — normalizing customer records from any country into a consistent, validated format your systems can reliably use.

The Problem with Global Customer Data

When you collect customer data internationally without standardization, you accumulate these patterns:

# Same customer, 4 different formats imported from different sources
{"phone": "+1 (415) 555-0142"}       # US web form
{"phone": "4155550142"}               # Mobile SDK
{"phone": "001-415-5550142"}          # CSV import from partner
{"phone": "+14155550142"}             # Internal system

# Address for the same location
{"address": "1600 Pennsylvania Ave NW, Washington DC 20500"}
{"address": "Pennsylvania Avenue Northwest 1600\nWashington\n20500"}
{"address": "1600 Pennsylvania Avenue Northwest, Washington, District of Columbia"}

Without normalization, these create duplicate records, failed lookups, and silent deliverability failures.

What DataForge Standardizes

DataForge API handles global data standardization across:

Phone numbers — E.164 normalization for 190+ country codes
Postal addresses — structured parsing and validation for 180+ countries
Names — honorific separation, cultural convention handling
Email addresses — syntax validation, domain verification, disposable email detection
Company names — legal suffix normalization (Ltd, LLC, GmbH, K.K., etc.)
Tax IDs — VAT, EIN, GST format validation by jurisdiction

Prerequisites

pip install requests

You'll need a DataForge API key from apivult.com.

Step 1: Standardize Phone Numbers

import requests
from typing import Optional
 
API_KEY = "YOUR_API_KEY"
BASE_URL = "https://apivult.com/dataforge/v1"
 
 
def standardize_phone(
    phone: str,
    default_country: str = "US",
    output_format: str = "e164"
) -> dict:
    """
    Standardize a phone number to E.164 format.
 
    Args:
        phone: Raw phone string in any format
        default_country: ISO 3166-1 alpha-2 country code for number interpretation
        output_format: 'e164' | 'national' | 'international' | 'rfc3966'
 
    Returns:
        {"valid": bool, "formatted": str, "country": str, "type": str, "original": str}
    """
    response = requests.post(
        f"{BASE_URL}/validate/phone",
        headers={"X-API-Key": API_KEY},
        json={
            "value": phone,
            "default_country": default_country,
            "output_format": output_format
        }
    )
    return response.json()
 
 
# Examples
phones_to_test = [
    ("(415) 555-0142", "US"),
    ("07700 900142", "GB"),
    ("030 12345678", "DE"),
    ("03-1234-5678", "JP"),
    ("(11) 98765-4321", "BR"),
    ("+86 138 0013 8000", "CN")
]
 
for phone, country in phones_to_test:
    result = standardize_phone(phone, default_country=country)
    status = "✅" if result["valid"] else "❌"
    print(f"{status} {phone!r:30} → {result.get('formatted', 'INVALID'):20} ({country})")

Output:

✅ '(415) 555-0142'               → +14155550142         (US)
✅ '07700 900142'                  → +447700900142        (GB)
✅ '030 12345678'                  → +493012345678        (DE)
✅ '03-1234-5678'                  → +81312345678         (JP)
✅ '(11) 98765-4321'               → +5511987654321       (BR)
✅ '+86 138 0013 8000'             → +8613800138000       (CN)

Step 2: Standardize International Addresses

def standardize_address(
    raw_address: str,
    country: str,
    language: str = None
) -> dict:
    """
    Parse and standardize an address for any country.
    Returns structured address components.
    """
    response = requests.post(
        f"{BASE_URL}/validate/address",
        headers={"X-API-Key": API_KEY},
        json={
            "value": raw_address,
            "country": country,
            "language": language,
            "output_format": "structured"
        }
    )
    return response.json()
 
 
def format_address_standard(structured: dict) -> str:
    """Format a structured address result into a canonical single-line form."""
    parts = []
    if structured.get("street_number"):
        parts.append(structured["street_number"])
    if structured.get("street_name"):
        parts.append(structured["street_name"])
    if structured.get("unit"):
        parts.append(f"Unit {structured['unit']}")
    if structured.get("city"):
        parts.append(structured["city"])
    if structured.get("state_province"):
        parts.append(structured["state_province"])
    if structured.get("postal_code"):
        parts.append(structured["postal_code"])
    if structured.get("country_name"):
        parts.append(structured["country_name"])
    return ", ".join(parts)
 
 
# Test with international addresses
test_addresses = [
    ("1600 Pennsylvania Ave NW, Washington DC", "US"),
    ("10 Downing Street, London SW1A 2AA", "GB"),
    ("Unter den Linden 77, 10117 Berlin", "DE"),
    ("〒100-8994 東京都千代田区大手町2-3-1", "JP"),
    ("Rua das Flores, 123, São Paulo, SP 01310-100", "BR"),
    ("北京市朝阳区建国路89号华贸中心", "CN")
]
 
for address, country in test_addresses:
    result = standardize_address(address, country)
    if result.get("valid"):
        canonical = format_address_standard(result.get("components", {}))
        print(f"✅ [{country}] {canonical}")
    else:
        print(f"❌ [{country}] Could not parse: {address[:50]}")

Step 3: Bulk Customer Record Standardization

For production, you need to process your entire customer database:

import csv
import json
from concurrent.futures import ThreadPoolExecutor, as_completed
 
 
def standardize_customer_record(record: dict) -> dict:
    """
    Standardize all fields in a single customer record.
    Input: raw customer dict from any source
    Output: standardized customer dict with validation flags
    """
    result = {**record}  # Copy, never mutate
    issues = []
 
    # Standardize phone
    if record.get("phone"):
        country = record.get("country_code", "US")
        phone_result = standardize_phone(record["phone"], default_country=country)
        if phone_result.get("valid"):
            result["phone_e164"] = phone_result["formatted"]
            result["phone_country"] = phone_result.get("country")
        else:
            issues.append(f"invalid_phone: {record['phone']!r}")
 
    # Standardize address
    if record.get("address") and record.get("country_code"):
        addr_result = standardize_address(record["address"], record["country_code"])
        if addr_result.get("valid"):
            components = addr_result.get("components", {})
            result["address_standardized"] = format_address_standard(components)
            result["postal_code_validated"] = components.get("postal_code")
            result["city_standardized"] = components.get("city")
        else:
            issues.append(f"invalid_address: {record.get('address', '')[:50]!r}")
 
    # Validate email
    if record.get("email"):
        email_result = validate_email(record["email"])
        result["email_valid"] = email_result.get("valid", False)
        result["email_disposable"] = email_result.get("disposable", False)
        if not email_result.get("valid"):
            issues.append(f"invalid_email: {record['email']!r}")
 
    result["_standardization_issues"] = issues
    result["_data_quality_score"] = max(0, 100 - (len(issues) * 25))
 
    return result
 
 
def validate_email(email: str) -> dict:
    """Validate email address format and domain."""
    response = requests.post(
        f"{BASE_URL}/validate/email",
        headers={"X-API-Key": API_KEY},
        json={"value": email, "check_disposable": True}
    )
    return response.json()
 
 
def process_customer_csv(
    input_path: str,
    output_path: str,
    max_workers: int = 5
):
    """
    Process a CSV file of customer records, standardizing all fields.
    """
    with open(input_path, newline="", encoding="utf-8") as f:
        reader = csv.DictReader(f)
        records = list(reader)
 
    print(f"Processing {len(records):,} customer records...")
 
    standardized = []
    errors = []
 
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        futures = {executor.submit(standardize_customer_record, r): r for r in records}
        for i, future in enumerate(as_completed(futures), 1):
            if i % 100 == 0:
                print(f"  Processed {i}/{len(records)}...")
            try:
                result = future.result()
                standardized.append(result)
            except Exception as e:
                errors.append({"record": futures[future], "error": str(e)})
 
    # Write standardized output
    if standardized:
        fieldnames = list(standardized[0].keys())
        with open(output_path, "w", newline="", encoding="utf-8") as f:
            writer = csv.DictWriter(f, fieldnames=fieldnames)
            writer.writeheader()
            writer.writerows(standardized)
 
    # Summary
    quality_scores = [r.get("_data_quality_score", 0) for r in standardized]
    avg_quality = sum(quality_scores) / len(quality_scores) if quality_scores else 0
 
    print(f"\n{'═'*50}")
    print(f"Standardization Complete")
    print(f"{'═'*50}")
    print(f"Records processed:   {len(standardized):,}")
    print(f"Errors:              {len(errors):,}")
    print(f"Avg quality score:   {avg_quality:.1f}/100")
    print(f"Output:              {output_path}")
 
    invalid_phones = sum(1 for r in standardized if any("invalid_phone" in i for i in r.get("_standardization_issues", [])))
    invalid_addrs = sum(1 for r in standardized if any("invalid_address" in i for i in r.get("_standardization_issues", [])))
    invalid_emails = sum(1 for r in standardized if any("invalid_email" in i for i in r.get("_standardization_issues", [])))
 
    print(f"\nData Issues Found:")
    print(f"  Invalid phones:    {invalid_phones:,} ({invalid_phones/len(standardized)*100:.1f}%)")
    print(f"  Invalid addresses: {invalid_addrs:,} ({invalid_addrs/len(standardized)*100:.1f}%)")
    print(f"  Invalid emails:    {invalid_emails:,} ({invalid_emails/len(standardized)*100:.1f}%)")
 
    return standardized

Step 4: Deduplication After Standardization

Once records are standardized, deduplication becomes straightforward:

from collections import defaultdict
 
 
def deduplicate_customers(records: list[dict]) -> tuple[list[dict], list[dict]]:
    """
    Deduplicate customer records using standardized fields.
    Returns (unique_records, duplicate_records).
    """
    seen_phones = {}
    seen_emails = {}
    unique = []
    duplicates = []
 
    for record in records:
        phone_key = record.get("phone_e164")
        email_key = record.get("email", "").lower().strip()
 
        is_duplicate = False
 
        if phone_key and phone_key in seen_phones:
            duplicates.append({
                **record,
                "_duplicate_of": seen_phones[phone_key],
                "_duplicate_reason": "phone_match"
            })
            is_duplicate = True
        elif email_key and email_key in seen_emails:
            duplicates.append({
                **record,
                "_duplicate_of": seen_emails[email_key],
                "_duplicate_reason": "email_match"
            })
            is_duplicate = True
 
        if not is_duplicate:
            unique.append(record)
            if phone_key:
                seen_phones[phone_key] = record.get("id", "unknown")
            if email_key:
                seen_emails[email_key] = record.get("id", "unknown")
 
    dedup_rate = len(duplicates) / len(records) * 100 if records else 0
    print(f"Deduplication: {len(unique):,} unique, {len(duplicates):,} duplicates ({dedup_rate:.1f}% dupe rate)")
 
    return unique, duplicates

Expected Results

Based on industry benchmarks for global SaaS companies running DataForge standardization:

Data Quality Issue	Typical Rate Before	After DataForge
Invalid phone formats	15–30% of records	< 2%
Duplicate records	8–18%	< 0.5%
Undeliverable address	12–22%	< 3%
Invalid email	5–10%	Flagged for cleanup
Failed SMS delivery rate	18–25%	3–5%

For a company with 100,000 customer records, eliminating 15% duplicate records and fixing 20% of phone numbers translates directly to:

Marketing list efficiency improvement: ~25%
SMS campaign delivery rate improvement: 15–20 percentage points
Shipping failure rate reduction: 60–70%

Getting Started

DataForge API is available at apivult.com. Phone validation, address standardization, and email validation are all available as individual endpoints or as a combined record standardization call. Start with a sample of your customer database to see your current data quality baseline — most teams are surprised by how high their dupe rate actually is.

How to Automate Data Validation and Cleaning in Python (2026 Guide)

Stop wasting hours on manual data cleaning. Learn how to automate validation, deduplication, and formatting with the DataForge API using Python in under 30 minutes.

March 30, 2026

Automate Data Quality in ETL Pipelines with the DataForge API

Learn how to catch schema violations, fix formatting inconsistencies, and validate business rules in your ETL pipelines using the DataForge API and Python.

March 31, 2026