Education

Global Customer Data Standardization with DataForge API

Learn how to standardize phone numbers, addresses, and names across 190+ countries using DataForge API — eliminating duplicate records and failed deliveries at scale.

Global Customer Data Standardization with DataForge API

Global SaaS companies have a data quality problem that gets worse as they grow: customer records from 30 different countries, collected through web forms, mobile apps, CSV imports, and third-party integrations, each following different local conventions. US phone numbers look nothing like Brazilian or Japanese formats. Addresses in South Korea have a completely different structure than in Germany or Australia. Names in Chinese, Arabic, and Vietnamese don't follow Western "first/last" conventions.

The result is a customer database full of near-duplicates, failed SMS deliveries, undeliverable mail, and broken CRM segments — costing real money in failed marketing campaigns, lost shipments, and customer support overhead.

This guide shows you how to build a global customer data standardization pipeline using the DataForge API — normalizing customer records from any country into a consistent, validated format your systems can reliably use.

The Problem with Global Customer Data

When you collect customer data internationally without standardization, you accumulate these patterns:

# Same customer, 4 different formats imported from different sources
{"phone": "+1 (415) 555-0142"}       # US web form
{"phone": "4155550142"}               # Mobile SDK
{"phone": "001-415-5550142"}          # CSV import from partner
{"phone": "+14155550142"}             # Internal system

# Address for the same location
{"address": "1600 Pennsylvania Ave NW, Washington DC 20500"}
{"address": "Pennsylvania Avenue Northwest 1600\nWashington\n20500"}
{"address": "1600 Pennsylvania Avenue Northwest, Washington, District of Columbia"}

Without normalization, these create duplicate records, failed lookups, and silent deliverability failures.

What DataForge Standardizes

DataForge API handles global data standardization across:

  • Phone numbers — E.164 normalization for 190+ country codes
  • Postal addresses — structured parsing and validation for 180+ countries
  • Names — honorific separation, cultural convention handling
  • Email addresses — syntax validation, domain verification, disposable email detection
  • Company names — legal suffix normalization (Ltd, LLC, GmbH, K.K., etc.)
  • Tax IDs — VAT, EIN, GST format validation by jurisdiction

Prerequisites

pip install requests

You'll need a DataForge API key from apivult.com.

Step 1: Standardize Phone Numbers

import requests
from typing import Optional
 
API_KEY = "YOUR_API_KEY"
BASE_URL = "https://apivult.com/dataforge/v1"
 
 
def standardize_phone(
    phone: str,
    default_country: str = "US",
    output_format: str = "e164"
) -> dict:
    """
    Standardize a phone number to E.164 format.
 
    Args:
        phone: Raw phone string in any format
        default_country: ISO 3166-1 alpha-2 country code for number interpretation
        output_format: 'e164' | 'national' | 'international' | 'rfc3966'
 
    Returns:
        {"valid": bool, "formatted": str, "country": str, "type": str, "original": str}
    """
    response = requests.post(
        f"{BASE_URL}/validate/phone",
        headers={"X-API-Key": API_KEY},
        json={
            "value": phone,
            "default_country": default_country,
            "output_format": output_format
        }
    )
    return response.json()
 
 
# Examples
phones_to_test = [
    ("(415) 555-0142", "US"),
    ("07700 900142", "GB"),
    ("030 12345678", "DE"),
    ("03-1234-5678", "JP"),
    ("(11) 98765-4321", "BR"),
    ("+86 138 0013 8000", "CN")
]
 
for phone, country in phones_to_test:
    result = standardize_phone(phone, default_country=country)
    status = "✅" if result["valid"] else "❌"
    print(f"{status} {phone!r:30}{result.get('formatted', 'INVALID'):20} ({country})")

Output:

✅ '(415) 555-0142'               → +14155550142         (US)
✅ '07700 900142'                  → +447700900142        (GB)
✅ '030 12345678'                  → +493012345678        (DE)
✅ '03-1234-5678'                  → +81312345678         (JP)
✅ '(11) 98765-4321'               → +5511987654321       (BR)
✅ '+86 138 0013 8000'             → +8613800138000       (CN)

Step 2: Standardize International Addresses

def standardize_address(
    raw_address: str,
    country: str,
    language: str = None
) -> dict:
    """
    Parse and standardize an address for any country.
    Returns structured address components.
    """
    response = requests.post(
        f"{BASE_URL}/validate/address",
        headers={"X-API-Key": API_KEY},
        json={
            "value": raw_address,
            "country": country,
            "language": language,
            "output_format": "structured"
        }
    )
    return response.json()
 
 
def format_address_standard(structured: dict) -> str:
    """Format a structured address result into a canonical single-line form."""
    parts = []
    if structured.get("street_number"):
        parts.append(structured["street_number"])
    if structured.get("street_name"):
        parts.append(structured["street_name"])
    if structured.get("unit"):
        parts.append(f"Unit {structured['unit']}")
    if structured.get("city"):
        parts.append(structured["city"])
    if structured.get("state_province"):
        parts.append(structured["state_province"])
    if structured.get("postal_code"):
        parts.append(structured["postal_code"])
    if structured.get("country_name"):
        parts.append(structured["country_name"])
    return ", ".join(parts)
 
 
# Test with international addresses
test_addresses = [
    ("1600 Pennsylvania Ave NW, Washington DC", "US"),
    ("10 Downing Street, London SW1A 2AA", "GB"),
    ("Unter den Linden 77, 10117 Berlin", "DE"),
    ("〒100-8994 東京都千代田区大手町2-3-1", "JP"),
    ("Rua das Flores, 123, São Paulo, SP 01310-100", "BR"),
    ("北京市朝阳区建国路89号华贸中心", "CN")
]
 
for address, country in test_addresses:
    result = standardize_address(address, country)
    if result.get("valid"):
        canonical = format_address_standard(result.get("components", {}))
        print(f"✅ [{country}] {canonical}")
    else:
        print(f"❌ [{country}] Could not parse: {address[:50]}")

Step 3: Bulk Customer Record Standardization

For production, you need to process your entire customer database:

import csv
import json
from concurrent.futures import ThreadPoolExecutor, as_completed
 
 
def standardize_customer_record(record: dict) -> dict:
    """
    Standardize all fields in a single customer record.
    Input: raw customer dict from any source
    Output: standardized customer dict with validation flags
    """
    result = {**record}  # Copy, never mutate
    issues = []
 
    # Standardize phone
    if record.get("phone"):
        country = record.get("country_code", "US")
        phone_result = standardize_phone(record["phone"], default_country=country)
        if phone_result.get("valid"):
            result["phone_e164"] = phone_result["formatted"]
            result["phone_country"] = phone_result.get("country")
        else:
            issues.append(f"invalid_phone: {record['phone']!r}")
 
    # Standardize address
    if record.get("address") and record.get("country_code"):
        addr_result = standardize_address(record["address"], record["country_code"])
        if addr_result.get("valid"):
            components = addr_result.get("components", {})
            result["address_standardized"] = format_address_standard(components)
            result["postal_code_validated"] = components.get("postal_code")
            result["city_standardized"] = components.get("city")
        else:
            issues.append(f"invalid_address: {record.get('address', '')[:50]!r}")
 
    # Validate email
    if record.get("email"):
        email_result = validate_email(record["email"])
        result["email_valid"] = email_result.get("valid", False)
        result["email_disposable"] = email_result.get("disposable", False)
        if not email_result.get("valid"):
            issues.append(f"invalid_email: {record['email']!r}")
 
    result["_standardization_issues"] = issues
    result["_data_quality_score"] = max(0, 100 - (len(issues) * 25))
 
    return result
 
 
def validate_email(email: str) -> dict:
    """Validate email address format and domain."""
    response = requests.post(
        f"{BASE_URL}/validate/email",
        headers={"X-API-Key": API_KEY},
        json={"value": email, "check_disposable": True}
    )
    return response.json()
 
 
def process_customer_csv(
    input_path: str,
    output_path: str,
    max_workers: int = 5
):
    """
    Process a CSV file of customer records, standardizing all fields.
    """
    with open(input_path, newline="", encoding="utf-8") as f:
        reader = csv.DictReader(f)
        records = list(reader)
 
    print(f"Processing {len(records):,} customer records...")
 
    standardized = []
    errors = []
 
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        futures = {executor.submit(standardize_customer_record, r): r for r in records}
        for i, future in enumerate(as_completed(futures), 1):
            if i % 100 == 0:
                print(f"  Processed {i}/{len(records)}...")
            try:
                result = future.result()
                standardized.append(result)
            except Exception as e:
                errors.append({"record": futures[future], "error": str(e)})
 
    # Write standardized output
    if standardized:
        fieldnames = list(standardized[0].keys())
        with open(output_path, "w", newline="", encoding="utf-8") as f:
            writer = csv.DictWriter(f, fieldnames=fieldnames)
            writer.writeheader()
            writer.writerows(standardized)
 
    # Summary
    quality_scores = [r.get("_data_quality_score", 0) for r in standardized]
    avg_quality = sum(quality_scores) / len(quality_scores) if quality_scores else 0
 
    print(f"\n{'═'*50}")
    print(f"Standardization Complete")
    print(f"{'═'*50}")
    print(f"Records processed:   {len(standardized):,}")
    print(f"Errors:              {len(errors):,}")
    print(f"Avg quality score:   {avg_quality:.1f}/100")
    print(f"Output:              {output_path}")
 
    invalid_phones = sum(1 for r in standardized if any("invalid_phone" in i for i in r.get("_standardization_issues", [])))
    invalid_addrs = sum(1 for r in standardized if any("invalid_address" in i for i in r.get("_standardization_issues", [])))
    invalid_emails = sum(1 for r in standardized if any("invalid_email" in i for i in r.get("_standardization_issues", [])))
 
    print(f"\nData Issues Found:")
    print(f"  Invalid phones:    {invalid_phones:,} ({invalid_phones/len(standardized)*100:.1f}%)")
    print(f"  Invalid addresses: {invalid_addrs:,} ({invalid_addrs/len(standardized)*100:.1f}%)")
    print(f"  Invalid emails:    {invalid_emails:,} ({invalid_emails/len(standardized)*100:.1f}%)")
 
    return standardized

Step 4: Deduplication After Standardization

Once records are standardized, deduplication becomes straightforward:

from collections import defaultdict
 
 
def deduplicate_customers(records: list[dict]) -> tuple[list[dict], list[dict]]:
    """
    Deduplicate customer records using standardized fields.
    Returns (unique_records, duplicate_records).
    """
    seen_phones = {}
    seen_emails = {}
    unique = []
    duplicates = []
 
    for record in records:
        phone_key = record.get("phone_e164")
        email_key = record.get("email", "").lower().strip()
 
        is_duplicate = False
 
        if phone_key and phone_key in seen_phones:
            duplicates.append({
                **record,
                "_duplicate_of": seen_phones[phone_key],
                "_duplicate_reason": "phone_match"
            })
            is_duplicate = True
        elif email_key and email_key in seen_emails:
            duplicates.append({
                **record,
                "_duplicate_of": seen_emails[email_key],
                "_duplicate_reason": "email_match"
            })
            is_duplicate = True
 
        if not is_duplicate:
            unique.append(record)
            if phone_key:
                seen_phones[phone_key] = record.get("id", "unknown")
            if email_key:
                seen_emails[email_key] = record.get("id", "unknown")
 
    dedup_rate = len(duplicates) / len(records) * 100 if records else 0
    print(f"Deduplication: {len(unique):,} unique, {len(duplicates):,} duplicates ({dedup_rate:.1f}% dupe rate)")
 
    return unique, duplicates

Expected Results

Based on industry benchmarks for global SaaS companies running DataForge standardization:

Data Quality IssueTypical Rate BeforeAfter DataForge
Invalid phone formats15–30% of records< 2%
Duplicate records8–18%< 0.5%
Undeliverable address12–22%< 3%
Invalid email5–10%Flagged for cleanup
Failed SMS delivery rate18–25%3–5%

For a company with 100,000 customer records, eliminating 15% duplicate records and fixing 20% of phone numbers translates directly to:

  • Marketing list efficiency improvement: ~25%
  • SMS campaign delivery rate improvement: 15–20 percentage points
  • Shipping failure rate reduction: 60–70%

Getting Started

DataForge API is available at apivult.com. Phone validation, address standardization, and email validation are all available as individual endpoints or as a combined record standardization call. Start with a sample of your customer database to see your current data quality baseline — most teams are surprised by how high their dupe rate actually is.