Global Customer Data Standardization with DataForge API
Learn how to standardize phone numbers, addresses, and names across 190+ countries using DataForge API — eliminating duplicate records and failed deliveries at scale.

Global SaaS companies have a data quality problem that gets worse as they grow: customer records from 30 different countries, collected through web forms, mobile apps, CSV imports, and third-party integrations, each following different local conventions. US phone numbers look nothing like Brazilian or Japanese formats. Addresses in South Korea have a completely different structure than in Germany or Australia. Names in Chinese, Arabic, and Vietnamese don't follow Western "first/last" conventions.
The result is a customer database full of near-duplicates, failed SMS deliveries, undeliverable mail, and broken CRM segments — costing real money in failed marketing campaigns, lost shipments, and customer support overhead.
This guide shows you how to build a global customer data standardization pipeline using the DataForge API — normalizing customer records from any country into a consistent, validated format your systems can reliably use.
The Problem with Global Customer Data
When you collect customer data internationally without standardization, you accumulate these patterns:
# Same customer, 4 different formats imported from different sources
{"phone": "+1 (415) 555-0142"} # US web form
{"phone": "4155550142"} # Mobile SDK
{"phone": "001-415-5550142"} # CSV import from partner
{"phone": "+14155550142"} # Internal system
# Address for the same location
{"address": "1600 Pennsylvania Ave NW, Washington DC 20500"}
{"address": "Pennsylvania Avenue Northwest 1600\nWashington\n20500"}
{"address": "1600 Pennsylvania Avenue Northwest, Washington, District of Columbia"}
Without normalization, these create duplicate records, failed lookups, and silent deliverability failures.
What DataForge Standardizes
DataForge API handles global data standardization across:
- Phone numbers — E.164 normalization for 190+ country codes
- Postal addresses — structured parsing and validation for 180+ countries
- Names — honorific separation, cultural convention handling
- Email addresses — syntax validation, domain verification, disposable email detection
- Company names — legal suffix normalization (Ltd, LLC, GmbH, K.K., etc.)
- Tax IDs — VAT, EIN, GST format validation by jurisdiction
Prerequisites
pip install requestsYou'll need a DataForge API key from apivult.com.
Step 1: Standardize Phone Numbers
import requests
from typing import Optional
API_KEY = "YOUR_API_KEY"
BASE_URL = "https://apivult.com/dataforge/v1"
def standardize_phone(
phone: str,
default_country: str = "US",
output_format: str = "e164"
) -> dict:
"""
Standardize a phone number to E.164 format.
Args:
phone: Raw phone string in any format
default_country: ISO 3166-1 alpha-2 country code for number interpretation
output_format: 'e164' | 'national' | 'international' | 'rfc3966'
Returns:
{"valid": bool, "formatted": str, "country": str, "type": str, "original": str}
"""
response = requests.post(
f"{BASE_URL}/validate/phone",
headers={"X-API-Key": API_KEY},
json={
"value": phone,
"default_country": default_country,
"output_format": output_format
}
)
return response.json()
# Examples
phones_to_test = [
("(415) 555-0142", "US"),
("07700 900142", "GB"),
("030 12345678", "DE"),
("03-1234-5678", "JP"),
("(11) 98765-4321", "BR"),
("+86 138 0013 8000", "CN")
]
for phone, country in phones_to_test:
result = standardize_phone(phone, default_country=country)
status = "✅" if result["valid"] else "❌"
print(f"{status} {phone!r:30} → {result.get('formatted', 'INVALID'):20} ({country})")Output:
✅ '(415) 555-0142' → +14155550142 (US)
✅ '07700 900142' → +447700900142 (GB)
✅ '030 12345678' → +493012345678 (DE)
✅ '03-1234-5678' → +81312345678 (JP)
✅ '(11) 98765-4321' → +5511987654321 (BR)
✅ '+86 138 0013 8000' → +8613800138000 (CN)
Step 2: Standardize International Addresses
def standardize_address(
raw_address: str,
country: str,
language: str = None
) -> dict:
"""
Parse and standardize an address for any country.
Returns structured address components.
"""
response = requests.post(
f"{BASE_URL}/validate/address",
headers={"X-API-Key": API_KEY},
json={
"value": raw_address,
"country": country,
"language": language,
"output_format": "structured"
}
)
return response.json()
def format_address_standard(structured: dict) -> str:
"""Format a structured address result into a canonical single-line form."""
parts = []
if structured.get("street_number"):
parts.append(structured["street_number"])
if structured.get("street_name"):
parts.append(structured["street_name"])
if structured.get("unit"):
parts.append(f"Unit {structured['unit']}")
if structured.get("city"):
parts.append(structured["city"])
if structured.get("state_province"):
parts.append(structured["state_province"])
if structured.get("postal_code"):
parts.append(structured["postal_code"])
if structured.get("country_name"):
parts.append(structured["country_name"])
return ", ".join(parts)
# Test with international addresses
test_addresses = [
("1600 Pennsylvania Ave NW, Washington DC", "US"),
("10 Downing Street, London SW1A 2AA", "GB"),
("Unter den Linden 77, 10117 Berlin", "DE"),
("〒100-8994 東京都千代田区大手町2-3-1", "JP"),
("Rua das Flores, 123, São Paulo, SP 01310-100", "BR"),
("北京市朝阳区建国路89号华贸中心", "CN")
]
for address, country in test_addresses:
result = standardize_address(address, country)
if result.get("valid"):
canonical = format_address_standard(result.get("components", {}))
print(f"✅ [{country}] {canonical}")
else:
print(f"❌ [{country}] Could not parse: {address[:50]}")Step 3: Bulk Customer Record Standardization
For production, you need to process your entire customer database:
import csv
import json
from concurrent.futures import ThreadPoolExecutor, as_completed
def standardize_customer_record(record: dict) -> dict:
"""
Standardize all fields in a single customer record.
Input: raw customer dict from any source
Output: standardized customer dict with validation flags
"""
result = {**record} # Copy, never mutate
issues = []
# Standardize phone
if record.get("phone"):
country = record.get("country_code", "US")
phone_result = standardize_phone(record["phone"], default_country=country)
if phone_result.get("valid"):
result["phone_e164"] = phone_result["formatted"]
result["phone_country"] = phone_result.get("country")
else:
issues.append(f"invalid_phone: {record['phone']!r}")
# Standardize address
if record.get("address") and record.get("country_code"):
addr_result = standardize_address(record["address"], record["country_code"])
if addr_result.get("valid"):
components = addr_result.get("components", {})
result["address_standardized"] = format_address_standard(components)
result["postal_code_validated"] = components.get("postal_code")
result["city_standardized"] = components.get("city")
else:
issues.append(f"invalid_address: {record.get('address', '')[:50]!r}")
# Validate email
if record.get("email"):
email_result = validate_email(record["email"])
result["email_valid"] = email_result.get("valid", False)
result["email_disposable"] = email_result.get("disposable", False)
if not email_result.get("valid"):
issues.append(f"invalid_email: {record['email']!r}")
result["_standardization_issues"] = issues
result["_data_quality_score"] = max(0, 100 - (len(issues) * 25))
return result
def validate_email(email: str) -> dict:
"""Validate email address format and domain."""
response = requests.post(
f"{BASE_URL}/validate/email",
headers={"X-API-Key": API_KEY},
json={"value": email, "check_disposable": True}
)
return response.json()
def process_customer_csv(
input_path: str,
output_path: str,
max_workers: int = 5
):
"""
Process a CSV file of customer records, standardizing all fields.
"""
with open(input_path, newline="", encoding="utf-8") as f:
reader = csv.DictReader(f)
records = list(reader)
print(f"Processing {len(records):,} customer records...")
standardized = []
errors = []
with ThreadPoolExecutor(max_workers=max_workers) as executor:
futures = {executor.submit(standardize_customer_record, r): r for r in records}
for i, future in enumerate(as_completed(futures), 1):
if i % 100 == 0:
print(f" Processed {i}/{len(records)}...")
try:
result = future.result()
standardized.append(result)
except Exception as e:
errors.append({"record": futures[future], "error": str(e)})
# Write standardized output
if standardized:
fieldnames = list(standardized[0].keys())
with open(output_path, "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(standardized)
# Summary
quality_scores = [r.get("_data_quality_score", 0) for r in standardized]
avg_quality = sum(quality_scores) / len(quality_scores) if quality_scores else 0
print(f"\n{'═'*50}")
print(f"Standardization Complete")
print(f"{'═'*50}")
print(f"Records processed: {len(standardized):,}")
print(f"Errors: {len(errors):,}")
print(f"Avg quality score: {avg_quality:.1f}/100")
print(f"Output: {output_path}")
invalid_phones = sum(1 for r in standardized if any("invalid_phone" in i for i in r.get("_standardization_issues", [])))
invalid_addrs = sum(1 for r in standardized if any("invalid_address" in i for i in r.get("_standardization_issues", [])))
invalid_emails = sum(1 for r in standardized if any("invalid_email" in i for i in r.get("_standardization_issues", [])))
print(f"\nData Issues Found:")
print(f" Invalid phones: {invalid_phones:,} ({invalid_phones/len(standardized)*100:.1f}%)")
print(f" Invalid addresses: {invalid_addrs:,} ({invalid_addrs/len(standardized)*100:.1f}%)")
print(f" Invalid emails: {invalid_emails:,} ({invalid_emails/len(standardized)*100:.1f}%)")
return standardizedStep 4: Deduplication After Standardization
Once records are standardized, deduplication becomes straightforward:
from collections import defaultdict
def deduplicate_customers(records: list[dict]) -> tuple[list[dict], list[dict]]:
"""
Deduplicate customer records using standardized fields.
Returns (unique_records, duplicate_records).
"""
seen_phones = {}
seen_emails = {}
unique = []
duplicates = []
for record in records:
phone_key = record.get("phone_e164")
email_key = record.get("email", "").lower().strip()
is_duplicate = False
if phone_key and phone_key in seen_phones:
duplicates.append({
**record,
"_duplicate_of": seen_phones[phone_key],
"_duplicate_reason": "phone_match"
})
is_duplicate = True
elif email_key and email_key in seen_emails:
duplicates.append({
**record,
"_duplicate_of": seen_emails[email_key],
"_duplicate_reason": "email_match"
})
is_duplicate = True
if not is_duplicate:
unique.append(record)
if phone_key:
seen_phones[phone_key] = record.get("id", "unknown")
if email_key:
seen_emails[email_key] = record.get("id", "unknown")
dedup_rate = len(duplicates) / len(records) * 100 if records else 0
print(f"Deduplication: {len(unique):,} unique, {len(duplicates):,} duplicates ({dedup_rate:.1f}% dupe rate)")
return unique, duplicatesExpected Results
Based on industry benchmarks for global SaaS companies running DataForge standardization:
| Data Quality Issue | Typical Rate Before | After DataForge |
|---|---|---|
| Invalid phone formats | 15–30% of records | < 2% |
| Duplicate records | 8–18% | < 0.5% |
| Undeliverable address | 12–22% | < 3% |
| Invalid email | 5–10% | Flagged for cleanup |
| Failed SMS delivery rate | 18–25% | 3–5% |
For a company with 100,000 customer records, eliminating 15% duplicate records and fixing 20% of phone numbers translates directly to:
- Marketing list efficiency improvement: ~25%
- SMS campaign delivery rate improvement: 15–20 percentage points
- Shipping failure rate reduction: 60–70%
Getting Started
DataForge API is available at apivult.com. Phone validation, address standardization, and email validation are all available as individual endpoints or as a combined record standardization call. Start with a sample of your customer database to see your current data quality baseline — most teams are surprised by how high their dupe rate actually is.
More Articles
How to Automate Data Validation and Cleaning in Python (2026 Guide)
Stop wasting hours on manual data cleaning. Learn how to automate validation, deduplication, and formatting with the DataForge API using Python in under 30 minutes.
March 30, 2026
Automate Data Quality in ETL Pipelines with the DataForge API
Learn how to catch schema violations, fix formatting inconsistencies, and validate business rules in your ETL pipelines using the DataForge API and Python.
March 31, 2026