How to Build an Automated Regulatory Website Archiving System with WebShot API
Build a compliance-grade website archiving system using WebShot API. Capture timestamped screenshots of regulatory pages, detect changes, and store admissible evidence.

Regulated industries have a documentation problem. Compliance teams need to prove what was on a regulator's website on a specific date. Legal teams need to demonstrate that a competitor's pricing page said something on a particular day. Financial services firms need archived copies of their own disclosures as they appeared to users.
Manual screenshots are inadmissible evidence. They can be faked, date-tampered, and lack the metadata chain-of-custody that courts and regulators require. You need automated, timestamped, hash-verified archives with an immutable audit trail.
This guide shows you how to build a regulatory website archiving system using the WebShot API that captures, hashes, stores, and alerts on changes across any list of URLs on a schedule you define.
Use Cases for Automated Regulatory Archiving
Financial Services Compliance
- Archive your own public disclosures (prospectuses, fee schedules, rate tables) as they appeared to users on each date
- Archive regulator guidance pages to document what the rule said when you relied on it
- Monitor competitor pricing for fair advertising compliance
Legal and Litigation Support
- Capture third-party websites as evidence with timestamp verification
- Archive terms of service changes for contract dispute documentation
- Monitor for infringing content with timestamped proof of infringement dates
Pharmaceutical and Medical Device
- Archive FDA guidance documents as they existed when your approval application was submitted
- Monitor competitor labeling claims for regulatory complaint filings
Real Estate and Financial Disclosure
- Archive MLS listings as they appeared at time of purchase agreement
- Capture loan disclosure pages at time of signing
Architecture
URL Watch List (YAML config)
│
▼
Scheduler (APScheduler / cron)
│
▼
WebShot API ──────────── Screenshot + Full Page Capture
│ │
│ ┌───────────┴───────────┐
│ │ │
▼ ▼ ▼
Change Detection S3/GCS Storage Hash Registry
│ (with metadata) (SHA-256)
│
▼
Alert Service (Email / Slack / PagerDuty)
Setup
pip install httpx python-dotenv apscheduler boto3 pyyaml hashlib# .env
WEBSHOT_API_KEY=YOUR_API_KEY
AWS_ACCESS_KEY_ID=your_key
AWS_SECRET_ACCESS_KEY=your_secret
S3_BUCKET=your-compliance-archive-bucketCore WebShot API Integration
import httpx
import os
import hashlib
import base64
from datetime import datetime
WEBSHOT_BASE_URL = "https://apivult.com/api/webshot"
async def capture_regulatory_page(
url: str,
full_page: bool = True,
viewport_width: int = 1920,
wait_for_idle: bool = True
) -> dict:
"""
Capture a full-page screenshot with complete metadata for compliance archiving.
"""
async with httpx.AsyncClient(timeout=60.0) as client:
response = await client.post(
f"{WEBSHOT_BASE_URL}/capture",
headers={
"X-RapidAPI-Key": os.getenv("WEBSHOT_API_KEY"),
"Content-Type": "application/json"
},
json={
"url": url,
"full_page": full_page,
"viewport": {
"width": viewport_width,
"height": 1080
},
"format": "png",
"quality": 100, # Maximum quality for archival
"wait_for_network_idle": wait_for_idle,
"include_metadata": True,
"capture_timestamp": True,
"capture_headers": True
}
)
response.raise_for_status()
result = response.json()
# Compute content hash for tamper detection
if result.get("screenshot_base64"):
image_bytes = base64.b64decode(result["screenshot_base64"])
result["sha256_hash"] = hashlib.sha256(image_bytes).hexdigest()
result["capture_time_utc"] = datetime.utcnow().isoformat()
result["url"] = url
return resultChange Detection
import json
from typing import Optional
def compare_captures(
previous: dict,
current: dict,
sensitivity: str = "medium"
) -> dict:
"""
Compare two captures and determine if the page has changed.
sensitivity:
- "low": Only flag hash-level changes
- "medium": Also flag layout/structural changes
- "high": Flag any visual difference above 1%
"""
# Hash comparison — definitive change detection
hash_changed = previous.get("sha256_hash") != current.get("sha256_hash")
# Visual diff score (0.0 = identical, 1.0 = completely different)
visual_diff_score = current.get("visual_diff_score", 0.0)
thresholds = {
"low": 0.05,
"medium": 0.02,
"high": 0.005
}
threshold = thresholds.get(sensitivity, 0.02)
significant_change = visual_diff_score > threshold
return {
"change_detected": hash_changed or significant_change,
"hash_changed": hash_changed,
"visual_diff_score": visual_diff_score,
"significant_visual_change": significant_change,
"previous_capture_time": previous.get("capture_time_utc"),
"current_capture_time": current.get("capture_time_utc"),
"previous_hash": previous.get("sha256_hash"),
"current_hash": current.get("sha256_hash")
}Archive Storage with Full Metadata
import boto3
import io
import json
from datetime import datetime
s3_client = boto3.client("s3")
def archive_capture(
capture_result: dict,
url: str,
url_label: str,
bucket: str = None
) -> str:
"""
Store a capture in S3 with full compliance metadata.
Returns the S3 key for the archived screenshot.
"""
bucket = bucket or os.getenv("S3_BUCKET")
timestamp = datetime.utcnow().strftime("%Y/%m/%d/%H%M%S")
safe_label = url_label.replace(" ", "_").replace("/", "-").lower()
screenshot_key = f"archive/{safe_label}/{timestamp}/screenshot.png"
metadata_key = f"archive/{safe_label}/{timestamp}/metadata.json"
# Store screenshot
image_bytes = base64.b64decode(capture_result["screenshot_base64"])
s3_client.put_object(
Bucket=bucket,
Key=screenshot_key,
Body=image_bytes,
ContentType="image/png",
Metadata={
"url": url,
"capture_time": capture_result.get("capture_time_utc", ""),
"sha256": capture_result.get("sha256_hash", ""),
"full_page": str(capture_result.get("full_page", True))
},
# Enable object lock for immutable compliance archiving
# ObjectLockMode="COMPLIANCE",
# ObjectLockRetainUntilDate=compliance_retention_date
)
# Store metadata separately for fast querying
metadata = {
"url": url,
"label": url_label,
"capture_time_utc": capture_result.get("capture_time_utc"),
"sha256_hash": capture_result.get("sha256_hash"),
"http_status": capture_result.get("http_status"),
"page_title": capture_result.get("page_title"),
"response_headers": capture_result.get("response_headers", {}),
"screenshot_s3_key": screenshot_key,
"full_page_captured": capture_result.get("full_page", True)
}
s3_client.put_object(
Bucket=bucket,
Key=metadata_key,
Body=json.dumps(metadata, indent=2),
ContentType="application/json"
)
return screenshot_keyScheduled Archiving with Change Alerts
from apscheduler.schedulers.asyncio import AsyncIOScheduler
import asyncio
import yaml
async def archive_url_and_detect_changes(url_config: dict):
"""
Capture a URL, store it, and alert if the page changed since last capture.
"""
url = url_config["url"]
label = url_config["label"]
sensitivity = url_config.get("sensitivity", "medium")
# Capture current state
current = await capture_regulatory_page(
url=url,
full_page=url_config.get("full_page", True)
)
if not current.get("sha256_hash"):
print(f"Capture failed for {url}: {current.get('error')}")
return
# Archive to S3
archive_key = archive_capture(current, url=url, url_label=label)
print(f"Archived {label} → {archive_key}")
# Compare against previous capture
previous = get_previous_capture(url)
if previous:
diff = compare_captures(previous, current, sensitivity=sensitivity)
if diff["change_detected"]:
send_change_alert(
url=url,
label=label,
diff=diff,
archive_key=archive_key
)
print(f"CHANGE DETECTED on {label}: visual diff score = {diff['visual_diff_score']:.4f}")
# Store as new "previous" for next comparison
save_latest_capture(url, current)
def load_watch_list(config_path: str = "watch_list.yaml") -> list:
with open(config_path) as f:
return yaml.safe_load(f)["urls"]
async def run_archiving_cycle():
"""Archive all configured URLs."""
watch_list = load_watch_list()
tasks = [archive_url_and_detect_changes(url_config) for url_config in watch_list]
await asyncio.gather(*tasks)
print(f"Archiving cycle complete: {len(watch_list)} URLs processed")
# Configure archiving schedule
def start_scheduler():
scheduler = AsyncIOScheduler()
# Daily archiving at 00:01 UTC
scheduler.add_job(
run_archiving_cycle,
trigger="cron",
hour=0,
minute=1
)
# Additional hourly capture for high-priority URLs
scheduler.add_job(
lambda: asyncio.run(run_archiving_cycle()),
trigger="interval",
hours=1,
id="hourly_priority"
)
scheduler.start()
return schedulerSample Watch List Configuration
# watch_list.yaml
urls:
- url: "https://www.sec.gov/rules/final.shtml"
label: "SEC Final Rules"
sensitivity: "low"
full_page: true
schedule: "daily"
- url: "https://home.treasury.gov/policy-issues/financial-sanctions/recent-actions"
label: "OFAC Recent Actions"
sensitivity: "medium"
full_page: true
schedule: "hourly"
- url: "https://yourcompany.com/legal/privacy-policy"
label: "Own Privacy Policy"
sensitivity: "high"
full_page: true
schedule: "daily"
- url: "https://yourcompany.com/rates"
label: "Own Rate Disclosure"
sensitivity: "high"
full_page: false
schedule: "hourly"Generating a Chain-of-Custody Report
def generate_chain_of_custody_report(
url_label: str,
from_date: str,
to_date: str
) -> dict:
"""
Generate a chain-of-custody report for a URL's archive history.
Suitable for legal proceedings and regulatory inquiries.
"""
archives = get_archives_for_label(url_label, from_date, to_date)
return {
"subject": url_label,
"period": {"from": from_date, "to": to_date},
"report_generated": datetime.utcnow().isoformat(),
"capture_count": len(archives),
"methodology": (
"Automated full-page screenshots captured via WebShot API. "
"Each capture includes SHA-256 content hash, UTC timestamp, "
"HTTP response headers, and page metadata. "
"Archives stored in S3 with server-side encryption."
),
"capture_log": [
{
"timestamp": a["capture_time_utc"],
"sha256_hash": a["sha256_hash"],
"http_status": a["http_status"],
"page_title": a["page_title"],
"s3_key": a["screenshot_s3_key"]
}
for a in archives
],
"hash_chain_integrity": verify_hash_chain(archives)
}Results at Scale
Compliance teams using WebShot API for regulatory archiving report:
- 100% coverage of configured URLs with zero manual intervention
- Sub-5-minute detection of changes on monitored pages
- Full audit trail accepted by legal counsel for litigation support
- 90% cost reduction vs. commercial web archiving services
- Instant retrieval of historical states for any date in the archive
Start with 5–10 high-priority regulatory URLs, verify the archive quality, and then expand to your full watch list.
Get Started
- Get your WebShot API key from APIVult
- Create your
watch_list.yamlwith your priority regulatory URLs - Set up an S3 bucket with versioning enabled for your archive store
- Deploy the scheduler as a background service
- Generate your first chain-of-custody report to verify archive quality
Regulatory archiving is not a project you want to start after you need the records. Build the system now so the evidence exists when it matters.
More Articles
Build a Continuous Website Monitoring System with a Screenshot API
Automate visual monitoring, detect CSS regressions, and alert on layout changes with WebShot API.
March 31, 2026
5 Real-World Use Cases for Web Screenshot APIs in 2026
From competitor monitoring to compliance archiving — learn how developers are using web screenshot APIs to automate visual capture workflows at scale.
March 30, 2026