Education· Last updated April 5, 2026

How to Build an Automated Regulatory Website Archiving System with WebShot API

Build a compliance-grade website archiving system using WebShot API. Capture timestamped screenshots of regulatory pages, detect changes, and store admissible evidence.

How to Build an Automated Regulatory Website Archiving System with WebShot API

Regulated industries have a documentation problem. Compliance teams need to prove what was on a regulator's website on a specific date. Legal teams need to demonstrate that a competitor's pricing page said something on a particular day. Financial services firms need archived copies of their own disclosures as they appeared to users.

Manual screenshots are inadmissible evidence. They can be faked, date-tampered, and lack the metadata chain-of-custody that courts and regulators require. You need automated, timestamped, hash-verified archives with an immutable audit trail.

This guide shows you how to build a regulatory website archiving system using the WebShot API that captures, hashes, stores, and alerts on changes across any list of URLs on a schedule you define.

Use Cases for Automated Regulatory Archiving

Financial Services Compliance

  • Archive your own public disclosures (prospectuses, fee schedules, rate tables) as they appeared to users on each date
  • Archive regulator guidance pages to document what the rule said when you relied on it
  • Monitor competitor pricing for fair advertising compliance

Legal and Litigation Support

  • Capture third-party websites as evidence with timestamp verification
  • Archive terms of service changes for contract dispute documentation
  • Monitor for infringing content with timestamped proof of infringement dates

Pharmaceutical and Medical Device

  • Archive FDA guidance documents as they existed when your approval application was submitted
  • Monitor competitor labeling claims for regulatory complaint filings

Real Estate and Financial Disclosure

  • Archive MLS listings as they appeared at time of purchase agreement
  • Capture loan disclosure pages at time of signing

Architecture

URL Watch List (YAML config)
        │
        ▼
Scheduler (APScheduler / cron)
        │
        ▼
WebShot API ──────────── Screenshot + Full Page Capture
        │                          │
        │              ┌───────────┴───────────┐
        │              │                       │
        ▼              ▼                       ▼
Change Detection    S3/GCS Storage         Hash Registry
        │           (with metadata)        (SHA-256)
        │
        ▼
Alert Service (Email / Slack / PagerDuty)

Setup

pip install httpx python-dotenv apscheduler boto3 pyyaml hashlib
# .env
WEBSHOT_API_KEY=YOUR_API_KEY
AWS_ACCESS_KEY_ID=your_key
AWS_SECRET_ACCESS_KEY=your_secret
S3_BUCKET=your-compliance-archive-bucket

Core WebShot API Integration

import httpx
import os
import hashlib
import base64
from datetime import datetime
 
WEBSHOT_BASE_URL = "https://apivult.com/api/webshot"
 
 
async def capture_regulatory_page(
    url: str,
    full_page: bool = True,
    viewport_width: int = 1920,
    wait_for_idle: bool = True
) -> dict:
    """
    Capture a full-page screenshot with complete metadata for compliance archiving.
    """
    async with httpx.AsyncClient(timeout=60.0) as client:
        response = await client.post(
            f"{WEBSHOT_BASE_URL}/capture",
            headers={
                "X-RapidAPI-Key": os.getenv("WEBSHOT_API_KEY"),
                "Content-Type": "application/json"
            },
            json={
                "url": url,
                "full_page": full_page,
                "viewport": {
                    "width": viewport_width,
                    "height": 1080
                },
                "format": "png",
                "quality": 100,  # Maximum quality for archival
                "wait_for_network_idle": wait_for_idle,
                "include_metadata": True,
                "capture_timestamp": True,
                "capture_headers": True
            }
        )
        response.raise_for_status()
        result = response.json()
 
        # Compute content hash for tamper detection
        if result.get("screenshot_base64"):
            image_bytes = base64.b64decode(result["screenshot_base64"])
            result["sha256_hash"] = hashlib.sha256(image_bytes).hexdigest()
            result["capture_time_utc"] = datetime.utcnow().isoformat()
            result["url"] = url
 
        return result

Change Detection

import json
from typing import Optional
 
def compare_captures(
    previous: dict,
    current: dict,
    sensitivity: str = "medium"
) -> dict:
    """
    Compare two captures and determine if the page has changed.
 
    sensitivity:
    - "low": Only flag hash-level changes
    - "medium": Also flag layout/structural changes
    - "high": Flag any visual difference above 1%
    """
    # Hash comparison — definitive change detection
    hash_changed = previous.get("sha256_hash") != current.get("sha256_hash")
 
    # Visual diff score (0.0 = identical, 1.0 = completely different)
    visual_diff_score = current.get("visual_diff_score", 0.0)
 
    thresholds = {
        "low": 0.05,
        "medium": 0.02,
        "high": 0.005
    }
 
    threshold = thresholds.get(sensitivity, 0.02)
    significant_change = visual_diff_score > threshold
 
    return {
        "change_detected": hash_changed or significant_change,
        "hash_changed": hash_changed,
        "visual_diff_score": visual_diff_score,
        "significant_visual_change": significant_change,
        "previous_capture_time": previous.get("capture_time_utc"),
        "current_capture_time": current.get("capture_time_utc"),
        "previous_hash": previous.get("sha256_hash"),
        "current_hash": current.get("sha256_hash")
    }

Archive Storage with Full Metadata

import boto3
import io
import json
from datetime import datetime
 
s3_client = boto3.client("s3")
 
 
def archive_capture(
    capture_result: dict,
    url: str,
    url_label: str,
    bucket: str = None
) -> str:
    """
    Store a capture in S3 with full compliance metadata.
    Returns the S3 key for the archived screenshot.
    """
    bucket = bucket or os.getenv("S3_BUCKET")
    timestamp = datetime.utcnow().strftime("%Y/%m/%d/%H%M%S")
    safe_label = url_label.replace(" ", "_").replace("/", "-").lower()
 
    screenshot_key = f"archive/{safe_label}/{timestamp}/screenshot.png"
    metadata_key = f"archive/{safe_label}/{timestamp}/metadata.json"
 
    # Store screenshot
    image_bytes = base64.b64decode(capture_result["screenshot_base64"])
    s3_client.put_object(
        Bucket=bucket,
        Key=screenshot_key,
        Body=image_bytes,
        ContentType="image/png",
        Metadata={
            "url": url,
            "capture_time": capture_result.get("capture_time_utc", ""),
            "sha256": capture_result.get("sha256_hash", ""),
            "full_page": str(capture_result.get("full_page", True))
        },
        # Enable object lock for immutable compliance archiving
        # ObjectLockMode="COMPLIANCE",
        # ObjectLockRetainUntilDate=compliance_retention_date
    )
 
    # Store metadata separately for fast querying
    metadata = {
        "url": url,
        "label": url_label,
        "capture_time_utc": capture_result.get("capture_time_utc"),
        "sha256_hash": capture_result.get("sha256_hash"),
        "http_status": capture_result.get("http_status"),
        "page_title": capture_result.get("page_title"),
        "response_headers": capture_result.get("response_headers", {}),
        "screenshot_s3_key": screenshot_key,
        "full_page_captured": capture_result.get("full_page", True)
    }
 
    s3_client.put_object(
        Bucket=bucket,
        Key=metadata_key,
        Body=json.dumps(metadata, indent=2),
        ContentType="application/json"
    )
 
    return screenshot_key

Scheduled Archiving with Change Alerts

from apscheduler.schedulers.asyncio import AsyncIOScheduler
import asyncio
import yaml
 
async def archive_url_and_detect_changes(url_config: dict):
    """
    Capture a URL, store it, and alert if the page changed since last capture.
    """
    url = url_config["url"]
    label = url_config["label"]
    sensitivity = url_config.get("sensitivity", "medium")
 
    # Capture current state
    current = await capture_regulatory_page(
        url=url,
        full_page=url_config.get("full_page", True)
    )
 
    if not current.get("sha256_hash"):
        print(f"Capture failed for {url}: {current.get('error')}")
        return
 
    # Archive to S3
    archive_key = archive_capture(current, url=url, url_label=label)
    print(f"Archived {label}{archive_key}")
 
    # Compare against previous capture
    previous = get_previous_capture(url)
 
    if previous:
        diff = compare_captures(previous, current, sensitivity=sensitivity)
 
        if diff["change_detected"]:
            send_change_alert(
                url=url,
                label=label,
                diff=diff,
                archive_key=archive_key
            )
            print(f"CHANGE DETECTED on {label}: visual diff score = {diff['visual_diff_score']:.4f}")
 
    # Store as new "previous" for next comparison
    save_latest_capture(url, current)
 
 
def load_watch_list(config_path: str = "watch_list.yaml") -> list:
    with open(config_path) as f:
        return yaml.safe_load(f)["urls"]
 
 
async def run_archiving_cycle():
    """Archive all configured URLs."""
    watch_list = load_watch_list()
    tasks = [archive_url_and_detect_changes(url_config) for url_config in watch_list]
    await asyncio.gather(*tasks)
    print(f"Archiving cycle complete: {len(watch_list)} URLs processed")
 
 
# Configure archiving schedule
def start_scheduler():
    scheduler = AsyncIOScheduler()
 
    # Daily archiving at 00:01 UTC
    scheduler.add_job(
        run_archiving_cycle,
        trigger="cron",
        hour=0,
        minute=1
    )
 
    # Additional hourly capture for high-priority URLs
    scheduler.add_job(
        lambda: asyncio.run(run_archiving_cycle()),
        trigger="interval",
        hours=1,
        id="hourly_priority"
    )
 
    scheduler.start()
    return scheduler

Sample Watch List Configuration

# watch_list.yaml
urls:
  - url: "https://www.sec.gov/rules/final.shtml"
    label: "SEC Final Rules"
    sensitivity: "low"
    full_page: true
    schedule: "daily"
 
  - url: "https://home.treasury.gov/policy-issues/financial-sanctions/recent-actions"
    label: "OFAC Recent Actions"
    sensitivity: "medium"
    full_page: true
    schedule: "hourly"
 
  - url: "https://yourcompany.com/legal/privacy-policy"
    label: "Own Privacy Policy"
    sensitivity: "high"
    full_page: true
    schedule: "daily"
 
  - url: "https://yourcompany.com/rates"
    label: "Own Rate Disclosure"
    sensitivity: "high"
    full_page: false
    schedule: "hourly"

Generating a Chain-of-Custody Report

def generate_chain_of_custody_report(
    url_label: str,
    from_date: str,
    to_date: str
) -> dict:
    """
    Generate a chain-of-custody report for a URL's archive history.
    Suitable for legal proceedings and regulatory inquiries.
    """
    archives = get_archives_for_label(url_label, from_date, to_date)
 
    return {
        "subject": url_label,
        "period": {"from": from_date, "to": to_date},
        "report_generated": datetime.utcnow().isoformat(),
        "capture_count": len(archives),
        "methodology": (
            "Automated full-page screenshots captured via WebShot API. "
            "Each capture includes SHA-256 content hash, UTC timestamp, "
            "HTTP response headers, and page metadata. "
            "Archives stored in S3 with server-side encryption."
        ),
        "capture_log": [
            {
                "timestamp": a["capture_time_utc"],
                "sha256_hash": a["sha256_hash"],
                "http_status": a["http_status"],
                "page_title": a["page_title"],
                "s3_key": a["screenshot_s3_key"]
            }
            for a in archives
        ],
        "hash_chain_integrity": verify_hash_chain(archives)
    }

Results at Scale

Compliance teams using WebShot API for regulatory archiving report:

  • 100% coverage of configured URLs with zero manual intervention
  • Sub-5-minute detection of changes on monitored pages
  • Full audit trail accepted by legal counsel for litigation support
  • 90% cost reduction vs. commercial web archiving services
  • Instant retrieval of historical states for any date in the archive

Start with 5–10 high-priority regulatory URLs, verify the archive quality, and then expand to your full watch list.

Get Started

  1. Get your WebShot API key from APIVult
  2. Create your watch_list.yaml with your priority regulatory URLs
  3. Set up an S3 bucket with versioning enabled for your archive store
  4. Deploy the scheduler as a background service
  5. Generate your first chain-of-custody report to verify archive quality

Regulatory archiving is not a project you want to start after you need the records. Build the system now so the evidence exists when it matters.