We're Hiring. Check Out Our Open Positions.

January 27, 2025

Cross-Site Schema Federation: Building a Unified API Interface Across Diverse Web Platforms

At Anon, we enable developers to build automated integrations with any website, even those without public APIs. One of our core technical challenges is automatically mapping diverse website structures and data models into a consistent API interface that developers can reliably build against. This post describes how we built our schema federation system to solve this challenge while maintaining backwards compatibility as website structures evolve.

The Challenge: Schema Diversity at Web Scale

When building integrations across thousands of websites, we face three key technical challenges:

  1. Structural Diversity: Every website structures its data differently - from DOM hierarchies to state management approaches. An e-commerce product page on one site might store crucial data in nested DOM elements, while another uses JSON embedded in script tags.
  2. Schema Evolution: Website structures frequently change with deployments. Our system needs to handle these changes gracefully without breaking existing integrations.
  3. Interface Consistency: Despite the underlying diversity, we need to present a unified, stable API interface that developers can build against confidently.

Our Solution: The Schema Federation Engine

To address these challenges, we built a schema federation engine that sits between our automation runtime and the API layer. Here's how it works:

1. Site-Specific Schema Mappings

For each supported website, we maintain a schema mapping configuration that defines how to extract structured data:

1const productMapping = {
2  version: "2.0",
3  site: "example.com",
4  extractors: [{
5    selector: ".product-price",
6    fallbacks: ["//meta[@property='product:price']/@content"]
7  }],
8  transformers: [{
9    field: "price",
10    transform: (value) => parseFloat(value.replace(/[^0-9.]/g, ''))
11  }]
12}

2. Version-Aware Schema Resolution

When a website's structure changes, we maintain multiple schema versions with a smart resolution system:

1class SchemaResolver {
2  async resolveSchema(site: string, page: Page): Promise<SchemaMapping> {
3    const candidates = await this.getSchemaVersions(site);
4    
5    for (const schema of candidates) {
6      if (await this.validateSchema(schema, page)) {
7        return schema;
8      }
9    }
10    throw new SchemaResolutionError(site);
11  }
12}

3. Adaptive Data Extraction

Our extraction engine adapts to different data storage patterns using a cascading fallback system:

1async function extractWithFallbacks(
2  page: Page, 
3  selector: string,
4  fallbacks: string[]
5): Promise<string | null> {
6  // Try primary selector
7  const result = await page.evaluate(selector);
8  if (result) return result;
9    
10  // Try fallbacks in order
11  for (const fallback of fallbacks) {
12    const fbResult = await page.evaluate(fallback);
13    if (fbResult) return fbResult;
14  }
15  return null;
16}

4. Schema Evolution and Compatibility

To maintain backwards compatibility while allowing schemas to evolve, we implement a versioning system:

1class SchemaEvolutionManager {
2  async evolveSchema(site: string, changes: SchemaChange[]): Promise<string> {
3    return changes.some(c => c.type === 'breaking')
4      ? this.createNewMajorVersion(currentSchema, changes)
5      : this.updateCurrentVersion(currentSchema, changes);
6  }
7}

Results and Future Work

This schema federation system has enabled us to:

[list-check]

  • Support 2000+ websites with unique structures through ~100 base schema templates
  • Maintain 99.9% API stability during website structure changes
  • Reduce new site integration time by 80%

We're currently working on:

[list-task]

  • Machine learning-based schema detection to automate mapping creation
  • Real-time schema validation and anomaly detection
  • Automatic schema migration suggestions based on historical patterns

[cta]

Key Learnings

Building a robust schema federation system taught us several lessons:

  1. Progressive Enhancement: Start with basic extractors and progressively add complexity only where needed.
  2. Defensive Extraction: Always implement fallback mechanisms - websites can change unexpectedly.
  3. Version Tolerance: Design for schema coexistence rather than immediate deprecation.

The challenge of cross-site schema federation exemplifies how modern web automation requires sophisticated infrastructure to handle the diversity and evolution of web platforms reliably.

Resilient Browser Automation at Scale: How Anon Solves the Anti-Bot Challenge

Secure Credential Management at Scale: Anon's Zero-Persistence Architecture

Distributed Rate Limiting at the Edge: How Anon Coordinates Global Request Quotas

Stateful Action Replay: Building Robust User Workflow Recording at Anon

Building Reliable Browser Automation Pipelines at Scale

Dynamic Protocol Adaptation: Building a Universal Authentication Layer at Anon

Building Unified Authentication at Anon: A Tale of Provider-Agnostic Session Management

Cross-Site Schema Federation: Building a Unified API Interface Across Diverse Web Platforms