The Challenge: Schema Diversity at Web Scale
When building integrations across thousands of websites, we face three key technical challenges:
- Structural Diversity: Every website structures its data differently - from DOM hierarchies to state management approaches. An e-commerce product page on one site might store crucial data in nested DOM elements, while another uses JSON embedded in script tags.
- Schema Evolution: Website structures frequently change with deployments. Our system needs to handle these changes gracefully without breaking existing integrations.
- Interface Consistency: Despite the underlying diversity, we need to present a unified, stable API interface that developers can build against confidently.
Our Solution: The Schema Federation Engine
To address these challenges, we built a schema federation engine that sits between our automation runtime and the API layer. Here's how it works:
1. Site-Specific Schema Mappings
For each supported website, we maintain a schema mapping configuration that defines how to extract structured data:
1const productMapping = {
2 version: "2.0",
3 site: "example.com",
4 extractors: [{
5 selector: ".product-price",
6 fallbacks: ["//meta[@property='product:price']/@content"]
7 }],
8 transformers: [{
9 field: "price",
10 transform: (value) => parseFloat(value.replace(/[^0-9.]/g, ''))
11 }]
12}
2. Version-Aware Schema Resolution
When a website's structure changes, we maintain multiple schema versions with a smart resolution system:
1class SchemaResolver {
2 async resolveSchema(site: string, page: Page): Promise<SchemaMapping> {
3 const candidates = await this.getSchemaVersions(site);
4
5 for (const schema of candidates) {
6 if (await this.validateSchema(schema, page)) {
7 return schema;
8 }
9 }
10 throw new SchemaResolutionError(site);
11 }
12}
3. Adaptive Data Extraction
Our extraction engine adapts to different data storage patterns using a cascading fallback system:
1async function extractWithFallbacks(
2 page: Page,
3 selector: string,
4 fallbacks: string[]
5): Promise<string | null> {
6 // Try primary selector
7 const result = await page.evaluate(selector);
8 if (result) return result;
9
10 // Try fallbacks in order
11 for (const fallback of fallbacks) {
12 const fbResult = await page.evaluate(fallback);
13 if (fbResult) return fbResult;
14 }
15 return null;
16}
4. Schema Evolution and Compatibility
To maintain backwards compatibility while allowing schemas to evolve, we implement a versioning system:
1class SchemaEvolutionManager {
2 async evolveSchema(site: string, changes: SchemaChange[]): Promise<string> {
3 return changes.some(c => c.type === 'breaking')
4 ? this.createNewMajorVersion(currentSchema, changes)
5 : this.updateCurrentVersion(currentSchema, changes);
6 }
7}
Results and Future Work
This schema federation system has enabled us to:
[list-check]
- Support 2000+ websites with unique structures through ~100 base schema templates
- Maintain 99.9% API stability during website structure changes
- Reduce new site integration time by 80%
We're currently working on:
[list-task]
- Machine learning-based schema detection to automate mapping creation
- Real-time schema validation and anomaly detection
- Automatic schema migration suggestions based on historical patterns
[cta]
Key Learnings
Building a robust schema federation system taught us several lessons:
- Progressive Enhancement: Start with basic extractors and progressively add complexity only where needed.
- Defensive Extraction: Always implement fallback mechanisms - websites can change unexpectedly.
- Version Tolerance: Design for schema coexistence rather than immediate deprecation.
The challenge of cross-site schema federation exemplifies how modern web automation requires sophisticated infrastructure to handle the diversity and evolution of web platforms reliably.