Introduction
Core concepts and architecture of expand.ai
expand.ai handles the hard parts of web data extraction so you can focus on building your application. This page covers how things work under the hood.
Raw HTML is messy. It's full of scripts, styles, navigation elements, and ads that add noise without adding information. When you feed this directly to an LLM, you're wasting context window space on content that doesn't help the model understand the page.
The API converts pages to clean Markdown, preserving the semantic structure (headings, lists, tables, links) while stripping away the clutter. The result is a representation that's both token-efficient and easy for models to reason about.
We also go beyond what's visible in the rendered DOM. Many modern sites load data through XHR requests and render it client-side—this data often never appears in the raw HTML. Our engine intercepts these network requests and surfaces the underlying JSON, giving you access to structured data that standard scrapers miss entirely.
For cases where you only need part of a page, you can use semantic search to extract snippets relevant to a natural language query. This is useful when you're looking for specific information on a long page and want to minimize token usage.
Most interesting websites block automated access. They fingerprint browsers, track behavioral patterns, and deploy CAPTCHAs when something looks suspicious.
Under the hood, we run full headless browsers that render pages exactly as a real user would see them. This means JavaScript-heavy sites built with React, Vue, or similar frameworks just work—there's no need to reverse-engineer APIs or deal with pre-rendered content.
Each browser session is configured with a realistic fingerprint (headers, canvas, WebGL, etc.) and routed through residential or mobile proxies. The proxy pool is diverse enough to avoid IP-based rate limiting, and proxy selection is automatic based on the target domain and current network conditions.
When CAPTCHAs do appear, the system handles them automatically so your requests don't get stuck.
When you call the API, your request goes through a few stages: