Expand AI logo
DocsDocs
Login

Documentation

Introduction

WelcomeQuickstartConcepts

Product

FetchMCPCLI

SDKs

SDKsTypeScript SDKPython SDK

Introduction

Concepts

Core concepts and architecture of expand.ai

expand.ai handles the hard parts of web data extraction so you can focus on building your application. This page covers how things work under the hood.

Extraction for LLMs

Raw HTML is messy. It's full of scripts, styles, navigation elements, and ads that add noise without adding information. When you feed this directly to an LLM, you're wasting context window space on content that doesn't help the model understand the page.

The API converts pages to clean Markdown, preserving the semantic structure (headings, lists, tables, links) while stripping away the clutter. The result is a representation that's both token-efficient and easy for models to reason about.

We also go beyond what's visible in the rendered DOM. Many modern sites load data through XHR requests and render it client-side—this data often never appears in the raw HTML. Our engine intercepts these network requests and surfaces the underlying JSON, giving you access to structured data that standard scrapers miss entirely.

For cases where you only need part of a page, you can use semantic search to extract snippets relevant to a natural language query. This is useful when you're looking for specific information on a long page and want to minimize token usage.

Browser Automation

Most interesting websites block automated access. They fingerprint browsers, track behavioral patterns, and deploy CAPTCHAs when something looks suspicious.

Under the hood, we run full headless browsers that render pages exactly as a real user would see them. This means JavaScript-heavy sites built with React, Vue, or similar frameworks just work—there's no need to reverse-engineer APIs or deal with pre-rendered content.

Each browser session is configured with a realistic fingerprint (headers, canvas, WebGL, etc.) and routed through residential or mobile proxies. The proxy pool is diverse enough to avoid IP-based rate limiting, and proxy selection is automatic based on the target domain and current network conditions.

When CAPTCHAs do appear, the system handles them automatically so your requests don't get stuck.

What Happens During a Request

PreviousQuickstart
NextFetch

On This Page

Extraction for LLMsBrowser AutomationWhat Happens During a Request

When you call the API, your request goes through a few stages:

  1. A proxy is selected based on the target URL and assigned to the request.
  2. A browser instance spins up with a unique fingerprint.
  3. The page loads and JavaScript executes. We wait for the DOM to stabilize before extracting content.
  4. The raw content passes through a pruning engine that removes noise (navbars, footers, sidebars, ads).
  5. The cleaned content is transformed into your requested format and returned.