TrueFoundry is recognized in the 2025 Gartner® Market Guide for AI Gateways! Read the full report

No items found.

Accelerator Series: Building a Resilient Web Scraper with LangGraph and TrueFoundry

March 9, 2026
|
9:30
min read
SHARE

The sales team is panicking – there is a major healthcare conference next week. The event website lists 200 speakers -- physicians, executives, and researchers -- spread across a dozen paginated sub-pages. To build a lead list, someone needs to open the site, click a name, copy the details into a spreadsheet, open a new tab, search for that person on LinkedIn, copy the profile URL, and paste it back.

They have to do this 200 times.

For engineers, this request usually results in a quick Python script using Selenium or BeautifulSoup. You inspect the page source, find the div with the class speaker-name, and extract the text. It works perfectly for about a week. Then the website updates its frontend framework, the CSS classes change, and the script crashes.

We built the Profile Crawler accelerator to stop this cycle. It is an autonomous agent that navigates websites and extracts data based on what the page says, not how the HTML is structured.

Here is how we architected the solution using LangGraph for orchestration, Playwright for interaction, and TrueFoundry to manage the infrastructure.

The Shift: From DOM Selectors to Semantic Extraction

The main reason scraping scripts fail is their reliance on the Document Object Model (DOM). If you tell a script to look for div.content-wrapper > h2.title, it will break the moment a developer changes a class name.

We moved to an agentic approach. We don't tell the bot where the data is located pixel-wise. Instead, we feed the rendered HTML (converted to Markdown) to an LLM. The model reads the text just like a human would. It understands that a section labeled "Keynote Speakers" contains the data we want, regardless of the underlying tags.

  • Old Way (Fragile): Hard-coded CSS selectors that break on UI updates.
  • New Way (Resilient): Semantic understanding that adapts to layout changes.

Architecture Deep Dive

We needed a system that could handle decision-making, not just a linear script. The application needs to decide: Is this input a URL or just a company name? Did we hit a captcha? Is this page a list of people or a single bio?

We chose LangGraph to model this workflow as a state machine.

The Logic Flow

The system operates in a loop rather than a straight line:

  1. Input Router: The system checks if the user provided a direct URL or just a company name. If it's a name, it uses a search tool to find the correct domain first.
  2. Stealth Navigation: We use a modified Playwright instance to load the page. It handles cookie consent banners and lazy-loading images automatically.
  3. Vector Filtering (The Optimization): A single conference page might have 200 navigation links. Feeding all of them to an LLM context window is slow and expensive. We use FastEmbed to embed the link text and query a local Qdrant instance. This filters the list down to the top 10 links relevant to "Team" or "Speakers."
  4. Extraction: The LLM parses the filtered content and extracts structured entities (Name, Role, Company).
  5. Enrichment: Finally, we loop through the extracted names and use a search tool (Tavily) to find their specific LinkedIn profiles.

Here is the system architecture:

Infrastructure and TrueFoundry Integration

Running headless browsers and LLM agents in production creates operational headaches: memory leaks from Chromium, rate limits on LLM APIs, and the need for process isolation.

We deployed this on TrueFoundry to handle these specific constraints.

1. The AI Gateway (Observability & Caching)

This application makes heavy use of LLMs for navigation decisions. Without governance, costs spiral quickly. We route all model calls through the TrueFoundry AI Gateway.

  • Caching: If the agent scrapes the same site twice, the Gateway serves cached LLM responses for the extraction steps. This significantly reduces latency and cost.
  • Rate Limiting: We set strict limits per user to prevent API quota exhaustion during large batch jobs.
  • Failover: If OpenAI experiences downtime, the Gateway automatically reroutes requests to Anthropic or Azure OpenAI without failing the crawl.

2. Model Context Protocol (MCP)

We structured the application using the Model Context Protocol (MCP). The "Crawler" is not just a Python function; it is an MCP Server. This allows us to sandbox the browser environment. If the browser crashes (which happens often with heavy JavaScript sites), it doesn't take down the main application logic.

Infrastructure Diagram

Comparison: Script vs. Agent

We benchmarked the standard Python script approach against this architecture.

Feature Standard Script (Selenium) TrueFoundry Accelerator
Resilience Brittle. Fails on minor UI updates (CSS class changes). High. Semantic extraction tolerates layout changes.
Logic Linear. Hard-coded "if/else" logic. Adaptive. Agent decides which links to follow based on context.
Maintenance High. Requires code updates for every target site. Low. One codebase works for 90% of targets.
Scale Single-threaded. Difficult to parallelize state. Containerized. Auto-scales on TrueFoundry based on queue depth.

Handling Edge Cases

Building the happy path is easy. Making it reliable required solving three specific engineering problems:

  1. Duplicate Data: Profiles often appear on multiple pages (e.g., both "Leadership" and "About Us"). We added a Deduplication Node at the end of the graph. It passes the full list to a smaller, cheaper LLM to merge records based on name similarity before enrichment.
  2. Anti-Bot Detection: Standard Playwright is easily detected by modern WAFs. We implemented undetected-playwright within the Docker container, which patches the browser fingerprint (Navigator object, WebGL vendor) to appear as a standard user device.
  3. Token Limits: Large pages with privacy policies and footers waste tokens. We use header-based chunking to split the markdown. The LLM only processes chunks relevant to "Team" or "Speakers," discarding the rest.

Conclusion

This architecture solves the "Last Mile" of data acquisition by replacing brittle scripts with adaptive agents. By running it on TrueFoundry, we ensure the system is observable, cost-controlled, and scalable.

You can deploy this exact architecture -- including the Gateway configuration and Dockerized agents -- from the TrueFoundry application library today.

The fastest way to build, govern and scale your AI

Discover More

No items found.
March 9, 2026
|
5 min read

TrueMem: Building a Model-Agnostic Memory Layer for AI

No items found.
March 9, 2026
|
5 min read

Accelerator Series: Building a Resilient Web Scraper with LangGraph and TrueFoundry

No items found.
What is LLM Observability ? Complete Guide
March 5, 2026
|
5 min read

What is LLM Observability ? Complete Guide

LLM Tools
Amazon SageMaker Review 2026: Features, Pricing, Pros & Cons
March 5, 2026
|
5 min read

Amazon SageMaker Review: Features, Pricing, Pros and Cons (+ Better Alternative)

No items found.
No items found.
Take a quick product tour
Start Product Tour
Product Tour