Lightfeed has published an open-source TypeScript library that pairs large language models with Playwright browser automation to extract structured data from websites. The Lightfeed Extractor, released under the Apache-2.0 license, targets developers building production data pipelines that require reliable, schema-validated output from arbitrary web pages. Author details were not available at time of publication.
- Lightfeed Extractor is a TypeScript library that uses natural language prompts and Zod schemas to pull structured data from HTML, markdown, or plain text.
- The library supports four LLM backends: OpenAI, Google Gemini, Anthropic, and locally hosted Ollama models.
- Built-in stealth-mode Playwright automation, JSON recovery for failed extractions, and per-run token tracking are included for production deployments.
- The repository carries an Apache-2.0 license and had 295 stars and 8 forks across 63 commits at time of writing.
What Happened
Lightfeed released an open-source TypeScript library on GitHub that allows developers to extract structured data from websites using natural language prompts rather than hand-written CSS selectors, with output validated via Zod and support for four LLM providers. The library is available at github.com/lightfeed/extractor under the Apache-2.0 license. At time of writing, the repository had accumulated 295 stars and 8 forks across 63 commits to the main branch.
The project README describes the library as built for “robust web data extraction using LLMs,” designed to deliver “complete, accurate results with great token efficiency — critical for production data pipelines.” The documentation states that developers can “use natural language prompts to extract structured data from HTML, markdown, or plain text” according to a developer-defined schema.
Why It Matters
LLM-assisted web extraction addresses a persistent problem in data engineering: modern websites rely on JavaScript rendering, dynamic content loading, and anti-bot protections that defeat static CSS-selector scrapers and require constant maintenance when page layouts change. By delegating schema extraction to a language model, developers can absorb page structure changes without rewriting parsing logic.
Lightfeed’s release adds a TypeScript-native, production-oriented option to a growing set of tools that includes Python-based scrapers and browser-agent frameworks. Its multi-provider architecture — spanning OpenAI, Google Gemini, Anthropic, and Ollama — allows teams to swap LLM backends based on cost, latency, or data-residency requirements without rewriting extraction logic.
Technical Details
The extraction pipeline begins by converting raw HTML into what the project calls “LLM-ready markdown,” with optional filters to isolate main content and strip navigation menus, headers, and advertising markup before sending content to an LLM. This preprocessing step reduces token consumption per extraction run, which the project positions as a key differentiator for cost-sensitive production workloads.
Structured output is generated using LLMs in JSON mode and validated against a Zod schema supplied by the developer, giving the pipeline end-to-end type safety without a separate validation step. A JSON recovery mechanism is included to parse and repair malformed or partial model outputs, reducing hard failures caused by inconsistent model responses — a known failure mode when using JSON-mode constraints with smaller models.
Browser automation is handled by Playwright with stealth-mode patches applied to reduce detection by anti-bot systems. The library supports three deployment configurations: launching a local browser process, running headlessly in serverless environments, and connecting to a remote browser server. URL validation normalizes relative links to absolute URLs. Token usage is tracked per extraction run and configurable per-run limits are exposed for production safety.
Who’s Affected
The library’s documentation and example code center on e-commerce use cases, with sample code demonstrating product catalog extraction that captures fields including product name, brand, price, and ratings from retail websites. Developers building price monitoring tools, catalog aggregators, or competitive intelligence pipelines are the primary stated audience.
Lightfeed operates a commercial retail intelligence platform at app.lightfeed.ai, and the open-source library appears to derive from that production infrastructure. Teams evaluating hosted scraping APIs alongside in-house alternatives now have access to the underlying extraction layer that Lightfeed uses in its own product.
What’s Next
The repository contains no public roadmap. One pull request was open and no issues were listed at time of publication; GitHub Actions workflows in the repository suggest automated testing is active. Contributions are governed by a CONTRIBUTING.md file, and a CHANGELOG.md is included tracking the project’s 63-commit history.
Support for locally hosted Ollama models makes the library viable for teams operating under data-residency constraints or seeking to avoid per-token API costs at scale. Whether Lightfeed intends to maintain feature parity between the open-source library and its commercial platform’s extraction capabilities was not addressed in the available documentation.