Scrappling
Paste a URL, get clean data.
What it is
Paste a URL and Scrappling renders it with a stealth fetcher, then hands back clean JSON and Markdown side by side. It is built on Scrapling behind a small FastAPI service, with agent-readable endpoints so AI tools can discover and call it on their own.
System design
A thin Next.js display client on Vercel proxies to a FastAPI scraper on Boltic, so no Python or headless browser ever runs in the lambda. The backend offers three fetcher tiers: plain HTTP for speed, a Camoufox stealth mode that clears Cloudflare and JS challenges, and a Playwright dynamic mode, and it reuses the launched browser context across requests so only the first cold start pays the boot cost. Every surface, UI and API alike, publishes llms.txt and agents.md so an agent can self-discover and call the scraper.
- Next.js 15
- FastAPI
- Scrapling
- Camoufox
- Playwright
- Boltic
What I got wrong, then fixed.
01 · the problem
A scrape of a paywalled or login-walled page came back as a clean 200, but the content was just the wall: 'subscribe to read', cookie banners, login prompts. The scraper was treating access-control text as the page.
what I did
Added wall detection that flags auth, paywall, subscribe, and cookie-gate text, labels the result blocked or partial, and returns quality metadata, so a successful status with junk content no longer reads as a win.
02 · the problem
The Jina Reader fallback and all the HTML cleanup lived in the frontend, duplicating logic the backend should own and bloating a client that was meant to just display results.
what I did
Moved the fallback and the content-cleaning pipeline into the FastAPI backend, keeping the frontend a thin display client and giving every caller, not just the UI, the cleaned output.
03 · the problem
The first stealth or dynamic request booted Chromium from cold, a 10 to 20 second wait, and Vercel's lambda cannot run a headless browser at all.
what I did
Kept the scraping on a Boltic service sized for it (1 vCPU, 1.5 GB, 120s timeout) and reused the launched browser context across requests, so only the first request pays the boot cost while Vercel just proxies.