Puppeteer: Dynamic Page Scraping
Use Puppeteer when the content you need is rendered by JavaScript after page load — single-page apps, infinite scroll, lazy-loaded images, login-gated content.
import puppeteer, { type Browser, type Page } from 'puppeteer'
async function launchBrowser(): Promise {
return puppeteer.launch({
headless: true,
args: [
'--no-sandbox',
'--disable-setuid-sandbox',
'--disable-dev-shm-usage',
'--disable-gpu',
],
})
}
async function scrapeDynamicPage(url: string): Promise {
const browser = await launchBrowser()
const page = await browser.newPage()
try {
// block images/fonts to speed up loads
await page.setRequestInterception(true)
page.on('request', (req) => {
if (['image', 'font', 'media'].includes(req.resourceType())) {
req.abort()
} else {
req.continue()
}
})
await page.goto(url, { waitUntil: 'networkidle2', timeout: 30_000 })
// wait for a specific element before extracting
await page.waitForSelector('.product-list', { timeout: 10_000 })
// extract data in page context (runs in browser)
const items = await page.evaluate(() => {
return Array.from(document.querySelectorAll('.product-card')).map((el) => ({
title: el.querySelector('.title')?.textContent?.trim() ?? '',
price: el.querySelector('.price')?.textContent?.trim() ?? '',
imageUrl: (el.querySelector('img') as HTMLImageElement | null)?.src ?? '',
}))
})
return items
} finally {
await browser.close()
}
}
Handling Infinite Scroll
async function scrapeInfiniteScroll(page: Page): Promise {
const results: string[] = []
let previousHeight = 0
while (true) {
// extract items currently visible
const newItems = await page.evaluate(() =>
Array.from(document.querySelectorAll('.item-title')).map(
(el) => el.textContent?.trim() ?? ''
)
)
results.push(...newItems.slice(results.length))
// scroll to bottom
await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight))
await new Promise((r) => setTimeout(r, 1500)) // wait for load
const newHeight = await page.evaluate(() => document.body.scrollHeight)
if (newHeight === previousHeight) break // no more content
previousHeight = newHeight
}
return [...new Set(results)] // deduplicate
}
Rate Limiting — The Most Important Part
Scraping without rate limiting will get you banned. Treat rate limiting as mandatory, not optional.
class RateLimiter {
private queue: Array<() => void> = []
private running = 0
constructor(
private readonly maxConcurrent: number,
private readonly minDelayMs: number,
private readonly maxDelayMs: number
) {}
async acquire(): Promise<() => void> {
if (this.running >= this.maxConcurrent) {
await new Promise((resolve) => this.queue.push(resolve))
}
this.running++
// jitter: random delay between min and max
const delay =
this.minDelayMs + Math.random() * (this.maxDelayMs - this.minDelayMs)
await new Promise((r) => setTimeout(r, delay))
return () => {
this.running--
this.queue.shift()?.()
}
}
}
// usage
const limiter = new RateLimiter(2, 1000, 3000) // max 2 concurrent, 1-3s delay
async function fetchWithRateLimit(url: string): Promise {
const release = await limiter.acquire()
try {
const res = await fetch(url)
if (!res.ok) throw new Error(`HTTP \${res.status}`)
return res.text()
} finally {
release()
}
}
Proxy Rotation
const proxies = [
'http://proxy1.example.com:8080',
'http://proxy2.example.com:8080',
'http://proxy3.example.com:8080',
]
function getProxy(): string {
return proxies[Math.floor(Math.random() * proxies.length)]
}
// with Puppeteer
async function launchWithProxy(proxy: string): Promise {
return puppeteer.launch({
headless: true,
args: [`--proxy-server=\${proxy}`, '--no-sandbox'],
})
}
// with fetch (via undici or https-proxy-agent)
import { ProxyAgent, fetch as undiciFetch } from 'undici'
async function fetchViaProxy(url: string): Promise {
const proxy = getProxy()
const dispatcher = new ProxyAgent(proxy)
const res = await undiciFetch(url, { dispatcher })
return res.text()
}
Retry Logic with Exponential Backoff
async function withRetry(
fn: () => Promise,
maxAttempts = 3,
baseDelayMs = 1000
): Promise {
for (let attempt = 1; attempt <= maxAttempts; attempt++) {
try {
return await fn()
} catch (err) {
if (attempt === maxAttempts) throw err
const delay = baseDelayMs * Math.pow(2, attempt - 1)
const jitter = Math.random() * 500
await new Promise((r) => setTimeout(r, delay + jitter))
}
}
throw new Error('unreachable')
}
// usage
const html = await withRetry(() => fetchWithRateLimit(url))
Structured Data Extraction with Zod Validation
import { z } from 'zod'
import * as cheerio from 'cheerio'
const ProductSchema = z.object({
id: z.string(),
title: z.string().min(1),
price: z.number().positive(),
currency: z.string().length(3),
inStock: z.boolean(),
imageUrl: z.string().url().nullable(),
})
type Product = z.infer
function extractProduct(el: cheerio.Cheerio, $: cheerio.CheerioAPI): Product {
const priceText = $(el).find('[data-price]').attr('data-price') ?? '0'
const raw = {
id: $(el).attr('data-product-id') ?? '',
title: $(el).find('.product-title').text().trim(),
price: parseFloat(priceText),
currency: $(el).find('[data-currency]').attr('data-currency') ?? 'USD',
inStock: $(el).find('.stock-badge').text().includes('In Stock'),
imageUrl: $(el).find('img').attr('src') ?? null,
}
return ProductSchema.parse(raw) // throws ZodError if invalid
}
Choosing Between Puppeteer and Cheerio
Use Cheerio when: the HTML you need is in the initial response body, you need speed, or you are running many concurrent requests. Use Puppeteer when: content is loaded by JavaScript after initial render, you need to interact with the page (click, scroll, fill forms), or you need to handle authentication flows.
A common pattern is to use Cheerio first, and fall back to Puppeteer only when the Cheerio output is empty — this gives you fast paths for static pages and automatic fallback for dynamic ones.
People Also Ask
Is web scraping legal?
It depends on the site’s terms of service, what data you are collecting, and your jurisdiction. Publicly available data with no ToS prohibition is generally safe. Scraping behind authentication, storing personal data, or violating a site’s ToS can create legal exposure. Always check robots.txt and the site’s ToS before scraping commercially.
How do I handle CAPTCHAs in Puppeteer?
For light usage, human-in-the-loop (pause and prompt the user to solve). For automation, CAPTCHA solving services like 2Captcha or CapSolver provide APIs that return solved tokens. Some CAPTCHAs can be avoided entirely by using real browser fingerprints, Puppeteer Stealth, and appropriate rate limiting that mimics human behaviour.
What is the difference between waitUntil: 'networkidle0' and 'networkidle2'?
networkidle0 waits until there are zero network connections for 500ms — good for pages that stop all requests when fully loaded. networkidle2 waits until there are at most 2 connections for 500ms — better for pages that maintain persistent WebSocket or polling connections. Use networkidle2 as the default and only switch to networkidle0 if you know the page fully quiesces.
Comments · 0
No comments yet. Be the first to share your thoughts.