
Ever wondered how businesses track prices, monitor competitors, or gather huge datasets behind the scenes? The answer is web scraping. When you combine Puppeteer with proxies, you get a powerful setup that makes the process smooth and efficient.
Many websites block bots using JavaScript-loaded content, CAPTCHAs, and IP bans. You can't just send a simple request anymore. Puppeteer acts like a real browser, handling dynamic content and letting you interact with pages as if you were browsing manually. Paired with smart proxies, it becomes an unstoppable tool for scraping.
Puppeteer is a Node.js library that gives you control over Chrome or Chromium. Unlike traditional scraping tools, it can fully render JavaScript-heavy pages — the kind that break basic scrapers.
Think of it as having a fully automated Chrome browser at your fingertips, perfect for modern, interactive sites.
Dynamic content scraping: Many sites don't load data until you scroll or click. Puppeteer handles it all.
Automated testing: Run UI tests without staring at a screen.
SEO tracking: Keep an eye on competitor changes or ranking shifts.
However, websites can (and will) block repeated requests from the same IP. That's where proxies come in.
First, install it:
npm install puppeteer
By default, Puppeteer runs in headless mode. No browser window. No distractions. Faster and lighter on resources. If you need to debug, just set headless: false and watch it work.
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
await page.goto('https://books.toscrape.com/');
console.log('Page loaded!');
await browser.close();
})();
Simple, clean, and you're already scraping.
Once the page is loaded, you can grab whatever you want from the DOM. Let's pull book titles, prices, and stock info from Books to Scrape.
const titleSelector = 'article.product_pod h3 a';
const priceSelector = 'article.product_pod p.price_color';
const availabilitySelector = 'article.product_pod p.instock.availability';
const bookData = await page.evaluate((titleSelector, priceSelector, availabilitySelector) => {
const books = [];
const titles = document.querySelectorAll(titleSelector);
const prices = document.querySelectorAll(priceSelector);
const availability = document.querySelectorAll(availabilitySelector);
titles.forEach((title, index) => {
books.push({
title: title.textContent.trim(),
price: prices[index].textContent.trim(),
availability: availability[index].textContent.trim(),
});
});
return books;
}, titleSelector, priceSelector, availabilitySelector);
console.log(bookData);
Now you’ve got a neat JSON array ready to analyze or store.
JavaScript-heavy pages often load elements after the initial request. If you scrape too early, you’ll get nothing but air.
Key commands to handle this:
page.waitForSelector(): Waits for a specific element to show up.
page.waitForNavigation(): Waits for a full page load or redirect.
Example:
await page.goto('https://books.toscrape.com/');
await page.waitForSelector('article.product_pod');
Let's use residential proxies as an example. Here's a quick setup.
const puppeteer = require('puppeteer');
(async () => {
const proxyServer = 'rp.scrapegw.com:6060';
const proxyUsername = 'proxy_username';
const proxyPassword = 'proxy_password';
const browser = await puppeteer.launch({
headless: true,
args: [\`--proxy-server=http://${proxyServer}\`],
});
const page = await browser.newPage();
await page.authenticate({
username: proxyUsername,
password: proxyPassword,
});
await page.goto('https://httpbin.org/ip', { waitUntil: 'networkidle2' });
const content = await page.evaluate(() => document.body.innerText);
console.log('IP Info:', content);
await browser.close();
})();
Proxy integration: Connects to proxy with authentication.
Verification step: Visits httpbin to confirm your proxy is active.
Headless by default: Fast and quiet.
Puppeteer combined with solid proxies is a game-changer. You get dynamic scraping power and the stealth needed to avoid bans. Remember the quality of your proxies is critical. Cheap or overused proxies can ruin your scraping and get you blocked fast.