
If you want to harvest large volumes of web data without getting bogged down by slow speeds or IP bans, Scrapy is your powerhouse. This open-source Python framework isn't just popular for nothing — it's fast, flexible, and built to handle serious scale.
But scraping isn't just about pulling data. It's about doing it smartly. That means integrating proxies, rotating user agents, and dodging detection like a pro.
Ready to dive in? Let's build a rock-solid Scrapy project from scratch, solve common headaches, and show you how to make your scrapers stealthy and scalable.
First, get Python installed. Grab the latest version (3.13.3 as of now) from the official site. Windows users, here's the tip — check the “Add python.exe to PATH” box during installation.
Next, fire up your Command Prompt and install Scrapy with:
pip install scrapy
Now create your project folder:
scrapy startproject ScrapyTutorial
This lays down a clean, organized structure. Think of it as your project's blueprint — scrapy.cfg for settings, items.py for your data models, pipelines.py for processing scraped info, and the all-important spiders folder, where the scraping magic happens.
To create your first spider, navigate into your project folder:
cd ScrapyTutorial
scrapy genspider SpiderName example.com
SpiderName is your spider's name, example.com the target domain.
Open that spider file in your favorite IDE — Visual Studio Code works wonders. Notice the allowed_domains list? It's your safety net, making sure your spider stays on target.
Right now, your spider grabs the whole HTML page, which is messy. Time to zoom in.
Use CSS selectors to pinpoint exactly what you want. Open the target website in a browser, press Ctrl + Shift + I, and inspect the element you need.
Say you want pricing info displayed like:
<p class="tp-headline-m text-neutral-0">$0.22</p>
Target it precisely:
pricing = response.css('[class="tp-headline-m text-neutral-0"]::text').getall()
if pricing:
print("Price details:")
for price in pricing:
print(f"- {price.strip()}")
Simple. Elegant. Accurate.
CSS selectors are great, but sometimes you need surgical precision. Enter XPath — a powerful query language navigating HTML like a pro.
Example:
//*/parent::p
This fetches all paragraph tags that are immediate parents of any node.
Use XPath when your data sits inside tricky nested structures or when attributes alone don't cut it.
Scrapy handles static HTML like a champ, but JavaScript? That's a different beast.
If your target site loads data dynamically — think content appearing after button clicks, scrolls, or API calls — Scrapy alone won't cut it.
Here's where Selenium and Playwright step in:
Selenium: Browser automation powerhouse. Perfect for sites requiring login, clicks, or any interaction.
Playwright: Microsoft's fast, reliable alternative. Auto-waits for page loads and handles complex JS effortlessly.
Integrate these with Scrapy middleware to combine speed with dynamic content handling.
Websites hate being scraped too aggressively. They throw up barriers like IP bans, CAPTCHAs, and throttling.
If you use a single IP, expect quick detection.
Solution? Proxies.
Residential proxies, in particular, mask your traffic as genuine users, offering excellent anonymity.
Set them up quickly with scrapy-rotating-proxies:
pip install scrapy-rotating-proxies
Add your proxies to settings.py:
ROTATING_PROXY_LIST = [
'http://username:password@proxy_address:port',
]
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
'rotating_proxies.middlewares.RotatingProxyMiddleware': 610,
'rotating_proxies.middlewares.BanDetectionMiddleware': 620,
}
Rotate IPs seamlessly to avoid bans and keep scraping uninterrupted.
Changing IPs isn't enough. Websites also track your user-agent — the digital fingerprint telling them which browser and OS you're using.
Rotate those user agents to appear human and varied.
Also, manage sessions and cookies wisely. Different cookies per session reduce detection risk. Scrapy's CookiesMiddleware does this out of the box but can be fine-tuned for your needs.
Bots tend to scream “I’m automated!” by requesting pages lightning-fast.
Scrapy lets you add delays easily:
DOWNLOAD_DELAY = 2 # seconds
Slowing down requests mimics human browsing patterns, dramatically lowering your ban risk.
407 Proxy Authentication Error: Make sure proxy strings are formatted like http://username:password@host:port.
Proxy Downtime: Check your proxies with online tools before scraping.
403 Forbidden: Increase privacy measures — rotate IPs and user agents, add delays.
Using Scrapy together with smart proxies creates a powerful and scalable solution for stealthy web scraping. After getting comfortable with these fundamentals, you can move on to tools like Selenium or Playwright to handle dynamic content more effectively. Stay vigilant, as anti-scraping measures are always advancing. Continuously refine your approach by rotating IP addresses and user agents, managing sessions carefully, and always practicing responsible scraping.