Scraping search engines like Yandex is no small feat. It's a task that comes with hurdles, especially if you're trying to do it at scale. But if you know how to handle the challenge, the rewards are immense. You get to unlock valuable data for SEO analysis, competitive research, and more. Ready to dive in?
In this guide, we'll walk you through building a custom Yandex scraper using proxies, and show you how to leverage API to extract Yandex search results with ease. No fluff, just practical steps that you can apply right away.
Yandex, like other major search engines, displays results based on relevance, quality, location, and personalization. The Yandex SERP is split into two main sections: Advertisements and Organic Results.
Let's imagine you searched for "iPhone". Here's how the results look:
Advertisements: These are clearly marked as "Sponsored" or "Advertisement" and show product details like prices and links.
Organic results: These are the pages that appear because they're most relevant to the query.
While ads are straightforward, scraping organic results can be tricky. However, Yandex is notorious for its anti-bot protection, especially the dreaded CAPTCHA. So, let's talk about how you can bypass it.
Yandex doesn't make it easy. Their CAPTCHA and anti-bot system are designed to stop scrapers dead in their tracks. If you're not careful, you’ll find your IP blocked in no time. To make matters worse, Yandex continuously updates its anti-bot measures, forcing scrapers to constantly adapt.
But don't sweat it—there's a solution. Proxies and API are your best friends here. Proxies hide your real IP, making it look like the requests are coming from different users. The API takes this a step further, handling all the proxy and CAPTCHA issues for you.
Now, let's jump into the meat of the tutorial.
Before we get into scraping, let's make sure your environment is ready. You'll need Python installed on your system. If you haven't done that yet, head to the official Python website and grab the latest version.
Next, let's install the Python libraries we’ll use for this project: requests, BeautifulSoup, and pandas. Open your terminal and run this command:
pip install requests pandas beautifulsoup4
These libraries are the building blocks:
Requests: Makes network requests.
BeautifulSoup: Extracts the data you need from raw HTML.
Pandas: Saves the scraped data into a clean CSV file.
This part is where the fun begins. We’ll build a basic scraper that uses residential proxies to bypass Yandex's CAPTCHA and IP blocks.
Here's how to configure your proxies and headers to mimic a real user.
import requests
from bs4 import BeautifulSoup
import pandas as pd
# Proxies and Authentication Details
USERNAME = 'PROXY_USERNAME'
PASSWORD = 'PROXY_PASSWORD'
proxies = {
'http': f'https://{USERNAME}:{PASSWORD}@pr.swiftproxy.net:7777',
'https': f'https://{USERNAME}:{PASSWORD}@pr.swiftproxy.net:7777'
}
# Request headers to mimic a browser
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:137.0) Gecko/20100101 Firefox/137.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.9,ru;q=0.8',
'Connection': 'keep-alive'
}
Next, we send a GET request to Yandex using the proxies and headers. This will fetch the search results.
response = requests.get(
'https://yandex.com/search/?text=what%20is%20web%20scraping',
proxies=proxies,
headers=headers
)
response.raise_for_status() # Ensure we get a successful response
Now, let's parse the raw HTML response to extract the search results. We’ll use BeautifulSoup to grab the title and link for each result.
soup = BeautifulSoup(response.text, 'html.parser')
data = []
for listing in soup.select('li.serp-item_card'):
title_el = listing.select_one('h2 > span')
title = title_el.text if title_el else None
link_el = listing.select_one('.organic__url')
link = link_el.get('href') if link_el else None
data.append({'Title': title, 'Link': link})
Once you’ve extracted the data, it's time to save it to a CSV file. This step is easy with pandas.
df = pd.DataFrame(data)
df.to_csv('yandex_results.csv', index=False)
Building your own scraper works, but it can become a hassle when you need to scale. This is where API shines.
You'll need to define the search parameters for the API. Here's a simple setup for scraping Yandex.
import requests
import pandas as pd
payload = {
'source': 'universal',
'url': 'https://yandex.com/search/?text=what%20is%20web%20scraping',
}
API lets you define your parsing logic with CSS or XPath selectors. Let's extract the titles and links from the Yandex search results.
payload['parsing_instructions'] = {
'listings': {
'_fns': [{'_fn': 'css', '_args': ['li.serp-item_card']}],
'_items': {
'title': {'_fns': [{'_fn': 'css_one', '_args': ['h2 > span']}, {'_fn': 'element_text'}]},
'link': {'_fns': [{'_fn': 'xpath_one', '_args': ['.//a[contains(@class, "organic__url")]/@href']}]}
}
}
}
Send the request to the API.
response = requests.post(
'https://realtime.swiftproxy.net/v1/queries',
auth=('API_USERNAME', 'API_PASSWORD'),
json=payload
)
response.raise_for_status()
Once the response is received, extract the data and save it to a CSV file.
data = response.json()['results'][0]['content']['listings']
df = pd.DataFrame(data)
df.to_csv('yandex_results_API.csv', index=False)
Approach |
Advantages |
Disadvantages |
No Proxies |
Simple setup, no proxy costs |
IP blocks, CAPTCHA, scaling issues |
With Proxies |
Avoids IP blocks, access geo-specific data |
Proxy service costs, maintenance |
API |
Scalable, automatic CAPTCHA bypass, no setup hassle |
Recurring subscription costs, vendor lock-in |
Custom Solutions |
Full control, ideal for JavaScript-heavy sites |
Requires technical expertise, can be slow |
Scraping Yandex may seem like a challenge, but with the right tools and techniques, it's a breeze. Whether you're using proxies, API, or a custom scraper, you can bypass Yandex's anti-bot measures and extract the valuable data you need.