
Over 70% of China's internet users turn to Baidu for information daily. Yet, scraping its search results? That's a whole different ballgame. Baidu's dynamic pages and anti-scraping defenses make it tough to grab data smoothly — but not impossible.
In this guide, we'll walk you through exactly how to scrape Baidu's organic search results with Python, using Swiftproxy API and Residential Proxies. By the end, you'll have a solid, maintainable scraper that delivers clean, actionable data — no headaches.
Baidu's SERP looks familiar if you know Google or Bing: it shows paid ads, organic results, and related searches. But there's nuance.
Organic Results: These are the real gems. They reflect Baidu's algorithmic ranking of pages matching your query.
Paid Results: Clearly labeled "advertise," these are sponsored placements.
Related Searches: Found at the bottom, these help users explore similar topics.
Baidu uses:
CAPTCHAs that block bots.
IP and user-agent blocking.
Dynamic HTML that shifts structure regularly.
Bottom line? Your scraper has to be adaptive and persistent. That's why a solid API is a game-changer — it handles these challenges under the hood. No more endless tweaking or manual blocks.
Scraping publicly available data usually sits in a legal grey area but is often allowed if you avoid:
Logging into accounts.
Downloading copyrighted or private content.
Always get professional legal advice before starting large-scale scraping projects. Stay ethical. Stay safe.
You'll need:
pip install requests bs4 pandas
Import these in your script:
import requests
from bs4 import BeautifulSoup
import pandas as pd
Here's the URL for Swiftproxy API:
url = 'https://realtime.swiftproxy.net/v1/queries'
auth = ('your_api_username', 'your_api_password')
Replace with your actual credentials.
The payload tells the API what to scrape:
payload = {
'source': 'universal',
'url': 'https://www.baidu.com/s?ie=utf-8&wd=nike&rn=50',
'geo_location': 'United States',
}
Swap the URL and location as needed.
response = requests.post(url, json=payload, auth=auth, timeout=180)
response.raise_for_status()
json_data = response.json()
if not json_data["results"]:
print("No results found for the given query.")
exit()
Extract titles and URLs with BeautifulSoup. Here's a robust parser function:
def parse_baidu_html_results(html_content: str) -> list[dict]:
parsed_results = []
soup = BeautifulSoup(html_content, "html.parser")
result_blocks = soup.select("div.c-container[id]")
for block in result_blocks:
title_tag = (
block.select_one("h3.t a")
or block.select_one("h3.c-title-en a")
or block.select_one("div.c-title a")
)
if not title_tag:
continue
title_text = title_tag.get_text(strip=True)
href = title_tag.get("href")
if title_text and href:
parsed_results.append({"title": title_text, "url": href})
return parsed_results
Saving parsed data is straightforward:
def store_to_csv(data: list[dict], filename="baidu_results.csv"):
df = pd.DataFrame(data)
df.to_csv(filename, index=False)
import requests
import pandas as pd
from bs4 import BeautifulSoup
def parse_baidu_html_results(html_content: str) -> list[dict]:
parsed_results = []
soup = BeautifulSoup(html_content, "html.parser")
result_blocks = soup.select("div.c-container[id]")
for block in result_blocks:
title_tag = (
block.select_one("h3.t a")
or block.select_one("h3.c-title-en a")
or block.select_one("div.c-title a")
)
if not title_tag:
continue
title_text = title_tag.get_text(strip=True)
href = title_tag.get("href")
if title_text and href:
parsed_results.append({"title": title_text, "url": href})
return parsed_results
def store_to_csv(data: list[dict], filename="baidu_results.csv"):
df = pd.DataFrame(data)
df.to_csv(filename, index=False)
def main():
url = 'https://realtime.swiftproxy.net/v1/queries'
auth = ('your_api_username', 'your_api_password')
payload = {
'source': 'universal',
'url': 'https://www.baidu.com/s?ie=utf-8&wd=nike&rn=50',
'geo_location': 'United States',
}
response = requests.post(url, json=payload, auth=auth, timeout=180)
response.raise_for_status()
json_data = response.json()
if not json_data["results"]:
print("No results found for the given query.")
return
html_content = json_data["results"][0]["content"]
parsed_data = parse_baidu_html_results(html_content)
store_to_csv(parsed_data)
print(f"Scraped {len(parsed_data)} results. Saved to baidu_results.csv")
if __name__ == "__main__":
main()
Run this, and you’ll get a clean CSV of Baidu search results.
Prefer controlling requests yourself? Residential Proxies make your IP look legit, avoid blocks, and rotate automatically.
Here's how to tweak the main() function to use proxies:
def main():
url = "https://www.baidu.com/s?ie=utf-8&wd=nike&rn=50"
proxy_entry = "http://customer-<your_username>:<your_password>@pr.swiftproxy.net:10000"
proxies = {
"http": proxy_entry,
"https": proxy_entry,
}
response = requests.get(url, proxies=proxies, timeout=180)
html_content = response.text
parsed_data = parse_baidu_html_results(html_content)
store_to_csv(parsed_data)
print(f"Scraped {len(parsed_data)} results with proxies. Saved to baidu_results.csv")
Replace placeholders with your proxy credentials.
The rest of the parsing and storing functions stay the same.
|
Method |
Pros |
Cons |
Best For |
|
Manual Scraping |
Full control, no extra costs |
IP bans, no geo-targeting |
Small projects, unrestricted sites |
|
Residential Proxies |
IP rotation, geo-targeting, better success |
Proxy costs, setup required |
Medium to large scale, restricted sites |
|
API |
Maintenance-free, handles blocks & CAPTCHAs |
Higher cost, less customization |
Enterprise scale, complex sites |
Scraping Baidu is tricky but rewarding. Use APIs or proxies to overcome Baidu's defenses without endless maintenance. You get reliable, scalable data—ready to power your business insights or research.