人工智慧

大規模收集數據

網頁抓取代理免費試用

在全球範圍內收集準確數據，無需擔心封鎖或中斷。

瞭解更多 >

適用於大規模視頻數據採集的無限帶寬代理解決方案

透過 Swiftproxy 強化您的業務成長

全球超過 8000 萬個住宅代理網絡，確保 99.89% 的運行時間和穩定連接，支持 HTTP(S) 和 SOCKS5 協議。

Swiftproxy residential proxies with 80M+ IPs, 99.89% uptime, supporting HTTP(S) & SOCKS5 protocols

The Ultimate Guide to Scraping Baidu Efficiently

By - Martin Koenig

2025-06-03 15:24:12

Over 70% of China's internet users turn to Baidu for information daily. Yet, scraping its search results? That's a whole different ballgame. Baidu's dynamic pages and anti-scraping defenses make it tough to grab data smoothly — but not impossible.

In this guide, we'll walk you through exactly how to scrape Baidu's organic search results with Python, using Swiftproxy API and Residential Proxies. By the end, you'll have a solid, maintainable scraper that delivers clean, actionable data — no headaches.

What's on Baidu's Search Results Page

Baidu's SERP looks familiar if you know Google or Bing: it shows paid ads, organic results, and related searches. But there's nuance.

Organic Results: These are the real gems. They reflect Baidu's algorithmic ranking of pages matching your query.

Paid Results: Clearly labeled "advertise," these are sponsored placements.

Related Searches: Found at the bottom, these help users explore similar topics.

Why Scraping Baidu Isn't Easy

Baidu uses:

CAPTCHAs that block bots.

IP and user-agent blocking.

Dynamic HTML that shifts structure regularly.

Bottom line? Your scraper has to be adaptive and persistent. That's why a solid API is a game-changer — it handles these challenges under the hood. No more endless tweaking or manual blocks.

Can You Scrape Baidu

Scraping publicly available data usually sits in a legal grey area but is often allowed if you avoid:

Logging into accounts.

Downloading copyrighted or private content.

Always get professional legal advice before starting large-scale scraping projects. Stay ethical. Stay safe.

Step-by-Step Guide to Scrape Baidu

Step 1: Set Up Your Python Environment

You'll need:

pip install requests bs4 pandas

Import these in your script:

import requests
from bs4 import BeautifulSoup
import pandas as pd

Step 2: Define API Endpoint and Credentials

Here's the URL for Swiftproxy API:

url = 'https://realtime.swiftproxy.net/v1/queries'
auth = ('your_api_username', 'your_api_password')

Replace with your actual credentials.

Step 3: Build Your Request Payload

The payload tells the API what to scrape:

payload = {
    'source': 'universal',
    'url': 'https://www.baidu.com/s?ie=utf-8&wd=nike&rn=50',
    'geo_location': 'United States',
}

Swap the URL and location as needed.

Step 4: Make the API Request and Handle Response

response = requests.post(url, json=payload, auth=auth, timeout=180)
response.raise_for_status()
json_data = response.json()

if not json_data["results"]:
    print("No results found for the given query.")
    exit()

Step 5: Parse Baidu's HTML Search Results

Extract titles and URLs with BeautifulSoup. Here's a robust parser function:

def parse_baidu_html_results(html_content: str) -> list[dict]:
    parsed_results = []
    soup = BeautifulSoup(html_content, "html.parser")
    result_blocks = soup.select("div.c-container[id]")

    for block in result_blocks:
        title_tag = (
            block.select_one("h3.t a")
            or block.select_one("h3.c-title-en a")
            or block.select_one("div.c-title a")
        )
        if not title_tag:
            continue
        title_text = title_tag.get_text(strip=True)
        href = title_tag.get("href")
        if title_text and href:
            parsed_results.append({"title": title_text, "url": href})
    return parsed_results

Step 6: Store Results in a CSV

Saving parsed data is straightforward:

def store_to_csv(data: list[dict], filename="baidu_results.csv"):
    df = pd.DataFrame(data)
    df.to_csv(filename, index=False)

Full Example

import requests
import pandas as pd
from bs4 import BeautifulSoup

def parse_baidu_html_results(html_content: str) -> list[dict]:
    parsed_results = []
    soup = BeautifulSoup(html_content, "html.parser")
    result_blocks = soup.select("div.c-container[id]")

    for block in result_blocks:
        title_tag = (
            block.select_one("h3.t a")
            or block.select_one("h3.c-title-en a")
            or block.select_one("div.c-title a")
        )
        if not title_tag:
            continue
        title_text = title_tag.get_text(strip=True)
        href = title_tag.get("href")
        if title_text and href:
            parsed_results.append({"title": title_text, "url": href})
    return parsed_results

def store_to_csv(data: list[dict], filename="baidu_results.csv"):
    df = pd.DataFrame(data)
    df.to_csv(filename, index=False)

def main():
    url = 'https://realtime.swiftproxy.net/v1/queries'
    auth = ('your_api_username', 'your_api_password')
    payload = {
        'source': 'universal',
        'url': 'https://www.baidu.com/s?ie=utf-8&wd=nike&rn=50',
        'geo_location': 'United States',
    }

    response = requests.post(url, json=payload, auth=auth, timeout=180)
    response.raise_for_status()
    json_data = response.json()

    if not json_data["results"]:
        print("No results found for the given query.")
        return

    html_content = json_data["results"][0]["content"]
    parsed_data = parse_baidu_html_results(html_content)
    store_to_csv(parsed_data)
    print(f"Scraped {len(parsed_data)} results. Saved to baidu_results.csv")

if __name__ == "__main__":
    main()

Run this, and you’ll get a clean CSV of Baidu search results.

Scraping Baidu Using Residential Proxies

Prefer controlling requests yourself? Residential Proxies make your IP look legit, avoid blocks, and rotate automatically.
Here's how to tweak the main() function to use proxies:

def main():
    url = "https://www.baidu.com/s?ie=utf-8&wd=nike&rn=50"
    proxy_entry = "http://customer-<your_username>:<your_password>@pr.swiftproxy.net:10000"
    proxies = {
        "http": proxy_entry,
        "https": proxy_entry,
    }

    response = requests.get(url, proxies=proxies, timeout=180)
    html_content = response.text
    parsed_data = parse_baidu_html_results(html_content)
    store_to_csv(parsed_data)
    print(f"Scraped {len(parsed_data)} results with proxies. Saved to baidu_results.csv")

Replace placeholders with your proxy credentials.
The rest of the parsing and storing functions stay the same.

API vs Proxies vs Manual

Method	Pros	Cons	Best For
Manual Scraping	Full control, no extra costs	IP bans, no geo-targeting	Small projects, unrestricted sites
Residential Proxies	IP rotation, geo-targeting, better success	Proxy costs, setup required	Medium to large scale, restricted sites
API	Maintenance-free, handles blocks & CAPTCHAs	Higher cost, less customization	Enterprise scale, complex sites

Final Thoughts

Scraping Baidu is tricky but rewarding. Use APIs or proxies to overcome Baidu's defenses without endless maintenance. You get reliable, scalable data—ready to power your business insights or research.

關於作者

Martin Koenig

商務主管

馬丁·科尼格是一位資深商業策略專家，擁有十多年技術、電信和諮詢行業的經驗。作為商務主管，他結合跨行業專業知識和數據驅動的思維，發掘增長機會，創造可衡量的商業價值。

Swiftproxy部落格提供的內容僅供參考，不提供任何形式的保證。Swiftproxy不保證所含資訊的準確性、完整性或合法合規性，也不對部落格中引用的第三方網站內容承擔任何責任。讀者在進行任何網頁抓取或自動化資料蒐集活動之前，強烈建議諮詢合格的法律顧問，並仔細閱讀目標網站的服務條款。在某些情況下，可能需要明確授權或抓取許可。

在這篇文章裏

頂級住宅代理解決方案