Proxies résidentiels

Proxy résidentiels statiques

Proxy résidentiels illimités

Proxys YouTube

Proxies résidentiels

Agent résidentiel statique

Proxy résidentiels illimités

Données pour l'IA

Collecte de données sur le web

SEO et scraping SERP

Suivi des prix

Agrégation des tarifs de voyage

Collecte de données sur le marché boursier

Tous les emplacements

Partenaires de Swiftproxy

Collectez des données à grande échelle

Proxies de Web Scraping Essai gratuit

Collectez des données précises dans le monde entier sans blocages ni interruptions.

Solution de proxy à bande passante illimitée pour la collecte de données vidéo à grande échelle

Boostez la croissance de votre entreprise avec Swiftproxy

Un réseau mondial de plus de 80 millions de proxies résidentiels, assurant une disponibilité de 99,89 % et des connexions stables, prenant en charge les protocoles HTTP(S) et SOCKS5.

Swiftproxy residential proxies with 80M+ IPs, 99.89% uptime, supporting HTTP(S) & SOCKS5 protocols

Programme d'affiliation

30% Commission garantie

Gains CDK

Proxies en profits

How to Scrape Yelp for Local Business Data with Python

By - Martin Koenig

2025-01-09 14:51:42

In a world dominated by reviews, Yelp stands as a goldmine of local business insights. If you're looking to tap into data from Yelp—restaurant names, ratings, cuisines, and more—scraping can give you the edge. Here's how you can extract valuable information from Yelp using Python, leveraging libraries like requests and lxml.

Step 1: Preparing Your Environment

Before diving into code, let's make sure you're ready to scrape. First, you'll need Python installed, then set up the necessary libraries:

pip install requests  
pip install lxml

These tools will allow you to send HTTP requests, parse HTML content, and extract the data you need from Yelp.

Step 2: Making a Request to Yelp

We'll begin by requesting the Yelp search page. This fetches the HTML, which we'll later parse to extract the information we want.

import requests  

# Yelp search page URL  
url = "https://www.yelp.com/search?find_desc=restaurants&find_loc=San+Francisco%2C+CA"  

# Send GET request to fetch the HTML content  
response = requests.get(url)  

# Check if request was successful  
if response.status_code == 200:  
    print("Page fetched successfully!")  
else:  
    print(f"Failed to retrieve page, status code: {response.status_code}")

Yelp's server will likely block simple requests without proper headers. So, we need to trick it into thinking the request comes from a legitimate browser.

Step 3: Handling HTTP Headers

Here's how you can set headers that simulate a real browser request. This reduces the chances of being blocked:

headers = {  
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',  
    'Accept-Language': 'en-US,en;q=0.5'  
}  

response = requests.get(url, headers=headers)

Without these headers, your requests might get rejected. It's a simple trick that goes a long way.

Step 4: Proxies for Uninterrupted Scraping

If you're scraping a large number of pages, there's a risk your IP will get blocked. The solution? Proxies. By rotating IP addresses, you make it harder for Yelp to detect and block your scraping activity.

proxies = {  
    'http': 'http://username:password@proxy-server:port',  
    'https': 'https://username:password@proxy-server:port'  
}  

response = requests.get(url, headers=headers, proxies=proxies)

Using rotating proxies can keep your scraper running smoothly for longer periods.

Step 5: Parsing HTML Content

Now, let's extract data from the HTML we've fetched. We'll use lxml to parse the content and pinpoint the specific data we need, like restaurant names, ratings, and URLs.

from lxml import html  

# Parse HTML content  
parser = html.fromstring(response.content)  

# Extract restaurant listings  
elements = parser.xpath('//div[@data-testid="serp-ia-card"]')[2:-1]

Here's where things get interesting. We need to isolate the relevant data—restaurant names, ratings, cuisines—from these HTML elements.

Step 6: Extracting Specific Data

Using XPath, we can target individual pieces of data from each restaurant listing:
Restaurant Name
Restaurant URL
Cuisines
Rating
Here's the XPath to target each:

name_xpath = './/div[@class="businessName__09f24__HG_pC y-css-ohs7lg"]/div/h3/a/text()'  
url_xpath = './/div[@class="businessName__09f24__HG_pC y-css-ohs7lg"]/div/h3/a/@href'  
cuisine_xpath = './/div[@class="priceCategory__09f24___4Wsg iaPriceCategory__09f24__x9YrM y-css-2hdccn"]/div/div/div/a/button/span/text()'  
rating_xpath = './/div[@class="y-css-9tnml4"]/@aria-label'

We'll use these expressions to extract the relevant data from each restaurant listing.

Step 7: Extracting and Storing Data

Now, let's loop through each restaurant and collect the details:

restaurants_data = []  

for element in elements:  
    name = element.xpath(name_xpath)[0]  
    url = element.xpath(url_xpath)[0]  
    cuisines = element.xpath(cuisine_xpath)  
    rating = element.xpath(rating_xpath)[0]  

    restaurant_info = {  
        "name": name,  
        "url": url,  
        "cuisines": cuisines,  
        "rating": rating  
    }  

    restaurants_data.append(restaurant_info)

This loop goes through every restaurant element and pulls out the necessary details. It stores this info in a list as a dictionary.

Step 8: Saving the Data

Finally, let's save all the scraped data in a JSON file. JSON is a clean, easy-to-read format that's perfect for storing structured data.

import json  

# Save data to JSON file  
with open('yelp_restaurants.json', 'w') as f:  
    json.dump(restaurants_data, f, indent=4)  

print("Data extraction complete. Saved to yelp_restaurants.json")

Complete Script

Here's a consolidated view of the entire scraping process:

import requests  
from lxml import html  
import json  

url = "https://www.yelp.com/search?find_desc=restaurants&find_loc=San+Francisco%2C+CA"  

headers = {  
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',  
    'Accept-Language': 'en-US,en;q=0.5'  
}  

proxies = {  
    'http': 'http://username:password@proxy-server:port',  
    'https': 'https://username:password@proxy-server:port'  
}  

response = requests.get(url, headers=headers, proxies=proxies)  

if response.status_code != 200:  
    print(f"Failed to retrieve page, status code: {response.status_code}")  
    exit()  

parser = html.fromstring(response.content)  
elements = parser.xpath('//div[@data-testid="serp-ia-card"]')[2:-1]  

restaurants_data = []  

for element in elements:  
    name = element.xpath('.//div[@class="businessName__09f24__HG_pC y-css-ohs7lg"]/div/h3/a/text()')[0]  
    url = element.xpath('.//div[@class="businessName__09f24__HG_pC y-css-ohs7lg"]/div/h3/a/@href')[0]  
    cuisines = element.xpath('.//div[@class="priceCategory__09f24___4Wsg iaPriceCategory__09f24__x9YrM y-css-2hdccn"]/div/div/div/a/button/span/text()')  
    rating = element.xpath('.//div[@class="y-css-9tnml4"]/@aria-label')[0]  

    restaurant_info = {  
        "name": name,  
        "url": url,  
        "cuisines": cuisines,  
        "rating": rating  
    }  

    restaurants_data.append(restaurant_info)  

with open('yelp_restaurants.json', 'w') as f:  
    json.dump(restaurants_data, f, indent=4)  

print("Data extraction complete. Saved to yelp_restaurants.json")

Final Thoughts

Scraping Yelp is a powerful way to unlock a treasure trove of local business insights. But remember, scraping can be tricky. Always ensure you're respecting Yelp's terms of service. And when scraping at scale, consider using proxies, rotating them regularly, and setting up proper headers to avoid getting blocked.
With these tips and techniques, you are all set to scrape Yelp and extract meaningful data to make your projects shine.

Note sur l'auteur

Martin Koenig

Responsable Commercial

Martin Koenig est un stratège commercial accompli avec plus de dix ans d'expérience dans les industries de la technologie, des télécommunications et du conseil. En tant que Responsable Commercial, il combine une expertise multisectorielle avec une approche axée sur les données pour identifier des opportunités de croissance et générer un impact commercial mesurable.

Le contenu fourni sur le blog Swiftproxy est destiné uniquement à des fins d'information et est présenté sans aucune garantie. Swiftproxy ne garantit pas l'exactitude, l'exhaustivité ou la conformité légale des informations contenues, ni n'assume de responsabilité pour le contenu des sites tiers référencés dans le blog. Avant d'engager toute activité de scraping web ou de collecte automatisée de données, il est fortement conseillé aux lecteurs de consulter un conseiller juridique qualifié et de revoir les conditions d'utilisation applicables du site cible. Dans certains cas, une autorisation explicite ou un permis de scraping peut être requis.

Dans cet article

Solutions proxy résidentielles de haut niveau

Accédez à plus de 90 millions d'IP résidentiels avec une fiabilité élevée et des temps de réponse rapides.

Essai gratuit

FAQ

Charger plus

Afficher moins

Chat with SwiftProxy support via Telegram

Contactez-nous avec un email

[email protected]

Tips

Veuillez fournir votre numéro de compte ou votre adresse courriel.
Fournissez des vidéos ou des captures d'écran et décrivez simplement les problèmes auxquels vous êtes confronté.
Notre personnel répondra à votre message dans les 24 heures.

How to Scrape Yelp for Local Business Data with Python

Step 1: Preparing Your Environment

Step 2: Making a Request to Yelp

Step 3: Handling HTTP Headers

Step 4: Proxies for Uninterrupted Scraping

Step 5: Parsing HTML Content

Step 6: Extracting Specific Data

Step 7: Extracting and Storing Data

Step 8: Saving the Data

Complete Script

Final Thoughts

Note sur l'auteur

Articles liés