
In a world dominated by reviews, Yelp stands as a goldmine of local business insights. If you're looking to tap into data from Yelp—restaurant names, ratings, cuisines, and more—scraping can give you the edge. Here's how you can extract valuable information from Yelp using Python, leveraging libraries like requests and lxml.
Before diving into code, let's make sure you're ready to scrape. First, you'll need Python installed, then set up the necessary libraries:
pip install requests
pip install lxml
These tools will allow you to send HTTP requests, parse HTML content, and extract the data you need from Yelp.
We'll begin by requesting the Yelp search page. This fetches the HTML, which we'll later parse to extract the information we want.
import requests
# Yelp search page URL
url = "https://www.yelp.com/search?find_desc=restaurants&find_loc=San+Francisco%2C+CA"
# Send GET request to fetch the HTML content
response = requests.get(url)
# Check if request was successful
if response.status_code == 200:
print("Page fetched successfully!")
else:
print(f"Failed to retrieve page, status code: {response.status_code}")
Yelp's server will likely block simple requests without proper headers. So, we need to trick it into thinking the request comes from a legitimate browser.
Here's how you can set headers that simulate a real browser request. This reduces the chances of being blocked:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
'Accept-Language': 'en-US,en;q=0.5'
}
response = requests.get(url, headers=headers)
Without these headers, your requests might get rejected. It's a simple trick that goes a long way.
If you're scraping a large number of pages, there's a risk your IP will get blocked. The solution? Proxies. By rotating IP addresses, you make it harder for Yelp to detect and block your scraping activity.
proxies = {
'http': 'http://username:password@proxy-server:port',
'https': 'https://username:password@proxy-server:port'
}
response = requests.get(url, headers=headers, proxies=proxies)
Using rotating proxies can keep your scraper running smoothly for longer periods.
Now, let's extract data from the HTML we've fetched. We'll use lxml to parse the content and pinpoint the specific data we need, like restaurant names, ratings, and URLs.
from lxml import html
# Parse HTML content
parser = html.fromstring(response.content)
# Extract restaurant listings
elements = parser.xpath('//div[@data-testid="serp-ia-card"]')[2:-1]
Here's where things get interesting. We need to isolate the relevant data—restaurant names, ratings, cuisines—from these HTML elements.
Using XPath, we can target individual pieces of data from each restaurant listing:
Restaurant Name
Restaurant URL
Cuisines
Rating
Here's the XPath to target each:
name_xpath = './/div[@class="businessName__09f24__HG_pC y-css-ohs7lg"]/div/h3/a/text()'
url_xpath = './/div[@class="businessName__09f24__HG_pC y-css-ohs7lg"]/div/h3/a/@href'
cuisine_xpath = './/div[@class="priceCategory__09f24___4Wsg iaPriceCategory__09f24__x9YrM y-css-2hdccn"]/div/div/div/a/button/span/text()'
rating_xpath = './/div[@class="y-css-9tnml4"]/@aria-label'
We'll use these expressions to extract the relevant data from each restaurant listing.
Now, let's loop through each restaurant and collect the details:
restaurants_data = []
for element in elements:
name = element.xpath(name_xpath)[0]
url = element.xpath(url_xpath)[0]
cuisines = element.xpath(cuisine_xpath)
rating = element.xpath(rating_xpath)[0]
restaurant_info = {
"name": name,
"url": url,
"cuisines": cuisines,
"rating": rating
}
restaurants_data.append(restaurant_info)
This loop goes through every restaurant element and pulls out the necessary details. It stores this info in a list as a dictionary.
Finally, let's save all the scraped data in a JSON file. JSON is a clean, easy-to-read format that's perfect for storing structured data.
import json
# Save data to JSON file
with open('yelp_restaurants.json', 'w') as f:
json.dump(restaurants_data, f, indent=4)
print("Data extraction complete. Saved to yelp_restaurants.json")
Here's a consolidated view of the entire scraping process:
import requests
from lxml import html
import json
url = "https://www.yelp.com/search?find_desc=restaurants&find_loc=San+Francisco%2C+CA"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
'Accept-Language': 'en-US,en;q=0.5'
}
proxies = {
'http': 'http://username:password@proxy-server:port',
'https': 'https://username:password@proxy-server:port'
}
response = requests.get(url, headers=headers, proxies=proxies)
if response.status_code != 200:
print(f"Failed to retrieve page, status code: {response.status_code}")
exit()
parser = html.fromstring(response.content)
elements = parser.xpath('//div[@data-testid="serp-ia-card"]')[2:-1]
restaurants_data = []
for element in elements:
name = element.xpath('.//div[@class="businessName__09f24__HG_pC y-css-ohs7lg"]/div/h3/a/text()')[0]
url = element.xpath('.//div[@class="businessName__09f24__HG_pC y-css-ohs7lg"]/div/h3/a/@href')[0]
cuisines = element.xpath('.//div[@class="priceCategory__09f24___4Wsg iaPriceCategory__09f24__x9YrM y-css-2hdccn"]/div/div/div/a/button/span/text()')
rating = element.xpath('.//div[@class="y-css-9tnml4"]/@aria-label')[0]
restaurant_info = {
"name": name,
"url": url,
"cuisines": cuisines,
"rating": rating
}
restaurants_data.append(restaurant_info)
with open('yelp_restaurants.json', 'w') as f:
json.dump(restaurants_data, f, indent=4)
print("Data extraction complete. Saved to yelp_restaurants.json")
Scraping Yelp is a powerful way to unlock a treasure trove of local business insights. But remember, scraping can be tricky. Always ensure you're respecting Yelp's terms of service. And when scraping at scale, consider using proxies, rotating them regularly, and setting up proper headers to avoid getting blocked.
With these tips and techniques, you are all set to scrape Yelp and extract meaningful data to make your projects shine.