
Imagine extracting hundreds of hotel details—prices, ratings, descriptions, and more—effortlessly with just a few lines of Python code. Whether you're a developer, data analyst, or a business looking to gather insights, scraping Booking.com can unlock a treasure trove of valuable information.
In this article, we'll walk you through how to scrape Booking.com data, including names, locations, ratings, prices, and more. We'll be using Python's powerful libraries to extract JSON data embedded in hotel pages and save it in a structured CSV file for analysis.
Before diving into scraping, you need to install a few essential Python libraries. It's a straightforward process.
· Requests: Used to send HTTP requests to Booking.com and fetch HTML data.
· LXML: Allows us to parse HTML content and extract data using XPath.
· JSON: A built-in Python module to handle structured JSON data.
· CSV: Built-in module for saving the data into a CSV file.
Here's how to install the required libraries:
pip install requests lxml
Now, you're ready to scrape.
To effectively scrape data, it's crucial to understand the page structure and how data is stored. Booking.com dynamically embeds structured data within a JSON-LD format on each hotel page. This JSON data contains all the details we need: hotel names, pricing, locations, and more.
So, we'll be targeting that data format for our extraction process.
Booking.com is no stranger to anti-scraping measures. To keep things smooth and avoid getting blocked, we must mimic a legitimate user session. This is where custom headers come into play. Plus, proxies help prevent detection by distributing requests across multiple IP addresses.
Here's the code for sending an HTTP request with headers:
import requests
from lxml.html import fromstring
urls = ["https://www.booking.com/hotel/xyz"]
for url in urls:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
}
response = requests.get(url, headers=headers)
When scraping Booking.com, you must deal with rate limits and IP tracking. To handle this, using proxies is a game-changer. You can either go for free proxies or opt for paid services offering IP address authentication. Below is how you can use proxies to send requests:
proxies = {
'http': 'http://your_proxy',
'https': 'https://your_proxy',
}
response = requests.get(url, headers=headers, proxies=proxies)
Once the request is sent, it's time to parse the HTML and locate the embedded JSON-LD script that holds the valuable hotel data. We’ll use XPath to extract it.
parser = fromstring(response.text)
json_data = json.loads(parser.xpath('//script[@type="application/ld+json"]/text()')[0])
Now that we have the JSON data, it's time to extract specific details like hotel name, location, price range, and more.
Here's how to pull out some of the most critical data points:
name = json_data['name']
location = json_data['hasMap']
price_range = json_data['priceRange']
rating = json_data['aggregateRating']['ratingValue']
review_count = json_data['aggregateRating']['reviewCount']
address = json_data['address']['streetAddress']
url = json_data['url']
Finally, after scraping all the data, let's save it into a CSV file for easy analysis.
import csv
fieldnames = ["Name", "Location", "Price Range", "Rating", "Review Count", "Address", "URL"]
with open('booking_data.csv', 'w', newline='') as file:
writer = csv.DictWriter(file, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(all_data)
Below is the complete code for your convenience. Copy and run it to start scraping.
import requests
from lxml.html import fromstring
import json
import csv
urls = ["https://www.booking.com/hotel/xyz"]
all_data = []
for url in urls:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
}
response = requests.get(url, headers=headers)
parser = fromstring(response.text)
json_data = json.loads(parser.xpath('//script[@type="application/ld+json"]/text()')[0])
data = {
"Name": json_data['name'],
"Location": json_data['hasMap'],
"Price Range": json_data['priceRange'],
"Rating": json_data['aggregateRating']['ratingValue'],
"Review Count": json_data['aggregateRating']['reviewCount'],
"Address": json_data['address']['streetAddress'],
"URL": json_data['url']
}
all_data.append(data)
# Save to CSV
with open('booking_data.csv', 'w', newline='') as file:
fieldnames = ["Name", "Location", "Price Range", "Rating", "Review Count", "Address", "URL"]
writer = csv.DictWriter(file, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(all_data)
print("Data successfully saved to booking_data.csv")
In this guide, we explored how to scrape valuable hotel data from Booking.com using Python. We covered the installation of essential libraries, proper header configuration to avoid blocks, and techniques for extracting and saving data.
By using this method, you can gather critical insights about hotel listings, helping you make data-driven decisions whether you're analyzing market trends, creating a travel website, or simply automating data collection for business needs.