
Every second, thousands of news stories are published online. The ability to collect and analyze these stories can provide invaluable insights into current events, market trends, and even public sentiment. Scraping Google News offers a powerful way to tap into this vast stream of information—and with Python, it's easier than ever.
In this guide, we'll show you how to scrape the latest news from Google News using Python. By the end of this article, you'll have the tools to fetch headlines, extract valuable links, and store everything in a structured JSON format, ready for analysis.
Before diving into code, let's make sure you're set up. First, ensure you have Python installed on your machine. Then, you'll need two libraries: requests for making HTTP requests, and lxml for parsing HTML.
Install them by running the following commands:
pip install requests
pip install lxml
These tools will allow us to send requests to Google News, fetch the content, and parse the HTML to extract headlines.
Google News doesn't just give you a static list of headlines. It organizes them in a dynamic layout, which is what we need to navigate with our scraper. Here's a brief rundown of the elements we'll target:
Main News Articles: The primary headlines.
Related Articles: News that’s contextually connected to the main story.
We'll use XPath expressions to locate these elements within the HTML. Here are the key ones:
Main News: //c-wiz[@jsrenderer="ARwRbe"]
Main Title: //c-wiz[@jsrenderer="ARwRbe"]/c-wiz/div/article/a/text()
Main Link: //c-wiz[@jsrenderer="ARwRbe"]/c-wiz/div/article/a/@href
Related News: //c-wiz[@jsrenderer="ARwRbe"]/c-wiz/div/div/article/
Now, let's retrieve the Google News content. We’ll use requests.get to make an HTTP request and pull the content of the page. Here’s how to do it:
import requests
url = "https://news.google.com/topics/CAAqKggKIiRDQkFTRlFvSUwyMHZNRGRqTVhZU0JXVnVMVWRDR2dKSlRpZ0FQAQ?hl=en-US&gl=US&ceid=US%3Aen"
response = requests.get(url)
if response.status_code == 200:
page_content = response.content
else:
print(f"Failed to retrieve the page. Status code: {response.status_code}")
This sends a GET request to Google News and stores the page's content in page_content. If the request fails, it will let you know with the status code.
lxmlOnce we have the page's HTML, we’ll use lxml to parse it. lxml allows us to easily navigate the HTML structure and extract the elements we need.
from lxml import html
# Parse the HTML content
parser = html.fromstring(page_content)
Now for the fun part—extracting the news. We'll start by targeting the main news container and iterating over the first 10 news items to grab the titles and links.
main_news_elements = parser.xpath('//c-wiz[@jsrenderer="ARwRbe"]')
news_data = []
for element in main_news_elements[:10]:
title = element.xpath('.//c-wiz/div/article/a/text()')[0]
link = "https://news.google.com" + element.xpath('.//c-wiz/div/article/a/@href')[0][1:]
# Ensure data exists before appending
if title and link:
news_data.append({
"main_title": title,
"main_link": link,
})
This extracts the main headline titles and links from the first 10 items. The xpath() method gives us a list of results, so we grab the first one [0].
Inside each main news item, there's a list of related articles. We'll extract those next:
for element in main_news_elements[:10]:
related_articles = []
related_news_elements = element.xpath('.//c-wiz/div/div/article')
for related_element in related_news_elements:
related_title = related_element.xpath('.//a/text()')[0]
related_link = "https://news.google.com" + related_element.xpath('.//a/@href')[0][1:]
related_articles.append({"title": related_title, "link": related_link})
news_data.append({
"main_title": title,
"main_link": link,
"related_articles": related_articles
})
This code loops through each main article and extracts the related articles as well.
Now that we've scraped the data, it's time to save it in a structured format. JSON is perfect for this, so let's store the data in a file.
import json
with open('google_news_data.json', 'w') as f:
json.dump(news_data, f, indent=4)
This writes all the scraped news data into a JSON file called google_news_data.json. You can easily load and analyze this data later.
When scraping large amounts of data, especially from high-traffic sites like Google News, you may encounter rate limits or IP blocks. To avoid this, use proxies.
Here's how you can configure a proxy:
proxies = {
"http": "http://your_proxy_ip:port",
"https": "https://your_proxy_ip:port",
}
response = requests.get(url, proxies=proxies)
Additionally, some sites block requests that look like they're coming from a bot. You can set custom headers to simulate a browser request:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
}
response = requests.get(url, headers=headers)
This makes your requests look more like they're coming from a regular browser.
Here's the complete code with all steps integrated:
import requests
from lxml import html
import json
# URL to scrape
url = "https://news.google.com/topics/CAAqKggKIiRDQkFTRlFvSUwyMHZNRGRqTVhZU0JXVnVMVWRDR2dKSlRpZ0FQAQ?hl=en-US&gl=US&ceid=US%3Aen"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}
proxies = {"http": "http://your_proxy_ip:port", "https": "https://your_proxy_ip:port"}
# Fetch the page content
response = requests.get(url, headers=headers, proxies=proxies)
if response.status_code != 200:
print(f"Failed to retrieve the page. Status code: {response.status_code}")
exit()
# Parse the HTML content
parser = html.fromstring(response.content)
# Extract main news and related articles
main_news_elements = parser.xpath('//c-wiz[@jsrenderer="ARwRbe"]')
news_data = []
for element in main_news_elements[:10]:
title = element.xpath('.//c-wiz/div/article/a/text()')[0]
link = "https://news.google.com" + element.xpath('.//c-wiz/div/article/a/@href')[0][1:]
related_articles = []
related_news_elements = element.xpath('.//c-wiz/div/div/article')
for related_element in related_news_elements:
related_title = related_element.xpath('.//a/text()')[0]
related_link = "https://news.google.com" + related_element.xpath('.//a/@href')[0][1:]
related_articles.append({"title": related_title, "link": related_link})
news_data.append({
"main_title": title,
"main_link": link,
"related_articles": related_articles
})
# Save the data to a JSON file
with open("google_news_data.json", "w") as json_file:
json.dump(news_data, json_file, indent=4)
print("Data extraction complete. Saved to google_news_data.json")
Scraping Google News with Python is a straightforward process that opens up a wealth of opportunities for real-time data analysis. Whether you're tracking breaking news, monitoring trends, or diving deep into sentiment analysis, this tutorial equips you with the tools you need. By implementing proxies and custom headers, you can scrape efficiently and avoid common pitfalls like rate limiting and IP blocking.