Mastering Scraping Google News with Python

SwiftProxy
By - Linh Tran
2025-01-09 14:38:47

Mastering Scraping Google News with Python

Every second, thousands of news stories are published online. The ability to collect and analyze these stories can provide invaluable insights into current events, market trends, and even public sentiment. Scraping Google News offers a powerful way to tap into this vast stream of information—and with Python, it's easier than ever.
In this guide, we'll show you how to scrape the latest news from Google News using Python. By the end of this article, you'll have the tools to fetch headlines, extract valuable links, and store everything in a structured JSON format, ready for analysis.

Step 1: Preparing Your Environment

Before diving into code, let's make sure you're set up. First, ensure you have Python installed on your machine. Then, you'll need two libraries: requests for making HTTP requests, and lxml for parsing HTML.
Install them by running the following commands:

pip install requests  
pip install lxml  

These tools will allow us to send requests to Google News, fetch the content, and parse the HTML to extract headlines.

Step 2: Understanding the Google News Structure

Google News doesn't just give you a static list of headlines. It organizes them in a dynamic layout, which is what we need to navigate with our scraper. Here's a brief rundown of the elements we'll target:
Main News Articles: The primary headlines.
Related Articles: News that’s contextually connected to the main story.
We'll use XPath expressions to locate these elements within the HTML. Here are the key ones:
Main News: //c-wiz[@jsrenderer="ARwRbe"]
Main Title: //c-wiz[@jsrenderer="ARwRbe"]/c-wiz/div/article/a/text()
Main Link: //c-wiz[@jsrenderer="ARwRbe"]/c-wiz/div/article/a/@href
Related News: //c-wiz[@jsrenderer="ARwRbe"]/c-wiz/div/div/article/

Step 3: Gathering Google News Content

Now, let's retrieve the Google News content. We’ll use requests.get to make an HTTP request and pull the content of the page. Here’s how to do it:

import requests  

url = "https://news.google.com/topics/CAAqKggKIiRDQkFTRlFvSUwyMHZNRGRqTVhZU0JXVnVMVWRDR2dKSlRpZ0FQAQ?hl=en-US&gl=US&ceid=US%3Aen"  
response = requests.get(url)  

if response.status_code == 200:  
    page_content = response.content  
else:  
    print(f"Failed to retrieve the page. Status code: {response.status_code}")  

This sends a GET request to Google News and stores the page's content in page_content. If the request fails, it will let you know with the status code.

Step 4: Parsing the HTML with lxml

Once we have the page's HTML, we’ll use lxml to parse it. lxml allows us to easily navigate the HTML structure and extract the elements we need.

from lxml import html  

# Parse the HTML content  
parser = html.fromstring(page_content)  

Step 5: Extracting News Information

Now for the fun part—extracting the news. We'll start by targeting the main news container and iterating over the first 10 news items to grab the titles and links.

main_news_elements = parser.xpath('//c-wiz[@jsrenderer="ARwRbe"]')  
news_data = []  

for element in main_news_elements[:10]:  
    title = element.xpath('.//c-wiz/div/article/a/text()')[0]  
    link = "https://news.google.com" + element.xpath('.//c-wiz/div/article/a/@href')[0][1:]  
    
    # Ensure data exists before appending  
    if title and link:  
        news_data.append({  
            "main_title": title,  
            "main_link": link,  
        })  

This extracts the main headline titles and links from the first 10 items. The xpath() method gives us a list of results, so we grab the first one [0].

Extracting Related Articles

Inside each main news item, there's a list of related articles. We'll extract those next:

for element in main_news_elements[:10]:  
    related_articles = []  
    related_news_elements = element.xpath('.//c-wiz/div/div/article')  
    
    for related_element in related_news_elements:  
        related_title = related_element.xpath('.//a/text()')[0]  
        related_link = "https://news.google.com" + related_element.xpath('.//a/@href')[0][1:]  
        related_articles.append({"title": related_title, "link": related_link})  

    news_data.append({  
        "main_title": title,  
        "main_link": link,  
        "related_articles": related_articles  
    })  

This code loops through each main article and extracts the related articles as well.

Step 6: Storing the Data

Now that we've scraped the data, it's time to save it in a structured format. JSON is perfect for this, so let's store the data in a file.

import json  

with open('google_news_data.json', 'w') as f:  
    json.dump(news_data, f, indent=4)  

This writes all the scraped news data into a JSON file called google_news_data.json. You can easily load and analyze this data later.

Bonus: Using Proxies and Custom Headers

When scraping large amounts of data, especially from high-traffic sites like Google News, you may encounter rate limits or IP blocks. To avoid this, use proxies.
Here's how you can configure a proxy:

proxies = {  
    "http": "http://your_proxy_ip:port",  
    "https": "https://your_proxy_ip:port",  
}  
response = requests.get(url, proxies=proxies)  

Additionally, some sites block requests that look like they're coming from a bot. You can set custom headers to simulate a browser request:

headers = {  
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',  
}  

response = requests.get(url, headers=headers)  

This makes your requests look more like they're coming from a regular browser.

Full Code Example

Here's the complete code with all steps integrated:

import requests  
from lxml import html  
import json  

# URL to scrape  
url = "https://news.google.com/topics/CAAqKggKIiRDQkFTRlFvSUwyMHZNRGRqTVhZU0JXVnVMVWRDR2dKSlRpZ0FQAQ?hl=en-US&gl=US&ceid=US%3Aen"  
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}  
proxies = {"http": "http://your_proxy_ip:port", "https": "https://your_proxy_ip:port"}  

# Fetch the page content  
response = requests.get(url, headers=headers, proxies=proxies)  
if response.status_code != 200:  
    print(f"Failed to retrieve the page. Status code: {response.status_code}")  
    exit()  

# Parse the HTML content  
parser = html.fromstring(response.content)  

# Extract main news and related articles  
main_news_elements = parser.xpath('//c-wiz[@jsrenderer="ARwRbe"]')  
news_data = []  

for element in main_news_elements[:10]:  
    title = element.xpath('.//c-wiz/div/article/a/text()')[0]  
    link = "https://news.google.com" + element.xpath('.//c-wiz/div/article/a/@href')[0][1:]  
    
    related_articles = []  
    related_news_elements = element.xpath('.//c-wiz/div/div/article')  
    
    for related_element in related_news_elements:  
        related_title = related_element.xpath('.//a/text()')[0]  
        related_link = "https://news.google.com" + related_element.xpath('.//a/@href')[0][1:]  
        related_articles.append({"title": related_title, "link": related_link})  
    
    news_data.append({  
        "main_title": title,  
        "main_link": link,  
        "related_articles": related_articles  
    })  

# Save the data to a JSON file  
with open("google_news_data.json", "w") as json_file:  
    json.dump(news_data, json_file, indent=4)  

print("Data extraction complete. Saved to google_news_data.json")  

Conclusion

Scraping Google News with Python is a straightforward process that opens up a wealth of opportunities for real-time data analysis. Whether you're tracking breaking news, monitoring trends, or diving deep into sentiment analysis, this tutorial equips you with the tools you need. By implementing proxies and custom headers, you can scrape efficiently and avoid common pitfalls like rate limiting and IP blocking.

About the author

SwiftProxy
Linh Tran
Senior Technology Analyst at Swiftproxy
Linh Tran is a Hong Kong-based technology writer with a background in computer science and over eight years of experience in the digital infrastructure space. At Swiftproxy, she specializes in making complex proxy technologies accessible, offering clear, actionable insights for businesses navigating the fast-evolving data landscape across Asia and beyond.
The content provided on the Swiftproxy Blog is intended solely for informational purposes and is presented without warranty of any kind. Swiftproxy does not guarantee the accuracy, completeness, or legal compliance of the information contained herein, nor does it assume any responsibility for content on thirdparty websites referenced in the blog. Prior to engaging in any web scraping or automated data collection activities, readers are strongly advised to consult with qualified legal counsel and to review the applicable terms of service of the target website. In certain cases, explicit authorization or a scraping permit may be required.
Join SwiftProxy Discord community Chat with SwiftProxy support via WhatsApp Chat with SwiftProxy support via Telegram
Chat with SwiftProxy support via Email