Mastering Scraping Google News with Python

SwiftProxy
By - Linh Tran
2025-01-09 14:38:47

Mastering Scraping Google News with Python

Every second, thousands of news stories are published online. The ability to collect and analyze these stories can provide invaluable insights into current events, market trends, and even public sentiment. Scraping Google News offers a powerful way to tap into this vast stream of information—and with Python, it's easier than ever.
In this guide, we'll show you how to scrape the latest news from Google News using Python. By the end of this article, you'll have the tools to fetch headlines, extract valuable links, and store everything in a structured JSON format, ready for analysis.

Step 1: Preparing Your Environment

Before diving into code, let's make sure you're set up. First, ensure you have Python installed on your machine. Then, you'll need two libraries: requests for making HTTP requests, and lxml for parsing HTML.
Install them by running the following commands:

pip install requests  
pip install lxml  

These tools will allow us to send requests to Google News, fetch the content, and parse the HTML to extract headlines.

Step 2: Understanding the Google News Structure

Google News doesn't just give you a static list of headlines. It organizes them in a dynamic layout, which is what we need to navigate with our scraper. Here's a brief rundown of the elements we'll target:
Main News Articles: The primary headlines.
Related Articles: News that’s contextually connected to the main story.
We'll use XPath expressions to locate these elements within the HTML. Here are the key ones:
Main News: //c-wiz[@jsrenderer="ARwRbe"]
Main Title: //c-wiz[@jsrenderer="ARwRbe"]/c-wiz/div/article/a/text()
Main Link: //c-wiz[@jsrenderer="ARwRbe"]/c-wiz/div/article/a/@href
Related News: //c-wiz[@jsrenderer="ARwRbe"]/c-wiz/div/div/article/

Step 3: Gathering Google News Content

Now, let's retrieve the Google News content. We’ll use requests.get to make an HTTP request and pull the content of the page. Here’s how to do it:

import requests  

url = "https://news.google.com/topics/CAAqKggKIiRDQkFTRlFvSUwyMHZNRGRqTVhZU0JXVnVMVWRDR2dKSlRpZ0FQAQ?hl=en-US&gl=US&ceid=US%3Aen"  
response = requests.get(url)  

if response.status_code == 200:  
    page_content = response.content  
else:  
    print(f"Failed to retrieve the page. Status code: {response.status_code}")  

This sends a GET request to Google News and stores the page's content in page_content. If the request fails, it will let you know with the status code.

Step 4: Parsing the HTML with lxml

Once we have the page's HTML, we’ll use lxml to parse it. lxml allows us to easily navigate the HTML structure and extract the elements we need.

from lxml import html  

# Parse the HTML content  
parser = html.fromstring(page_content)  

Step 5: Extracting News Information

Now for the fun part—extracting the news. We'll start by targeting the main news container and iterating over the first 10 news items to grab the titles and links.

main_news_elements = parser.xpath('//c-wiz[@jsrenderer="ARwRbe"]')  
news_data = []  

for element in main_news_elements[:10]:  
    title = element.xpath('.//c-wiz/div/article/a/text()')[0]  
    link = "https://news.google.com" + element.xpath('.//c-wiz/div/article/a/@href')[0][1:]  
    
    # Ensure data exists before appending  
    if title and link:  
        news_data.append({  
            "main_title": title,  
            "main_link": link,  
        })  

This extracts the main headline titles and links from the first 10 items. The xpath() method gives us a list of results, so we grab the first one [0].

Extracting Related Articles

Inside each main news item, there's a list of related articles. We'll extract those next:

for element in main_news_elements[:10]:  
    related_articles = []  
    related_news_elements = element.xpath('.//c-wiz/div/div/article')  
    
    for related_element in related_news_elements:  
        related_title = related_element.xpath('.//a/text()')[0]  
        related_link = "https://news.google.com" + related_element.xpath('.//a/@href')[0][1:]  
        related_articles.append({"title": related_title, "link": related_link})  

    news_data.append({  
        "main_title": title,  
        "main_link": link,  
        "related_articles": related_articles  
    })  

This code loops through each main article and extracts the related articles as well.

Step 6: Storing the Data

Now that we've scraped the data, it's time to save it in a structured format. JSON is perfect for this, so let's store the data in a file.

import json  

with open('google_news_data.json', 'w') as f:  
    json.dump(news_data, f, indent=4)  

This writes all the scraped news data into a JSON file called google_news_data.json. You can easily load and analyze this data later.

Bonus: Using Proxies and Custom Headers

When scraping large amounts of data, especially from high-traffic sites like Google News, you may encounter rate limits or IP blocks. To avoid this, use proxies.
Here's how you can configure a proxy:

proxies = {  
    "http": "http://your_proxy_ip:port",  
    "https": "https://your_proxy_ip:port",  
}  
response = requests.get(url, proxies=proxies)  

Additionally, some sites block requests that look like they're coming from a bot. You can set custom headers to simulate a browser request:

headers = {  
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',  
}  

response = requests.get(url, headers=headers)  

This makes your requests look more like they're coming from a regular browser.

Full Code Example

Here's the complete code with all steps integrated:

import requests  
from lxml import html  
import json  

# URL to scrape  
url = "https://news.google.com/topics/CAAqKggKIiRDQkFTRlFvSUwyMHZNRGRqTVhZU0JXVnVMVWRDR2dKSlRpZ0FQAQ?hl=en-US&gl=US&ceid=US%3Aen"  
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}  
proxies = {"http": "http://your_proxy_ip:port", "https": "https://your_proxy_ip:port"}  

# Fetch the page content  
response = requests.get(url, headers=headers, proxies=proxies)  
if response.status_code != 200:  
    print(f"Failed to retrieve the page. Status code: {response.status_code}")  
    exit()  

# Parse the HTML content  
parser = html.fromstring(response.content)  

# Extract main news and related articles  
main_news_elements = parser.xpath('//c-wiz[@jsrenderer="ARwRbe"]')  
news_data = []  

for element in main_news_elements[:10]:  
    title = element.xpath('.//c-wiz/div/article/a/text()')[0]  
    link = "https://news.google.com" + element.xpath('.//c-wiz/div/article/a/@href')[0][1:]  
    
    related_articles = []  
    related_news_elements = element.xpath('.//c-wiz/div/div/article')  
    
    for related_element in related_news_elements:  
        related_title = related_element.xpath('.//a/text()')[0]  
        related_link = "https://news.google.com" + related_element.xpath('.//a/@href')[0][1:]  
        related_articles.append({"title": related_title, "link": related_link})  
    
    news_data.append({  
        "main_title": title,  
        "main_link": link,  
        "related_articles": related_articles  
    })  

# Save the data to a JSON file  
with open("google_news_data.json", "w") as json_file:  
    json.dump(news_data, json_file, indent=4)  

print("Data extraction complete. Saved to google_news_data.json")  

Conclusion

Scraping Google News with Python is a straightforward process that opens up a wealth of opportunities for real-time data analysis. Whether you're tracking breaking news, monitoring trends, or diving deep into sentiment analysis, this tutorial equips you with the tools you need. By implementing proxies and custom headers, you can scrape efficiently and avoid common pitfalls like rate limiting and IP blocking.

Note sur l'auteur

SwiftProxy
Linh Tran
Linh Tran est une rédactrice technique basée à Hong Kong, avec une formation en informatique et plus de huit ans d'expérience dans le domaine des infrastructures numériques. Chez Swiftproxy, elle se spécialise dans la simplification des technologies proxy complexes, offrant des analyses claires et exploitables aux entreprises naviguant dans le paysage des données en rapide évolution en Asie et au-delà.
Analyste technologique senior chez Swiftproxy
Le contenu fourni sur le blog Swiftproxy est destiné uniquement à des fins d'information et est présenté sans aucune garantie. Swiftproxy ne garantit pas l'exactitude, l'exhaustivité ou la conformité légale des informations contenues, ni n'assume de responsabilité pour le contenu des sites tiers référencés dans le blog. Avant d'engager toute activité de scraping web ou de collecte automatisée de données, il est fortement conseillé aux lecteurs de consulter un conseiller juridique qualifié et de revoir les conditions d'utilisation applicables du site cible. Dans certains cas, une autorisation explicite ou un permis de scraping peut être requis.
Join SwiftProxy Discord community Chat with SwiftProxy support via WhatsApp Chat with SwiftProxy support via Telegram
Chat with SwiftProxy support via Email