How to Scrape Medium Articles Using Python for Data Analysis

SwiftProxy
By - Emily Chan
2025-02-14 15:10:29

How to Scrape Medium Articles Using Python for Data Analysis

When it comes to extracting valuable insights from Medium, scraping articles can be an effective approach. Whether you're analyzing trends, evaluating content quality, or tracking your favorite authors, Python simplifies this process. Let's explore how to scrape data from a Medium article efficiently.

Why Scrape Medium

Medium has a wealth of information—stories, insights, tutorials, and more. But how do you harness this data for research or personal projects? Python offers a simple and effective way to gather content like article titles, author names, publication dates, and the full text body.

Step 1: Setting Up Your Environment

Before we get into the code, let's install the necessary libraries. These will help us send requests, parse HTML, and save data in a readable format:

pip install requests  
pip install lxml  
pip install pandas

These tools are essential for scraping data efficiently. Here's a quick breakdown of their roles:

· Requests: Handles sending HTTP requests to Medium.

· lxml: Parses HTML content from the response.

· Pandas: Helps store and organize the extracted data, saving it into CSV files.

Step 2: Handling Medium's Anti-Scraping Measures

Medium doesn't make scraping easy. It uses bot detection mechanisms to block unauthorized access. To avoid being flagged, you'll need to use headers that mimic a real browser request, and in some cases, proxies to rotate IP addresses.

Crafting the Perfect Headers

Headers are key to simulating a legitimate browser request. This is what Medium looks for to ensure you're not a bot. Here's an example of headers you should use:

headers = {
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
    'accept-language': 'en-IN,en;q=0.9',
    'cache-control': 'no-cache',
    'dnt': '1',
    'pragma': 'no-cache',
    'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36',
}

These headers ensure that your request looks like it's coming from a legitimate user.

Using Proxies to Avoid Blocking

Proxies are another line of defense. They mask your real IP address, making it harder for Medium to detect multiple requests from the same source. Here's an example of how to set up a proxy:

proxies = {
    'http': 'IP:PORT',
    'https': 'IP:PORT'
}

Rotate your IP periodically to prevent Medium from blocking you.

Step 3: Making Requests to Medium

Now that we've set up headers and proxies, it's time to send a request to the article URL. Here's how you do it:

import requests  

url = 'https://medium.com/techtofreedom/9-python-built-in-decorators-that-optimize-your-code-significantly-bc3f661e9017'  
response = requests.get(url, headers=headers, proxies=proxies)

This will fetch the article content from Medium.

Step 4: Extracting Article Data

Once the page loads, we can extract key details. Using lxml, we'll parse the HTML and grab elements like the article title, author, and content.

from lxml.html import fromstring  

parser = fromstring(response.text)

# Extracting data  
title = parser.xpath('//h1[@data-testid="storyTitle"]/text()')[0]  
author = parser.xpath('//a[@data-testid="authorName"]/text()')[0]  
publication_name = parser.xpath('//a[@data-testid="publicationName"]/p/text()')[0]  
publication_date = parser.xpath('//span[@data-testid="storyPublishDate"]/text()')[0]  
content = '\n '.join(parser.xpath('//div[@class="ci bh ga gb gc gd"]/p/text()'))  
auth_followers = parser.xpath('//span[@class="pw-follower-count bf b bg z bk"]/a/text()')[0]  
sub_title = parser.xpath('//h2[@id="1de6"]/text()')[0]

These XPath queries target specific elements in the HTML and extract the necessary data.

Step 5: Storing Data in a Dictionary

Once we have all the extracted content, let's organize it into a neat dictionary. This makes it easy to save the data into a CSV file for future use.

# Store data in a dictionary  
article_data = {  
    'Title': title,  
    'Author': author,  
    'Publication': publication_name,  
    'Date': publication_date,  
    'Followers': auth_followers,  
    'Subtitle': sub_title,  
    'Content': content,  
}

Step 6: Saving to CSV

With the data ready, let's save it in a CSV file using Pandas. This step is crucial for anyone who wants to analyze the data later.

import pandas as pd

# Save data to CSV  
df = pd.DataFrame([article_data])  
df.to_csv('medium_article_data.csv', index=False)  
print("Data saved to medium_article_data.csv")

The Complete Example

Here's the entire process wrapped up in one script:

import requests  
from lxml.html import fromstring  
import pandas as pd  

# Headers to simulate a real browser request  
headers = {  
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',  
    'accept-language': 'en-IN,en;q=0.9',  
    'cache-control': 'no-cache',  
    'dnt': '1',  
    'pragma': 'no-cache',  
    'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36',  
}  

# Proxies to rotate IP addresses  
proxies = {  
    'http': 'IP:PORT',  
    'https': 'IP:PORT'  
}

# Send the request  
url = 'https://medium.com/techtofreedom/9-python-built-in-decorators-that-optimize-your-code-significantly-bc3f661e9017'  
response = requests.get(url, headers=headers, proxies=proxies)

# Parse the page  
parser = fromstring(response.text)

# Extract article data  
title = parser.xpath('//h1[@data-testid="storyTitle"]/text()')[0]  
author = parser.xpath('//a[@data-testid="authorName"]/text()')[0]  
publication_name = parser.xpath('//a[@data-testid="publicationName"]/p/text()')[0]  
publication_date = parser.xpath('//span[@data-testid="storyPublishDate"]/text()')[0]  
content = '\n '.join(parser.xpath('//div[@class="ci bh ga gb gc gd"]/p/text()'))  
auth_followers = parser.xpath('//span[@class="pw-follower-count bf b bg z bk"]/a/text()')[0]  
sub_title = parser.xpath('//h2[@id="1de6"]/text()')[0]

# Save the extracted data to CSV  
article_data = {  
    'Title': title,  
    'Author': author,  
    'Publication': publication_name,  
    'Date': publication_date,  
    'Followers': auth_followers,  
    'Subtitle': sub_title,  
    'Content': content,  
}

df = pd.DataFrame([article_data])  
df.to_csv('medium_article_data.csv', index=False)  
print("Data saved to medium_article_data.csv")

Respecting Medium's Terms

As you scrape Medium, remember that scraping can violate a website's terms of service if not done responsibly. Always check Medium's robots.txt and respect their guidelines. Avoid overwhelming their servers with too many requests, and ensure you're scraping ethically.

Conclusion

Now you're ready to start scraping Medium articles using Python for your data analysis or research. Go ahead and try it out, and remember to respect the boundaries of responsible web scraping.

About the author

SwiftProxy
Emily Chan
Lead Writer at Swiftproxy
Emily Chan is the lead writer at Swiftproxy, bringing over a decade of experience in technology, digital infrastructure, and strategic communications. Based in Hong Kong, she combines regional insight with a clear, practical voice to help businesses navigate the evolving world of proxy solutions and data-driven growth.
The content provided on the Swiftproxy Blog is intended solely for informational purposes and is presented without warranty of any kind. Swiftproxy does not guarantee the accuracy, completeness, or legal compliance of the information contained herein, nor does it assume any responsibility for content on thirdparty websites referenced in the blog. Prior to engaging in any web scraping or automated data collection activities, readers are strongly advised to consult with qualified legal counsel and to review the applicable terms of service of the target website. In certain cases, explicit authorization or a scraping permit may be required.
Join SwiftProxy Discord community Chat with SwiftProxy support via WhatsApp Chat with SwiftProxy support via Telegram
Chat with SwiftProxy support via Email