How to Scrape Medium Articles Using Python for Data Analysis

SwiftProxy
By - Emily Chan
2025-02-14 15:10:29

How to Scrape Medium Articles Using Python for Data Analysis

When it comes to extracting valuable insights from Medium, scraping articles can be an effective approach. Whether you're analyzing trends, evaluating content quality, or tracking your favorite authors, Python simplifies this process. Let's explore how to scrape data from a Medium article efficiently.

Why Scrape Medium

Medium has a wealth of information—stories, insights, tutorials, and more. But how do you harness this data for research or personal projects? Python offers a simple and effective way to gather content like article titles, author names, publication dates, and the full text body.

Step 1: Setting Up Your Environment

Before we get into the code, let's install the necessary libraries. These will help us send requests, parse HTML, and save data in a readable format:

pip install requests  
pip install lxml  
pip install pandas

These tools are essential for scraping data efficiently. Here's a quick breakdown of their roles:

· Requests: Handles sending HTTP requests to Medium.

· lxml: Parses HTML content from the response.

· Pandas: Helps store and organize the extracted data, saving it into CSV files.

Step 2: Handling Medium's Anti-Scraping Measures

Medium doesn't make scraping easy. It uses bot detection mechanisms to block unauthorized access. To avoid being flagged, you'll need to use headers that mimic a real browser request, and in some cases, proxies to rotate IP addresses.

Crafting the Perfect Headers

Headers are key to simulating a legitimate browser request. This is what Medium looks for to ensure you're not a bot. Here's an example of headers you should use:

headers = {
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
    'accept-language': 'en-IN,en;q=0.9',
    'cache-control': 'no-cache',
    'dnt': '1',
    'pragma': 'no-cache',
    'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36',
}

These headers ensure that your request looks like it's coming from a legitimate user.

Using Proxies to Avoid Blocking

Proxies are another line of defense. They mask your real IP address, making it harder for Medium to detect multiple requests from the same source. Here's an example of how to set up a proxy:

proxies = {
    'http': 'IP:PORT',
    'https': 'IP:PORT'
}

Rotate your IP periodically to prevent Medium from blocking you.

Step 3: Making Requests to Medium

Now that we've set up headers and proxies, it's time to send a request to the article URL. Here's how you do it:

import requests  

url = 'https://medium.com/techtofreedom/9-python-built-in-decorators-that-optimize-your-code-significantly-bc3f661e9017'  
response = requests.get(url, headers=headers, proxies=proxies)

This will fetch the article content from Medium.

Step 4: Extracting Article Data

Once the page loads, we can extract key details. Using lxml, we'll parse the HTML and grab elements like the article title, author, and content.

from lxml.html import fromstring  

parser = fromstring(response.text)

# Extracting data  
title = parser.xpath('//h1[@data-testid="storyTitle"]/text()')[0]  
author = parser.xpath('//a[@data-testid="authorName"]/text()')[0]  
publication_name = parser.xpath('//a[@data-testid="publicationName"]/p/text()')[0]  
publication_date = parser.xpath('//span[@data-testid="storyPublishDate"]/text()')[0]  
content = '\n '.join(parser.xpath('//div[@class="ci bh ga gb gc gd"]/p/text()'))  
auth_followers = parser.xpath('//span[@class="pw-follower-count bf b bg z bk"]/a/text()')[0]  
sub_title = parser.xpath('//h2[@id="1de6"]/text()')[0]

These XPath queries target specific elements in the HTML and extract the necessary data.

Step 5: Storing Data in a Dictionary

Once we have all the extracted content, let's organize it into a neat dictionary. This makes it easy to save the data into a CSV file for future use.

# Store data in a dictionary  
article_data = {  
    'Title': title,  
    'Author': author,  
    'Publication': publication_name,  
    'Date': publication_date,  
    'Followers': auth_followers,  
    'Subtitle': sub_title,  
    'Content': content,  
}

Step 6: Saving to CSV

With the data ready, let's save it in a CSV file using Pandas. This step is crucial for anyone who wants to analyze the data later.

import pandas as pd

# Save data to CSV  
df = pd.DataFrame([article_data])  
df.to_csv('medium_article_data.csv', index=False)  
print("Data saved to medium_article_data.csv")

The Complete Example

Here's the entire process wrapped up in one script:

import requests  
from lxml.html import fromstring  
import pandas as pd  

# Headers to simulate a real browser request  
headers = {  
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',  
    'accept-language': 'en-IN,en;q=0.9',  
    'cache-control': 'no-cache',  
    'dnt': '1',  
    'pragma': 'no-cache',  
    'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36',  
}  

# Proxies to rotate IP addresses  
proxies = {  
    'http': 'IP:PORT',  
    'https': 'IP:PORT'  
}

# Send the request  
url = 'https://medium.com/techtofreedom/9-python-built-in-decorators-that-optimize-your-code-significantly-bc3f661e9017'  
response = requests.get(url, headers=headers, proxies=proxies)

# Parse the page  
parser = fromstring(response.text)

# Extract article data  
title = parser.xpath('//h1[@data-testid="storyTitle"]/text()')[0]  
author = parser.xpath('//a[@data-testid="authorName"]/text()')[0]  
publication_name = parser.xpath('//a[@data-testid="publicationName"]/p/text()')[0]  
publication_date = parser.xpath('//span[@data-testid="storyPublishDate"]/text()')[0]  
content = '\n '.join(parser.xpath('//div[@class="ci bh ga gb gc gd"]/p/text()'))  
auth_followers = parser.xpath('//span[@class="pw-follower-count bf b bg z bk"]/a/text()')[0]  
sub_title = parser.xpath('//h2[@id="1de6"]/text()')[0]

# Save the extracted data to CSV  
article_data = {  
    'Title': title,  
    'Author': author,  
    'Publication': publication_name,  
    'Date': publication_date,  
    'Followers': auth_followers,  
    'Subtitle': sub_title,  
    'Content': content,  
}

df = pd.DataFrame([article_data])  
df.to_csv('medium_article_data.csv', index=False)  
print("Data saved to medium_article_data.csv")

Respecting Medium's Terms

As you scrape Medium, remember that scraping can violate a website's terms of service if not done responsibly. Always check Medium's robots.txt and respect their guidelines. Avoid overwhelming their servers with too many requests, and ensure you're scraping ethically.

Conclusion

Now you're ready to start scraping Medium articles using Python for your data analysis or research. Go ahead and try it out, and remember to respect the boundaries of responsible web scraping.

關於作者

SwiftProxy
Emily Chan
Swiftproxy首席撰稿人
Emily Chan是Swiftproxy的首席撰稿人,擁有十多年技術、數字基礎設施和戰略傳播的經驗。她常駐香港,結合區域洞察力和清晰實用的表達,幫助企業駕馭不斷變化的代理IP解決方案和數據驅動增長。
Swiftproxy部落格提供的內容僅供參考,不提供任何形式的保證。Swiftproxy不保證所含資訊的準確性、完整性或合法合規性,也不對部落格中引用的第三方網站內容承擔任何責任。讀者在進行任何網頁抓取或自動化資料蒐集活動之前,強烈建議諮詢合格的法律顧問,並仔細閱讀目標網站的服務條款。在某些情況下,可能需要明確授權或抓取許可。
Join SwiftProxy Discord community Chat with SwiftProxy support via WhatsApp Chat with SwiftProxy support via Telegram
Chat with SwiftProxy support via Email