
Instagram's data is a treasure trove for researchers, marketers, and developers—but getting it? That's a different story. With sophisticated anti-bot systems, login hurdles, and rate limits, scraping Instagram can feel like trying to break into a vault. But don't worry, it's not impossible.
In this guide, we'll walk you through scraping Instagram user data with Python. By sending API requests, parsing JSON responses, and using a few clever tools, you'll be able to collect valuable insights from public profiles. Let's dive in.
Before we get into the code, let's set you up with the right tools. To scrape Instagram efficiently, you'll need a couple of Python libraries. Make sure you have these installed:
pip install requests python-box
· requests: Used for making HTTP requests to Instagram's backend.
· python-box: This simplifies navigating and accessing JSON data.
Instagram's frontend is locked down tight. But here's the trick: Instagram has an exposed backend API that's not as heavily protected. With the right headers, you can pull public data without authentication.
We'll target the user profile endpoint to grab key data like follower count, bio, and post details. Here's how we make the request:
import requests
headers = {
"x-ig-app-id": "936619743392459", # Essential to mimic the Instagram app
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Accept": "*/*",
}
username = 'testtest' # Replace with the username you want
response = requests.get(f'https://i.instagram.com/api/v1/users/web_profile_info/?username={username}', headers=headers)
response_json = response.json() # Parse the response to JSON
Explanation:
· Headers: These mimic a real browser request. Instagram looks at headers to detect bots—so using valid headers like x-ig-app-id and User-Agent makes your requests appear legitimate.
· Backend API: We're hitting the web_profile_info endpoint, which pulls detailed user profile data.
Instagram's rate-limiting can be a challenge. If you're scraping multiple profiles or making a lot of requests, you might get blocked. Proxies can help overcome this issue.
Proxies allow you to send requests through different IPs, masking your real location and preventing Instagram from flagging you.
Here's how to integrate proxies into your requests:
proxies = {
'http': 'http://<proxy_username>:<proxy_password>@<proxy_ip>:<proxy_port>',
'https': 'https://<proxy_username>:<proxy_password>@<proxy_ip>:<proxy_port>',
}
response = requests.get(f'https://i.instagram.com/api/v1/users/web_profile_info/?username={username}', headers=headers, proxies=proxies)
Instagram's API returns a deep, nested JSON structure. Navigating this with standard dictionary syntax can be a hassle. This is where Box comes in.
Box turns JSON into an object that you can access with dot notation. It's cleaner, faster, and more intuitive.
Here's how to use it:
from box import Box
response_json = Box(response.json())
user_data = {
'full name': response_json.data.user.full_name,
'followers': response_json.data.user.edge_followed_by.count,
'bio': response_json.data.user.biography,
'is_verified': response_json.data.user.is_verified,
'profile_pic': response_json.data.user.profile_pic_url_hd,
}
Instead of accessing data with response_json['data']['user']['full_name'], you can just write response_json.data.user.full_name. Much easier, right?
Once you have the profile data, you can dig deeper into a user's posts and videos. Instagram gives you a goldmine of insights here: view counts, like counts, and more.
For videos, here's the extraction method:
profile_video_data = []
for element in response_json.data.user.edge_felix_video_timeline.edges:
video_data = {
'id': element.node.id,
'video_url': element.node.video_url,
'views': element.node.video_view_count,
'likes': element.node.edge_liked_by.count,
}
profile_video_data.append(video_data)
Similarly, you can extract regular timeline posts (photos and videos) and pull data like media URL, comment counts, and like counts.
Once you've scraped the data, you'll likely want to save it. Use Python's built-in json library to export the data into readable JSON files for later analysis.
import json
# Save the profile data
with open(f'{username}_profile_data.json', 'w') as file:
json.dump(user_data, file, indent=4)
# Save video data
with open(f'{username}_video_data.json', 'w') as file:
json.dump(profile_video_data, file, indent=4)
Now, you have a neatly structured JSON file with all the data you need. Easy to read, easy to process.
Here's the full Python script for your convenience:
import requests
from box import Box
import json
headers = {
"x-ig-app-id": "936619743392459",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36",
}
proxies = {
'http': 'http://<proxy_username>:<proxy_password>@<proxy_ip>:<proxy_port>',
'https': 'https://<proxy_username>:<proxy_password>@<proxy_ip>:<proxy_port>',
}
username = 'testtest'
response = requests.get(f'https://i.instagram.com/api/v1/users/web_profile_info/?username={username}', headers=headers, proxies=proxies)
response_json = Box(response.json())
user_data = {
'full name': response_json.data.user.full_name,
'followers': response_json.data.user.edge_followed_by.count,
}
# Save the data
with open(f'{username}_profile_data.json', 'w') as file:
json.dump(user_data, file, indent=4)
Scraping Instagram data can seem daunting, but with the right tools, it's entirely doable. By leveraging Instagram's backend API, using headers to mimic a real browser, and applying proxies to avoid detection, you can scrape Instagram data from public profiles and gain valuable insights. Always ensure you're respecting Instagram's terms of service. Stay ethical and avoid overloading their servers with requests.