
Web scraping is an indispensable skill for anyone working with data from the web. But when it comes to scraping dynamic websites like Instagram or Pinterest, traditional scraping methods fall short. These sites load content dynamically using JavaScript, meaning the information you need might not be available in the initial HTML. So, how do you scrape data from these dynamic pages? The answer: Playwright and lxml.
In this guide, we'll walk you through scraping Instagram posts by automating user interactions like scrolling and waiting for posts to load. We'll use Playwright to automate browser actions and lxml to extract data. Proxies will be used to bypass anti-bot measures and keep you under the radar.
Let's dive into the tools and the step-by-step process.
Before you start scraping, let's get the right tools in place:
· Playwright (for browser automation)
· lxml (for data extraction using XPath)
· Python (of course!)
You'll be simulating user behavior on Instagram, scrolling through the page to trigger the loading of more posts, and then extracting the URLs. Simple, right?
First, you need to install Playwright, lxml, and a few dependencies. Fire up your terminal and run:
pip install playwright
pip install lxml
Playwright also needs browsers to work with. You can install them using:
playwright install
Playwright automates the browser for you, interacting with Instagram's dynamic content and scrolling through the page to load new posts.
Here's a basic script to start scraping Instagram:
import asyncio
from playwright.async_api import async_playwright
async def scrape_instagram():
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page()
        
        # Open Instagram profile
        await page.goto("https://www.instagram.com/username/", wait_until="networkidle")
        # Simulate clicking to load more posts
        await page.get_by_role("button", name="Show more posts from").click()
        
        # Scroll to trigger AJAX requests and load more posts
        for _ in range(5):  # Customize the scroll count
            await page.evaluate('window.scrollBy(0, 700);')
            await page.wait_for_timeout(3000)  # Wait for posts to load
        
        content = await page.content()
        await browser.close()
        
        return content
# Run the scraper
asyncio.run(scrape_instagram())
This script will simulate a user visiting an Instagram profile, clicking to load more posts, and scrolling to trigger more content to load.
Now that you have the content, let's parse the HTML and extract the post URLs. We'll use XPath to locate the URLs of the posts.
from lxml import html
def extract_post_urls(page_content):
    tree = html.fromstring(page_content)
    
    # XPath for extracting post URLs
    post_urls_xpath = '//a[contains(@href, "/p/")]/@href'
    
    post_urls = tree.xpath(post_urls_xpath)
    
    # Convert relative URLs to absolute URLs
    base_url = "https://www.instagram.com"
    return [f"{base_url}{url}" for url in post_urls]
This function will grab all the post URLs from the page content and return them as absolute URLs.
Dynamic sites like Instagram use infinite scrolling to load more content as the user scrolls. To handle this, we simulate scrolling with JavaScript:
await page.evaluate('window.scrollBy(0, 700);')
await page.wait_for_timeout(3000)  # Adjust the wait time based on load speed
await page.wait_for_load_state("networkidle")
This ensures that more posts are loaded each time you scroll. Customize the scroll count based on the profile you're scraping.
Instagram has strict anti-bot measures. If you're scraping a lot, your IP might get blocked. This is where proxies come in handy.
Playwright makes it easy to rotate IPs by using proxies. Here's how to add a proxy to your Playwright script:
async def scrape_with_proxy():
    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=False,
            proxy={"server": "http://your-proxy-server:port"}
        )
        page = await browser.new_page()
        await page.goto("https://www.instagram.com/username/", wait_until="networkidle")
        # Continue scraping...
This ensures your scraping is distributed across multiple IPs, reducing the chances of getting blocked.
Once you have the URLs, it’s time to save them. We’ll store them in a JSON file for easy access:
import json
def save_data(profile_url, post_urls):
    data = {profile_url: post_urls}
    with open('instagram_posts.json', 'w') as json_file:
        json.dump(data, json_file, indent=4)
    print(f"Data saved to instagram_posts.json")
This function will save all the extracted post URLs in a clean, structured JSON file.
Here's the full script, from scraping the profile to saving the URLs:
import asyncio
from playwright.async_api import async_playwright
from lxml import html
import json
async def scrape_instagram(profile_url, proxy=None):
    async with async_playwright() as p:
        browser_options = {'headless': True}
        if proxy:
            browser_options['proxy'] = proxy
        
        browser = await p.chromium.launch(browser_options)
        page = await browser.new_page()
        await page.goto(profile_url, wait_until="networkidle")
        
        try:
            await page.click('button:has-text("Show more posts from")')
        except Exception as e:
            print(f"No 'Show more posts' button found: {e}")
        for _ in range(5):  # Scroll and wait for posts to load
            await page.evaluate('window.scrollBy(0, 500);')
            await page.wait_for_timeout(3000)
            await page.wait_for_load_state("networkidle")
        
        content = await page.content()
        await browser.close()
        return content
def extract_post_urls(page_content):
    tree = html.fromstring(page_content)
    post_urls_xpath = '//a[contains(@href, "/p/")]/@href'
    post_urls = tree.xpath(post_urls_xpath)
    base_url = "https://www.instagram.com"
    return [f"{base_url}{url}" for url in post_urls]
def save_data(profile_url, post_urls):
    data = {profile_url: post_urls}
    with open('instagram_posts.json', 'w') as json_file:
        json.dump(data, json_file, indent=4)
async def main():
    profile_url = "https://www.instagram.com/username/"
    proxy = {"server": "server", "username": "username", "password": "password"}  # Optional
    page_content = await scrape_instagram(profile_url, proxy)
    post_urls = extract_post_urls(page_content)
    save_data(profile_url, post_urls)
if __name__ == '__main__':
    asyncio.run(main())
While Playwright is a powerful tool, it's not the only option out there. Here are a few alternatives:
· Selenium: The old faithful of browser automation. It's versatile but not as fast or modern as Playwright.
· Puppeteer: Ideal for JavaScript-heavy sites, but only supports Chrome and Chromium.
· Requests + BeautifulSoup: Great for simple, static websites, but struggles with dynamic content.
Each tool has its strengths. Choose one based on the complexity of your project.
Scraping dynamic websites is no longer a daunting task. With Playwright and lxml, you can easily automate browsing, simulate user behavior, and extract data from pages like Instagram. By using proxies, you can avoid detection and keep your scraping smooth and uninterrupted.
Remember, scraping dynamic websites takes patience—especially with infinite scrolling. But with the right tools and approach, you'll be collecting the data you need in no time.
 Top-tier residential proxy solutions
Top-tier residential proxy solutions {{item.title}}
                                        {{item.title}}