Residential Proxies

Static Residential Proxies

Unlimited Residential Proxies

Learn more

Youtube Proxies

Residential Proxies

Static Residential Proxies

Unlimited Residential Proxies

Data for AI

Web Scraping

SEO and SERP Scraping

Price Monitoring

Travel Fare Aggregation

Stock Market Data Collection

Swiftproxy’s partners

Gather data at scale

Web Scraping Proxies Free Trial

Gather accurate data worldwide without blocks or interruptions.

Learn more >

Unlimited-Bandwidth Proxy Solution for Large-Scale Video Data Collection

Power Your Business Growth with Swiftproxy

A global network of over 80 million residential proxies, ensuring 99.89% uptime and stable connections, supporting HTTP(S) & SOCKS5 protocols.

Swiftproxy residential proxies with 80M+ IPs, 99.89% uptime, supporting HTTP(S) & SOCKS5 protocols

Affiliate program

30% commission guaranteed

CDK Earning Program

Turn your proxies into profit

Unlocking the Potential of Scraping Dynamic Websites

By - Linh Tran

2025-01-22 15:39:42

Web scraping is an indispensable skill for anyone working with data from the web. But when it comes to scraping dynamic websites like Instagram or Pinterest, traditional scraping methods fall short. These sites load content dynamically using JavaScript, meaning the information you need might not be available in the initial HTML. So, how do you scrape data from these dynamic pages? The answer: Playwright and lxml.
In this guide, we'll walk you through scraping Instagram posts by automating user interactions like scrolling and waiting for posts to load. We'll use Playwright to automate browser actions and lxml to extract data. Proxies will be used to bypass anti-bot measures and keep you under the radar.
Let's dive into the tools and the step-by-step process.

Tools You Need

Before you start scraping, let's get the right tools in place:

· Playwright (for browser automation)

· lxml (for data extraction using XPath)

· Python (of course!)

You'll be simulating user behavior on Instagram, scrolling through the page to trigger the loading of more posts, and then extracting the URLs. Simple, right?

Step 1: Installing Necessary Libraries

First, you need to install Playwright, lxml, and a few dependencies. Fire up your terminal and run:

pip install playwright
pip install lxml

Playwright also needs browsers to work with. You can install them using:

playwright install

Step 2: Setting Up Playwright for Scraping Dynamic Websites

Playwright automates the browser for you, interacting with Instagram's dynamic content and scrolling through the page to load new posts.
Here's a basic script to start scraping Instagram:

import asyncio
from playwright.async_api import async_playwright

async def scrape_instagram():
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page()
        
        # Open Instagram profile
        await page.goto("https://www.instagram.com/username/", wait_until="networkidle")

        # Simulate clicking to load more posts
        await page.get_by_role("button", name="Show more posts from").click()
        
        # Scroll to trigger AJAX requests and load more posts
        for _ in range(5):  # Customize the scroll count
            await page.evaluate('window.scrollBy(0, 700);')
            await page.wait_for_timeout(3000)  # Wait for posts to load
        
        content = await page.content()
        await browser.close()
        
        return content

# Run the scraper
asyncio.run(scrape_instagram())

This script will simulate a user visiting an Instagram profile, clicking to load more posts, and scrolling to trigger more content to load.

Step 3: Parsing HTML with lxml

Now that you have the content, let's parse the HTML and extract the post URLs. We'll use XPath to locate the URLs of the posts.

from lxml import html

def extract_post_urls(page_content):
    tree = html.fromstring(page_content)
    
    # XPath for extracting post URLs
    post_urls_xpath = '//a[contains(@href, "/p/")]/@href'
    
    post_urls = tree.xpath(post_urls_xpath)
    
    # Convert relative URLs to absolute URLs
    base_url = "https://www.instagram.com"
    return [f"{base_url}{url}" for url in post_urls]

This function will grab all the post URLs from the page content and return them as absolute URLs.

Step 4: Overcoming Infinite Scrolling

Dynamic sites like Instagram use infinite scrolling to load more content as the user scrolls. To handle this, we simulate scrolling with JavaScript:

await page.evaluate('window.scrollBy(0, 700);')
await page.wait_for_timeout(3000)  # Adjust the wait time based on load speed
await page.wait_for_load_state("networkidle")

This ensures that more posts are loaded each time you scroll. Customize the scroll count based on the profile you're scraping.

Step 5: Avoiding Detection with Proxies

Instagram has strict anti-bot measures. If you're scraping a lot, your IP might get blocked. This is where proxies come in handy.
Playwright makes it easy to rotate IPs by using proxies. Here's how to add a proxy to your Playwright script:

async def scrape_with_proxy():
    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=False,
            proxy={"server": "http://your-proxy-server:port"}
        )
        page = await browser.new_page()
        await page.goto("https://www.instagram.com/username/", wait_until="networkidle")
        # Continue scraping...

This ensures your scraping is distributed across multiple IPs, reducing the chances of getting blocked.

Step 6: Saving the Data

Once you have the URLs, it’s time to save them. We’ll store them in a JSON file for easy access:

import json

def save_data(profile_url, post_urls):
    data = {profile_url: post_urls}
    with open('instagram_posts.json', 'w') as json_file:
        json.dump(data, json_file, indent=4)
    print(f"Data saved to instagram_posts.json")

This function will save all the extracted post URLs in a clean, structured JSON file.

Full Code Example

Here's the full script, from scraping the profile to saving the URLs:

import asyncio
from playwright.async_api import async_playwright
from lxml import html
import json

async def scrape_instagram(profile_url, proxy=None):
    async with async_playwright() as p:
        browser_options = {'headless': True}
        if proxy:
            browser_options['proxy'] = proxy
        
        browser = await p.chromium.launch(browser_options)
        page = await browser.new_page()
        await page.goto(profile_url, wait_until="networkidle")
        
        try:
            await page.click('button:has-text("Show more posts from")')
        except Exception as e:
            print(f"No 'Show more posts' button found: {e}")

        for _ in range(5):  # Scroll and wait for posts to load
            await page.evaluate('window.scrollBy(0, 500);')
            await page.wait_for_timeout(3000)
            await page.wait_for_load_state("networkidle")
        
        content = await page.content()
        await browser.close()
        return content

def extract_post_urls(page_content):
    tree = html.fromstring(page_content)
    post_urls_xpath = '//a[contains(@href, "/p/")]/@href'
    post_urls = tree.xpath(post_urls_xpath)
    base_url = "https://www.instagram.com"
    return [f"{base_url}{url}" for url in post_urls]

def save_data(profile_url, post_urls):
    data = {profile_url: post_urls}
    with open('instagram_posts.json', 'w') as json_file:
        json.dump(data, json_file, indent=4)

async def main():
    profile_url = "https://www.instagram.com/username/"
    proxy = {"server": "server", "username": "username", "password": "password"}  # Optional
    page_content = await scrape_instagram(profile_url, proxy)
    post_urls = extract_post_urls(page_content)
    save_data(profile_url, post_urls)

if __name__ == '__main__':
    asyncio.run(main())

Alternatives to Playwright

While Playwright is a powerful tool, it's not the only option out there. Here are a few alternatives:

· Selenium: The old faithful of browser automation. It's versatile but not as fast or modern as Playwright.

· Puppeteer: Ideal for JavaScript-heavy sites, but only supports Chrome and Chromium.

· Requests + BeautifulSoup: Great for simple, static websites, but struggles with dynamic content.

Each tool has its strengths. Choose one based on the complexity of your project.

Final Thoughts

Scraping dynamic websites is no longer a daunting task. With Playwright and lxml, you can easily automate browsing, simulate user behavior, and extract data from pages like Instagram. By using proxies, you can avoid detection and keep your scraping smooth and uninterrupted.
Remember, scraping dynamic websites takes patience—especially with infinite scrolling. But with the right tools and approach, you'll be collecting the data you need in no time.

About the author

Linh Tran

Senior Technology Analyst at Swiftproxy

Linh Tran is a Hong Kong-based technology writer with a background in computer science and over eight years of experience in the digital infrastructure space. At Swiftproxy, she specializes in making complex proxy technologies accessible, offering clear, actionable insights for businesses navigating the fast-evolving data landscape across Asia and beyond.

The content provided on the Swiftproxy Blog is intended solely for informational purposes and is presented without warranty of any kind. Swiftproxy does not guarantee the accuracy, completeness, or legal compliance of the information contained herein, nor does it assume any responsibility for content on thirdparty websites referenced in the blog. Prior to engaging in any web scraping or automated data collection activities, readers are strongly advised to consult with qualified legal counsel and to review the applicable terms of service of the target website. In certain cases, explicit authorization or a scraping permit may be required.

IN THIS ARTICLE

Top-tier residential proxy solutions