Unlocking the Potential of Scraping Dynamic Websites

SwiftProxy
By - Linh Tran
2025-01-22 15:39:42

Unlocking the Potential of Scraping Dynamic Websites

Web scraping is an indispensable skill for anyone working with data from the web. But when it comes to scraping dynamic websites like Instagram or Pinterest, traditional scraping methods fall short. These sites load content dynamically using JavaScript, meaning the information you need might not be available in the initial HTML. So, how do you scrape data from these dynamic pages? The answer: Playwright and lxml.
In this guide, we'll walk you through scraping Instagram posts by automating user interactions like scrolling and waiting for posts to load. We'll use Playwright to automate browser actions and lxml to extract data. Proxies will be used to bypass anti-bot measures and keep you under the radar.
Let's dive into the tools and the step-by-step process.

Tools You Need

Before you start scraping, let's get the right tools in place:

· Playwright (for browser automation)

· lxml (for data extraction using XPath)

· Python (of course!)

You'll be simulating user behavior on Instagram, scrolling through the page to trigger the loading of more posts, and then extracting the URLs. Simple, right?

Step 1: Installing Necessary Libraries

First, you need to install Playwright, lxml, and a few dependencies. Fire up your terminal and run:

pip install playwright
pip install lxml

Playwright also needs browsers to work with. You can install them using:

playwright install

Step 2: Setting Up Playwright for Scraping Dynamic Websites

Playwright automates the browser for you, interacting with Instagram's dynamic content and scrolling through the page to load new posts.
Here's a basic script to start scraping Instagram:

import asyncio
from playwright.async_api import async_playwright

async def scrape_instagram():
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page()
        
        # Open Instagram profile
        await page.goto("https://www.instagram.com/username/", wait_until="networkidle")

        # Simulate clicking to load more posts
        await page.get_by_role("button", name="Show more posts from").click()
        
        # Scroll to trigger AJAX requests and load more posts
        for _ in range(5):  # Customize the scroll count
            await page.evaluate('window.scrollBy(0, 700);')
            await page.wait_for_timeout(3000)  # Wait for posts to load
        
        content = await page.content()
        await browser.close()
        
        return content

# Run the scraper
asyncio.run(scrape_instagram())

This script will simulate a user visiting an Instagram profile, clicking to load more posts, and scrolling to trigger more content to load.

Step 3: Parsing HTML with lxml

Now that you have the content, let's parse the HTML and extract the post URLs. We'll use XPath to locate the URLs of the posts.

from lxml import html

def extract_post_urls(page_content):
    tree = html.fromstring(page_content)
    
    # XPath for extracting post URLs
    post_urls_xpath = '//a[contains(@href, "/p/")]/@href'
    
    post_urls = tree.xpath(post_urls_xpath)
    
    # Convert relative URLs to absolute URLs
    base_url = "https://www.instagram.com"
    return [f"{base_url}{url}" for url in post_urls]

This function will grab all the post URLs from the page content and return them as absolute URLs.

Step 4: Overcoming Infinite Scrolling

Dynamic sites like Instagram use infinite scrolling to load more content as the user scrolls. To handle this, we simulate scrolling with JavaScript:

await page.evaluate('window.scrollBy(0, 700);')
await page.wait_for_timeout(3000)  # Adjust the wait time based on load speed
await page.wait_for_load_state("networkidle")

This ensures that more posts are loaded each time you scroll. Customize the scroll count based on the profile you're scraping.

Step 5: Avoiding Detection with Proxies

Instagram has strict anti-bot measures. If you're scraping a lot, your IP might get blocked. This is where proxies come in handy.
Playwright makes it easy to rotate IPs by using proxies. Here's how to add a proxy to your Playwright script:

async def scrape_with_proxy():
    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=False,
            proxy={"server": "http://your-proxy-server:port"}
        )
        page = await browser.new_page()
        await page.goto("https://www.instagram.com/username/", wait_until="networkidle")
        # Continue scraping...

This ensures your scraping is distributed across multiple IPs, reducing the chances of getting blocked.

Step 6: Saving the Data

Once you have the URLs, it’s time to save them. We’ll store them in a JSON file for easy access:

import json

def save_data(profile_url, post_urls):
    data = {profile_url: post_urls}
    with open('instagram_posts.json', 'w') as json_file:
        json.dump(data, json_file, indent=4)
    print(f"Data saved to instagram_posts.json")

This function will save all the extracted post URLs in a clean, structured JSON file.

Full Code Example

Here's the full script, from scraping the profile to saving the URLs:

import asyncio
from playwright.async_api import async_playwright
from lxml import html
import json

async def scrape_instagram(profile_url, proxy=None):
    async with async_playwright() as p:
        browser_options = {'headless': True}
        if proxy:
            browser_options['proxy'] = proxy
        
        browser = await p.chromium.launch(browser_options)
        page = await browser.new_page()
        await page.goto(profile_url, wait_until="networkidle")
        
        try:
            await page.click('button:has-text("Show more posts from")')
        except Exception as e:
            print(f"No 'Show more posts' button found: {e}")

        for _ in range(5):  # Scroll and wait for posts to load
            await page.evaluate('window.scrollBy(0, 500);')
            await page.wait_for_timeout(3000)
            await page.wait_for_load_state("networkidle")
        
        content = await page.content()
        await browser.close()
        return content

def extract_post_urls(page_content):
    tree = html.fromstring(page_content)
    post_urls_xpath = '//a[contains(@href, "/p/")]/@href'
    post_urls = tree.xpath(post_urls_xpath)
    base_url = "https://www.instagram.com"
    return [f"{base_url}{url}" for url in post_urls]

def save_data(profile_url, post_urls):
    data = {profile_url: post_urls}
    with open('instagram_posts.json', 'w') as json_file:
        json.dump(data, json_file, indent=4)

async def main():
    profile_url = "https://www.instagram.com/username/"
    proxy = {"server": "server", "username": "username", "password": "password"}  # Optional
    page_content = await scrape_instagram(profile_url, proxy)
    post_urls = extract_post_urls(page_content)
    save_data(profile_url, post_urls)

if __name__ == '__main__':
    asyncio.run(main())

Alternatives to Playwright

While Playwright is a powerful tool, it's not the only option out there. Here are a few alternatives:

· Selenium: The old faithful of browser automation. It's versatile but not as fast or modern as Playwright.

· Puppeteer: Ideal for JavaScript-heavy sites, but only supports Chrome and Chromium.

· Requests + BeautifulSoup: Great for simple, static websites, but struggles with dynamic content.

Each tool has its strengths. Choose one based on the complexity of your project.

Final Thoughts

Scraping dynamic websites is no longer a daunting task. With Playwright and lxml, you can easily automate browsing, simulate user behavior, and extract data from pages like Instagram. By using proxies, you can avoid detection and keep your scraping smooth and uninterrupted.
Remember, scraping dynamic websites takes patience—especially with infinite scrolling. But with the right tools and approach, you'll be collecting the data you need in no time.

Note sur l'auteur

SwiftProxy
Linh Tran
Linh Tran est une rédactrice technique basée à Hong Kong, avec une formation en informatique et plus de huit ans d'expérience dans le domaine des infrastructures numériques. Chez Swiftproxy, elle se spécialise dans la simplification des technologies proxy complexes, offrant des analyses claires et exploitables aux entreprises naviguant dans le paysage des données en rapide évolution en Asie et au-delà.
Analyste technologique senior chez Swiftproxy
Le contenu fourni sur le blog Swiftproxy est destiné uniquement à des fins d'information et est présenté sans aucune garantie. Swiftproxy ne garantit pas l'exactitude, l'exhaustivité ou la conformité légale des informations contenues, ni n'assume de responsabilité pour le contenu des sites tiers référencés dans le blog. Avant d'engager toute activité de scraping web ou de collecte automatisée de données, il est fortement conseillé aux lecteurs de consulter un conseiller juridique qualifié et de revoir les conditions d'utilisation applicables du site cible. Dans certains cas, une autorisation explicite ou un permis de scraping peut être requis.
Join SwiftProxy Discord community Chat with SwiftProxy support via WhatsApp Chat with SwiftProxy support via Telegram
Chat with SwiftProxy support via Email