The Complete Guide to Scrape YouTube Videos for Data

SwiftProxy
By - Emily Chan
2025-01-09 14:44:55

The Complete Guide to Scrape YouTube Videos for Data

YouTube hosts billions of videos, making it an invaluable source for data. But scraping this goldmine isn't easy. With dynamic content, anti-scraping measures, and high traffic, getting the data you need can feel like an uphill battle. With the right approach, you can efficiently extract video details using Python, Playwright, and lxml.

Preparing Your Environment

Before jumping into the code, make sure your environment is ready. Here's what you'll need:

1. Playwright: Automates headless browsers, allowing you to interact with YouTube just like a real user.

2. lxml: A powerhouse for parsing and querying HTML.

3. CSV module: A built-in Python tool for storing extracted data.
Install the necessary packages using pip:

pip install playwright  
pip install lxml  

Then, install Playwright's browser binaries:

playwright install  

Or, if you prefer Chromium:

playwright install chromium  

Step 1: Import Necessary Libraries

You'll need these Python libraries to get started:

import asyncio  
from playwright.async_api import Playwright, async_playwright  
from lxml import html  
import csv  

Step 2: Automating the Browser

Playwright lets us launch a headless browser to scrape YouTube content. The first step is navigating to the video URL and waiting for the page to load fully.

browser = await playwright.chromium.launch(headless=True)  
context = await browser.new_context()  
page = await context.new_page()  

# Navigating to the YouTube video URL  
await page.goto("https://www.youtube.com/watch?v=Ct8Gxo8StBU", wait_until="networkidle")  

# Scroll to load more comments  
for _ in range(20):  
    await page.mouse.wheel(0, 200)  
    await asyncio.sleep(0.2)  

# Give it time to load additional content  
await page.wait_for_timeout(1000)  

Step 3: Parsing HTML for Data

Next, grab the raw HTML content and parse it with lxml.

page_content = await page.content()  
parser = html.fromstring(page_content)  

Step 4: Gathering the Data

Now for the exciting part—extracting key data like the video title, channel info, views, comments, and more. XPath queries help pull out the exact elements you need.

title = parser.xpath('//div[@id="title"]/h1/yt-formatted-string/text()')[0]  
channel = parser.xpath('//yt-formatted-string[@id="text"]/a/text()')[0]  
channel_link = 'https://www.youtube.com' + parser.xpath('//yt-formatted-string[@id="text"]/a/@href')[0]  
posted = parser.xpath('//yt-formatted-string[@id="info"]/span/text()')[2]  
total_views = parser.xpath('//yt-formatted-string[@id="info"]/span/text()')[0]  
total_comments = parser.xpath('//h2[@id="count"]/yt-formatted-string/span/text()')[0]  
comments_list = parser.xpath('//yt-attributed-string[@id="content-text"]/span/text()')  

Step 5: Saving the Information

Now that you've extracted the data, save it to a CSV file for easy access and analysis.

with open('youtube_video_data.csv', 'w', newline='', encoding='utf-8') as file:  
    writer = csv.writer(file)  
    writer.writerow(["Title", "Channel", "Channel Link", "Posted", "Total Views", "Total Comments", "Comments"])  
    writer.writerow([title, channel, channel_link, posted, total_views, total_comments, ", ".join(comments_list)])  

The Power of Proxies

If you want to scrape YouTube at scale or avoid being blocked, proxies are a must. By routing your browser traffic through proxies, you can reduce the chance of detection and avoid getting your IP banned.
Here's how you can set up proxies in Playwright:

browser = await playwright.chromium.launch(  
    headless=True,  
    proxy={"server": "http://your_proxy_ip:port", "username": "your_username", "password": "your_password"}  
)  

Why Use Proxies

IP Address Masking: Hides your real IP, making it harder for YouTube to detect scraping attempts.
Distribute Requests: Rotate proxies to simulate traffic from multiple users.
Circumvent Restrictions: Overcome regional or content access restrictions.

Full Code Implementation

Here's everything in one place:

import asyncio  
from playwright.async_api import Playwright, async_playwright  
from lxml import html  
import csv  

async def run(playwright: Playwright) -> None:  
    browser = await playwright.chromium.launch(  
        headless=True,  
        proxy={"server": "http://your_proxy_ip:port", "username": "your_username", "password": "your_password"}  
    )  
    context = await browser.new_context()  
    page = await context.new_page()  

    await page.goto("https://www.youtube.com/watch?v=Ct8Gxo8StBU", wait_until="networkidle")  

    for _ in range(20):  
        await page.mouse.wheel(0, 200)  
        await asyncio.sleep(0.2)  

    await page.wait_for_timeout(1000)  

    page_content = await page.content()  
    await context.close()  
    await browser.close()  

    parser = html.fromstring(page_content)  
    title = parser.xpath('//div[@id="title"]/h1/yt-formatted-string/text()')[0]  
    channel = parser.xpath('//yt-formatted-string[@id="text"]/a/text()')[0]  
    channel_link = 'https://www.youtube.com' + parser.xpath('//yt-formatted-string[@id="text"]/a/@href')[0]  
    posted = parser.xpath('//yt-formatted-string[@id="info"]/span/text()')[2]  
    total_views = parser.xpath('//yt-formatted-string[@id="info"]/span/text()')[0]  
    total_comments = parser.xpath('//h2[@id="count"]/yt-formatted-string/span/text()')[0]  
    comments_list = parser.xpath('//yt-attributed-string[@id="content-text"]/span/text()')  

    with open('youtube_video_data.csv', 'w', newline='', encoding='utf-8') as file:  
        writer = csv.writer(file)  
        writer.writerow(["Title", "Channel", "Channel Link", "Posted", "Total Views", "Total Comments", "Comments"])  
        writer.writerow([title, channel, channel_link, posted, total_views, total_comments, ", ".join(comments_list)])  

async def main():  
    async with async_playwright() as playwright:  
        await run(playwright)  

asyncio.run(main())  

Final Thoughts

By following these steps, you can scrape YouTube data with ease. But remember, scraping can raise ethical concerns and risks of being blocked. Always use proxies wisely, follow ethical guidelines, and respect YouTube's terms of service.

About the author

SwiftProxy
Emily Chan
Lead Writer at Swiftproxy
Emily Chan is the lead writer at Swiftproxy, bringing over a decade of experience in technology, digital infrastructure, and strategic communications. Based in Hong Kong, she combines regional insight with a clear, practical voice to help businesses navigate the evolving world of proxy solutions and data-driven growth.
The content provided on the Swiftproxy Blog is intended solely for informational purposes and is presented without warranty of any kind. Swiftproxy does not guarantee the accuracy, completeness, or legal compliance of the information contained herein, nor does it assume any responsibility for content on thirdparty websites referenced in the blog. Prior to engaging in any web scraping or automated data collection activities, readers are strongly advised to consult with qualified legal counsel and to review the applicable terms of service of the target website. In certain cases, explicit authorization or a scraping permit may be required.
Join SwiftProxy Discord community Chat with SwiftProxy support via WhatsApp Chat with SwiftProxy support via Telegram
Chat with SwiftProxy support via Email