
YouTube hosts billions of videos, making it an invaluable source for data. But scraping this goldmine isn't easy. With dynamic content, anti-scraping measures, and high traffic, getting the data you need can feel like an uphill battle. With the right approach, you can efficiently extract video details using Python, Playwright, and lxml.
Before jumping into the code, make sure your environment is ready. Here's what you'll need:
1. Playwright: Automates headless browsers, allowing you to interact with YouTube just like a real user.
2. lxml: A powerhouse for parsing and querying HTML.
3. CSV module: A built-in Python tool for storing extracted data.
Install the necessary packages using pip:
pip install playwright
pip install lxml
Then, install Playwright's browser binaries:
playwright install
Or, if you prefer Chromium:
playwright install chromium
You'll need these Python libraries to get started:
import asyncio
from playwright.async_api import Playwright, async_playwright
from lxml import html
import csv
Playwright lets us launch a headless browser to scrape YouTube content. The first step is navigating to the video URL and waiting for the page to load fully.
browser = await playwright.chromium.launch(headless=True)
context = await browser.new_context()
page = await context.new_page()
# Navigating to the YouTube video URL
await page.goto("https://www.youtube.com/watch?v=Ct8Gxo8StBU", wait_until="networkidle")
# Scroll to load more comments
for _ in range(20):
await page.mouse.wheel(0, 200)
await asyncio.sleep(0.2)
# Give it time to load additional content
await page.wait_for_timeout(1000)
Next, grab the raw HTML content and parse it with lxml.
page_content = await page.content()
parser = html.fromstring(page_content)
Now for the exciting part—extracting key data like the video title, channel info, views, comments, and more. XPath queries help pull out the exact elements you need.
title = parser.xpath('//div[@id="title"]/h1/yt-formatted-string/text()')[0]
channel = parser.xpath('//yt-formatted-string[@id="text"]/a/text()')[0]
channel_link = 'https://www.youtube.com' + parser.xpath('//yt-formatted-string[@id="text"]/a/@href')[0]
posted = parser.xpath('//yt-formatted-string[@id="info"]/span/text()')[2]
total_views = parser.xpath('//yt-formatted-string[@id="info"]/span/text()')[0]
total_comments = parser.xpath('//h2[@id="count"]/yt-formatted-string/span/text()')[0]
comments_list = parser.xpath('//yt-attributed-string[@id="content-text"]/span/text()')
Now that you've extracted the data, save it to a CSV file for easy access and analysis.
with open('youtube_video_data.csv', 'w', newline='', encoding='utf-8') as file:
writer = csv.writer(file)
writer.writerow(["Title", "Channel", "Channel Link", "Posted", "Total Views", "Total Comments", "Comments"])
writer.writerow([title, channel, channel_link, posted, total_views, total_comments, ", ".join(comments_list)])
If you want to scrape YouTube at scale or avoid being blocked, proxies are a must. By routing your browser traffic through proxies, you can reduce the chance of detection and avoid getting your IP banned.
Here's how you can set up proxies in Playwright:
browser = await playwright.chromium.launch(
headless=True,
proxy={"server": "http://your_proxy_ip:port", "username": "your_username", "password": "your_password"}
)
IP Address Masking: Hides your real IP, making it harder for YouTube to detect scraping attempts.
Distribute Requests: Rotate proxies to simulate traffic from multiple users.
Circumvent Restrictions: Overcome regional or content access restrictions.
Here's everything in one place:
import asyncio
from playwright.async_api import Playwright, async_playwright
from lxml import html
import csv
async def run(playwright: Playwright) -> None:
browser = await playwright.chromium.launch(
headless=True,
proxy={"server": "http://your_proxy_ip:port", "username": "your_username", "password": "your_password"}
)
context = await browser.new_context()
page = await context.new_page()
await page.goto("https://www.youtube.com/watch?v=Ct8Gxo8StBU", wait_until="networkidle")
for _ in range(20):
await page.mouse.wheel(0, 200)
await asyncio.sleep(0.2)
await page.wait_for_timeout(1000)
page_content = await page.content()
await context.close()
await browser.close()
parser = html.fromstring(page_content)
title = parser.xpath('//div[@id="title"]/h1/yt-formatted-string/text()')[0]
channel = parser.xpath('//yt-formatted-string[@id="text"]/a/text()')[0]
channel_link = 'https://www.youtube.com' + parser.xpath('//yt-formatted-string[@id="text"]/a/@href')[0]
posted = parser.xpath('//yt-formatted-string[@id="info"]/span/text()')[2]
total_views = parser.xpath('//yt-formatted-string[@id="info"]/span/text()')[0]
total_comments = parser.xpath('//h2[@id="count"]/yt-formatted-string/span/text()')[0]
comments_list = parser.xpath('//yt-attributed-string[@id="content-text"]/span/text()')
with open('youtube_video_data.csv', 'w', newline='', encoding='utf-8') as file:
writer = csv.writer(file)
writer.writerow(["Title", "Channel", "Channel Link", "Posted", "Total Views", "Total Comments", "Comments"])
writer.writerow([title, channel, channel_link, posted, total_views, total_comments, ", ".join(comments_list)])
async def main():
async with async_playwright() as playwright:
await run(playwright)
asyncio.run(main())
By following these steps, you can scrape YouTube data with ease. But remember, scraping can raise ethical concerns and risks of being blocked. Always use proxies wisely, follow ethical guidelines, and respect YouTube's terms of service.