
YouTube hosts over 500 hours of video content uploaded every minute. For creators, analyzing their own content's performance—along with competitor videos—can be an overwhelming task. Manually sifting through all that data? Tedious. That's where automation steps in, especially with a well-crafted YouTube scraping script. Let's dive into building one from scratch.
To begin scraping, we need the right tools. Python's Selenium is a go-to for automating web browsers, but we'll need a few extra packages to make it all run smoothly. First up:
selenium-wire: A Selenium extension that lets us configure proxies (crucial for avoiding IP bans).
selenium: Standard tool for web automation.
blinker: To ensure smooth execution without runtime errors.
Install them using the command below:
pip install selenium-wire selenium blinker==1.7.0
Next, let's import the libraries that will drive our script.
from selenium.webdriver.chrome.options import Options
from seleniumwire import webdriver as wiredriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains
import json
import time
Here's what each one does:
selenium.webdriver: Interacts with web elements.
json: Converts our scraped data into a clean format.
time: Helps us add delays, so we don't look like a robot.
ActionChains: For mimicking real human scrolling and clicking behaviors.
YouTube's robots.txt file makes it clear—they're not fond of scraping. So, to avoid triggering anti-scraping measures, we need to route our requests through a proxy. Here's how:
Set up proxy credentials.
Add those credentials to the Chrome options.
Launch the browser using Selenium.
proxy_address = "your.proxy.address"
proxy_username = "your-username"
proxy_password = "your-password"
chrome_options = Options()
chrome_options.add_argument(f'--proxy-server={proxy_address}')
chrome_options.add_argument(f'--proxy-auth={proxy_username}:{proxy_password}')
driver = wiredriver.Chrome(options=chrome_options)
This setup ensures your script remains stealthy and doesn't attract unwanted attention.
With your proxy in place, we can now focus on extracting data. Here's the flow:
Open the Video URL: We'll fetch the video page URL and load it into the driver.
Wait for Page Elements: We'll use WebDriverWait to ensure that the elements we need are fully loaded before extracting any data.
Scroll and Load More: To get all comments, we simulate a user scrolling through the page.
youtube_url_to_scrape = "your_video_url"
driver.get(youtube_url_to_scrape)
def extract_information() -> dict:
try:
element = WebDriverWait(driver, 15).until(
EC.presence_of_element_located((By.XPATH, '//[@id="expand"]'))
)
element.click()
time.sleep(10)
actions = ActionChains(driver)
actions.send_keys(Keys.END).perform() # Scroll down
time.sleep(10)
actions.send_keys(Keys.END).perform() # Scroll again
time.sleep(10)
video_title = driver.find_elements(By.XPATH, '//[@id="title"]/h1')[0].text
owner = driver.find_elements(By.XPATH, '//[@id="text"]/a')[0].text
total_number_of_subscribers = driver.find_elements(By.XPATH, "//div[@id='upload-info']//yt-formatted-string[@id='owner-sub-count']")[0].text
description = ''.join([i.text for i in driver.find_elements(By.XPATH, '//[@id="description-inline-expander"]/yt-attributed-string/span/span')])
# Additional details
publish_date = driver.find_elements(By.XPATH, '//[@id="info"]/span')[2].text
total_views = driver.find_elements(By.XPATH, '//[@id="info"]/span')[0].text
number_of_likes = driver.find_elements(By.XPATH, '//[@id="top-level-buttons-computed"]/segmented-like-dislike-button-view-model/yt-smartination/div/div/like-button-view-model/toggle-button-view-model/button-view-model/button/div')[1].text
# Scrape comments
comment_names = driver.find_elements(By.XPATH, '//[@id="author-text"]/span')
comment_content = driver.find_elements(By.XPATH, '//[@id="content-text"]/span')
comments = [
{"name": comment_names[i].text, "comment": comment_content[i].text}
for i in range(len(comment_names))
]
data = {
'owner': owner,
'subscribers': total_number_of_subscribers,
'video_title': video_title,
'description': description,
'date': publish_date,
'views': total_views,
'likes': number_of_likes,
'comments': comments
}
return data
except Exception as err:
print(f"Error: {err}")
This function does all the heavy lifting—pulling video details, the owner’s stats, likes, views, and a collection of comments.
Once we've gathered the data, the next step is saving it in an easy-to-use format: JSON.
def organize_write_data(data: dict):
output = json.dumps(data, indent=2, ensure_ascii=False).encode("ascii", "ignore").decode("utf-8")
try:
with open("output.json", 'w', encoding='utf-8') as file:
file.write(output)
except Exception as err:
print(f"Error encountered: {err}")
This function will neatly store everything into a output.json file, ready for analysis or further processing.
Here's how everything ties together:
# Importing necessary packages
from selenium.webdriver.chrome.options import Options
from seleniumwire import webdriver as wiredriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains
import json
import time
# Proxy setup
proxy_address = "your.proxy.address"
proxy_username = "your-username"
proxy_password = "your-password"
chrome_options = Options()
chrome_options.add_argument(f'--proxy-server={proxy_address}')
chrome_options.add_argument(f'--proxy-auth={proxy_username}:{proxy_password}')
driver = wiredriver.Chrome(options=chrome_options)
# Target URL
youtube_url_to_scrape = "your_video_url"
driver.get(youtube_url_to_scrape)
def extract_information() -> dict:
# Scraping logic (refer to Step 3)
return data
def organize_write_data(data: dict):
# Save scraped data to JSON
return output
organize_write_data(extract_information())
driver.quit()
Scraping vital YouTube data can be an incredibly powerful way to gain insights into what works—and what doesn't—in your content strategy. With the approach outlined above, you can build a tool that pulls in relevant data on video views, likes, comments, and more—all while staying under the radar. The key? Always use a proxy, respect YouTube's policies, and automate wisely.