
YouTube hosts over 500 hours of video content uploaded every minute. For creators, analyzing their own content's performance—along with competitor videos—can be an overwhelming task. Manually sifting through all that data? Tedious. That's where automation steps in, especially with a well-crafted YouTube scraping script. Let's dive into building one from scratch.
To begin scraping, we need the right tools. Python's Selenium is a go-to for automating web browsers, but we'll need a few extra packages to make it all run smoothly. First up:
selenium-wire: A Selenium extension that lets us configure proxies (crucial for avoiding IP bans).
selenium: Standard tool for web automation.
blinker: To ensure smooth execution without runtime errors.
Install them using the command below:
pip install selenium-wire selenium blinker==1.7.0  
Next, let's import the libraries that will drive our script.
from selenium.webdriver.chrome.options import Options  
from seleniumwire import webdriver as wiredriver  
from selenium.webdriver.common.keys import Keys  
from selenium.webdriver.common.by import By  
from selenium.webdriver.support.wait import WebDriverWait  
from selenium.webdriver.support import expected_conditions as EC  
from selenium.webdriver.common.action_chains import ActionChains  
import json  
import time  
Here's what each one does:
selenium.webdriver: Interacts with web elements.
json: Converts our scraped data into a clean format.
time: Helps us add delays, so we don't look like a robot.
ActionChains: For mimicking real human scrolling and clicking behaviors.
YouTube's robots.txt file makes it clear—they're not fond of scraping. So, to avoid triggering anti-scraping measures, we need to route our requests through a proxy. Here's how:
Set up proxy credentials.
Add those credentials to the Chrome options.
Launch the browser using Selenium.
proxy_address = "your.proxy.address"  
proxy_username = "your-username"  
proxy_password = "your-password"  
chrome_options = Options()  
chrome_options.add_argument(f'--proxy-server={proxy_address}')  
chrome_options.add_argument(f'--proxy-auth={proxy_username}:{proxy_password}')  
driver = wiredriver.Chrome(options=chrome_options)  
This setup ensures your script remains stealthy and doesn't attract unwanted attention.
With your proxy in place, we can now focus on extracting data. Here's the flow:
Open the Video URL: We'll fetch the video page URL and load it into the driver.
Wait for Page Elements: We'll use WebDriverWait to ensure that the elements we need are fully loaded before extracting any data.
Scroll and Load More: To get all comments, we simulate a user scrolling through the page.
youtube_url_to_scrape = "your_video_url"  
driver.get(youtube_url_to_scrape)
def extract_information() -> dict:  
    try:  
        element = WebDriverWait(driver, 15).until(  
            EC.presence_of_element_located((By.XPATH, '//[@id="expand"]'))  
        )  
        element.click()  
        time.sleep(10)  
        actions = ActionChains(driver)  
        actions.send_keys(Keys.END).perform()  # Scroll down  
        time.sleep(10)  
        actions.send_keys(Keys.END).perform()  # Scroll again  
        time.sleep(10)  
        video_title = driver.find_elements(By.XPATH, '//[@id="title"]/h1')[0].text  
        owner = driver.find_elements(By.XPATH, '//[@id="text"]/a')[0].text  
        total_number_of_subscribers = driver.find_elements(By.XPATH, "//div[@id='upload-info']//yt-formatted-string[@id='owner-sub-count']")[0].text  
        description = ''.join([i.text for i in driver.find_elements(By.XPATH, '//[@id="description-inline-expander"]/yt-attributed-string/span/span')])  
        # Additional details  
        publish_date = driver.find_elements(By.XPATH, '//[@id="info"]/span')[2].text  
        total_views = driver.find_elements(By.XPATH, '//[@id="info"]/span')[0].text  
        number_of_likes = driver.find_elements(By.XPATH, '//[@id="top-level-buttons-computed"]/segmented-like-dislike-button-view-model/yt-smartination/div/div/like-button-view-model/toggle-button-view-model/button-view-model/button/div')[1].text  
        # Scrape comments  
        comment_names = driver.find_elements(By.XPATH, '//[@id="author-text"]/span')  
        comment_content = driver.find_elements(By.XPATH, '//[@id="content-text"]/span')  
        comments = [  
            {"name": comment_names[i].text, "comment": comment_content[i].text}  
            for i in range(len(comment_names))  
        ]  
        data = {  
            'owner': owner,  
            'subscribers': total_number_of_subscribers,  
            'video_title': video_title,  
            'description': description,  
            'date': publish_date,  
            'views': total_views,  
            'likes': number_of_likes,  
            'comments': comments  
        }  
        return data  
    except Exception as err:  
        print(f"Error: {err}")  
This function does all the heavy lifting—pulling video details, the owner’s stats, likes, views, and a collection of comments.
Once we've gathered the data, the next step is saving it in an easy-to-use format: JSON.
def organize_write_data(data: dict):  
    output = json.dumps(data, indent=2, ensure_ascii=False).encode("ascii", "ignore").decode("utf-8")  
    try:  
        with open("output.json", 'w', encoding='utf-8') as file:  
            file.write(output)  
    except Exception as err:  
        print(f"Error encountered: {err}")  
This function will neatly store everything into a output.json file, ready for analysis or further processing.
Here's how everything ties together:
# Importing necessary packages  
from selenium.webdriver.chrome.options import Options  
from seleniumwire import webdriver as wiredriver  
from selenium.webdriver.common.keys import Keys  
from selenium.webdriver.common.by import By  
from selenium.webdriver.support.wait import WebDriverWait  
from selenium.webdriver.support import expected_conditions as EC  
from selenium.webdriver.common.action_chains import ActionChains  
import json  
import time  
# Proxy setup  
proxy_address = "your.proxy.address"  
proxy_username = "your-username"  
proxy_password = "your-password"  
chrome_options = Options()  
chrome_options.add_argument(f'--proxy-server={proxy_address}')  
chrome_options.add_argument(f'--proxy-auth={proxy_username}:{proxy_password}')  
driver = wiredriver.Chrome(options=chrome_options)  
# Target URL  
youtube_url_to_scrape = "your_video_url"  
driver.get(youtube_url_to_scrape)
def extract_information() -> dict:  
    # Scraping logic (refer to Step 3)  
    return data  
def organize_write_data(data: dict):  
    # Save scraped data to JSON  
    return output  
organize_write_data(extract_information())  
driver.quit()  
Scraping vital YouTube data can be an incredibly powerful way to gain insights into what works—and what doesn't—in your content strategy. With the approach outlined above, you can build a tool that pulls in relevant data on video views, likes, comments, and more—all while staying under the radar. The key? Always use a proxy, respect YouTube's policies, and automate wisely.
 頂級住宅代理解決方案
頂級住宅代理解決方案 {{item.title}}
                                        {{item.title}}