How to Scrape Vital YouTube Data for Performance Tracking

SwiftProxy
By - Martin Koenig
2025-01-23 15:29:31

How to Scrape Vital YouTube Data for Performance Tracking

YouTube hosts over 500 hours of video content uploaded every minute. For creators, analyzing their own content's performance—along with competitor videos—can be an overwhelming task. Manually sifting through all that data? Tedious. That's where automation steps in, especially with a well-crafted YouTube scraping script. Let's dive into building one from scratch.

What You Need to Get Started

To begin scraping, we need the right tools. Python's Selenium is a go-to for automating web browsers, but we'll need a few extra packages to make it all run smoothly. First up:
selenium-wire: A Selenium extension that lets us configure proxies (crucial for avoiding IP bans).
selenium: Standard tool for web automation.
blinker: To ensure smooth execution without runtime errors.
Install them using the command below:

pip install selenium-wire selenium blinker==1.7.0  

Step 1: Importing the Essentials

Next, let's import the libraries that will drive our script.

from selenium.webdriver.chrome.options import Options  
from seleniumwire import webdriver as wiredriver  
from selenium.webdriver.common.keys import Keys  
from selenium.webdriver.common.by import By  
from selenium.webdriver.support.wait import WebDriverWait  
from selenium.webdriver.support import expected_conditions as EC  
from selenium.webdriver.common.action_chains import ActionChains  
import json  
import time  

Here's what each one does:
selenium.webdriver: Interacts with web elements.
json: Converts our scraped data into a clean format.
time: Helps us add delays, so we don't look like a robot.
ActionChains: For mimicking real human scrolling and clicking behaviors.

Step 2: Setting Up Your Chrome Driver with a Proxy

YouTube's robots.txt file makes it clear—they're not fond of scraping. So, to avoid triggering anti-scraping measures, we need to route our requests through a proxy. Here's how:

Set up proxy credentials.

Add those credentials to the Chrome options.

Launch the browser using Selenium.

proxy_address = "your.proxy.address"  
proxy_username = "your-username"  
proxy_password = "your-password"  

chrome_options = Options()  
chrome_options.add_argument(f'--proxy-server={proxy_address}')  
chrome_options.add_argument(f'--proxy-auth={proxy_username}:{proxy_password}')  
driver = wiredriver.Chrome(options=chrome_options)  

This setup ensures your script remains stealthy and doesn't attract unwanted attention.

Step 3: Navigating and Scraping Vital YouTube Data

With your proxy in place, we can now focus on extracting data. Here's the flow:

Open the Video URL: We'll fetch the video page URL and load it into the driver.

Wait for Page Elements: We'll use WebDriverWait to ensure that the elements we need are fully loaded before extracting any data.

Scroll and Load More: To get all comments, we simulate a user scrolling through the page.

youtube_url_to_scrape = "your_video_url"  
driver.get(youtube_url_to_scrape)

def extract_information() -> dict:  
    try:  
        element = WebDriverWait(driver, 15).until(  
            EC.presence_of_element_located((By.XPATH, '//[@id="expand"]'))  
        )  
        element.click()  

        time.sleep(10)  
        actions = ActionChains(driver)  
        actions.send_keys(Keys.END).perform()  # Scroll down  
        time.sleep(10)  
        actions.send_keys(Keys.END).perform()  # Scroll again  
        time.sleep(10)  

        video_title = driver.find_elements(By.XPATH, '//[@id="title"]/h1')[0].text  
        owner = driver.find_elements(By.XPATH, '//[@id="text"]/a')[0].text  
        total_number_of_subscribers = driver.find_elements(By.XPATH, "//div[@id='upload-info']//yt-formatted-string[@id='owner-sub-count']")[0].text  
        description = ''.join([i.text for i in driver.find_elements(By.XPATH, '//[@id="description-inline-expander"]/yt-attributed-string/span/span')])  

        # Additional details  
        publish_date = driver.find_elements(By.XPATH, '//[@id="info"]/span')[2].text  
        total_views = driver.find_elements(By.XPATH, '//[@id="info"]/span')[0].text  
        number_of_likes = driver.find_elements(By.XPATH, '//[@id="top-level-buttons-computed"]/segmented-like-dislike-button-view-model/yt-smartination/div/div/like-button-view-model/toggle-button-view-model/button-view-model/button/div')[1].text  

        # Scrape comments  
        comment_names = driver.find_elements(By.XPATH, '//[@id="author-text"]/span')  
        comment_content = driver.find_elements(By.XPATH, '//[@id="content-text"]/span')  

        comments = [  
            {"name": comment_names[i].text, "comment": comment_content[i].text}  
            for i in range(len(comment_names))  
        ]  

        data = {  
            'owner': owner,  
            'subscribers': total_number_of_subscribers,  
            'video_title': video_title,  
            'description': description,  
            'date': publish_date,  
            'views': total_views,  
            'likes': number_of_likes,  
            'comments': comments  
        }  

        return data  
    except Exception as err:  
        print(f"Error: {err}")  

This function does all the heavy lifting—pulling video details, the owner’s stats, likes, views, and a collection of comments.

Step 4: Saving the Data into a Neat JSON File

Once we've gathered the data, the next step is saving it in an easy-to-use format: JSON.

def organize_write_data(data: dict):  
    output = json.dumps(data, indent=2, ensure_ascii=False).encode("ascii", "ignore").decode("utf-8")  
    try:  
        with open("output.json", 'w', encoding='utf-8') as file:  
            file.write(output)  
    except Exception as err:  
        print(f"Error encountered: {err}")  

This function will neatly store everything into a output.json file, ready for analysis or further processing.

The Complete Code

Here's how everything ties together:

# Importing necessary packages  
from selenium.webdriver.chrome.options import Options  
from seleniumwire import webdriver as wiredriver  
from selenium.webdriver.common.keys import Keys  
from selenium.webdriver.common.by import By  
from selenium.webdriver.support.wait import WebDriverWait  
from selenium.webdriver.support import expected_conditions as EC  
from selenium.webdriver.common.action_chains import ActionChains  
import json  
import time  

# Proxy setup  
proxy_address = "your.proxy.address"  
proxy_username = "your-username"  
proxy_password = "your-password"  
chrome_options = Options()  
chrome_options.add_argument(f'--proxy-server={proxy_address}')  
chrome_options.add_argument(f'--proxy-auth={proxy_username}:{proxy_password}')  
driver = wiredriver.Chrome(options=chrome_options)  

# Target URL  
youtube_url_to_scrape = "your_video_url"  
driver.get(youtube_url_to_scrape)

def extract_information() -> dict:  
    # Scraping logic (refer to Step 3)  
    return data  

def organize_write_data(data: dict):  
    # Save scraped data to JSON  
    return output  

organize_write_data(extract_information())  
driver.quit()  

Final Thoughts

Scraping vital YouTube data can be an incredibly powerful way to gain insights into what works—and what doesn't—in your content strategy. With the approach outlined above, you can build a tool that pulls in relevant data on video views, likes, comments, and more—all while staying under the radar. The key? Always use a proxy, respect YouTube's policies, and automate wisely.

About the author

SwiftProxy
Martin Koenig
Head of Commerce
Martin Koenig is an accomplished commercial strategist with over a decade of experience in the technology, telecommunications, and consulting industries. As Head of Commerce, he combines cross-sector expertise with a data-driven mindset to unlock growth opportunities and deliver measurable business impact.
The content provided on the Swiftproxy Blog is intended solely for informational purposes and is presented without warranty of any kind. Swiftproxy does not guarantee the accuracy, completeness, or legal compliance of the information contained herein, nor does it assume any responsibility for content on thirdparty websites referenced in the blog. Prior to engaging in any web scraping or automated data collection activities, readers are strongly advised to consult with qualified legal counsel and to review the applicable terms of service of the target website. In certain cases, explicit authorization or a scraping permit may be required.
Join SwiftProxy Discord community Chat with SwiftProxy support via WhatsApp Chat with SwiftProxy support via Telegram
Chat with SwiftProxy support via Email