How to Scrape Vital YouTube Data for Performance Tracking

SwiftProxy
By - Martin Koenig
2025-01-23 15:29:31

How to Scrape Vital YouTube Data for Performance Tracking

YouTube hosts over 500 hours of video content uploaded every minute. For creators, analyzing their own content's performance—along with competitor videos—can be an overwhelming task. Manually sifting through all that data? Tedious. That's where automation steps in, especially with a well-crafted YouTube scraping script. Let's dive into building one from scratch.

What You Need to Get Started

To begin scraping, we need the right tools. Python's Selenium is a go-to for automating web browsers, but we'll need a few extra packages to make it all run smoothly. First up:
selenium-wire: A Selenium extension that lets us configure proxies (crucial for avoiding IP bans).
selenium: Standard tool for web automation.
blinker: To ensure smooth execution without runtime errors.
Install them using the command below:

pip install selenium-wire selenium blinker==1.7.0  

Step 1: Importing the Essentials

Next, let's import the libraries that will drive our script.

from selenium.webdriver.chrome.options import Options  
from seleniumwire import webdriver as wiredriver  
from selenium.webdriver.common.keys import Keys  
from selenium.webdriver.common.by import By  
from selenium.webdriver.support.wait import WebDriverWait  
from selenium.webdriver.support import expected_conditions as EC  
from selenium.webdriver.common.action_chains import ActionChains  
import json  
import time  

Here's what each one does:
selenium.webdriver: Interacts with web elements.
json: Converts our scraped data into a clean format.
time: Helps us add delays, so we don't look like a robot.
ActionChains: For mimicking real human scrolling and clicking behaviors.

Step 2: Setting Up Your Chrome Driver with a Proxy

YouTube's robots.txt file makes it clear—they're not fond of scraping. So, to avoid triggering anti-scraping measures, we need to route our requests through a proxy. Here's how:

Set up proxy credentials.

Add those credentials to the Chrome options.

Launch the browser using Selenium.

proxy_address = "your.proxy.address"  
proxy_username = "your-username"  
proxy_password = "your-password"  

chrome_options = Options()  
chrome_options.add_argument(f'--proxy-server={proxy_address}')  
chrome_options.add_argument(f'--proxy-auth={proxy_username}:{proxy_password}')  
driver = wiredriver.Chrome(options=chrome_options)  

This setup ensures your script remains stealthy and doesn't attract unwanted attention.

Step 3: Navigating and Scraping Vital YouTube Data

With your proxy in place, we can now focus on extracting data. Here's the flow:

Open the Video URL: We'll fetch the video page URL and load it into the driver.

Wait for Page Elements: We'll use WebDriverWait to ensure that the elements we need are fully loaded before extracting any data.

Scroll and Load More: To get all comments, we simulate a user scrolling through the page.

youtube_url_to_scrape = "your_video_url"  
driver.get(youtube_url_to_scrape)

def extract_information() -> dict:  
    try:  
        element = WebDriverWait(driver, 15).until(  
            EC.presence_of_element_located((By.XPATH, '//[@id="expand"]'))  
        )  
        element.click()  

        time.sleep(10)  
        actions = ActionChains(driver)  
        actions.send_keys(Keys.END).perform()  # Scroll down  
        time.sleep(10)  
        actions.send_keys(Keys.END).perform()  # Scroll again  
        time.sleep(10)  

        video_title = driver.find_elements(By.XPATH, '//[@id="title"]/h1')[0].text  
        owner = driver.find_elements(By.XPATH, '//[@id="text"]/a')[0].text  
        total_number_of_subscribers = driver.find_elements(By.XPATH, "//div[@id='upload-info']//yt-formatted-string[@id='owner-sub-count']")[0].text  
        description = ''.join([i.text for i in driver.find_elements(By.XPATH, '//[@id="description-inline-expander"]/yt-attributed-string/span/span')])  

        # Additional details  
        publish_date = driver.find_elements(By.XPATH, '//[@id="info"]/span')[2].text  
        total_views = driver.find_elements(By.XPATH, '//[@id="info"]/span')[0].text  
        number_of_likes = driver.find_elements(By.XPATH, '//[@id="top-level-buttons-computed"]/segmented-like-dislike-button-view-model/yt-smartination/div/div/like-button-view-model/toggle-button-view-model/button-view-model/button/div')[1].text  

        # Scrape comments  
        comment_names = driver.find_elements(By.XPATH, '//[@id="author-text"]/span')  
        comment_content = driver.find_elements(By.XPATH, '//[@id="content-text"]/span')  

        comments = [  
            {"name": comment_names[i].text, "comment": comment_content[i].text}  
            for i in range(len(comment_names))  
        ]  

        data = {  
            'owner': owner,  
            'subscribers': total_number_of_subscribers,  
            'video_title': video_title,  
            'description': description,  
            'date': publish_date,  
            'views': total_views,  
            'likes': number_of_likes,  
            'comments': comments  
        }  

        return data  
    except Exception as err:  
        print(f"Error: {err}")  

This function does all the heavy lifting—pulling video details, the owner’s stats, likes, views, and a collection of comments.

Step 4: Saving the Data into a Neat JSON File

Once we've gathered the data, the next step is saving it in an easy-to-use format: JSON.

def organize_write_data(data: dict):  
    output = json.dumps(data, indent=2, ensure_ascii=False).encode("ascii", "ignore").decode("utf-8")  
    try:  
        with open("output.json", 'w', encoding='utf-8') as file:  
            file.write(output)  
    except Exception as err:  
        print(f"Error encountered: {err}")  

This function will neatly store everything into a output.json file, ready for analysis or further processing.

The Complete Code

Here's how everything ties together:

# Importing necessary packages  
from selenium.webdriver.chrome.options import Options  
from seleniumwire import webdriver as wiredriver  
from selenium.webdriver.common.keys import Keys  
from selenium.webdriver.common.by import By  
from selenium.webdriver.support.wait import WebDriverWait  
from selenium.webdriver.support import expected_conditions as EC  
from selenium.webdriver.common.action_chains import ActionChains  
import json  
import time  

# Proxy setup  
proxy_address = "your.proxy.address"  
proxy_username = "your-username"  
proxy_password = "your-password"  
chrome_options = Options()  
chrome_options.add_argument(f'--proxy-server={proxy_address}')  
chrome_options.add_argument(f'--proxy-auth={proxy_username}:{proxy_password}')  
driver = wiredriver.Chrome(options=chrome_options)  

# Target URL  
youtube_url_to_scrape = "your_video_url"  
driver.get(youtube_url_to_scrape)

def extract_information() -> dict:  
    # Scraping logic (refer to Step 3)  
    return data  

def organize_write_data(data: dict):  
    # Save scraped data to JSON  
    return output  

organize_write_data(extract_information())  
driver.quit()  

Final Thoughts

Scraping vital YouTube data can be an incredibly powerful way to gain insights into what works—and what doesn't—in your content strategy. With the approach outlined above, you can build a tool that pulls in relevant data on video views, likes, comments, and more—all while staying under the radar. The key? Always use a proxy, respect YouTube's policies, and automate wisely.

關於作者

SwiftProxy
Martin Koenig
商務主管
馬丁·科尼格是一位資深商業策略專家,擁有十多年技術、電信和諮詢行業的經驗。作為商務主管,他結合跨行業專業知識和數據驅動的思維,發掘增長機會,創造可衡量的商業價值。
Swiftproxy部落格提供的內容僅供參考,不提供任何形式的保證。Swiftproxy不保證所含資訊的準確性、完整性或合法合規性,也不對部落格中引用的第三方網站內容承擔任何責任。讀者在進行任何網頁抓取或自動化資料蒐集活動之前,強烈建議諮詢合格的法律顧問,並仔細閱讀目標網站的服務條款。在某些情況下,可能需要明確授權或抓取許可。
Join SwiftProxy Discord community Chat with SwiftProxy support via WhatsApp Chat with SwiftProxy support via Telegram
Chat with SwiftProxy support via Email