Proxies résidentiels

Proxy résidentiels statiques

Proxy résidentiels illimités

Proxys YouTube

Proxies résidentiels

Agent résidentiel statique

Proxy résidentiels illimités

Données pour l'IA

Collecte de données sur le web

SEO et scraping SERP

Suivi des prix

Agrégation des tarifs de voyage

Collecte de données sur le marché boursier

Tous les emplacements

Partenaires de Swiftproxy

Collectez des données à grande échelle

Proxies de Web Scraping Essai gratuit

Collectez des données précises dans le monde entier sans blocages ni interruptions.

Solution de proxy à bande passante illimitée pour la collecte de données vidéo à grande échelle

Boostez la croissance de votre entreprise avec Swiftproxy

Un réseau mondial de plus de 80 millions de proxies résidentiels, assurant une disponibilité de 99,89 % et des connexions stables, prenant en charge les protocoles HTTP(S) et SOCKS5.

Swiftproxy residential proxies with 80M+ IPs, 99.89% uptime, supporting HTTP(S) & SOCKS5 protocols

Programme d'affiliation

30% Commission garantie

Gains CDK

Proxies en profits

How to Use Playwright to Scrape Glassdoor Data

By - Linh Tran

2025-02-28 15:35:24

Imagine accessing thousands of job listings, salary information, and employer reviews at your fingertips. That's the power of Glassdoor, one of the most valuable resources for job seekers and recruiters alike. But here's the catch: scraping Glassdoor for its rich data isn't as simple as it sounds. With strong anti-bot measures in place, using traditional scraping methods like requests can quickly land you in the "blocked" zone.
This is where Playwright comes into play. It's not just another scraping tool—it allows us to simulate real human browsing behavior, sidestepping the typical blockers like CAPTCHAs and IP bans. Let's dive into scraping job listings from Glassdoor using Python and Playwright.

Why Use Playwright for Scraping

If you've tried scraping Glassdoor before, you know the challenges. The website's anti-bot mechanisms can easily flag suspicious activity. That's why Playwright is crucial—it lets us control a real browser session, complete with proxies and browser headers to make our requests look genuine. This dramatically reduces the chances of detection and lets you gather data seamlessly.

Necessary Tools and Resources

Before we get into the nitty-gritty of the code, here's what you need to set up:

· Python: We'll be using Python, of course.

· Playwright: The tool that makes browser automation possible.

· lxml: A fast and efficient library for parsing HTML.
To install these dependencies, run the following commands:
pip install playwright lxml
playwright install

How to Scrape Glassdoor Job Listings Effectively

Let's break it down. We're going to use Playwright to launch a browser, navigate to the job listings page, extract the job data, and save it into a CSV file.

Step 1: Configuring Playwright and Making Requests

We begin by launching a browser with Playwright. Don't forget to use a proxy to avoid getting blocked.

from playwright.async_api import async_playwright
from lxml.html import fromstring

async def scrape_job_listings():
    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=False,
            proxy={"server": 'your_proxy_server', 'username': 'your_username', 'password': 'your_password'}
        )
        page = await browser.new_page()
        await page.goto('https://www.glassdoor.com/Job/united-states-software-engineer-jobs-SRCH_IL.0,13_IN1_KO14,31.htm', timeout=60000)
        content = await page.content()
        await browser.close()
        return content

The key here is launching the browser with a proxy, which will help us bypass detection. We navigate to the desired job listings page, retrieve the HTML content, and then close the browser.

Step 2: Parsing the HTML to Extract Data

Now that we have the page's HTML, we can use lxml to parse and extract the job details.

parser = fromstring(html_content)
job_posting_elements = parser.xpath('//li[@data-test="jobListing"]')

jobs_data = []
for element in job_posting_elements:
    job_title = element.xpath('.//a[@data-test="job-title"]/text()')[0]
    job_location = element.xpath('.//div[@data-test="emp-location"]/text()')[0]
    salary = ' '.join(element.xpath('.//div[@data-test="detailSalary"]/text()')).strip()
    job_link = element.xpath('.//a[@data-test="job-title"]/@href')[0]
    easy_apply = bool(element.xpath('.//div[@data-role-variant="featured"]'))
    company = element.xpath('.//span[@class="EmployerProfile_compactEmployerName__LE242"]/text()')[0]
    
    job_data = {
        'company': company,
        'job_title': job_title,
        'job_location': job_location,
        'job_link': job_link,
        'salary': salary,
        'easy_apply': easy_apply
    }
    jobs_data.append(job_data)

Here, we loop through each job posting, extracting key details like job title, location, salary, and company name.

Step 3: Saving the Data

Once we've collected the job data, we save it into a CSV file for further analysis.

import csv

with open('glassdoor_job_listings.csv', 'w', newline='', encoding='utf-8') as file:
    writer = csv.DictWriter(file, fieldnames=['company', 'job_title', 'job_location', 'job_link', 'salary', 'easy_apply'])
    writer.writeheader()
    writer.writerows(jobs_data)

This will save all the extracted data in a neat CSV file, which you can then analyze or import into a database.

Full Code in Action

Here's the complete code all put together:

import csv
from playwright.async_api import async_playwright
from lxml.html import fromstring

async def scrape_job_listings():
    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=False,
            proxy={"server": 'your_proxy_server', 'username': 'your_username', 'password': 'your_password'}
        )
        page = await browser.new_page()
        await page.goto('https://www.glassdoor.com/Job/united-states-software-engineer-jobs-SRCH_IL.0,13_IN1_KO14,31.htm', timeout=60000)
        content = await page.content()
        await browser.close()
        
        parser = fromstring(content)
        job_posting_elements = parser.xpath('//li[@data-test="jobListing"]')
        
        jobs_data = []
        for element in job_posting_elements:
            job_title = element.xpath('.//a[@data-test="job-title"]/text()')[0]
            job_location = element.xpath('.//div[@data-test="emp-location"]/text()')[0]
            salary = ' '.join(element.xpath('.//div[@data-test="detailSalary"]/text()')).strip()
            job_link = "https://www.glassdoor.com" + element.xpath('.//a[@data-test="job-title"]/@href')[0]
            easy_apply = bool(element.xpath('.//div[@data-role-variant="featured"]'))
            company = element.xpath('.//span[@class="EmployerProfile_compactEmployerName__LE242"]/text()')[0]
            
            job_data = {
                'company': company,
                'job_title': job_title,
                'job_location': job_location,
                'job_link': job_link,
                'salary': salary,
                'easy_apply': easy_apply
            }
            jobs_data.append(job_data)
        
        with open('glassdoor_job_listings.csv', 'w', newline='', encoding='utf-8') as file:
            writer = csv.DictWriter(file, fieldnames=['company', 'job_title', 'job_location', 'job_link', 'salary', 'easy_apply'])
            writer.writeheader()
            writer.writerows(jobs_data)

import asyncio
asyncio.run(scrape_job_listings())

Key Takeaways

· Playwright is the game-changer for scraping Glassdoor. It allows us to bypass detection by simulating real browser behavior.

· Proxies and headers are essential to avoid getting blocked.

· After scraping, you can easily store and analyze the data in CSV format.

Conclusion

Remember, scraping is not a free-for-all. Always ensure that your actions align with Glassdoor's terms of service. To be respectful of their resources, implement rate limits to avoid bombarding the site with too many requests in a short period. Using rotating proxies can help minimize the risk of being flagged, and ethical scraping practices should always be followed. Regularly review the terms of service to ensure that your methods to scrape Glassdoor data comply with them.

Note sur l'auteur

Linh Tran

Linh Tran est une rédactrice technique basée à Hong Kong, avec une formation en informatique et plus de huit ans d'expérience dans le domaine des infrastructures numériques. Chez Swiftproxy, elle se spécialise dans la simplification des technologies proxy complexes, offrant des analyses claires et exploitables aux entreprises naviguant dans le paysage des données en rapide évolution en Asie et au-delà.

Analyste technologique senior chez Swiftproxy

Le contenu fourni sur le blog Swiftproxy est destiné uniquement à des fins d'information et est présenté sans aucune garantie. Swiftproxy ne garantit pas l'exactitude, l'exhaustivité ou la conformité légale des informations contenues, ni n'assume de responsabilité pour le contenu des sites tiers référencés dans le blog. Avant d'engager toute activité de scraping web ou de collecte automatisée de données, il est fortement conseillé aux lecteurs de consulter un conseiller juridique qualifié et de revoir les conditions d'utilisation applicables du site cible. Dans certains cas, une autorisation explicite ou un permis de scraping peut être requis.

Dans cet article

Solutions proxy résidentielles de haut niveau

Accédez à plus de 90 millions d'IP résidentiels avec une fiabilité élevée et des temps de réponse rapides.

Essai gratuit

FAQ

Charger plus

Afficher moins

Chat with SwiftProxy support via Telegram

Contactez-nous avec un email

[email protected]

Tips

Veuillez fournir votre numéro de compte ou votre adresse courriel.
Fournissez des vidéos ou des captures d'écran et décrivez simplement les problèmes auxquels vous êtes confronté.
Notre personnel répondra à votre message dans les 24 heures.

How to Use Playwright to Scrape Glassdoor Data

Why Use Playwright for Scraping

Necessary Tools and Resources

How to Scrape Glassdoor Job Listings Effectively

Step 1: Configuring Playwright and Making Requests

Step 2: Parsing the HTML to Extract Data

Step 3: Saving the Data

Full Code in Action

Conclusion

Note sur l'auteur

Articles liés