How to get and resolve proxy from URL in Python 3 environment

SwiftProxy
By - Emily Chan
2024-12-09 20:33:16

In the process of web crawling and data collection, proxy servers play a vital role. They can help us bypass IP restrictions, hide our true identity, and improve the efficiency of crawling. This article will detail how to obtain and parse proxy information from URLs in a Python 3 environment for use in subsequent crawling tasks.

What is a proxy?

A proxy server is an intermediary server located between a client and a server. It receives requests from clients, forwards them to the target server, and returns the server's response to the client. Using a proxy can hide our real IP address and prevent being blocked or restricted by the target website.

Install related libraries

Before we start, we need to make sure that Python 3 and related network request libraries (such as requests) and parsing libraries (such as BeautifulSoup) are installed. These libraries can be easily installed through the pip command:

pip install requests beautifulsoup4

Get proxy list from URL

First, we need a URL containing proxy information. This URL can be a website that provides free or paid proxy services. We will use the requests library to send HTTP requests and get the web page content.
 

import requests

# Suppose we have a URL containing a list of proxies
proxy_url = 'http://example.com/proxies'

# Send a GET request to obtain the web page content
response = requests.get(proxy_url)

# Check if the request was successful
if response.status_code == 200:
    page_content = response.text
else:
    print(f"Failed to retrieve page with status code: {response.status_code}")
    exit()

Parsing proxy information

Next, we need to parse the web page content to extract the proxy information. This usually involves parsing HTML, and we can use the BeautifulSoup library to accomplish this task.

from bs4 import BeautifulSoup

# Parsing web content using BeautifulSoup
soup = BeautifulSoup(page_content, 'html.parser')

# Assume that proxy information is stored in a specific HTML tag format
# For example:<tr><td>IP address</td><td>Port</td></tr>
proxies = []
for row in soup.find_all('tr'):
    columns = row.find_all('td')
    if len(columns) == 2:
        ip = columns.text.strip()
        port = columns.text.strip()
        proxies.append(f"{ip}:{port}")

# Print the parsed proxy list
print(proxies)

Verify Proxies

After getting the list of proxies, we need to verify that these proxies are working. This can be done by trying to access a test website using these proxies and checking the response.

import requests

# Define a test URL
test_url = 'http://httpbin.org/ip'

# Verify each proxy
valid_proxies = []
for proxy in proxies:
    try:
        response = requests.get(test_url, proxies={'http': proxy, 'https': proxy})
        if response.status_code == 200:
            valid_proxies.append(proxy)
            print(f"Valid proxy: {proxy}")
        else:
            print(f"Invalid proxy: {proxy} (Status code: {response.status_code})")
    except requests.exceptions.RequestException as e:
        print(f"Error testing proxy {proxy}: {e}")

Conclusion

Through the above steps, we have successfully obtained and parsed the proxy information from the URL and verified the effectiveness of these proxies. These proxies can now be used in our crawler tasks to improve the efficiency and stability of the crawler. It should be noted that we should comply with the usage policies and laws and regulations of the target website to ensure that our crawler behavior is legal and ethical.

Note sur l'auteur

SwiftProxy
Emily Chan
Rédactrice en chef chez Swiftproxy
Emily Chan est la rédactrice en chef chez Swiftproxy, avec plus de dix ans d'expérience dans la technologie, les infrastructures numériques et la communication stratégique. Basée à Hong Kong, elle combine une connaissance régionale approfondie avec une voix claire et pratique pour aider les entreprises à naviguer dans le monde en évolution des solutions proxy et de la croissance basée sur les données.
Le contenu fourni sur le blog Swiftproxy est destiné uniquement à des fins d'information et est présenté sans aucune garantie. Swiftproxy ne garantit pas l'exactitude, l'exhaustivité ou la conformité légale des informations contenues, ni n'assume de responsabilité pour le contenu des sites tiers référencés dans le blog. Avant d'engager toute activité de scraping web ou de collecte automatisée de données, il est fortement conseillé aux lecteurs de consulter un conseiller juridique qualifié et de revoir les conditions d'utilisation applicables du site cible. Dans certains cas, une autorisation explicite ou un permis de scraping peut être requis.
Join SwiftProxy Discord community Chat with SwiftProxy support via WhatsApp Chat with SwiftProxy support via Telegram
Chat with SwiftProxy support via Email