当前位置：首页 > news >正文

【Python】Urllib：发送请求

news 2026/2/17 6:38:59

urllib 是 Python 内置的用于处理URL和HTTP请求的模块集合，包括 urllib.request, urllib.parse, urllib.error, 和 urllib.robotparser 等子模块。

GET请求

GET请求是最常用的HTTP请求类型，用于从服务器获取数据。

import urllib.request# 发送GET请求到指定URL
url = 'http://example.com'
response = urllib.request.urlopen(url)# 读取响应内容并解码
html = response.read().decode('utf-8')
print(html)

urlopen(url)：向指定URL发送GET请求。
read()：读取服务器响应的数据（通常是HTML）。
decode('utf-8')：将字节数据解码为字符串。

POST请求

POST请求用于向服务器发送数据（如表单数据）。

import urllib.request
import urllib.parseurl = 'http://example.com/api'
data = {'key1': 'value1', 'key2': 'value2'}# 将数据编码为URL格式，并转换为字节流
data = urllib.parse.urlencode(data).encode('utf-8')# 发送POST请求，带上数据
response = urllib.request.urlopen(url, data=data)# 读取响应内容并解码
result = response.read().decode('utf-8')
print(result)

urlencode(data)：将数据编码为URL格式（例如 key1=value1&key2=value2）。
urlopen(url, data=data)：发送POST请求，第二个参数是请求数据，必须是字节流。

设置请求头

有时我们需要模拟浏览器发送请求，可以通过自定义请求头实现。

import urllib.requesturl = 'http://example.com'
headers = {'User-Agent': 'Mozilla/5.0'}# 创建带有自定义请求头的Request对象
req = urllib.request.Request(url, headers=headers)# 发送请求
response = urllib.request.urlopen(req)# 读取响应内容并解码
html = response.read().decode('utf-8')
print(html)

Request(url, headers=headers)：创建请求对象，并附带自定义的HTTP请求头。

处理异常

在网络请求中，处理错误情况（如404错误）是很重要的。

import urllib.request
import urllib.errorurl = 'http://example.com/nonexistent'try:response = urllib.request.urlopen(url)
except urllib.error.HTTPError as e:print(f'HTTP Error: {e.code} - {e.reason}')
except urllib.error.URLError as e:print(f'URL Error: {e.reason}')

HTTPError：捕获HTTP错误（如404, 500等），并可以获取错误代码和原因。
URLError：捕获由于网络问题（如域名无法解析）导致的错误。

URL解析

通过 urllib.parse 可以解析和构建URL。

import urllib.parseurl = 'http://www.example.com/path/to/page?name=ferret&color=purple#section2'
parsed_url = urllib.parse.urlparse(url)
et8
print(parsed_url)

urlparse(url)：解析URL并返回其各部分（如协议、域名、路径、查询参数、片段）。

构建URL:

import urllib.parsebase_url = 'http://www.example.com/path'
params = {'name': 'ferret', 'color': 'purple'}
query_string = urllib.parse.urlencode(params)# 构建完整URL
url = f'{base_url}?{query_string}'
print(url)

urlencode()：将字典数据编码为URL查询字符串。

处理Cookies

Cookies 在HTTP请求中扮演着重要角色，urllib 可以自动处理Cookies。

import urllib.request
import http.cookiejar# 创建一个CookieJar对象来保存Cookies
cookie_jar = http.cookiejar.CookieJar()
handler = urllib.request.HTTPCookieProcessor(cookie_jar)
opener = urllib.request.build_opener(handler)# 使用opener发送请求
url = 'http://example.com'
response = opener.open(url)# 显示Cookies
for cookie in cookie_jar:print(cookie)

HTTPCookieProcessor(cookie_jar)：处理HTTP请求中的Cookies。
CookieJar：存储和管理Cookies的对象。

下载文件

使用 urllib.request.urlretrieve 可以轻松下载文件。

import urllib.requesturl = 'http://example.com/somefile.zip'
local_filename, headers = urllib.request.urlretrieve(url, 'somefile.zip')
print(f'Downloaded file: {local_filename}')

urlretrieve(url, filename)：从指定URL下载文件，并保存到本地文件系统。

设置代理

在需要通过代理服务器访问外部资源时，可以使用 ProxyHandler。

import urllib.requestproxy_handler = urllib.request.ProxyHandler({'http': 'http://10.10.1.10:3128/'})
opener = urllib.request.build_opener(proxy_handler)response = opener.open('http://example.com')
print(response.read().decode('utf-8'))

ProxyHandler({'protocol': 'proxy_url'})：为指定协议（如HTTP或HTTPS）设置代理。

处理SSL验证

默认情况下，urllib 会验证SSL证书。在某些情况下，你可能想要忽略SSL验证（例如在自签名证书的环境中）。

import urllib.request
import ssl# 忽略SSL证书验证
context = ssl._create_unverified_context()
response = urllib.request.urlopen('https://example.com', context=context)
print(response.read().decode('utf-8'))

_create_unverified_context()：创建一个不进行SSL证书验证的SSL上下文。

遵守robots.txt

urllib 提供了 urllib.robotparser 模块来解析和遵守网站的 robots.txt 文件。

import urllib.robotparserrp = urllib.robotparser.RobotFileParser()
rp.set_url('http://example.com/robots.txt')
rp.read()# 检查某个URL是否允许爬取
url = 'http://example.com/somepage'
user_agent = 'MyBot'
can_fetch = rp.can_fetch(user_agent, url)
print(f'Can fetch: {can_fetch}')