注意:本文章的大部分代码案例出自于《Python3 网络爬虫开发实战(第2版)》。
一、GET请求
import requests res = requests.get('https://www.python.org') print(type(res)) #print(res.status_code) # 200 print(type(res.text)) # print(res.text[:150]) print(res.encoding) # utf-8 print(res.headers['content-type']) # application/json; charset=utf8 print(res.cookies) #
- 扩展:其它请求类型
r = requests.get('http://www.httpbin.org/get') r = requests.post('http://www.httpbin.org/post') r = requests.put('http://httpbin.org/put', data = {'key':'value'}) r = requests.delete('http://www.httpbin.org/delete') r = requests.patch('http://www.httpbin.org/patch')
1、实例
import requests r = requests.get('https://www.httpbin.org/get') print(r.text)
2、params传递参数
import requests data = { 'name':'germey', 'age':25 } r = requests.get('https://www.httpbin.org/get',params=data) print(r.text)
3、JSON 响应内容
import requests r = requests.get('https://www.httpbin.org/get') print(type(r.text)) #print(r.json()) print(type(r.json())) #
返回结果不是JSON格式,会出现解析错误:
import requests res = requests.get('https://www.baidu.com') """ 注意,返回结果不是json格式,就会出现解析异常,抛出 json.decoder.JSONDecodeError 异常 """ print(res.json()) # 报错
4、原始响应内容
import requests r = requests.get('https://api.github.com/events', stream=True) print(r.raw.read())
扩展:以下面的模式将文本流保存到文件(下载图片):
r = requests.get('https://docs.python-requests.org/zh_CN/latest/_static/requests-sidebar.png') with open('requests-sidebar.png',mode='wb') as f: for chunk in r.iter_content(chunk_size=20): f.write(chunk)
5、二进制响应内容
import requests r = requests.get('https://www.sogou.com') print(r.text) # 抓取文本数据 print(r.content) # 抓取二进制数据
扩展:以二进制的形式,下载图片:
import requests r = requests.get('https://www.baidu.com/img/PCtm_d9c8750bed0b3c7d089fa7d55720d6cf.png') with open('baidu_logo.png',mode='wb') as f: f.write(r.content)
- 从请求返回的二进制数据创建图像
import requests from PIL import Image from io import BytesIO r = requests.get('https://pic.ntimg.cn/file/20220402/19727910_161258533101_2.jpg') img = Image.open(BytesIO(r.content)) # 打开图片看一下 img.show() # 保存图片到本地磁盘 img.save('BythesIO_IMG.png')
6、添加请求头
import requests headers = { 'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.80 Safari/537.36' } r = requests.get('https://www.so.com',headers=headers) print(r.text)
注意:定制 header 的优先级低于某些特定的信息源,例如:
- 如果在 .netrc 中设置了用户认证信息,使用 headers= 设置的授权就不会生效。而如果设置了 auth= 参数,.netrc 的设置就无效了。
- 如果被重定向到别的主机,授权 header 就会被删除。
- 代理授权 header 会被 URL 中提供的代理身份覆盖掉。
- 在我们能判断内容长度的情况下,header 的 Content-Length 会被改写。
7、抓取网页
import requests import re r = requests.get('https://ssr1.scrape.center') pattern = re.compile('(.*?)',re.S) titles = re.findall(pattern,r.text) print(titles)
二、POST请求
1、data 接收 字典 类型的数据
import requests data = {'name':'germey','age':'25'} r = requests.post('https://httpbin.org/post',data=data) print(r.text)
2、data 接收 JSON 数据
import requests import json url = 'https://httpbin.org/post' payload = {'some':'data'} r = requests.post(url,data=json.dumps(payload)) print(r.text)
3、json 接收 JSON 数据
import requests url = 'https://httpbin.org/post' payload = {'some':'data'} r = requests.post(url,json=payload) print(r.text)
4、获取响应内容
a、状态码
import requests r = requests.get('https://www.qq.com/') # 响应状态码,可以通过状态码数字200知道爬虫爬取成功 print(type(r.status_code),r.status_code) #200
- 扩展:r.status_code == requests.codes.ok
import requests r = requests.get('https://www.baidu.com/') # 判断状态码这个数字是否爬虫爬取成功的代码如下: exit() if not r.status_code == requests.codes.ok else print('request successfully')
运行结果如下:
request successfully
b、响应头
import requests r = requests.get('https://www.qq.com/') # 文件头信息 print(type(r.headers),r.headers) print(r.headers['content-type']) print(r.headers.get('content-type'))
c、Cookie
import requests r = requests.get('https://www.qq.com/') # cookie print(type(r.cookies),r.cookies)
d、URL
import requests r = requests.get('https://www.qq.com/') # URL print(type(r.url),r.url)
e、请求历史
import requests r = requests.get('https://www.qq.com/') # 请求历史 print(type(r.history),r.history)
三、高级用法
1、文件上传
import requests files = {'file': open('favicon.ico', 'rb')} r = requests.post('https://www.httpbin.org/post', files=files) if r.status_code == requests.codes.ok: print(r.text)
2、获取Cookie
import requests r = requests.get('https://www.baidu.com') print(r.cookies) for key,value in r.cookies.items(): print(key + '=' + value)
3、Cookie设置
# Python版本:3.6 # -*- coding:utf-8 -*- import requests """ 从浏览器的开发者工具中复制cookie,将其设置到请求头headers里面 """ headers = { 'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) ' 'Chrome/98.0.4758.80 Safari/537.36', 'Cookie':'__yjs_duid=1_86f7e6bca32c93983e1d41bcd81ba1a61643807279657; ' 'BAIDUID=0B4418964BD37E97784AE7BADE39438E:FG=1; BAIDUID_BFESS=0B4418964BD37E9747028796C4E53DBE:FG=1; ' 'BIDUPSID=0B4418964BD37E97784AE7BADE39438E; PSTM=1643894126; BD_UPN=12314753; ' 'COOKIE_SESSION=332_0_1_5_0_1_0_0_1_1_0_0_0_0_0_0_0_0_1643981338|5#0_0_1643981338|1; ' 'baikeVisitId=20c3652e-6429-4306-9c4f-631ab804fa2c; BD_HOME=1; ' 'H_PS_PSSID=35411_35106_34584_35490_*****_35322_26350_35752_35746; BA_HECTOR=81al04ag8k8h0ha04m1gvsv740r', } r = requests.get('https://www.baidu.com',headers=headers) print(r.text) # 查看cookie,与设置的cookie一样:"H_PS_PSSID=35411_35106_34584_35490_*****_35322_26350_35752_35746; " print(r.cookies) """outPut:, , ]> """
4、通过Cookie参数来设置Cookie的信息
# Python版本:3.6 # -*- coding:utf-8 -*- import requests # 首先在浏览器上登录帐号后,提取cookies信息,用cookies模拟了登录状态,这样就能爬取登录之后才能看到的页面了 # cookie可能会过期,需要重新从浏览器提取cookie过来 cookies = '_octo=GH1.1.46492411.1643975102; tz=Asia%2FShanghai; _device_id=397217a23bfb6953f8994efb95142319; ' 'has_recent_activity=1; tz=Asia%2FShanghai; ' 'color_mode=%7B%22color_mode%2uto%22%2C%22light_theme%22%3A%7B%22name%22%3A%22light%22%2C' '%22color_mode%22%3A%22light%22%7D%2C%22dark_theme%22%3A%7B%22name%22%3A%22dark%22%2C%22color_mode%22%3A' '%22dark%22%7D%7D; user_session=6ULX20UhnDk08MsEjP7ngv3_PW4GgIAbOmSAKi9jwrvgS; ' '__Host-user_session_same_site=6ULX20UhnDk08MsEjP7ngv3_PWlOb4GgIAbOmSAKi9jwrvgS; logged_in=yes; ' 'dotcom_user=juerson; _gh_sess=zR%2FV9HmFlSokFIt1F%2B63Ltg6igWu1GOYumK' '%2BAsflH38KNrDgzdpnKMYNcZ8Kg1lpjxvAkQp1kZQ5zKLsaJBwTo62x9MMg2mK6yvNOb0Z3fVWDUYQbCIdZvy7bzR74NoJ7KBaG7D6ckAU2mANSFZWEdkIw5oOyAY6trLHZEVz4HCZRrgUA4fSB8OTvmruruAq%2BMDwWDcqlQvk2Hbg9uPJHVQ9yXt0nPyvXpprc5gjdRNlyurM7LBL6UHrw71%2B4vLy1SmjeI2mbji9xf97p0mj2vF0AXYNL5N9b8i8InTF%2BaUAZVqkawg4MKqfuj0GMsFjrcVEnlkRpNIkD8Y6QDfRABcClI2IdjoGtVY9YQIY8EM65nc7dPmlTR7yPowAQ1mjddHk0eZ%2BlNZXXz6xU6NsWAoUwFg5pWngLLXdJhpPDbNk1%2B4EYfOAgOUbIBh3nuJDrOfSbkAficQKoN7WczLFe%2BLK6QSdZUJpvXUfNm25%2FVBHr32POUC8nH3cGLVZnVYXflAwChW6JemDvMvMNbZmnTCx4z%2FnRSOU2c10ogSwsabfxq7JnmfGYMGrCn7%2FRd5YGkaBN44Q0nc%2B1vJDKTM%2FZXfNw%3D%3D--n%2F%2B2Aj9tuz9j--Ng18Cfe2ofjEQ%2B55GQMozA%3D%3D ' # 构造RequestsCookieJar对象 jar = requests.cookies.RequestsCookieJar() # 文件头 headers = { 'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) ' 'Chrome/98.0.4758.80 Safari/537.36' } # 对cookies进行处理以及赋值:将复制的cookies分割 for cookie in cookies.split(';'): # 再次分割成key、value key,value = cookie.split('=',1) # 使用set方法设置好每个键名和键值 jar.set(key,value) # 将RequestsCookieJar对象通过cookies参数传递 r = requests.get('https://github.com/',cookies=jar,headers=headers) # 获取登录后的页面 print(r.text)
5、获取设置好的seesion
import requests """方法一:不能获取cookie信息""" requests.get('https://www.httpbin.org/cookies/set/number/123456789') r = requests.get('https://www.httpbin.org/cookies') print(r.text)# { "cookies": {}} """ 方法二:使用session获取当前cookie信息 """ s = requests.session() s.get('https://www.httpbin.org/cookies/set/number/123456789') r = s.get('https://www.httpbin.org/cookies') print(r.text) # {"cookies": { "number": "123456789" }}
6、使用 cookies 参数,发送cookies到服务器
import requests url = 'http://httpbin.org/cookies' cookies = dict(cookies_are='working') r = requests.get(url,cookies=cookies) print(r.text)
7、SSL证书验证
import requests """SSL证书验证错误:""" response = requests.get('https://ssr2.scrape.center') print(response.status_code) # 抛出SSLError错误,原因我们请求的URL的证书是无效
设置verify参数,默认值为verify=“True”(即自动验证)
import requests response = requests.get('https://ssr2.scrape.center/',verify=False) print(response.status_code) # 200
8、设置忽略警告(屏蔽警告)
- disable_warnings方法:
# Python版本:3.6 # -*- coding:utf-8 -*- import requests from requests.packages import urllib3 # 忽略警告下面两行的警告: # D:PythonPython36libsite-packagesurllib3connectionpool.py:1050: InsecureRequestWarning: Unverified HTTPS request is being made to host 'ssr2.scrape.center'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings # InsecureRequestWarning # 忽略警告 urllib3.disable_warnings() response = requests.get('https://ssr2.scrape.center/',verify=False) print(response.status_code) # 200
- captureWarnings方法
# Python版本:3.6 # -*- coding:utf-8 -*- import logging import requests # 通过捕获警告到日志的方式忽略警告: logging.captureWarnings(True) response = requests.get('https://ssr2.scrape.center/',verify=False) print(response.status_code) # 200
- 指定本地证书
import requests # 需要有crt和key文件,并且指定它们的路径 response = requests.get('https://ssr2.scrape.center/',cret=('/path/server.crt','/path/server.key')) print(response.status_code)
9、超时设置
-
永久等待
timeout参数不设置,或者设置为None,表示一直等待网站响应完毕
import requests response = requests.get('https://httpbin.org/get',timeout=None) print(response.status_code) # 200
- 设置timeout的总和
import requests """ 本机网络状况不好或服务器网络响应太慢甚至无响应,超出这个设置的数就会抛出异常 参数timeout用于超时,超过这个时间就会抛出异常。 """ response = requests.get('https://httpbin.org/get',timeout=1) print(response.status_code) # 200
- 分别设置连接和读取的timeout值
import requests ''' timeout参数:实际上,请求分为两个阶段——连接和读取; timeout=1,意味如果1秒内没有响应,就会抛出异常,其中1是连接和读取的总和; 如果要分别指定用作连接和读取的timeout,则可以传入一个元组,如timeout=(1,5),1代表连接时间,5代表读取时间 ''' response = requests.get('https://httpbin.org/get',timeout=(5,30)) # 分别设置连接和读取的时间并存放到元组中 print(response.status_code) # 200
10、身份认证
- 使用requests库自带的身份证功能,通过auth参数设置
import requests from requests.auth import HTTPBasicAuth r = requests.get('https://ssr3.scrape.center/',auth=HTTPBasicAuth('admin','admin')) print(r.status_code) # 200
- 简写auth参数,直接传一个元组(简单的写法)
import requests # 简写auth参数 res = requests.get('https://ssr3.scrape.center/',auth=('admin','admin')) print(res.status_code) # 200
11、代理设置
注意:下面案例设置的代理ip,可能是过期、无效了,需要搜索寻找有效的代理并替换试验一下(可以尝试在这个网站寻找:https://www.proxy-list.download/)
- http协议代理
import requests proxies = { 'http':'http://27.203.215.138:8060', # 'http':'http://120.196.112.6:3128', # 'http':'http://47.242.242.32:80', # 'https':'http://120.196.112.6:3128', } r = requests.get('https://httpbin.org/get',proxies=proxies) print(r.status_code) # 200
- 类似 http://user:password@host:port
import requests proxies = { # 'http':'http://user:password@61.216.185.88:60808/', # 'http':'http://user:password@128.199.108.29:3128/', 'http':'http://user:password@165.154.23.222:80/', } r = requests.get('https://www.httpbin.org/get',proxies=proxies) print(r.status_code) # 200
- SOCKS 协议代理(socks4、socks5)
import requests proxies = { 'http':'socks4://user:password@139.159.48.155:39593', # 'http':'socks5://user:password@139.162.108.196:12347', # 'http':'socks4://user:password@35.220.160.28:10808', # 'http':'socks5://user:password@72.206.181.123:4145', # 'http':'socks5://user:password@103.9.159.235:59350', # 'http':'socks5://user:password@112.105.12.63:1111', # 'http':'socks4://user:password@192.111.138.29:4145', # 'http':'socks5://user:password@72.223.168.86:57481', } r = requests.get('https://www.httpbin.org/get',proxies=proxies) print(r.status_code) # 200
- prepared_request对象创建网络请求
直接使用requests库的get和post方法发送请求,也可以使用Prepared Request。实际上,requests在发送请求的时候,是在内部构造了一个Request对象,并给这个对象赋予了各种参数,包括url、headers、data等,然后直接把这个Request对象发送出去,请求成功后会再得到一个Requese对象,解析这个对象即可。
from requests import Request,Session URL = "https://www.httpbin.org/post" data = {'name':'germey'} headers = { 'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) ' 'Chrome/98.0.4758.80 Safari/537.36' } s = Session() req = Request('POST',URL,data=data,headers=headers) # 调用Session类的prepare_request方法再其转换为一个Prepared Request对象 prepped = s.prepare_request(req) # 调用send方法发送 r = s.send(prepped) print(r.text)
12、禁用重定向
⑴、Github 将所有的 HTTP 请求重定向到 HTTPS
import requests r = requests.get('http://github.com') print(r.url) # https://github.com/ print(r.status_code) # 200 print(r.history) # []
⑵、禁用重定向:allow_redirects 参数
import requests r = requests.get('http://github.com',allow_redirects=False) print(r.url) # http://github.com/ print(r.status_code) # 301 print(r.history) # []
⑶、启用重定向
r = requests.head('http://github.com',allow_redirects=True) print(r.url) # https://github.com/ print(r.status_code) # 200 print(r.history) # []