开篇寄语
在第三篇的时候,伯衡君曾介绍过如何给爬虫增加浏览器标识伪装,这次则进一步讲解伪装IP,因为有时候一些网站会将IP封禁,可能是爬取的速度过快,导致安全员在后台将该IP加进了黑名单所致,所以要多增加点IP来伪装,那么该如何伪装IP呢?并爬取多页内容,请看这篇。
前情提要
- 《来和伯衡君一起快速入门Python爬虫——Beautifulsoup篇(一)》
- 《来和伯衡君一起快速入门Python爬虫——Beautifulsoup篇(二)》
- 《来和伯衡君一起快速入门Python爬虫——Beautifulsoup篇(三)》
- 《来和伯衡君一起快速入门Python爬虫——Beautifulsoup篇(四)》
官方库指导文档
内容详情
要想获取可用的代理IP,可以参考伯衡君之前撰写的这篇文章,里面有不少类似的网站:
伯衡君从里面随便算了一个,作为例子了。
查看html代码,确定代理IP所在的区间,之后将他们筛选后打印出来:
import requests from bs4 import BeautifulSoup as bs from fake_useragent import UserAgent ua = UserAgent() header = {"User-Agent": ua.random} url = "https://free-proxy-list.net/" req = requests.get(url, headers=header) soup = bs(req.text, "html.parser") proxies = [] for row in soup.find("table", attrs={"id": "proxylisttable"}).find_all("tr")[1:]: tds = row.find_all("td") try: ip = tds[0].text.strip() port = tds[1].text.strip() host = f"{ip}:{port}" proxies.append(host) except IndexError: continue for i in proxies: print(i)
之后,就会看到生成的代理IP序列表,如下所示:
177.37.240.52:8080 104.198.108.238:8080 46.99.163.241:8080 41.78.212.62:8080 195.158.3.117:3128 59.125.123.129:81 185.18.212.227:3128 191.242.179.138:3128 161.35.4.201:80 5.252.161.48:8080 103.152.5.80:8080 208.138.24.254:80 119.206.242.196:80 193.29.104.185:3128 185.245.84.131:3128 152.67.48.62:3128 141.164.56.244:8080 34.203.142.175:80 173.212.202.65:80 74.143.245.221:80 136.233.215.136:80 103.253.146.44:8080 103.134.168.180:80 103.152.5.70:8080 14.99.225.212:80 14.99.225.213:80 179.52.186.159:999 136.233.215.139:80 195.78.112.235:42549 51.158.180.179:8811 103.134.168.81:80 109.73.13.132:34693 136.243.254.196:80 103.134.168.16:80 185.198.188.49:8080 195.206.106.186:3128 54.179.49.83:1080 159.89.221.73:3128 37.120.222.132:3128 88.198.24.108:8080 89.249.67.57:3128 136.233.215.142:80 65.160.224.144:80 134.209.29.120:8080 51.75.147.33:3128 85.133.183.66:8080 73.144.10.167:80 89.221.223.204:80 103.218.240.75:80 104.238.81.186:56227 191.103.219.225:48612 178.150.148.38:8282 165.22.108.115:8080 154.16.63.16:3128 160.16.203.39:80 122.15.211.125:80 103.80.61.79:8080 124.41.243.72:44716 94.180.106.94:32767 150.129.148.99:35101 51.75.147.40:3128 5.189.133.231:80 45.82.245.34:3128 82.99.217.18:8080 193.239.86.247:3128 45.7.205.103:39750 49.204.79.81:80 103.84.70.49:84 116.202.108.45:8008 203.142.69.69:8080 190.53.38.98:46340 89.221.223.234:80 46.4.96.137:3128 202.141.233.166:48995 185.198.188.53:8080 134.3.255.10:8080 88.198.50.103:8080 138.68.60.8:8080 191.96.71.118:3128 209.97.150.167:8080 159.203.61.169:8080 102.129.249.120:8080 154.16.202.22:8080 161.35.70.249:3128 191.96.42.80:8080 139.59.1.14:8080 128.199.202.122:3128 167.71.5.83:8080 198.199.86.11:3128 54.146.128.205:80 139.162.78.109:8080 51.158.68.133:8811 185.236.203.209:3128 85.185.159.74:8080 217.150.77.31:53281 136.233.215.137:80 103.227.255.43:80 125.163.190.51:3128 176.9.75.42:8080 176.9.119.170:8080 91.132.139.177:3128 185.189.112.157:3128 159.65.171.69:80 185.189.112.133:3128 184.82.235.73:8080 180.247.72.9:8080 36.75.202.120:8080 193.34.55.64:32767 187.243.253.2:8080 36.37.74.60:8080 200.85.169.18:47548 190.95.214.178:8080 118.174.220.14:43473 124.107.182.196:8118 3.22.0.212:8080 161.202.226.194:80 117.102.87.138:41757 192.109.165.129:80 191.101.39.154:80 122.15.211.124:80 89.45.4.138:3128 192.46.215.101:8080 191.101.39.81:80 191.101.39.238:80 14.99.225.208:80 193.239.86.137:3128 51.158.119.88:8811 51.158.68.68:8811 79.110.52.252:3128 14.97.2.107:80 78.47.16.54:80 185.236.202.205:3128 186.10.82.22:59880 46.21.153.16:3128 46.102.153.48:3128 208.80.28.208:8080 200.62.96.71:80 193.239.86.248:3128 67.43.239.169:3128 193.56.255.181:3128 164.132.112.237:80 185.236.202.170:3128 46.175.186.24:8081 185.236.203.208:3128 193.56.255.131:3128 193.29.104.90:3128 185.236.202.168:3128 89.249.65.191:3128 37.120.140.158:3128 46.253.45.24:8080 212.234.67.60:8080 80.48.119.28:8080 85.196.183.162:8080 217.8.51.206:8080 12.186.206.85:80 46.209.63.177:3128 96.9.77.203:55667 178.63.240.212:80 191.242.178.209:3128 186.125.59.8:46316 200.94.140.50:30682 94.130.179.24:8009 185.198.188.50:8080 185.198.188.54:8080 180.183.26.121:3128 95.165.233.60:2020 138.97.200.225:8080 45.230.171.17:999 88.255.92.37:8080 213.6.28.85:8080 45.173.6.70:999 103.19.129.34:83 61.19.145.66:8080 36.91.51.233:3128 45.172.108.44:9991 189.50.9.250:8080 36.67.27.153:8080 36.94.253.189:8080 14.97.2.106:80 179.96.28.58:80 62.109.21.59:80 167.99.146.95:8888 159.65.140.227:8080 128.199.115.226:3128 154.72.199.202:41201 175.111.15.2:42483 83.216.224.41:8080 47.75.90.57:80 185.198.189.21:8080 185.198.188.51:8080 103.150.239.25:8080 136.232.209.70:47423 64.4.94.129:80 184.147.26.69:8080 185.198.188.52:8080 185.236.203.156:3128 103.156.225.18:80 89.223.80.30:8080 31.172.105.144:8080 191.100.20.187:8080 103.36.11.240:14571 103.217.173.210:53905 152.67.24.187:80 77.94.112.234:32222 62.23.15.92:3128 175.111.181.26:56297 2.187.213.38:8080 183.88.33.147:8080 160.0.219.21:8080 190.90.24.12:999 190.109.168.217:8080 103.15.60.225:8080 185.198.188.55:8080 160.202.40.20:55655 3.25.29.231:3128 43.241.141.27:35101 79.104.25.218:8080 118.175.207.180:40017 213.79.122.82:8080 187.243.255.174:8080 176.56.107.184:46973 182.253.168.161:8080 187.243.240.54:8080 54.151.132.183:3128 176.62.178.247:47556 103.109.59.242:53281 132.145.18.53:80 139.162.1.237:80 177.72.81.39:8080 51.158.165.18:8811 122.102.27.172:23500 198.50.163.192:3129 51.158.172.165:8811 202.131.103.67:80 103.146.17.97:80 45.79.23.35:3128 130.226.140.40:80 193.56.255.179:3128 142.44.148.56:8080 14.99.225.209:80 188.247.20.1:80 189.146.126.85:80 185.198.188.48:8080 172.104.65.13:3128 217.19.217.151:8080 36.90.101.91:8080 41.204.87.90:8080 103.24.126.182:84 189.52.154.213:3128 139.162.41.219:8889 118.179.173.253:40836 24.172.34.114:49920 157.245.86.213:8118 181.198.97.241:30072 154.72.204.122:8080 139.99.105.5:80 58.96.148.49:8080 37.17.38.196:53281 144.217.101.245:3129 41.65.146.38:8080 193.56.255.180:3128 103.134.168.154:80 20.50.107.111:80 14.97.2.104:80 45.236.169.150:999 117.121.202.44:8080 116.0.3.140:8080 201.190.184.22:46740 149.100.165.85:8080 97.87.248.14:80 103.224.36.209:8080 43.224.10.27:6666 125.141.117.36:80 150.129.58.190:31111 192.158.15.201:60684 113.53.83.212:44664 47.91.242.160:3128 31.14.49.1:8080 46.5.252.59:3128 52.149.152.236:80 110.44.117.26:43922 213.230.110.39:3128 103.11.106.70:8181 190.214.27.106:48586 125.26.99.223:36506 103.81.77.65:84 180.211.192.61:8080 103.78.252.89:8080 80.78.237.2:55443 103.122.60.5:8080 36.92.107.194:8080 182.253.21.26:46977 103.107.92.1:52827 14.97.2.105:80 51.75.147.44:3128 78.42.42.42:8080 113.254.178.224:80 103.62.232.26:8080 109.86.182.203:3128 91.92.180.45:8080
找到这么多,肯定有的能用,有的不能用,需要将这些IP进行筛选,到这个网站进行筛选:
可以得到如下有效性的地址:
152.67.48.62:3128 103.134.168.180:80 103.152.5.70:8080 51.158.180.179:8811 103.134.168.81:80 103.134.168.16:80 54.179.49.83:1080 159.89.221.73:3128 103.218.240.75:80
可能大家看到这篇文章的时候,这些地址已经失效了,因为该网站时常更新,所以不需要复制,可以到其他网站来筛选。
请提前将proxies列表换成这个样子:
proxies = [ '152.67.48.62:3128', '103.134.168.180:80', '103.152.5.70:8080', '51.158.180.179:8811', '103.134.168.81:80', '103.134.168.16:80', '54.179.49.83:1080', '159.89.221.73:3128', '103.218.240.75:80', ]
之后,我们建立一个function,看看是否访问成功:
def get_session(proxies): # construct an HTTP session session = requests.Session() # choose one random proxy proxy = random.choice(proxies) session.proxies = {"http": proxy, "https": proxy} return session
for i in range(5): s = get_session(proxies) try: print("Request page with IP:", s.get("https://luckydesigner.space", timeout=1.5).text.strip()) except Exception as e: continue
如果有打印出来"Request page with IP:"+“ip”地址,那就说明是可行的。
接下来,咱们用这个代理IP的方式来爬取多页的Avgle.com这个成人网站的视频播放地址,方便以后看片。
import os import time import random import requests from fake_useragent import UserAgent from bs4 import BeautifulSoup as bs ua = UserAgent() header = {"User-Agent":ua.random}#伪装浏览器头部 proxies = [ '152.67.48.62:3128', '103.134.168.180:80', '103.152.5.70:8080', '51.158.180.179:8811', '103.134.168.81:80', '103.134.168.16:80', '54.179.49.83:1080', '159.89.221.73:3128', '103.218.240.75:80', ] ip=random.choice(proxies) proxy_ip = 'http://' + ip proxy_ips = 'https://' + ip proxy = {'https': proxy_ips, 'http': proxy_ip} url = "https://avgle.com/videos?page={}"#该网址视频列表的地址基本上是这样的格式 lst = [] for page in range(1,11):#爬取1-10页 time.sleep(2)#设置2秒间隔爬取 soup = bs(requests.get(url.format(page), headers=header, proxies=proxy).text, "lxml") chat = soup.find_all("div", "well well-sm") link = list(x.find_next("a")["href"] for x in chat) for i in link: lst.append("https://www.avgle.com{}".format(i)) dir_name = "avlge" if not os.path.exists(dir_name): os.mkdir(dir_name) with open(dir_name+"/"+"avgleList.txt","w") as f:#保存到本地一个名为“avgleList”的文本文件 for a in sorted(list(set(lst)), key=lst.index): f.write(a+"\n")
爬取的成果可以看下图:
这样一来,就可以快速浏览想看的影片地址,直接复制粘贴到浏览器观看就可以了,很方便。
如果想在Avgle.com这个网站上进行搜索,伯衡君又撰写了这样的代码,可以实行搜索关键词来查找影片的播放地址:
import os import math import time import random import requests from fake_useragent import UserAgent from bs4 import BeautifulSoup as bs ua = UserAgent() header = {"User-Agent":ua.random} url = "https://avgle.com/search/videos?search_query={}&search_type=videos&page={}" lst = [] keywords = input("Please input your keywords:")#输入想要查询的关键词 sup = bs(requests.get(url.format(keywords, 1)).text, "html.parser") page = math.ceil(int(sup.find_all("span","text-white")[-1].get_text())/18)#每页显示18个搜索结果,用18取除数近似最大值,这样就可以得到相关页数 for i in range(1,page+1): time.sleep(2) soup = bs(requests.get(url.format(keywords, i)).text, "lxml") chat = soup.find_all("div", "well well-sm") link = list(x.find_next("a")["href"] for x in chat) for w in link: lst.append("https://www.avgle.com{}".format(w)) dir_name = "avlge" if not os.path.exists(dir_name): os.mkdir(dir_name) with open(dir_name+"/"+"avgleSearch.txt","w") as f: for a in sorted(list(set(lst)), key=lst.index): f.write(a+"\n")
同理,可以爬取小说,多页的文章等等。
- 我的微信
- 微信扫一扫加好友
- 我的微信公众号
- 扫描关注公众号