来和伯衡君一起快速入门Python爬虫——Beautifulsoup篇(五)

百度已收录   阅读次数: 2,641
2021-02-0922:13:19 发表评论
摘要

在第三篇的时候,伯衡君曾介绍过如何给爬虫增加浏览器标识伪装,这次则进一步讲解伪装IP,因为有时候一些网站会将IP封禁,可能是爬取的速度过快,导致安全员在后台将该IP加进了黑名单所致,所以要多增加点IP来伪装,那么该如何伪装IP呢?并爬取多页内容,请看这篇……

来和伯衡君一起快速入门Python爬虫——Beautifulsoup篇(五)

开篇寄语

在第三篇的时候,伯衡君曾介绍过如何给爬虫增加浏览器标识伪装,这次则进一步讲解伪装IP,因为有时候一些网站会将IP封禁,可能是爬取的速度过快,导致安全员在后台将该IP加进了黑名单所致,所以要多增加点IP来伪装,那么该如何伪装IP呢?并爬取多页内容,请看这篇。

前情提要

官方库指导文档

内容详情

要想获取可用的代理IP,可以参考伯衡君之前撰写的这篇文章,里面有不少类似的网站:

伯衡君从里面随便算了一个,作为例子了。

查看html代码,确定代理IP所在的区间,之后将他们筛选后打印出来:

import requests
from bs4 import BeautifulSoup as bs
from fake_useragent import UserAgent
ua = UserAgent()
header = {"User-Agent": ua.random}
url = "https://free-proxy-list.net/"
req = requests.get(url, headers=header)
soup = bs(req.text, "html.parser")
proxies = []
for row in soup.find("table", attrs={"id": "proxylisttable"}).find_all("tr")[1:]:
    tds = row.find_all("td")
    try:
        ip = tds[0].text.strip()
        port = tds[1].text.strip()
        host = f"{ip}:{port}"
        proxies.append(host)
    except IndexError:
        continue
for i in proxies:
    print(i)

之后,就会看到生成的代理IP序列表,如下所示:

177.37.240.52:8080
104.198.108.238:8080
46.99.163.241:8080
41.78.212.62:8080
195.158.3.117:3128
59.125.123.129:81
185.18.212.227:3128
191.242.179.138:3128
161.35.4.201:80
5.252.161.48:8080
103.152.5.80:8080
208.138.24.254:80
119.206.242.196:80
193.29.104.185:3128
185.245.84.131:3128
152.67.48.62:3128
141.164.56.244:8080
34.203.142.175:80
173.212.202.65:80
74.143.245.221:80
136.233.215.136:80
103.253.146.44:8080
103.134.168.180:80
103.152.5.70:8080
14.99.225.212:80
14.99.225.213:80
179.52.186.159:999
136.233.215.139:80
195.78.112.235:42549
51.158.180.179:8811
103.134.168.81:80
109.73.13.132:34693
136.243.254.196:80
103.134.168.16:80
185.198.188.49:8080
195.206.106.186:3128
54.179.49.83:1080
159.89.221.73:3128
37.120.222.132:3128
88.198.24.108:8080
89.249.67.57:3128
136.233.215.142:80
65.160.224.144:80
134.209.29.120:8080
51.75.147.33:3128
85.133.183.66:8080
73.144.10.167:80
89.221.223.204:80
103.218.240.75:80
104.238.81.186:56227
191.103.219.225:48612
178.150.148.38:8282
165.22.108.115:8080
154.16.63.16:3128
160.16.203.39:80
122.15.211.125:80
103.80.61.79:8080
124.41.243.72:44716
94.180.106.94:32767
150.129.148.99:35101
51.75.147.40:3128
5.189.133.231:80
45.82.245.34:3128
82.99.217.18:8080
193.239.86.247:3128
45.7.205.103:39750
49.204.79.81:80
103.84.70.49:84
116.202.108.45:8008
203.142.69.69:8080
190.53.38.98:46340
89.221.223.234:80
46.4.96.137:3128
202.141.233.166:48995
185.198.188.53:8080
134.3.255.10:8080
88.198.50.103:8080
138.68.60.8:8080
191.96.71.118:3128
209.97.150.167:8080
159.203.61.169:8080
102.129.249.120:8080
154.16.202.22:8080
161.35.70.249:3128
191.96.42.80:8080
139.59.1.14:8080
128.199.202.122:3128
167.71.5.83:8080
198.199.86.11:3128
54.146.128.205:80
139.162.78.109:8080
51.158.68.133:8811
185.236.203.209:3128
85.185.159.74:8080
217.150.77.31:53281
136.233.215.137:80
103.227.255.43:80
125.163.190.51:3128
176.9.75.42:8080
176.9.119.170:8080
91.132.139.177:3128
185.189.112.157:3128
159.65.171.69:80
185.189.112.133:3128
184.82.235.73:8080
180.247.72.9:8080
36.75.202.120:8080
193.34.55.64:32767
187.243.253.2:8080
36.37.74.60:8080
200.85.169.18:47548
190.95.214.178:8080
118.174.220.14:43473
124.107.182.196:8118
3.22.0.212:8080
161.202.226.194:80
117.102.87.138:41757
192.109.165.129:80
191.101.39.154:80
122.15.211.124:80
89.45.4.138:3128
192.46.215.101:8080
191.101.39.81:80
191.101.39.238:80
14.99.225.208:80
193.239.86.137:3128
51.158.119.88:8811
51.158.68.68:8811
79.110.52.252:3128
14.97.2.107:80
78.47.16.54:80
185.236.202.205:3128
186.10.82.22:59880
46.21.153.16:3128
46.102.153.48:3128
208.80.28.208:8080
200.62.96.71:80
193.239.86.248:3128
67.43.239.169:3128
193.56.255.181:3128
164.132.112.237:80
185.236.202.170:3128
46.175.186.24:8081
185.236.203.208:3128
193.56.255.131:3128
193.29.104.90:3128
185.236.202.168:3128
89.249.65.191:3128
37.120.140.158:3128
46.253.45.24:8080
212.234.67.60:8080
80.48.119.28:8080
85.196.183.162:8080
217.8.51.206:8080
12.186.206.85:80
46.209.63.177:3128
96.9.77.203:55667
178.63.240.212:80
191.242.178.209:3128
186.125.59.8:46316
200.94.140.50:30682
94.130.179.24:8009
185.198.188.50:8080
185.198.188.54:8080
180.183.26.121:3128
95.165.233.60:2020
138.97.200.225:8080
45.230.171.17:999
88.255.92.37:8080
213.6.28.85:8080
45.173.6.70:999
103.19.129.34:83
61.19.145.66:8080
36.91.51.233:3128
45.172.108.44:9991
189.50.9.250:8080
36.67.27.153:8080
36.94.253.189:8080
14.97.2.106:80
179.96.28.58:80
62.109.21.59:80
167.99.146.95:8888
159.65.140.227:8080
128.199.115.226:3128
154.72.199.202:41201
175.111.15.2:42483
83.216.224.41:8080
47.75.90.57:80
185.198.189.21:8080
185.198.188.51:8080
103.150.239.25:8080
136.232.209.70:47423
64.4.94.129:80
184.147.26.69:8080
185.198.188.52:8080
185.236.203.156:3128
103.156.225.18:80
89.223.80.30:8080
31.172.105.144:8080
191.100.20.187:8080
103.36.11.240:14571
103.217.173.210:53905
152.67.24.187:80
77.94.112.234:32222
62.23.15.92:3128
175.111.181.26:56297
2.187.213.38:8080
183.88.33.147:8080
160.0.219.21:8080
190.90.24.12:999
190.109.168.217:8080
103.15.60.225:8080
185.198.188.55:8080
160.202.40.20:55655
3.25.29.231:3128
43.241.141.27:35101
79.104.25.218:8080
118.175.207.180:40017
213.79.122.82:8080
187.243.255.174:8080
176.56.107.184:46973
182.253.168.161:8080
187.243.240.54:8080
54.151.132.183:3128
176.62.178.247:47556
103.109.59.242:53281
132.145.18.53:80
139.162.1.237:80
177.72.81.39:8080
51.158.165.18:8811
122.102.27.172:23500
198.50.163.192:3129
51.158.172.165:8811
202.131.103.67:80
103.146.17.97:80
45.79.23.35:3128
130.226.140.40:80
193.56.255.179:3128
142.44.148.56:8080
14.99.225.209:80
188.247.20.1:80
189.146.126.85:80
185.198.188.48:8080
172.104.65.13:3128
217.19.217.151:8080
36.90.101.91:8080
41.204.87.90:8080
103.24.126.182:84
189.52.154.213:3128
139.162.41.219:8889
118.179.173.253:40836
24.172.34.114:49920
157.245.86.213:8118
181.198.97.241:30072
154.72.204.122:8080
139.99.105.5:80
58.96.148.49:8080
37.17.38.196:53281
144.217.101.245:3129
41.65.146.38:8080
193.56.255.180:3128
103.134.168.154:80
20.50.107.111:80
14.97.2.104:80
45.236.169.150:999
117.121.202.44:8080
116.0.3.140:8080
201.190.184.22:46740
149.100.165.85:8080
97.87.248.14:80
103.224.36.209:8080
43.224.10.27:6666
125.141.117.36:80
150.129.58.190:31111
192.158.15.201:60684
113.53.83.212:44664
47.91.242.160:3128
31.14.49.1:8080
46.5.252.59:3128
52.149.152.236:80
110.44.117.26:43922
213.230.110.39:3128
103.11.106.70:8181
190.214.27.106:48586
125.26.99.223:36506
103.81.77.65:84
180.211.192.61:8080
103.78.252.89:8080
80.78.237.2:55443
103.122.60.5:8080
36.92.107.194:8080
182.253.21.26:46977
103.107.92.1:52827
14.97.2.105:80
51.75.147.44:3128
78.42.42.42:8080
113.254.178.224:80
103.62.232.26:8080
109.86.182.203:3128
91.92.180.45:8080

找到这么多,肯定有的能用,有的不能用,需要将这些IP进行筛选,到这个网站进行筛选:

可以得到如下有效性的地址:

152.67.48.62:3128
103.134.168.180:80
103.152.5.70:8080
51.158.180.179:8811
103.134.168.81:80
103.134.168.16:80
54.179.49.83:1080
159.89.221.73:3128
103.218.240.75:80

可能大家看到这篇文章的时候,这些地址已经失效了,因为该网站时常更新,所以不需要复制,可以到其他网站来筛选。

请提前将proxies列表换成这个样子:

proxies = [
    '152.67.48.62:3128',
    '103.134.168.180:80',
    '103.152.5.70:8080',
    '51.158.180.179:8811',
    '103.134.168.81:80',
    '103.134.168.16:80',
    '54.179.49.83:1080',
    '159.89.221.73:3128',
    '103.218.240.75:80',
]

之后,我们建立一个function,看看是否访问成功:

def get_session(proxies):
    # construct an HTTP session
    session = requests.Session()
    # choose one random proxy
    proxy = random.choice(proxies)
    session.proxies = {"http": proxy, "https": proxy}
    return session
for i in range(5):
    s = get_session(proxies)
    try:
        print("Request page with IP:", s.get("https://luckydesigner.space", timeout=1.5).text.strip())
    except Exception as e:
        continue

如果有打印出来"Request page with IP:"+“ip”地址,那就说明是可行的。

接下来,咱们用这个代理IP的方式来爬取多页的Avgle.com这个成人网站的视频播放地址,方便以后看片。

import os
import time
import requests
from fake_useragent import UserAgent
from bs4 import BeautifulSoup as bs

ua = UserAgent()
header = {"User-Agent":ua.random}#伪装浏览器头部
proxies = [
    '152.67.48.62:3128',
    '103.134.168.180:80',
    '103.152.5.70:8080',
    '51.158.180.179:8811',
    '103.134.168.81:80',
    '103.134.168.16:80',
    '54.179.49.83:1080',
    '159.89.221.73:3128',
    '103.218.240.75:80',
]
ip=random.choice(proxies)
proxy_ip = 'http://' + ip
proxy_ips = 'https://' + ip
proxy = {'https': proxy_ips, 'http': proxy_ip} 
url = "https://avgle.com/videos?page={}"#该网址视频列表的地址基本上是这样的格式
lst = []
for page in range(1,11):#爬取1-10页
    time.sleep(2)#设置2秒间隔爬取
    soup = bs(requests.get(url.format(page), headers=header, proxies=proxy).text, "lxml")
    chat = soup.find_all("div", "well well-sm")
    link = list(x.find_next("a")["href"] for x in chat)
    for i in link:
        lst.append("https://www.avgle.com{}".format(i))
dir_name = "avlge"
if not os.path.exists(dir_name):
    os.mkdir(dir_name)
with open(dir_name+"/"+"avgleList.txt","w") as f:#保存到本地一个名为“avgleList”的文本文件
    for a in sorted(list(set(lst)), key=lst.index):
        f.write(a+"\n")

爬取的成果可以看下图:

这样一来,就可以快速浏览想看的影片地址,直接复制粘贴到浏览器观看就可以了,很方便。

如果想在Avgle.com这个网站上进行搜索,伯衡君又撰写了这样的代码,可以实行搜索关键词来查找影片的播放地址:

import os
import math
import time
import requests
from fake_useragent import UserAgent
from bs4 import BeautifulSoup as bs

ua = UserAgent()
header = {"User-Agent":ua.random}
url = "https://avgle.com/search/videos?search_query={}&search_type=videos&page={}"
lst = []
keywords = input("Please input your keywords:")#输入想要查询的关键词
sup = bs(requests.get(url.format(keywords, 1)).text, "html.parser")
page = math.ceil(int(sup.find_all("span","text-white")[-1].get_text())/18)#每页显示18个搜索结果,用18取除数近似最大值,这样就可以得到相关页数
for i in range(1,page+1):
    time.sleep(2)
    soup = bs(requests.get(url.format(keywords, i)).text, "lxml")
    chat = soup.find_all("div", "well well-sm")
    link = list(x.find_next("a")["href"] for x in chat)
    for w in link:
        lst.append("https://www.avgle.com{}".format(w))
dir_name = "avlge"
if not os.path.exists(dir_name):
    os.mkdir(dir_name)
with open(dir_name+"/"+"avgleSearch.txt","w") as f:
    for a in sorted(list(set(lst)), key=lst.index):
        f.write(a+"\n")

同理,可以爬取小说,多页的文章等等。

分享至:
  • 我的微信
  • 微信扫一扫加好友
  • weinxin
  • 我的微信公众号
  • 扫描关注公众号
  • weinxin

发表评论

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: