开篇寄语
上一篇文章,伯衡君简单介绍了如何利用python的Beautifulsoup库来爬取网站标题,具体可以参看下方的前情提要,而这次伯衡君将标题,链接以及文章内文都爬取下来,并生成一张excel表来保存这些爬取结果,其实也比较简单,分享给大家。
前情提要
官方库指导文档
内容详情
首先是要安装依赖库,前面介绍的两个库,requests和beautifulsoup我就不说了,这次再介绍一个用python写xlwt库:
pip install xlwt
安装完成后,在创建的编辑器新文件中就可以引入了,开头先引入这三个库:
import xlwt from xlwt import Workbook from bs4 import BeautifulSoup from requests
接着按照昨天那样,获取首页的标题和链接:
import xlwt from xlwt import Workbook from bs4 import BeautifulSoup import requests #创建Excel表 wb=Workbook() sheet1 = wb.add_sheet('Sheet 1') sheet1.write(0,0,"Title") sheet1.write(0,1,"Link") sheet1.write(0,2,"Contents") #上面会生成一个表格,列出三个标题 url = "https://www.luckydesigner.space" req = requests.get(url) soup = BeautifulSoup(req.text, "html.parser") arr = [] for i in list(soup.select("h2>a")): arr.append(i.get_text()) lst = [] for i in list(soup.select("h2>a")): lst.append(i.get("href")) #以上将标题和链接都分别储存在了arr和last
接下来是找出内文所在的区间,所以随意找一个lst中的链接,新建一个文件,输入代码寻找区间,一般在<p></p>标签或者<span></span>标签内,根据每个网站的不同而查找,比如本站的寻找方法就是下面这个样子:
from bs4 import BeautifulSoup import requests url = "https://www.luckydesigner.space/howtotestssorssrifnotwithsscap/" req = requests.get(url) sup = BeautifulSoup(req.text,"lxml") chi = sup.select("div") print(chi.prettify())
就会列出一个完整的内页html内容,通过观察,可以发现文章内容都是在挨着<h4></h4>标签下的<p></p>标签内,于是伯衡君就继续优化筛选条件:
from bs4 import BeautifulSoup import requests url = "https://www.luckydesigner.space/howtotestssorssrifnotwithsscap/" req = requests.get(url) sup = BeautifulSoup(req.text,"lxml") chi = sup.select("div>h4")[0:-1] for i in chi: print(i.find_next("p"))
这样在输出的结果中,可以看到结果就是我们想要的。
找出后,就可以返回到第一个文件,继续编写完成的代码,如下:
import xlwt from xlwt import Workbook from bs4 import BeautifulSoup import requests #创建Excel表 wb=Workbook() sheet1 = wb.add_sheet('Sheet 1') sheet1.write(0,0,"Title") sheet1.write(0,1,"Link") sheet1.write(0,2,"Contents") #上面会生成一个表格,列出三个标题 url = "https://www.luckydesigner.space" req = requests.get(url) soup = BeautifulSoup(req.text, "html.parser") arr = [] for i in list(soup.select("h2>a")): arr.append(i.get_text()) lst = [] for i in list(soup.select("h2>a")): lst.append(i.get("href")) #以上将标题和链接都分别储存在了arr和last cat = [] for i in lst: ret = requests.get(i) sup = BeautifulSoup(ret.text,"lxml") chi = [x.find_next("p").get_text() for x in sup.select("div>h4")[0:-1]] cat.append(chi) #以上将内文储存在了cat列表中 i=1 while i<=len(lst): sheet1.write(i,1,"{}".format(lst[i-1])) sheet1.write(i,0,"{}".format(arr[i-1])) sheet1.write(i,2,"{}".format(cat[i-1])) i+=1 #以上一个个将内容填充到Excel表中 wb.save("demo.xls")
搞定后,就可以在文件夹内看到一个名为“demo”的Excel表。
然后在iPad上,伯衡君使用自带的Numbers这款App,打开了“demo”这个Excel表,如下图所示:
这样就将文章的标题,链接和内文都保存在了同一张Excel表里面了,怎么样,是不是感觉很有意思?
- 我的微信
- 微信扫一扫加好友
- 我的微信公众号
- 扫描关注公众号