网络数据采集之抓取简单页面链接

December 09, 2023

测试

1 分钟阅读

任务：抓取页面的链接并返回。

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen("https://blog.csdn.net/mercury_lc") # 打开链接
bsObj = BeautifulSoup(html,features='lxml') # 把这个页面的html放到bs4中
# print(bsObj)
cnt = 0
for link in bsObj.findAll("a"):
    if 'href' in link.attrs: # html标签的属性字典
        #print(link.attrs)
        print(link.attrs['href']) # 这是包括好几个的，只要href就可以
        cnt += 1
print("网页链接数量：")
print(cnt)

这里当然是ctrl+v的课本啦，重在学习 BeautifulSoup 的这个的四个对象类型。

继续阅读

更多来自我们博客的帖子

查看全部

如何安装 BuddyPress

由测试 December 17, 2023

经过差不多一年的开发，BuddyPress 这个基于 WordPress Mu 的 SNS 插件正式版终于发布了。BuddyPress...

Filter如何工作

由测试 December 17, 2023

在 web.xml...

如何理解CGAffineTransform

由测试 December 17, 2023

CGAffineTransform A structure for holding an affine transformation matrix. ...