AI 爬虫入门：用 Python 写第一个网页抓取脚本

很多同学想学爬虫，但被各种复杂的教程吓退。其实用 Python 写一个最基础的爬虫，代码比你想的简单得多。今天我们就从零开始，手把手写一个抓取网页的小脚本。

环境准备

只需要安装一个库：

pip install requests

第一个爬虫

假设我们要抓取一个网页的标题。我们的思路很简单：请求网页 → 解析内容 → 提取数据。

import requests

url = "https://example.com"
response = requests.get(url)
print(response.status_code)  # 打印状态码
print(response.text[:500])   # 打印前500个字符

运行后你会看到网页的 HTML 源代码。右键查看网页源码，找到要抓的元素（比如 <h1> 标签），对照着写解析规则就行。

用 BeautifulSoup 解析

直接看 HTML 很痛苦，用 BeautifulSoup 来解析：

pip install beautifulsoup4

from bsoup4 import BeautifulSoup

html = response.text
soup = BeautifulSoup(html, 'html.parser')
titles = soup.find_all('h1')
for t in titles:
    print(t.get_text())

注意事项

遵守规则：看网站的 robots.txt，不让爬的就别爬
控制频率：加 time.sleep(1) 别疯狂请求
User-Agent：有些网站会拦截默认的 Python 请求头

完整示例

import requests
from bsoup4 import BeautifulSoup
import time

url = "https://example.com"
headers = {'User-Agent': 'Mozilla/5.0'}

resp = requests.get(url, headers=headers)
soup = BeautifulSoup(resp.text, 'html.parser')

for item in soup.select('h1'):
    print(item.get_text().strip())

time.sleep(1)  # 礼貌爬取

这就是一个完整的爬虫了。核心就是：发请求 → 拿内容 → 解析 → 保存。学会这四步，大多数网页抓取任务都能搞定。

标签：技术 #教程 #爬虫 #Python

环境准备

第一个爬虫

用 BeautifulSoup 解析

注意事项

完整示例

Leave a Comment Cancel reply