Scrapy 爬取在行数据¶
前置依赖¶
- MongoDB 服务
- python 环境
准备工作¶
# 启动 MongoDB 服务
$ mongod.exe --dbpath C:\dev\soft\mongodb-win32-x86_64-windows-6.0.1\data
# 查看 scrapy 项目依赖
$ cat requirements.txt
scrapy
pymongo
$ pwd
/c/dev/python/scrapy
# 安装依赖
$ python -m pip install -r requirem
ents.txt
# 创建项目
$ python -m scrapy startproject zaihang
New Scrapy project 'zaihang', using template directory 'C:\Users\chenh\AppData\Local\Programs\Python\Python37-32\lib\site-packages\scrapy\templates\project', created in:
C:\dev\python\scrapy\zaihang
You can start your first spider with:
cd zaihang
scrapy genspider example example.com
# 生成爬虫
$ cd zaihang
$ python -m scrapy genspider zaih zaihang.com
Created spider 'zaih' using template 'basic' in module:
zaihang.spiders.zaih
开发¶
最终代码结构¶
源码:github.com:xchenhao/crawl.git
$ pwd
/c/dev/python/scrapy/zaihang
$ tree
.
│ requirements.txt
│ scrapy.cfg
│
└─zaihang
│ api_resp.py
│ items.py
│ middlewares.py
│ pipelines.py
│ settings.py
│ __init__.py
│
├─mongo_
│ __init__.py
│
└─spiders
zaih.py
__init__.py
开始爬取数据¶
$ python -m scrapy crawl zaih