跳转至

Scrapy 爬取在行数据

前置依赖

  • MongoDB 服务
  • python 环境

准备工作

# 启动 MongoDB 服务
$ mongod.exe --dbpath C:\dev\soft\mongodb-win32-x86_64-windows-6.0.1\data

# 查看 scrapy 项目依赖
$ cat requirements.txt
scrapy
pymongo

$ pwd
/c/dev/python/scrapy

# 安装依赖
$ python -m pip install -r requirem
ents.txt

# 创建项目
$ python -m scrapy startproject zaihang
New Scrapy project 'zaihang', using template directory 'C:\Users\chenh\AppData\Local\Programs\Python\Python37-32\lib\site-packages\scrapy\templates\project', created in:
    C:\dev\python\scrapy\zaihang

You can start your first spider with:
    cd zaihang
    scrapy genspider example example.com

# 生成爬虫
$ cd zaihang
$ python -m scrapy genspider zaih zaihang.com
Created spider 'zaih' using template 'basic' in module:
  zaihang.spiders.zaih

开发

最终代码结构

源码:github.com:xchenhao/crawl.git

$ pwd
/c/dev/python/scrapy/zaihang

$ tree
.
│  requirements.txt
│  scrapy.cfg
│
└─zaihang
    │  api_resp.py
    │  items.py
    │  middlewares.py
    │  pipelines.py
    │  settings.py
    │  __init__.py
    │
    ├─mongo_
    │  __init__.py
    │
    └─spiders
       zaih.py
       __init__.py

开始爬取数据

$ python -m scrapy crawl zaih

查看数据

scrapy_zaihang_mongodb_data

参考