Scrapy 模块

Scrapy 模块
1 Scrapy 简介

1 Scrapy 简介

Scrapy是一个应用程序框架，用于对网站进行爬行和提取结构化数据，这些结构化数据可用于各种有用的应用程序，如数据挖掘、信息处理或历史存档。其具有以下功能：

支持全栈数据爬取操作
支持XPath
异步的数据下载
支持高性能持久化存储
分布式

官网：Scrapy | A Fast and Powerful Scraping and Web Crawling Framework

1.1 安装

# Twisted是用Python实现的基于事件驱动的网络引擎框架，Scrapy 基于 Twisted
pip install twisted

# 安装scrapy
pip install scrapy

1.2 Scrapy 全局命令

Usage:
  scrapy <command> [options] [args]

Available commands:
  bench         Run quick benchmark test
  fetch         Fetch a URL using the Scrapy downloader
  genspider     Generate new spider using pre-defined templates
  runspider     Run a self-contained spider (without creating a project)
  settings      Get settings values
  shell         Interactive scraping console
  startproject  Create new project
  version       Print Scrapy version
  view          Open URL in browser, as seen by Scrapy

1.3 Scrapy项目命令

Usage:
  scrapy <command> [options] [args]

Available commands:
  check         Check spider contracts
  crawl         Run a spider
  edit          Edit spider
  list          List available spiders
  parse         Parse URL (using its spider) and print the results

2 Scrapy 操作

2.1 创建项目操作

# 创建项目文件
scrapy startproject <scrapyPJname>

# 创建爬虫文件
cd <scrapyPJname>
scrapy genspider <spiderName> www.xxx.com

# 执行
scrapy crawl spiderName

2.2 配置项目文件

# settings.py

USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.81 ' \
             'Safari/537.36 '

## 不遵从robots协议
ROBOTSTXT_OBEY = False

## Log
LOG_LEVEL = 'ERROR'
LOG_FILE = 'log.txt'

# 300表示的优先级，越小优先级越高
ITEM_PIPELINES = {
   'scrapyPJ01.pipelines.Scrapypj01Pipeline': 300,
}

1.4 数据解析

extract():列表是有多个列表元素
extract_first():列表元素只有单个

1.5 持久化存储

基于终端指令：
  - 只可以将parse方法的返回值存储到磁盘文件中
  - scrapy crawl first -o file.csv
基于管道：pipelines.py
  - 编码流程：
      - 1.数据解析
      - 2.在item的类中定义相关的属性
      - 3.将解析的数据存储封装到item类型的对象中.item['p']
      - 4.将item对象提交给管道
      - 5.在管道类中的process_item方法负责接收item对象，然后对item进行任意形式的持久化存储
      - 6.在配置文件中开启管道
   - 细节补充：
      - 管道文件中的一个管道类表示将数据存储到某一种形式的平台中。
      - 如果管道文件中定义了多个管道类，爬虫类提交的item会给到优先级最高的管道类。
      - process_item方法的实现中的return item的操作表示将item传递给下一个即将被执行的管道类

3 实例

3.1 基于终端命令持久化存储

ctspider.py

import scrapy


class CtspiderSpider(scrapy.Spider):
    name = 'ctspider'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['http://www.xxx.com/']

    def parse(self, response):
        data_list = []
        div_list = response.xpath('/html/body/main/div/div/div[1]/div/div[2]/div[1]/div')
        for div in div_list:
            # 注意：xpath返回的列表中的列表元素是Selector对象，我们要解析获取的字符串的数据是存储在该对象中
            # 必须经过一个extract()的操作才可以将改对象中存储的字符串的数据获取
            # title = div.xpath('./div/div/div[1]/a/text()')  # [<Selector xpath='./div/div/div[1]/a/text()' data='泽连斯基何以当选《时代》2022年度人物？'>]
            title = div.xpath('./div/div/div[1]/a/text()').extract_first()
            # xpath返回的列表中的列表元素有多个（Selector对象）,使用extract()取出
            author = div.xpath('./div[1]/div/div[2]/div[1]/span[1]/span/text()')[0].extract()  # ['知世']
            content = div.xpath('./div[1]/div/div[1]/div[3]/text()').extract_first()  # 美国《时代》杂志将乌克兰总统泽连斯基及“乌克兰精神”评为2022年度风云人...
            # 返回列表数据
            data = {
                'title':title,
                'author':author,
                'content':content
            }
            data_list.append(data)
        return data_list

scrapy crawl ctspider -o ctresult.csv

3.2 引入item

items.py

import scrapy


class Scrapypj01Item(scrapy.Item):
    # define the fields for your item here like:
    # Field是一个万能的数据类型
    title = scrapy.Field()
    author = scrapy.Field()

ctspider.py

import scrapy
import scrapypj01.items as items


class CtspiderSpider(scrapy.Spider):
    name = 'ctspider'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['http://www.xxx.com/']

    # 终端命令持久化存储
    def parse(self, response):
        ctresponse = response.xpath('')
        title = ctresponse.xpath('/html/body/main/div/div/div[1]/div/div[2]/div[1]/div/div[1]/div/div/div[1]/a/text()').extract_first()
        author = ctresponse.xpath('/html/body/main/div/div/div[1]/div/div[2]/div[1]/div/div[1]/div[1]/div/div[2]/div[1]/span[1]/span/text()')[0].extract()
		
        # 实例化item类型的对象
        ctitem = items.Scrapypj01Item()
        ctitem['title'] = title
        ctitem['author'] = author

        return ctitem

scrapy crawl ctspider -o ctspider.csv

3.3 基于管道的持久化存储：`pipelines.py`

items.py

import scrapy


class Scrapypj01Item(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    author = scrapy.Field()

pipelines.py: 专门用作于持久化存储

# Pipleline
class Scrapypj01Pipeline(object):
    fp = None

    def open_spider(self, spider):
        print('爬虫开始')
        self.fp = open('./ctresult.txt', 'w', encoding='utf-8')

    def process_item(self, item, spider):
        title = item['title']
        author = item['author']
        data = '{0},{1}\n'.format(title, author)
        self.fp.write(data)
        print(data, '写入成功')
        return item

    def close_spider(self, spider):
        print('爬虫结束')
        self.fp.close()

settings.py

ITEM_PIPELINES = {
   'scrapypj01.pipelines.Scrapypj01Pipeline': 300,
}

ctspider.py

import scrapy
import scrapypj01.items as items


class CtspiderSpider(scrapy.Spider):
    name = 'ctspider'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['http://www.xxx.com/']

    def parse(self, response):
        div_list = response.xpath('/html/body/main/div/div/div[1]/div/div[2]/div[1]/div')
        for div in div_list:
            title = div.xpath('./div/div/div[1]/a/text()').extract_first()
            author = div.xpath('./div[1]/div/div[2]/div[1]/span[1]/span/text()')[0].extract()
            
			# 实例化item类型的对象
            ctitem = items.Scrapypj01Item()
            ctitem['title'] = title
            ctitem['author'] = author

            # 将item对象提交给管道
            yield ctitem

3.4 基于 Mysql 的持久化存储

items.py

import scrapy


class Scrapypj01Item(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    author = scrapy.Field()

pipelines.py: 专门用作于持久化存储

# Mysql
import pymysql


# 专门用作于持久化存储
class Scrapypj01Pipeline(object):
    fp = None

    def open_spider(self, spider):
        print('爬虫开始')
        self.fp = open('./ctresult.txt', 'w', encoding='utf-8')

    def process_item(self, item, spider):
        title = item['title']
        author = item['author']
        data = '{0},{1}\n'.format(title, author)
        self.fp.write(data)
        print(data, '写入成功')
        return item

    def close_spider(self, spider):
        print('爬虫结束')
        self.fp.close()


class MysqlPipeline(object):
    conn = None
    cursor = None

    def open_spider(self, spider):
        print('爬虫开始')
        self.conn = pymysql.connect(host='10.1.1.8', port=3306, user='root', password='Admin@123', db='spiderdb')

    def process_item(self, item, spider):
        sql = 'insert into ctinfo values(%s,%s)'
        data = (item['title'], item['author'])
        self.cursor = self.conn.cursor()
        try:
            self.cursor.execute(sql, data)
            self.conn.commit()
        except Exception as error:
            print(error)
            self.conn.rollback()
        return item

    def close_spider(self, spider):
        print('爬虫结束')
        self.cursor.close()
        self.conn.close()

settings.py

ITEM_PIPELINES = {
      'scrapypj01.pipelines.MysqlPipeline': 301,
}

ctspider.py

import scrapy
import scrapypj01.items as items


class CtspiderSpider(scrapy.Spider):
    name = 'ctspider'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['http://www.xxx.com/']

    # 终端命令持久化存储
    def parse(self, response):
        div_list = response.xpath('/html/body/main/div/div/div[1]/div/div[2]/div[1]/div')
        for div in div_list:
            title = div.xpath('./div/div/div[1]/a/text()').extract_first()
            author = div.xpath('./div[1]/div/div[2]/div[1]/span[1]/span/text()')[0].extract()
            
			# 实例化item类型的对象
            ctitem = items.Scrapypj01Item()
            ctitem['title'] = title
            ctitem['author'] = author

            # 将item对象提交给管道
            yield ctitem

3.5 基于 Redis 的持久化存储

items.py

import scrapy


class Scrapypj01Item(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    author = scrapy.Field()

pipelines.py: 专门用作于持久化存储

# Redis
from redis import Redis

class RedisPipeline(object):
    conn = None

    def open_spider(self, spider):
        print('爬虫开始')
        self.conn = Redis(host='10.1.1.8', port=6379, password='Admin@123')

    def process_item(self, item, spider):
        self.conn.lpush('ctlist', item)
        return item

settings.py

ITEM_PIPELINES = {
      'scrapypj01.pipelines.RedisPipeline': 302,
}

ctspider.py

import scrapy
import scrapypj01.items as items


class CtspiderSpider(scrapy.Spider):
    name = 'ctspider'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['http://www.xxx.com/']

    # 终端命令持久化存储
    def parse(self, response):
        div_list = response.xpath('/html/body/main/div/div/div[1]/div/div[2]/div[1]/div')
        for div in div_list:
            title = div.xpath('./div/div/div[1]/a/text()').extract_first()
            author = div.xpath('./div[1]/div/div[2]/div[1]/span[1]/span/text()')[0].extract()
            
			# 实例化item类型的对象
            ctitem = items.Scrapypj01Item()
            ctitem['title'] = title
            ctitem['author'] = author

            # 将item对象提交给管道
            yield ctitem