Scrapy 模块

1 Scrapy 简介

Scrapy是一个应用程序框架,用于对网站进行爬行和提取结构化数据,这些结构化数据可用于各种有用的应用程序,如数据挖掘、信息处理或历史存档。其具有以下功能:

  • 支持全栈数据爬取操作
  • 支持XPath
  • 异步的数据下载
  • 支持高性能持久化存储
  • 分布式

官网:Scrapy | A Fast and Powerful Scraping and Web Crawling Framework

1.1 安装

# Twisted是用Python实现的基于事件驱动的网络引擎框架,Scrapy 基于 Twisted
pip install twisted

# 安装scrapy
pip install scrapy

1.2 Scrapy 全局命令

Usage:
  scrapy <command> [options] [args]

Available commands:
  bench         Run quick benchmark test
  fetch         Fetch a URL using the Scrapy downloader
  genspider     Generate new spider using pre-defined templates
  runspider     Run a self-contained spider (without creating a project)
  settings      Get settings values
  shell         Interactive scraping console
  startproject  Create new project
  version       Print Scrapy version
  view          Open URL in browser, as seen by Scrapy
    

1.3 Scrapy项目命令

Usage:
  scrapy <command> [options] [args]

Available commands:
  check         Check spider contracts
  crawl         Run a spider
  edit          Edit spider
  list          List available spiders
  parse         Parse URL (using its spider) and print the results

2 Scrapy 操作

2.1 创建项目操作

# 创建项目文件
scrapy startproject <scrapyPJname>

# 创建爬虫文件
cd <scrapyPJname>
scrapy genspider <spiderName> www.xxx.com

# 执行
scrapy crawl spiderName

2.2 配置项目文件

# settings.py

USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.81 ' \
             'Safari/537.36 '

## 不遵从robots协议
ROBOTSTXT_OBEY = False

## Log
LOG_LEVEL = 'ERROR'
LOG_FILE = 'log.txt'

# 300表示的优先级,越小优先级越高
ITEM_PIPELINES = {
   'scrapyPJ01.pipelines.Scrapypj01Pipeline': 300,
}

1.4 数据解析

extract():列表是有多个列表元素
extract_first():列表元素只有单个

1.5 持久化存储

基于终端指令:
  - 只可以将parse方法的返回值存储到磁盘文件中
  - scrapy crawl first -o file.csv
基于管道:pipelines.py
  - 编码流程:
      - 1.数据解析
      - 2.在item的类中定义相关的属性
      - 3.将解析的数据存储封装到item类型的对象中.item['p']
      - 4.将item对象提交给管道
      - 5.在管道类中的process_item方法负责接收item对象,然后对item进行任意形式的持久化存储
      - 6.在配置文件中开启管道
   - 细节补充:
      - 管道文件中的一个管道类表示将数据存储到某一种形式的平台中。
      - 如果管道文件中定义了多个管道类,爬虫类提交的item会给到优先级最高的管道类。
      - process_item方法的实现中的return item的操作表示将item传递给下一个即将被执行的管道类

3 实例

3.1 基于终端命令持久化存储

  • ctspider.py
import scrapy


class CtspiderSpider(scrapy.Spider):
    name = 'ctspider'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['http://www.xxx.com/']

    def parse(self, response):
        data_list = []
        div_list = response.xpath('/html/body/main/div/div/div[1]/div/div[2]/div[1]/div')
        for div in div_list:
            # 注意:xpath返回的列表中的列表元素是Selector对象,我们要解析获取的字符串的数据是存储在该对象中
            # 必须经过一个extract()的操作才可以将改对象中存储的字符串的数据获取
            # title = div.xpath('./div/div/div[1]/a/text()')  # [<Selector xpath='./div/div/div[1]/a/text()' data='泽连斯基何以当选《时代》2022年度人物?'>]
            title = div.xpath('./div/div/div[1]/a/text()').extract_first()
            # xpath返回的列表中的列表元素有多个(Selector对象),使用extract()取出
            author = div.xpath('./div[1]/div/div[2]/div[1]/span[1]/span/text()')[0].extract()  # ['知世']
            content = div.xpath('./div[1]/div/div[1]/div[3]/text()').extract_first()  # 美国《时代》杂志将乌克兰总统泽连斯基及“乌克兰精神”评为2022年度风云人...
            # 返回列表数据
            data = {
                'title':title,
                'author':author,
                'content':content
            }
            data_list.append(data)
        return data_list

  • scrapy crawl ctspider -o ctresult.csv

3.2 引入item

  • items.py
import scrapy


class Scrapypj01Item(scrapy.Item):
    # define the fields for your item here like:
    # Field是一个万能的数据类型
    title = scrapy.Field()
    author = scrapy.Field()

  • ctspider.py
import scrapy
import scrapypj01.items as items


class CtspiderSpider(scrapy.Spider):
    name = 'ctspider'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['http://www.xxx.com/']

    # 终端命令持久化存储
    def parse(self, response):
        ctresponse = response.xpath('')
        title = ctresponse.xpath('/html/body/main/div/div/div[1]/div/div[2]/div[1]/div/div[1]/div/div/div[1]/a/text()').extract_first()
        author = ctresponse.xpath('/html/body/main/div/div/div[1]/div/div[2]/div[1]/div/div[1]/div[1]/div/div[2]/div[1]/span[1]/span/text()')[0].extract()
		
        # 实例化item类型的对象
        ctitem = items.Scrapypj01Item()
        ctitem['title'] = title
        ctitem['author'] = author

        return ctitem

  • scrapy crawl ctspider -o ctspider.csv

3.3 基于管道的持久化存储:pipelines.py

  • items.py
import scrapy


class Scrapypj01Item(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    author = scrapy.Field()

  • pipelines.py: 专门用作于持久化存储
# Pipleline
class Scrapypj01Pipeline(object):
    fp = None

    def open_spider(self, spider):
        print('爬虫开始')
        self.fp = open('./ctresult.txt', 'w', encoding='utf-8')

    def process_item(self, item, spider):
        title = item['title']
        author = item['author']
        data = '{0},{1}\n'.format(title, author)
        self.fp.write(data)
        print(data, '写入成功')
        return item

    def close_spider(self, spider):
        print('爬虫结束')
        self.fp.close()
  • settings.py
ITEM_PIPELINES = {
   'scrapypj01.pipelines.Scrapypj01Pipeline': 300,
}
  • ctspider.py
import scrapy
import scrapypj01.items as items


class CtspiderSpider(scrapy.Spider):
    name = 'ctspider'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['http://www.xxx.com/']

    def parse(self, response):
        div_list = response.xpath('/html/body/main/div/div/div[1]/div/div[2]/div[1]/div')
        for div in div_list:
            title = div.xpath('./div/div/div[1]/a/text()').extract_first()
            author = div.xpath('./div[1]/div/div[2]/div[1]/span[1]/span/text()')[0].extract()
            
			# 实例化item类型的对象
            ctitem = items.Scrapypj01Item()
            ctitem['title'] = title
            ctitem['author'] = author

            # 将item对象提交给管道
            yield ctitem

3.4 基于 Mysql 的持久化存储

  • items.py
import scrapy


class Scrapypj01Item(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    author = scrapy.Field()

  • pipelines.py: 专门用作于持久化存储
# Mysql
import pymysql


# 专门用作于持久化存储
class Scrapypj01Pipeline(object):
    fp = None

    def open_spider(self, spider):
        print('爬虫开始')
        self.fp = open('./ctresult.txt', 'w', encoding='utf-8')

    def process_item(self, item, spider):
        title = item['title']
        author = item['author']
        data = '{0},{1}\n'.format(title, author)
        self.fp.write(data)
        print(data, '写入成功')
        return item

    def close_spider(self, spider):
        print('爬虫结束')
        self.fp.close()


class MysqlPipeline(object):
    conn = None
    cursor = None

    def open_spider(self, spider):
        print('爬虫开始')
        self.conn = pymysql.connect(host='10.1.1.8', port=3306, user='root', password='Admin@123', db='spiderdb')

    def process_item(self, item, spider):
        sql = 'insert into ctinfo values(%s,%s)'
        data = (item['title'], item['author'])
        self.cursor = self.conn.cursor()
        try:
            self.cursor.execute(sql, data)
            self.conn.commit()
        except Exception as error:
            print(error)
            self.conn.rollback()
        return item

    def close_spider(self, spider):
        print('爬虫结束')
        self.cursor.close()
        self.conn.close()

  • settings.py
ITEM_PIPELINES = {
      'scrapypj01.pipelines.MysqlPipeline': 301,
}
  • ctspider.py
import scrapy
import scrapypj01.items as items


class CtspiderSpider(scrapy.Spider):
    name = 'ctspider'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['http://www.xxx.com/']

    # 终端命令持久化存储
    def parse(self, response):
        div_list = response.xpath('/html/body/main/div/div/div[1]/div/div[2]/div[1]/div')
        for div in div_list:
            title = div.xpath('./div/div/div[1]/a/text()').extract_first()
            author = div.xpath('./div[1]/div/div[2]/div[1]/span[1]/span/text()')[0].extract()
            
			# 实例化item类型的对象
            ctitem = items.Scrapypj01Item()
            ctitem['title'] = title
            ctitem['author'] = author

            # 将item对象提交给管道
            yield ctitem

3.5 基于 Redis 的持久化存储

  • items.py
import scrapy


class Scrapypj01Item(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    author = scrapy.Field()

  • pipelines.py: 专门用作于持久化存储
# Redis
from redis import Redis

class RedisPipeline(object):
    conn = None

    def open_spider(self, spider):
        print('爬虫开始')
        self.conn = Redis(host='10.1.1.8', port=6379, password='Admin@123')

    def process_item(self, item, spider):
        self.conn.lpush('ctlist', item)
        return item
  • settings.py
ITEM_PIPELINES = {
      'scrapypj01.pipelines.RedisPipeline': 302,
}
  • ctspider.py
import scrapy
import scrapypj01.items as items


class CtspiderSpider(scrapy.Spider):
    name = 'ctspider'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['http://www.xxx.com/']

    # 终端命令持久化存储
    def parse(self, response):
        div_list = response.xpath('/html/body/main/div/div/div[1]/div/div[2]/div[1]/div')
        for div in div_list:
            title = div.xpath('./div/div/div[1]/a/text()').extract_first()
            author = div.xpath('./div[1]/div/div[2]/div[1]/span[1]/span/text()')[0].extract()
            
			# 实例化item类型的对象
            ctitem = items.Scrapypj01Item()
            ctitem['title'] = title
            ctitem['author'] = author

            # 将item对象提交给管道
            yield ctitem