Linux安装Python Scrapy并制作爬虫的例子

作者：简简单单 2015-08-17

一：安装步骤

sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 627220E7

echo 'deb http://archive.scrapy.org/ubuntu scrapy main' | sudo tee /etc/apt/sources.list.d/scrapy.list

sudo apt-get update && sudo apt-get install scrapy-0.25
二：执行完毕后输入scrapy 不报错即可安装成功，然后输入

scrapy startproject TestMore

会在当前目录建立scrapy文件项目，下面让我们制作一个爬虫抓取网页

以http://www.111com.net 为例

TestMore/spiders/spider.py

# -*- coding: utf-8 -*-

import scrapy

from scrapy.contrib.spiders import CrawlSpider

from scrapy.http import Request

from scrapy.selector import Selector

from doubanmovie.items import DoubanmovieItem

 

class TestMore(CrawlSpider):

    name = "TestMore"

    redis_key = 'TestMore:start_urls'

    start_urls = ['http://movie.douban.com/top250']

 

    url = 'http://movie.douban.com/top250'

 

    def parse(self,response):

        # print response.body

        item = DoubanmovieItem()

        selector = Selector(response)

        Movies = selector.xpath('//div[@class="info"]')

        for eachMoive in Movies:

            title = eachMoive.xpath('div[@class="hd"]/a/span/text()').extract()

            fullTitle = ''

            for each in title:

                fullTitle += each

            movieInfo = eachMoive.xpath('div[@class="bd"]/p/text()').extract()

            star = eachMoive.xpath('div[@class="bd"]/div[@class="star"]/span/em/text()').extract()[0]

            quote = eachMoive.xpath('div[@class="bd"]/p[@class="quote"]/span/text()').extract()

            #quote可能为空，因此需要先进行判断

            if quote:

                quote = quote[0]

            else:

                quote = ''

            item['title'] = fullTitle

            item['movieInfo'] = ';'.join(movieInfo)

            item['star'] = star

            item['quote'] = quote

            yield item

        nextLink = selector.xpath('//span[@class="next"]/link/@href').extract()

        #page分页

        if nextLink:

            nextLink = nextLink[0]

            print nextLink

            yield Request(self.url + nextLink,callback=self.parse)

上一个：阿里云ECS基于WDCP面板一键升级PHP5.3.17、PHP5.5.9、PHP5.6.12脚本下一个：虚拟化服务 XenServer Yum 安装软件开启自精简存储模式

Linux安装Python Scrapy并制作爬虫的例子

相关文章

精彩推荐