Python : Web Crawling with Scrapy

Introduction

A long time ago... In my parent's house far far away... I used to collect Star Wars CCG cards... My card collection was a fun part of my childhood that was, during some unfortunate purge of my stuff in my teenage years, lost forever.

A few weeks ago, my wife and I rewatched the Star Wars trilogy for the first time in over 10 years. My interest in Star Wars reawakened, I decided to see if I could reassemble my old card collection digitally (the original Star Wars CCG game now having gone out of print.)

After some searching and a little luck with Google I found a website that hosted individual GIF images of each card! But... Downloading the 1000+ card images was a little daunting, even to my nostalgic fervor. And since I needed an exuse to learn more Python on my Raspberry Pi anyway, I decided to tackle automating the downloads using a web crawler / scraper library written in Python called Scrapy.

Installation

Scrapy is installed through pip, Python's package installer. If you do not have pip installed it can be installed through apt-get:

sudo apt-get install python-pip  

Before installing scrapy, there are a few additional dependencies needed:

sudo apt-get install python-lxml libffi-dev  

Once those are installed you can install Scrapy through pip:

sudo pip install Scrapy  

Scrapy also needs an additional Python dependency to handle some hostname verifications when using SSL. You may not need it but better to install now while we are here:

sudo pip install service_identity  

Finally, for the downloading images (the goal of this little project) we need to ensure that Python has some additional image handling libraries.

sudo pip uninstall pillow  
sudo apt-get install libjpeg-dev  
sudo pip install -I pillow  

NOTE: Scrapy is not to be confused with Scrappy. Scrappy is a Python library for renaming video files... Mind your Ps and Qs! But especially those Ps...

Once installed you should be able to type scrapy at your terminal and see the usage for Scrapy.

To start a new scrapy project you can use Scrapy's scaffolding:

scrapy startproject <project name>  

This will create a default project structure (including files for Items, ItemPipelines, Spiders, and other goodness discussed below.) You can use these as a starting point to right your own!

Scrapy

Scrapy has several key concepts:

  • Spider - A Spider is a module that encapsulates the logic for how to traverse URLs and how to extract information from a page for processing.

  • Item - An Item is a container that holds information from a page. Items are created by Spiders and can be processed in different ways depending on the type of Item.

  • Item Pipeline - An Item Pipeline is a processor that handles an Item and performs some action on it or with it. Item Pipelines can be chained together to form a pipeline... (You see what happened there?) 

The project I am undertaking (scraping all the card images from this site) is relatively simple. So I created a single Spider that goes through the page to find image links. Each image link is encapsulated as an Item. There is a single Item Pipeline in my pipeline that handles downloading the image.

Scrapy already provides an "ImagesPipeline" which provides some basic behavior. If an Item has an image_urls field, all images in that field are downloaded by ImagesPipeline. The images are saved as a file in a configurable directory with the hash of the image as the filename. Metadata about the image is saved to a images field on the Item. 

While it is good that Scrapy can handle the heavy lifting of downloading the images, their choice of default file names is not very helpful. So I decided to extend their ImagePipeline and give it more helpful behavior.

First of all, I wrote a simple Item module:

import scrapy

class CardImage(scrapy.Item):  
    page_title = scrapy.Field()
    image_urls = scrapy.Field()
    images = scrapy.Field()

As you can see, my CardImage extends Scrapy's Item. I gave it a page_title to assist with debugging, and the required image_urls and images field for the default behavior of ImagesPipeline.

Next, I wrote a modified version of ImagesPipeline. (This required digging around in the Scrapy source code... Fortunately the ImagesPipeline module was written in a well factored way and I was able to override a single method to achieve the desired behavior.)

import string

import scrapy

from scrapy.contrib.pipeline.images import ImagesPipeline

class CardImagePipeline(ImagesPipeline):  
    def file_path(self, request, response=None, info=None):
        return string.split(request.url, '/')[-3] + '/' + string.split(request.url, '/')[-1]

The prior implementation of file_path just calculated a hash from the image. This one takes the image url, splits it on / and takes the last and third from the last elements (which, because of the structure of the site correspond to the card name and the expansion name respecively.) This creates a nice directory structure where there is one directory for each expansion and the files inside are named correctly for the card they depict.

NOTE: This highlights an interesting observation about writing web scrapers. Here I am leveraging a specific convention the site authors used. While this makes solving this problem much easier, it means this web scraper is not at all generic. There seems to be a trade off between "generic" and "easy" when it comes to web scraping!

In order to configure Scrapy to use your pipelines, you need to edit the settings.py file and add an entry to define your pipelines:

ITEM_PIPELINES = {'ccg.pipelines.CardImagePipeline': 1}  

The value (1 in my case) is a priority that determines the order in which pipelines are executed.

As mentioned above, ImagesPipeline by default stores images in a configurable directory. That directory is also configured in settings.py as follows:

IMAGES_STORE = '/home/pi/work/scrapy/images'  

Finally we come to the Spider:

import scrapy

import urlparse

from ccg.items import CardImage

class CcgSpider(scrapy.Spider):  
  name = "ccg"
#  allowed_domains = "starwarsccg.org"
  start_urls = [
    "http://www.starwarsccg.org/cardlists/PremiereType.html"
  ]

  seen_urls = []

  def parse(self, response):
    title = response.xpath('//head/title/text()').extract()[0]
    for sel in response.xpath('//a'):
      link = str(sel.xpath([email protected]').extract()[0])
      if (link.endswith('.gif')):
        cardImage = CardImage()
        cardImage['page_title'] = title
        cardImage['image_urls'] = ['http://www.starwarsccg.org/cardlists/' + link]
        yield cardImage
      if (not link.startswith('V') and link.endswith('Type.html')):
        if (not link in self.seen_urls):
          self.seen_urls.append(link) 
          yield scrapy.Request(urlparse.urljoin('http://www.starwarsccg.org/cardlists/', link), callback=self.parse)

A lot to process here! Let's start from the top:

  name = "ccg"
#  allowed_domains = "starwarsccg.org"
  start_urls = [
    "http://www.starwarsccg.org/cardlists/PremiereType.html"
  ]

Important metadata for the Spider. The start_url will (obviously) be automatically processed first. A spider can generate new urls to follow as it processes its initial start_urls. I commented out the allowed_domains field, but that can be used to limit Scrapy to only scraping within certain sites if desired. It is applicable but not important here since I already limit the urls the spider processes through other mean.

NOTE: Scrapy gives a strange error (something to the effect of: ImportError: No module named items) if you name your Spider module the same name as your project. (My project is named ccg and my Spider is named "ccg" but the module containing the Spider code is called CcgSpider.)

The parse method contains the core logic of the spider. In this case we use xpath to parse out the page title (which is later saved in the item for debugging), and then process every link on the page.

Inspection of the target website shows that each expansion page (such as Premiere) contain <a> tags for each image (all ending in .gif) and links to other expansion pages (each starting with "V" and ending in "Type.html" by internal convention). I wrote the spider to specifically exploit this.

Thus you have this code:

    for sel in response.xpath('//a'):
      link = str(sel.xpath([email protected]').extract()[0])
      if (link.endswith('.gif')):
        <process image>
      if (not link.startswith('V') and link.endswith('Type.html')):
        if (not link in self.seen_urls):
            <process expansion page>

The spider can yields an arbitrary number of Items or Requests (which are themselves, then processed by the spider). For this spider we encapsulate each image as an Item:

        cardImage = CardImage()
        cardImage['page_title'] = title
        cardImage['image_urls'] = ['http://www.starwarsccg.org/cardlists/' + link]
        yield cardImage

And each expansion page as a request:

          self.seen_urls.append(link) 
          yield scrapy.Request(urlparse.urljoin('http://www.starwarsccg.org/cardlists/', link), callback=self.parse)

Note: To prevent infinite loops, I keep track of what expansion pages I have visited in order to prevent revisiting them. Hence the seen_urls list. For a general purpose web crawler you would need a much more scalable solution to solve this problem.

Note that when returning a Request you can specific a different method in your spider to handle that Request. This allows you to have different logic for different URLs as needed.

That is all there is to this simple web spider!

Thoughts? Questions? Comments? Email me at: [email protected]