Python : Web Crawling IMDB with Scrapy, Neo4J, and AWS

In my last blog I introduced Scrapy, a web crawling library for Python, and did some very simple image scraping with Scrapy. In this post, I want to dive a little deeper into Scrapy's capabilities and show you some more complex web crawling! 

A little while ago I went to a chalk talk on Neo4J. I was immediately intrigued by the power and simplicity of Neo4J and the different problems it could solve so effectively at scale.

NOTE: For those unfamiliar with Neo4J, it is a graph database that stores nodes (which can have an arbitrary number of attributes) and relationships between nodes. It can scale infinitely and can perform graph calculations (such as "find the nodes that are connected to a node that is connected to another node through a certain relationship") very efficiently.

I was immediately inspired to: (a) install Neo4J on my Raspberry Pi, and (b) to scrape IMDB for acting data so I could do things like solve my own "6 Degrees of Kevin Bacon" problem for an arbitrary actor.

For the record, (a) did not work out at all. I was able, after a lot of effort, to get Neo4J installed on a Raspberry Pi B+ but it crashed immediately as soon as it tried to handle its first query. I am told that it works much better on Java 8 on the Raspberry Pi 2, but I will have to wait for my Raspberry Pi 2 to arrive to verify this. I relocated my efforts to an EC2 instance in AWS and proceed to build my IMDB spider.

With (b) though, I had good success! 

NOTE: Source code can be downloaded from GitHub here.

Installing Neo4J

I launched an t2.medium EC2 instance in AWS with the stock Amazon AMI. I assigned it a security group that allowed TCP ports 22 and 7474 (for PuTTY and Neo4J Console respectively). I also assigned it an elastic ip so I would have a single fixed IP address to work with and not have to reconfigure PuTTY each time I restarted the instance. 

Once the instance was running and I had setup my local PuTTY with the AWS keys, I was able to log in and install Neo4J.

To download Neo4J you can go to their download page. If you try to download the community edition it will autodetect your operating system. If you are not running on a Mac/Linux computer this will be the wrong version for AWS. Instead, click on Other Release and download the latest community edition version for Mac/Linux. Once it has downloaded you can transfer it to your AWS EC2 instance using the PSCP command line tool that comes with PuTTY. 

Alternatively, you can start the download on your computer, and then find the actual source URL of the download (in Chrome by looking in the Downloads page for example.) You can then download Neo4J directly from the URL on your EC2 instance using wget:

wget -O neo4j.tar.gz 'http://neo4j.com/artifact.php?name=neo4j-community-2.2.1-unix.tar.gz'  

Once copied to your EC2 instance, all you need to do is untar the file:

tar xvf neo4j.tar.gz  

You should also modified your .bash_profile or .bashrc to export NEO4J_HOME and add Neo4J to your PATH for convenience.

export NEO4J_HOME=/home/ec2-user/neo4j-community-2.2.1  
export PATH=$NEO4J_HOME/bin:$PATH  

You can confirm that Neo4J is installed by running the command:

neo4j status  

You can then start Neo4j by running:

neo4j start  

After Neo4J starts there should be a console available at:

<EC2 instance ip address>:7474/browser  

The console, among many other things, allows you to run Neo4J queries against your Neo4J. For the purposes of this project, you really only need to do two simple things. One, see all the nodes and node relationships:

MATCH (n)  
RETURN n;  

Two, delete all the nodes and node relationships in the database so you can run the crawler multiple times without conflicting data:

MATCH (n)  
OPTIONAL MATCH (n)-[r]-()  
DELETE n,r  

Installing Scrapy / py2neo

The next step is to install Scrapy and a library called py2neo on your EC2 instance. 

By default, AWS preinstalls pip, so you only need install Scrapy and some dependencies (found, as usual, by trial and error and repeated use of Google on my part.)

sudo yum install libffi-devel  
sudo yum install libxslt-devel  
sudo yum install gcc  
sudo yum install openssl-devel  
sudo pip install Scrapy  
sudo pip install service_identity  

Scrapy should now be installed on your EC2 instance.

After Scrapy, you can install py2neo which is a compact Python library for interacting with Neo4J from within a Python script.

The command to install py2neo is:

sudo pip install py2neo  

To test py2neo you can run the Python console:

sudo python  

And run this script:

from py2neo import Graph  
from py2neo import Node  
from py2neo import Relationship  
graph = Graph()  
stephen = graph.merge_one("Person", "name", "Stephen Mouring");  
kathryn = graph.merge_one("Person", "name", "Kathryn Mouring");  
stephen_loves_kathryn = Relationship(stephen, "LOVES", kathryn)  
graph.create_unique(stephen_loves_kathryn);  

(Feel free to substitute your own name and the name of a loved one if you are feeling particularly romantic...)

You should be able to run the "select all" query in the Neo4J Console (MATCH (n) RETURN n;) and see your two nodes connected by a relationship.

Designing The Code

For this project, we will be need to write a slightly more complex web crawler. In the prior blog post, we only needed to write a crawler that processed one kind of page.

IMDB (our target site in this project) has several different kinds of pages. Each actor has a page listing (among other things) all their films. Each film has a page listing (among other things) all its actors. To solve problems like the "6 Degrees of Separation" problem, we want to create a graph in Neo4J where actors are the nodes and films they played in are the relationships. Two actors will be connected by a relationship if they acted in the same film. 

To do this in Scrapy, we will need to process actor pages differently than we process film pages. When we process an actor page, we will create a unique node for that actor, a unique node for each film on that actor's page, and a _unique _relationship between that actor and that film. We then process each film page listed on that actor page. These pages then generate requests for new actor pages which are processed as before.

It is important that the nodes and node relationships are unique. Neo4J allows you to have identical nodes and identical relationships between nodes, but that would prevent us from searching relationships between actors correctly. We will need to take steps to enforce uniqueness, discussed below.

Writing The Item

You might think we need two different Scrapy Items for this problem (one for Actor and one for Film), but in reality we can do it with a single item:

File: items.py

import scrapy

class ImdbPersonPage(scrapy.Item):  
    person = scrapy.Field()
    person_id = scrapy.Field()
    films = scrapy.Field()
    pass

We have a person field for the actor's name, a person_id for the actor, and films, a list of all the film ids the actor has acted in. This allows us to create both kinds of nodes and relatoinships between them using a single Scrapy Item.

NOTE: IMDB already has a unique id assigned to each actor (a seven digit number prefixed with nm) and each film (a seven digit number prefixed with tt). We will use these in our program for convenience.

Writing The Pipeline

Although it is putting the cart before the horse a little bit, I think it is easier to see the Pipeline code before we dive into the Spider code.

The ImdbPersonPagePipeline accepts ImdbPersonPage items. It then creates a unique node for the actor, a unique node for each film, and a unique relatonship between the actor and the films. 

File: pipelines.py

from py2neo import Graph  
from py2neo import Node  
from py2neo import Relationship

class ImdbPersonPagePipeline(object):  
    graph = Graph()

    def process_item(self, item, spider):
        print('Putting Person in Neo4J: ' + item['person_id'])
        person_node = self.graph.merge_one("Person", "id", item['person_id'])
        person_node.properties['name'] = item['person']
        person_node.push()
        for film in item['films']:
            film_node = self.graph.merge_one("Film", "id", film)
            film_node.properties['name'] = item['films'][film]
            film_node.push()
            self.graph.create_unique(Relationship(person_node, "ACTED_IN", film_node))
        return item

A few things to note... First of all, the graph.merge_one method takes a Node type, and an property pair. If that a Node of that type with that property pair already exists it returns it. Otherwise it creates it. This is what ensures our actor (and film) nodes are unique.

For convenience we also add the actor's real name to the node as a property to make it easier to view the grpah in the Neo4J Console.

We then loop over the films associated with the actor and create unique notes for those, again, using the merge_one method.

Finally, we now have a reference to both the actor node and the film node, so we create an ACTED_IN relationship between them using the create_unique method. (Suprise! The create_unique method creates a relationship if it is not present, or does nothing if it is already present.)

Writing The Spider

The Spider is definitely the most challenging part of code to write. There are several things to note. First, it uses two different callbacks, one for each type of page it is processing. Second, we have to account for coming across the same page multiple times. (This is a real problem in general purpose web crawler as two pages can link to each other (even indirectly) and trap your crawler in an infinite loop. Third, there is a significant gotcha in Scrapy that I discovered during this project. You will see some lines commented out which will be discussed shortly.

I will show you the whole class and then dive into each component:

File: spiders/imdb6degrees.py

import string

import scrapy

# from py2neo import Graph
# from py2neo import Node

from imdb.items import ImdbPersonPage

class Imdb6DegreesSpider(scrapy.Spider):  
    name = "imdb"
    start_urls = (
        'http://www.imdb.com/name/nm0000246',
    )

    people = 0
    peopleLimit = 10

    people_crawled = []

    films_crawled = []

#    graph = Graph()

    def parse(self, response):
        person_id = string.split(response.url, "/")[-2]

#        if (self.has_person_been_crawled(person_id)):
#            return

        personPage = ImdbPersonPage()
        personPage['person'] = response.xpath("//span[@itemprop='name']/text()").extract()[0]
        personPage['person_id'] = person_id
        personPage['films'] = {}
        print('Person: ' + personPage['person_id'])
        for filmElement in response.xpath("//div[@id='filmography']/div[@id='filmo-head-actor']/following-sibling::div[contains(@class, 'filmo-category-section')][1]/div[contains(@class, 'filmo-row')]//a[starts-with(@href, '/title/tt')]"):
            film_id = string.split(filmElement.xpath('@href').extract()[0], '/')[-2]

            personPage['films'][film_id] = filmElement.xpath('text()').extract()[0]

            if (not film_id in self.films_crawled):
                self.films_crawled.append(film_id)
                yield scrapy.Request('http://www.imdb.com' + filmElement.xpath('@href').extract()[0], callback=self.parse_film_page)
        yield personPage
        return

    def parse_film_page(self, response):
        for personElement in response.xpath("//table[contains(@class, 'cast_list')]//td[contains(@itemprop, 'actor')]//a"):
            person_id = string.split(personElement.xpath('@href').extract()[0], "/")[-2]
            if (person_id in self.people_crawled):
                print('Person: ' + person_id + ' ALREADY CRAWLED')
                return

            self.people_crawled.append(person_id)

            self.people += 1
            if (self.people <= self.peopleLimit):
                yield scrapy.Request('http://www.imdb.com' + personElement.xpath('@href').extract()[0], callback=self.parse)
        return

#    def has_person_been_crawled(self, person_id):
#        person = list(self.graph.find('Person', property_key='id', property_value=person_id))
#        if (len(person) > 0):
#            print('ALREADY CRAWLED: ' + person_id)
#            return True
#        return False

All right, taking it from the top. 

    name = "imdb"
    start_urls = (
        'http://www.imdb.com/name/nm0000246',
    )

    people = 0
    peopleLimit = 10

    people_crawled = []

    films_crawled = []

Both name and start_urls variables are required by Scrapy. Our Spider starts on a single actor page. (I choose Bruce Willis... Why? Because of this.) I also added a count of the total number of actors processed (people) and, for testing purposes, a limit to the number of actors we will process (peopleLimit). I also added two arrays to keep track of what actors and what films we have already processed.

Next is the actual parse method. Since we are starting on an actor page, the parse method will handle actor pages. We will then write a separate callback for the film pages.

    def parse(self, response):
        person_id = string.split(response.url, "/")[-2]

#        if (self.has_person_been_crawled(person_id)):
#            return

        personPage = ImdbPersonPage()
        personPage['person'] = response.xpath("//span[@itemprop='name']/text()").extract()[0]
        personPage['person_id'] = person_id
        personPage['films'] = {}
        print('Person: ' + personPage['person_id'])

First, the method extracts the person_id (using IMDB's id scheme) from the URL of the page. It then constructs an ImdbPersonPage item and initializes the name of the actor, the person_id, and an empty array to store the films this actor has acted in. (Note that IMDB tags metadata about the actor with a custom HTML attribute (itemprop) making it easy to locate the actor's name on the page.)

After collecting the information about the actor for the item, we then process all the films that the actor has been in.

        for filmElement in response.xpath("//div[@id='filmography']/div[@id='filmo-head-actor']/following-sibling::div[contains(@class, 'filmo-category-section')][1]/div[contains(@class, 'filmo-row')]//a[starts-with(@href, '/title/tt')]"):
            film_id = string.split(filmElement.xpath('@href').extract()[0], '/')[-2]

            personPage['films'][film_id] = filmElement.xpath('text()').extract()[0]

            if (not film_id in self.films_crawled):
                self.films_crawled.append(film_id)
                yield scrapy.Request('http://www.imdb.com' + filmElement.xpath('@href').extract()[0], callback=self.parse_film_page)
        yield personPage
        return

That is a doosie of an xpath expression! So let me explain... Each actor page can have different sections to cover different roles the actor has in a film (some actors also direct films, produce films, etc.) So first we need to find the right section (under the div#filmography there is a subsection div#filmo-head-actor). Then we narrow it down to a row for each film (div.filmo-row), and then find all URLs in those rows that start with /title/tt. (This is important because there are other links in the rows that do not point to the film but rather to other metadata about the film.)

For each film URL we find, we extract the id and add it to the films list of the item. We then generate a new request for the film url to go crawl that page. Notice that I am specifying a different callback (parse_film_page)when I yield the request.

After processing all the films on the page, we then commit the item to the pipeline. We also add the film id to the films_crawled array and check that array before processing the film. This prevents us from processing the same film twice.

_NOTE: You notice some of the code in the Spider is commented out where I initially tried to be clever and query Neo4J for the actor and film nodes instead of storing them locally in an array. My thought is that this would scale better and be a more realistic solution. I discovered however that the Spider and the Pipeline operate on different threads! _

Once a Spider commits an item to the Pipeline it continues crawling. Since it can take a long time to commit objects to Neo4J, the Pipeline can become backed up, creating a long delay between when an Item is committed and when the nodes appear in Neo4J. The Spider therefore cannot rely on the Pipeline to process the Item quickly enough for it to use Neo4J to detect which actors and films it has already processed.

The calback for the flim page is much simpler:

    def parse_film_page(self, response):
        for personElement in response.xpath("//table[contains(@class, 'cast_list')]//td[contains(@itemprop, 'actor')]//a"):
            person_id = string.split(personElement.xpath('@href').extract()[0], "/")[-2]
            if (person_id in self.people_crawled):
                print('Person: ' + person_id + ' ALREADY CRAWLED')
                return

            self.people_crawled.append(person_id)

            self.people += 1
            if (self.people <= self.peopleLimit):
                yield scrapy.Request('http://www.imdb.com' + personElement.xpath('@href').extract()[0], callback=self.parse)
        return

The actor page actually contains all the information we need to build the actor and film nodes, and the relationships between them. All we need from the film page is more actor pages to crawl.

All the parse_film_page does is find the section of the page that contains all the actors that acted in the film, and yield a new request each actor page. When it yields the request it sets the callback to be the original parse method, thus completing the loop between the two methods.

Like the film page, it also adds the id of the film to an array and checks that array before crawling the page to make sure it has not already processed that page.

Conclusion

That is all there is! You can now crawl IMDB and will see a network of nodes and node relationships populate inside of Neo4J allowing you to run whatever queries you choose to against the data!

Questions? Comments? Email me at: [email protected]