Scrape BoxOfficeMojo Data with Scrapy

Published in

Python in Plain English

6 min readMar 29, 2021

As a project I did for linear regression of movies called the “Liamometer” (you see it here), I scraped data using Scrapy. This article will cover building a web-crawler that stores in a CSV all movie titles from 2017–2020 as listed on BoxOfficeMojo.

Getting started with Scrapy (terms & setting up)

Scrapy is a framework used for webscraping. Webscraping is pulling data from websites for use in your outside app; we do it when we do not have an API to use. Scrapy is very fast, so it’s nice to use.

Terms you should know

Spiders: Scrapy uses “spiders” which are our blueprints. That is where we put in the instructions for the spider on where to go on the page, or where to “crawl”. They are classes which we put logic into to extract data.

Selectors: Selectors are mechanisms to parse HTML. HTML has p tags, anchor tags, divs, stuff like that, and these can be selected.

Items: Items are dictionary-like objects in Scrapy, which will hold the extracted data.

Installation

pip install scrapy
mkdir myproject
cd myproject
scrapy startproject tutorial

This will create a directory structure like this:

items.py and files inspiders/ are the places where we will define and use Items and Spiders, respectively.

Scraping BoxOfficeMojo

Step 1. Getting a link to each movie from 2017–2020

spiders/ is currently empty. We will use a spider called BoxOfficeMojo_spider.py.

cd tutorial/tutorial/spiders
touch BoxOfficeMojo_spider.py

In BoxOfficeMojo_spider.py, we can start to scrape the website. We create a class MojoSpider and pass in a scrapy.Spider object as the argument of the class since our spider inherits from the base spider class. We will name the spider “mojo”. Later on, when we run the spider we will call it by this name. We define our urls to scrape in the start_urls list; let’s say we want movies from 2017–2020.

#in BoxOfficeMojo_spider.py
import scrapyclass MojoSpider(scrapy.Spider):
 
     name = "mojo"     start_urls = [
         "https://www.boxofficemojo.com/year/2017/",
         "https://www.boxofficemojo.com/year/2018/",
         "https://www.boxofficemojo.com/year/2019/",
         "https://www.boxofficemojo.com/year/2020/"
     ]

Note: at this point check the robots.txt page of the website. BoxOfficeMojo allows all scraping, so we can keep going without worrying about getting blocked.

Let’s say I wanted some basic information like the link of each movie on each webpage.

We can define a function called parse() where we can do this. We pass two arguments: self and response which relate to the spider object and the response respectively.

def parse(self, response): 
    #how do I find each link?

The response will be the entire glob of HTML. To find a link within that, we can use what is called an xpath-selector. (More onXPath syntax can be found here.) By right-clicking a link and clicking “Inspect”, we see the raw HTML for the BoxOfficeMojo page. Each link is within a tr tag that’s within a table within a div within a div with the id attribute id="table". This “backwards” thinking strategy is useful for webscraping.

From there, we can design an xpath-selector that finds each table row for the first 9 rows (skipping the header row).

def parse(self, response): 
    
    table_rows = response.xpath('//*[@id="table"]/div/table/tr')[1:10]    for row in table_rows:
        #how is each link defined within a row?

Each link is within the second td tag in each row . Our whole block of code will then look like this:

#in BoxOfficeMojo_spider.py
"""
This code scrapes the first 9 links for movies in the years 2017-2020."""import scrapy
from ..items import MovieInfoclass MojoSpider(scrapy.Spider): 
    
    name = "mojo"
    
    start_urls = [
        "https://www.boxofficemojo.com/year/2017/",
        "https://www.boxofficemojo.com/year/2018/",
        "https://www.boxofficemojo.com/year/2019/",
        "https://www.boxofficemojo.com/year/2020/"
    ]def parse(self, response):
 
    table_rows = response.xpath('//*[@id="table"]/div/table/tr')[1:10]
    
    for row in table_rows: 
       link = row.xpath('./td[2]/a/@href')
       print(link)

To run this code, cd tutorial and type scrapy crawl mojo. To suppress the large chunk of code and for cleaner outputs, you can type scrapy crawl mojo -L WARN .

Step 2. Getting data (movie title) from each link’s page

This section will cover executing a request on a request (following links).

For our project, let’s say we want to get some information off of the url we just scraped. This is called “following a link”.

As per documentation, we can use response.urljoin in parse() to build a new url and provide a new request. You can think of this like a parse within a parse.

#in BoxOfficeMojo_spider.py
"""
This code parses each link then executes a second request on that link to open it and get the movie title.
"""class MojoSpider(scrapy.Spider): 
    
    name = "mojo"
    
    start_urls = [
        "https://www.boxofficemojo.com/year/2017/",
        "https://www.boxofficemojo.com/year/2018/",
        "https://www.boxofficemojo.com/year/2019/",
        "https://www.boxofficemojo.com/year/2020/"
    ]    def parse(self, response):        table_row = response.xpath('//*[@id="table"]/div/table/tr')[1:10]        for row in table_row:
             #get href attribute of a tag in 2nd td in row
             link = row.xpath('./td[2]/a/@href')
             #get url via response.urljoin
             url = response.urljoin(link[0].extract())
             #parse again via another parse function called parse_link
             yield scrapy.Request(url, self.parse_link)   def parse_link(self, response): 
    
       movie_title = response.xpath('//*[@id="a-page"]/main/div/div[1]/div[1]/div/div/div[2]/h1/text()')[0].extract()       print(movie_title)

By running scrapy crawl mojo -L WARN, we can see a bunch of movie titles.

Step 3. Storing data in a CSV file: Scrapy Items and Pipelines

Now that we’ve got our movie titles and print them out in console, let’s say we want to generate a CSV file with that information. There’s two steps: first we store the data in a Scrapy Item and afterwards we can write that dictionary-like Item into a CSV file via a Scrapy Pipeline .

It works basically like this (emphasis on “basically”)

Scrapy Items

Now we will move onto items.py.

Let’s say we want to call our Item MovieInfo. Dictionaries in Python have key-value pairs. Items in Scrapy have field-value pairs. In items.py, we must define whatever keys we want as Scrapy fields by using the syntax field_one = Field().

For MovieInfo, we just want movie titles. In items.py, we code the below:

#in items.py
import scrapyclass MovieInfo(scrapy.Item):    title = scrapy.Field()

Now that we’ve defined our dictionary with keys, we need to fill it with values. Back in BoxOfficeMojo_spider.py, we make some tweaks to our code:

#in BoxOfficeMojo_spider.pyimport scrapyfrom ..items import MovieInfoclass MojoSpider(scrapy.Spider):    name = "mojo"    start_urls = [
      "https://www.boxofficemojo.com/year/2017/",
      "https://www.boxofficemojo.com/year/2018/",
      "https://www.boxofficemojo.com/year/2019/",
      "https://www.boxofficemojo.com/year/2020/"
    ]    def parse(self, response):         table_rows = response.xpath('//*[@id="table"]/div/table/tr')[1:10]         for row in table_rows:
         #for each row in the table, uses xpath selectors
            link = row.xpath('./td[2]/a/@href')
            url = response.urljoin(link[0].extract())            yield scrapy.Request(url, self.parse_link)   def parse_link(self, response):       item = MovieInfo()       movie_title = response.xpath('//*[@id="a-page"]/main/div/div[1]/div[1]/div/div/div[2]/h1/text()')[0].extract()
       #print(movie_title)       #title and field name must match
       item['title'] = movie_title       yield item

Scrapy Pipelines

One use of Scrapy pipelines is to write our Scrapy items into databases, in our case just a CSV output file.

In our constructor function __init__, we can create a CSV file called MovieData.csv .

We can use process_item to process the row of data for each link we visit. We store as a row in our csv the item’s data, and write that to our csv file. We write print(“Done”) once the spider is closed and all things have been iterated through.

#in pipelines.py

import csvclass MoviePipeline():    def __init__(self):
        self.csvwriter = csv.writer(open("MovieInfo.csv", "w", newline=''))
        self.csvwriter.writerow(["Movie Titles"])    def process_item(self, item, spider):
        row = []
        row.append(item["title"])        self.csvwriter.writerow(row)    return item   def close_spider(self, spider):
        print("Done")

Step 4. Settings.py

The last step in our process is just un-commenting out ITEM_PIPELINES and changing the default name to the name of our pipeline.

#in settings.py
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'tutorial.pipelines.MoviePipeline': 300,
}

Conclusion

Now you can run scrapy crawl mojo -L WARN to scrape movie titles from BoxOfficeMojo.

Thank you and happy web-scraping!

You can find the GitHub repository used for this tutorial here.