Scrape BoxOfficeMojo Data with Scrapy
As a project I did for linear regression of movies called the “Liamometer” (you see it here), I scraped data using Scrapy
. This article will cover building a web-crawler that stores in a CSV all movie titles from 2017–2020 as listed on BoxOfficeMojo.
Getting started with Scrapy (terms & setting up)
Scrapy is a framework used for webscraping. Webscraping is pulling data from websites for use in your outside app; we do it when we do not have an API to use. Scrapy is very fast, so it’s nice to use.
Terms you should know
Spiders: Scrapy uses “spiders” which are our blueprints. That is where we put in the instructions for the spider on where to go on the page, or where to “crawl”. They are classes which we put logic into to extract data.
Selectors: Selectors are mechanisms to parse HTML. HTML has p tags, anchor tags, divs, stuff like that, and these can be selected.
Items: Items are dictionary-like objects in Scrapy, which will hold the extracted data.
Installation
pip install scrapy
mkdir myproject
cd myproject
scrapy startproject tutorial
This will create a directory structure like this:
items.py
and files inspiders/
are the places where we will define and use Items
and Spiders
, respectively.
Scraping BoxOfficeMojo
Step 1. Getting a link to each movie from 2017–2020
spiders/
is currently empty. We will use a spider called BoxOfficeMojo_spider.py
.
cd tutorial/tutorial/spiders
touch BoxOfficeMojo_spider.py
In BoxOfficeMojo_spider.py
, we can start to scrape the website. We create a class MojoSpider
and pass in a scrapy.Spider
object as the argument of the class since our spider inherits from the base spider class. We will name the spider “mojo”. Later on, when we run the spider we will call it by this name. We define our urls to scrape in the start_urls
list; let’s say we want movies from 2017–2020.
#in BoxOfficeMojo_spider.py
import scrapyclass MojoSpider(scrapy.Spider):
name = "mojo" start_urls = [
"https://www.boxofficemojo.com/year/2017/",
"https://www.boxofficemojo.com/year/2018/",
"https://www.boxofficemojo.com/year/2019/",
"https://www.boxofficemojo.com/year/2020/"
]
Note: at this point check the robots.txt
page of the website. BoxOfficeMojo allows all scraping, so we can keep going without worrying about getting blocked.
Let’s say I wanted some basic information like the link of each movie on each webpage.
We can define a function called parse()
where we can do this. We pass two arguments: self
and response
which relate to the spider object and the response respectively.
def parse(self, response):
#how do I find each link?
The response
will be the entire glob of HTML. To find a link within that, we can use what is called an xpath-selector
. (More onXPath
syntax can be found here.) By right-clicking a link and clicking “Inspect”, we see the raw HTML for the BoxOfficeMojo page. Each link is within a tr
tag that’s within a table
within a div
within a div
with the id attribute id="table"
. This “backwards” thinking strategy is useful for webscraping.
From there, we can design an xpath-selector
that finds each table row for the first 9 rows (skipping the header row).
def parse(self, response):
table_rows = response.xpath('//*[@id="table"]/div/table/tr')[1:10] for row in table_rows:
#how is each link defined within a row?
Each link is within the second td
tag in each row
. Our whole block of code will then look like this:
#in BoxOfficeMojo_spider.py
"""
This code scrapes the first 9 links for movies in the years 2017-2020."""import scrapy
from ..items import MovieInfoclass MojoSpider(scrapy.Spider):
name = "mojo"
start_urls = [
"https://www.boxofficemojo.com/year/2017/",
"https://www.boxofficemojo.com/year/2018/",
"https://www.boxofficemojo.com/year/2019/",
"https://www.boxofficemojo.com/year/2020/"
]def parse(self, response):
table_rows = response.xpath('//*[@id="table"]/div/table/tr')[1:10]
for row in table_rows:
link = row.xpath('./td[2]/a/@href')
print(link)
To run this code, cd tutorial
and type scrapy crawl mojo
. To suppress the large chunk of code and for cleaner outputs, you can type scrapy crawl mojo -L WARN
.
Step 2. Getting data (movie title) from each link’s page
This section will cover executing a request on a request (following links).
For our project, let’s say we want to get some information off of the url we just scraped. This is called “following a link”.
As per documentation, we can use response.urljoin
in parse()
to build a new url and provide a new request. You can think of this like a parse within a parse.
#in BoxOfficeMojo_spider.py
"""
This code parses each link then executes a second request on that link to open it and get the movie title.
"""class MojoSpider(scrapy.Spider):
name = "mojo"
start_urls = [
"https://www.boxofficemojo.com/year/2017/",
"https://www.boxofficemojo.com/year/2018/",
"https://www.boxofficemojo.com/year/2019/",
"https://www.boxofficemojo.com/year/2020/"
] def parse(self, response): table_row = response.xpath('//*[@id="table"]/div/table/tr')[1:10] for row in table_row:
#get href attribute of a tag in 2nd td in row
link = row.xpath('./td[2]/a/@href')
#get url via response.urljoin
url = response.urljoin(link[0].extract())
#parse again via another parse function called parse_link
yield scrapy.Request(url, self.parse_link) def parse_link(self, response):
movie_title = response.xpath('//*[@id="a-page"]/main/div/div[1]/div[1]/div/div/div[2]/h1/text()')[0].extract() print(movie_title)
By running scrapy crawl mojo -L WARN
, we can see a bunch of movie titles.
Step 3. Storing data in a CSV file: Scrapy Items and Pipelines
Now that we’ve got our movie titles and print them out in console, let’s say we want to generate a CSV file with that information. There’s two steps: first we store the data in a Scrapy Item
and afterwards we can write that dictionary-like Item
into a CSV file via a Scrapy Pipeline
.
Scrapy Items
Now we will move onto items.py
.
Let’s say we want to call our Item MovieInfo
. Dictionaries in Python have key-value pairs. Items in Scrapy have field-value pairs. In items.py
, we must define whatever keys we want as Scrapy fields by using the syntax field_one = Field()
.
For MovieInfo
, we just want movie titles. In items.py
, we code the below:
#in items.py
import scrapyclass MovieInfo(scrapy.Item): title = scrapy.Field()
Now that we’ve defined our dictionary with keys, we need to fill it with values. Back in BoxOfficeMojo_spider.py
, we make some tweaks to our code:
#in BoxOfficeMojo_spider.pyimport scrapyfrom ..items import MovieInfoclass MojoSpider(scrapy.Spider): name = "mojo" start_urls = [
"https://www.boxofficemojo.com/year/2017/",
"https://www.boxofficemojo.com/year/2018/",
"https://www.boxofficemojo.com/year/2019/",
"https://www.boxofficemojo.com/year/2020/"
] def parse(self, response): table_rows = response.xpath('//*[@id="table"]/div/table/tr')[1:10] for row in table_rows:
#for each row in the table, uses xpath selectors
link = row.xpath('./td[2]/a/@href')
url = response.urljoin(link[0].extract()) yield scrapy.Request(url, self.parse_link) def parse_link(self, response): item = MovieInfo() movie_title = response.xpath('//*[@id="a-page"]/main/div/div[1]/div[1]/div/div/div[2]/h1/text()')[0].extract()
#print(movie_title) #title and field name must match
item['title'] = movie_title yield item
Scrapy Pipelines
One use of Scrapy pipelines is to write our Scrapy items into databases, in our case just a CSV output file.
In our constructor function __init__
, we can create a CSV file called MovieData.csv
.
We can use process_item
to process the row of data for each link we visit. We store as a row in our csv the item’s data, and write that to our csv file. We write print(“Done”)
once the spider is closed and all things have been iterated through.
#in pipelines.py
import csvclass MoviePipeline(): def __init__(self):
self.csvwriter = csv.writer(open("MovieInfo.csv", "w", newline=''))
self.csvwriter.writerow(["Movie Titles"]) def process_item(self, item, spider):
row = []
row.append(item["title"]) self.csvwriter.writerow(row) return item def close_spider(self, spider):
print("Done")
Step 4. Settings.py
The last step in our process is just un-commenting out ITEM_PIPELINES
and changing the default name to the name of our pipeline.
#in settings.py
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'tutorial.pipelines.MoviePipeline': 300,
}
Conclusion
Now you can run scrapy crawl mojo -L WARN
to scrape movie titles from BoxOfficeMojo.
Thank you and happy web-scraping!
You can find the GitHub repository used for this tutorial here.