Now, we will create a new GitHub repository, which will host all our Scrapy files. Then, we will open a terminal in the location of the repository and type:
scrapy startproject TMDB_scraper
cd TMDB_scraper
c. Tweak Settings
The GitHub repository will now have a lot of files in it, but let’s direct our attention to the file called settings.py.
In this file, we will modify User_Agent to equal 'Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148'.
This will prevent us from getting 403 errors while scraping.
2. Scraper
The fun part! First, we will create a file named tmdb_spider.py inside the spiders directory. Then, we add the following lines:
import scrapyclass TmdbSpider(scrapy.Spider): name ='tmdb_spider' start_urls = ['https://www.themoviedb.org/tv/2316-the-office']
Note that start_urls is defined as our link from Part 1a. If one had a different favorite movie/TV show, they will need to replace this url with their favorite movie/TV show’s TMDB page. Now, we will write three parsing methods:
a. parse()
This method will navigate from start_urls to the Full Cast & Crew page:
def parse(self, response):""" directs to the cast page given the starting tv/movie site """yield scrapy.Request("https://www.themoviedb.org/tv/2316-the-office/cast", callback =self.parse_full_credits)
Since the Full Cast & Crew page has the url <start_urls>/cast, we simply request that page using scrapy.Request. Our callback method is parse_full_credits(), which will start from the Full Cast & Crew page and lead to each actor’s own profile page (not the crew!).
Note
If one were to run this scraper with a different movie/TV show, they would need to change the link.
b. parse_full_credits()
def parse_full_credits(self,response): """ goes through each actor in the cast page """ actors_list = response.css('ol.people.credits:not(.crew) a::attr(href)').getall()for actor in actors_list:yield response.follow(actor, callback =self.parse_actor_page)
Here, actors_list will contain all the actors’ individual profile links. We iterate through this list and follow each link. The callback method is parse_actor_page(), which will start from the actor profile page and yield a dictionary containing all of the movies/TV shows this particular actor has been part of.
Note
Using the appropriate tag (:not(.crew)), we were able to filter out the crew members.
c. parse_actor_page()
We want to return a dictionary with two key-value pairs, of the form {"actor" : actor_name, "movie_or_TV_name" : movie_or_TV_name}.
def parse_actor_page(self, response):""" parses through each actor and creates a dictionary containing movies/shows the actor has been in """ actor_name = response.css("h2 a::text").get()for movie_or_TV_name in response.css("div.credits_list bdi::text").getall():yield {"actor": actor_name,"movie_or_TV_name": movie_or_TV_name }
Using the proper HTML tags, we extract the actor names and movie/TV show titles.
The scraper is now complete!
3. Recommendations
Now that the scraper is written, we will use this to compile our results.
We will run the following command in the terminal:
scrapy crawl tmdb_spider -o results.csv
The above will run the spider and generate a csv file containing the data. In our case, results.csv will look like this:
import pandas as pddf = pd.read_csv("results.csv")df
actor
movie_or_TV_name
0
Rainn Wilson
Robodog
1
Rainn Wilson
Hitpig
2
Rainn Wilson
Empire Waist
3
Rainn Wilson
Inappropriate Behaviour
4
Rainn Wilson
Rainn Wilson and the Geography of Bliss
...
...
...
10783
Damani Roberts
The King of Queens
10784
Damani Roberts
Buffy the Vampire Slayer
10785
Damani Roberts
Shamitabh
10786
Tanveer K. Atwal
The Office
10787
Tanveer K. Atwal
The Matrix Revolutions
10788 rows × 2 columns
Using the above data, let’s create a visualization! We want to find movies/TV shows that share the most amount of actors with The Office. To do this, we can simply look at the most frequent entries in the movie_or_TV_name column. Then, we will use this to create a pie chart using Plotly.
# import packagesfrom plotly import express as pximport numpy as npimport matplotlib.pyplot as plt