Web Scraping with Scrapy

python
visualizations
Author

Jun Ryu

Published

January 26, 2023

What movie or TV shows share actors with your favorite movie or show?

We will try to answer the above question by building a simple “recommender system” that looks at the number of shared actors between two movies/shows.

This blog post will consist of two main parts:

  1. We will build a webscraper to scrape TMDB.
  2. We will sort our scraped results and return an appropriate visualization.

1. Setup


a. Pick a Movie/TV Show

First, we will locate the starting page. For this post, let’s use The Office, a classic American sitcom.

The Office’s TMDB page is found here: https://www.themoviedb.org/tv/2316-the-office. We will use this link later.

b. Initialize Project

Now, we will create a new GitHub repository, which will host all our Scrapy files. Then, we will open a terminal in the location of the repository and type:

scrapy startproject TMDB_scraper
cd TMDB_scraper

c. Tweak Settings

The GitHub repository will now have a lot of files in it, but let’s direct our attention to the file called settings.py.

In this file, we will modify User_Agent to equal 'Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148'.

This will prevent us from getting 403 errors while scraping.

2. Scraper


The fun part! First, we will create a file named tmdb_spider.py inside the spiders directory. Then, we add the following lines:

import scrapy

class TmdbSpider(scrapy.Spider):
    name = 'tmdb_spider'
    
    start_urls = ['https://www.themoviedb.org/tv/2316-the-office']

Note that start_urls is defined as our link from Part 1a. If one had a different favorite movie/TV show, they will need to replace this url with their favorite movie/TV show’s TMDB page. Now, we will write three parsing methods:

a. parse()

This method will navigate from start_urls to the Full Cast & Crew page:

def parse(self, response):
        """
        directs to the cast page given the starting tv/movie site
        """
        
        yield scrapy.Request("https://www.themoviedb.org/tv/2316-the-office/cast", callback = self.parse_full_credits)

Since the Full Cast & Crew page has the url <start_urls>/cast, we simply request that page using scrapy.Request. Our callback method is parse_full_credits(), which will start from the Full Cast & Crew page and lead to each actor’s own profile page (not the crew!).

Note

If one were to run this scraper with a different movie/TV show, they would need to change the link.

b. parse_full_credits()

def parse_full_credits(self,response): 
        """
        goes through each actor in the cast page 
        """
     
        actors_list = response.css('ol.people.credits:not(.crew) a::attr(href)').getall()
        for actor in actors_list:
            yield response.follow(actor, callback = self.parse_actor_page)

Here, actors_list will contain all the actors’ individual profile links. We iterate through this list and follow each link. The callback method is parse_actor_page(), which will start from the actor profile page and yield a dictionary containing all of the movies/TV shows this particular actor has been part of.

Note

Using the appropriate tag (:not(.crew)), we were able to filter out the crew members.

c. parse_actor_page()

We want to return a dictionary with two key-value pairs, of the form {"actor" : actor_name, "movie_or_TV_name" : movie_or_TV_name}.

def parse_actor_page(self, response):
        """
        parses through each actor and creates a dictionary containing movies/shows the actor has been in
        """
        
        actor_name = response.css("h2 a::text").get()
        for movie_or_TV_name in response.css("div.credits_list bdi::text").getall():
            yield {
                "actor": actor_name,
                "movie_or_TV_name": movie_or_TV_name
            }

Using the proper HTML tags, we extract the actor names and movie/TV show titles.

The scraper is now complete!

3. Recommendations


Now that the scraper is written, we will use this to compile our results.

We will run the following command in the terminal:

scrapy crawl tmdb_spider -o results.csv

The above will run the spider and generate a csv file containing the data. In our case, results.csv will look like this:

import pandas as pd
df = pd.read_csv("results.csv")
df
actor movie_or_TV_name
0 Rainn Wilson Robodog
1 Rainn Wilson Hitpig
2 Rainn Wilson Empire Waist
3 Rainn Wilson Inappropriate Behaviour
4 Rainn Wilson Rainn Wilson and the Geography of Bliss
... ... ...
10783 Damani Roberts The King of Queens
10784 Damani Roberts Buffy the Vampire Slayer
10785 Damani Roberts Shamitabh
10786 Tanveer K. Atwal The Office
10787 Tanveer K. Atwal The Matrix Revolutions

10788 rows × 2 columns

Using the above data, let’s create a visualization! We want to find movies/TV shows that share the most amount of actors with The Office. To do this, we can simply look at the most frequent entries in the movie_or_TV_name column. Then, we will use this to create a pie chart using Plotly.

# import packages

from plotly import express as px
import numpy as np
import matplotlib.pyplot as plt
# pull the top 24 results
top_result = df["movie_or_TV_name"].value_counts()[1:24].reset_index()
top_result.columns = ["movie_or_TV_name", "shared_actors_count"] # reset column names

fig = px.pie(top_result, values='shared_actors_count', names='movie_or_TV_name', title='Top Recommendations')
fig.update_traces(textposition='inside', textinfo='label+text', text=top_result['shared_actors_count'])
fig.update_layout(showlegend=False)
fig.update_layout(height=800, width=800)
fig.show()


Awesome! The above graphic clearly encapsulates what we were looking for.

Now, what should I watch next…?