How to scrape IMDB to determine overlap between casts and crews

Analyzing the recent trend of superhero shows, and the people working on them

Anjali Shrivastava
5 min readMay 7, 2022

I can’t be the only one who’s watched a TV show, and was eerily reminded of another show or movie. For me, the latest instance of this came when I was watching Doom Patrol on HBO Max, and was picking up some vibes that were very similar to that of Netflix’s Umbrella Academy.

I wasn’t the only one to see the similarities between these two shows, and after some digging, I found that there were a number of producers that worked on both Doom Patrol and Umbrella Academy. That of course explains why these shows have similar aesthetics and character moments.

The data scientist in me then got an idea. What if we could use data to determine overlap between the cast/crew of a television show, and use that as a proxy to analyze similarities between the shows?

As always, the code for this project lives in my “data-science-projects” repo on Github. The specific folder for this project is here.

I developed this project for a Youtube video analyzing similarities between “surreal superhero shows,” which you can watch below!

Scraping the Data

The first step in this project is to procure a list of all of the people who worked on a given show. IMDB makes this very easy.

Screenshot of IMDB credits for Doom Patrol.

IMDB has a “full cast and crew” page for each of the television shows in their database (here’s the link for Doom Patrol). This list is handy because it buckets all of the people into their roles (ie. writing/cast/makeup, etc.) Note that people can be listed multiple types if they have more than one role on a given production.

Scraping this information is relatively straightforward since the page is constructed using HTML tables. Below is the function I used to scrape this data:

def scrapeCrew(url): #scrape cast and crew from imdb
html = urlopen(url)
soup = BeautifulSoup(html, 'html.parser')
tables = soup.find_all('table')
headers = soup.find_all('h4')
dfs = []
for i in range(len(tables)):
header = headers[i].text.strip().replace('\n', '')
rows = tables[i].find_all('tr')
names = []
roles = []
presence = []
if header == 'Series Cast':
for row in rows[1::2]:
names.append(row.find_all('a')[1].text)
roles.append(row.find_all('a')[2].text)
presence.append(row.find_all('a')[3].text)
df = pd.DataFrame()
df['name'] = names
df['role'] = roles
df['presence'] = presence
dfs.append(df)
else:
for row in rows:
names.append(row.find('a').text)
presence.append(row.find('td', class_='credit').text)
df = pd.DataFrame()
df['name'] = names
df['role'] = header
df['presence'] = presence
dfs.append(df)
temp = pd.concat(dfs)
temp['show'] = soup.find('h3').text.strip()
return temp

It’s worth noting that IMDB displays the cast information differently than they do for every one else in the crew, which is why I wrote an if condition to cast whether or not the given header is “Cast.”

After feeding that function a URL, you’ll wind up with a dataframe that might look like this:

DataFrame of Doom Patrol’s cast and crew.

Obviously, this dataframe will need to be cleaned (a simple string replacement for “\n” and any other special characters should do the trick).

The other thing you might notice is that I included a “show” column, even though all rows in the dataframe should have the same value (in this case, Doom Patrol). However, the purpose of this exercise is to determine cast overlap between multiple shows, so you will have to run this scrapeCrew function multiple times and then you will have to concatenate the dataframes using panda’s concat method. The show column will be useful when we have multiple shows in the dataframe.

Determining Cast Overlap

For my project, I scraped the cast and crew of Doom Patrol, Wandavision, Umbrella Academy and Legion — all of which are popular, surreal television shows based on comic book characters.

As I said earlier, some people are listed multiple times in the credits of a show if they had multiple roles. We can use a combination of panda’s drop_duplicates and value_counts methods to determine a list of people who worked on multiple shows (rather than having multiple roles).

Below is the code I used to determine people who worked on more than one show out of Doom Patrol, Wandavision, Umbrella Academy and Legion. The variable “all_shows” refers to my dataframe of all cast and crew members.

all_shows[["name", "show"]].drop_duplicates()["name"].value_counts().head(20) #drop duplicates due to multiple roles
Pandas series object of people who worked on multiple shows.

We can also group our “all_shows” dataframe by the “name” column to achieve a similar effect. This way we will not only have the names of all of the people who worked on multiple shows, but will also retain the show and role information as well.

grouped = all_shows.groupby("name").agg(lambda x: pd.unique(x))

The aggregator function calls panda’s unique method so that we keep all information for each name. As you can see in the below screenshot, the “role”, “presence” and “show” columns for Aaron Shore now have arrays as a value instead of an individual string. This is because Aaron Shore worked on both Doom Patrol and Wandavision.

DataFrame grouped by name.

I also created the “num_shows” column by calculating the length of the array in the “show” column. However, if you take a look at the screenshot, you will definitely see that something wacky is going on with this “num_shows” column.

And that is because the “show” column has mixed types. If a person has only worked on 1 show, the value’s type is string. Otherwise, it is an array. Now because we used the python function len to calculate “num_shows,” if the value type is a string, then “num_shows” represents how many characters is in the string, rather than how many shows the person worked on.

There are ways around this. However, I was only interested in people who worked on multiple shows so I ended up filtering out any row that had a num_shows value greater than 4, anyway.

DataFrame of crew members who worked on multiple shows out of my itemset (Doom Patrol, Wandavision, Legion and Umbrella Academy).

What you’re left with is a DataFrame of people who worked on multiple shows out of the ones you’ve specified (for me, that was Doom Patrol, Wandavision, Legion and Umbrella Academy).

You can do some pretty interesting network analysis with this type of data, or even look at how the roles differ between shows for certain people (eg. maybe they directed an entire season of one show, and only guest featured in another). The options are pretty endless here!

--

--

Anjali Shrivastava
Anjali Shrivastava

No responses yet