How to scrape Rotten Tomatoes for historic Tomatometer scores

Leverage the Wayback Machine’s archived URLs for time series analysis of movie reviews

Anjali Shrivastava
8 min readFeb 18, 2021

I, like many others, was fascinated by how Wonder Woman 1984’s Rotten Tomatoes score fell from “certified fresh” to “rotten” over the course of a few weeks. I wanted to scrape Rotten Tomatoes for historic Tomatometer scores to see if this kind of drop in ratings can be seen for other films.

Image by author.

There are two different methods to scrape these historic reviews, and there are pros and cons to each. The first method is to scrape all of the reviews for a film, and then use each review’s date to calculate the Rotten Tomatoes score for that day. The second method is to use the Wayback machine to scrape historic scores from archived versions of the page.

The first method is certainly faster, but the second is more accurate (although might result in missing data for a few days). I ended up going for the second method for my analysis, but will walk you through both methods in this tutorial.

I scraped this data for a broader analysis of Rotten Tomatoes reviews. My video below summarizes my findings!

Did 2020 break Rotten Tomaotoes?

The complete code for this project can be seen here.

Method #1: Scraping Rotten Tomatoes reviews

The code for the first method can be seen in my review scraper Jupyter Notebook.

To reiterate, this first method scrapes all critic reviews for a given film on Rotten Tomatoes, and uses those reviews to calculate the Tomatometer score for each day. This method can be inaccurate for a couple of reasons:

  1. Rotten Tomatoes typically waits 1–2 days after reviews start pouring in before publishing the Tomatometer score as a number they can “certify.” This method has no way of telling when that certification occurs, so the calculated score may be inaccurate in the first few days.
  2. This method requires scraping reviews from multiple pages, and it’s possible that it misses reviews or pages, which would affect the score calculation. For the most part, the scraper is comprehensive but I have seen it miss a few reviews a couple of times which is why I ended up going with method #2.

Still, I think this code is useful for scraping the *text* of reviews, especially if you’re interested in doing some type of sentiment analysis on film reviews.

I first defined regex patterns for each of the elements I was interested in scraping (eg. page numbers, reviews, ratings, etc.) and then created a make_soup function to make a beautifulsoup object out of a url.

#regex patterns
page_pat = re.compile(r'Page 1 of \d+')
review_pat = re.compile(r'<div class=\"the_review\" data-qa=\"review-text\">[;a-zA-Z\s,-.\'\/\?\[\]\":\']*</div>')
rating_pat = re.compile(r'Original Score:\s([A-Z](\+|-)?|\d(.\d)?(\/\d)?)')
fresh_pat = re.compile(r'small\s(fresh|rotten)\"')
critic_pat = re.compile(r'\/\"\>([A-Z][a-zA-Z]+\s[A-Z][a-zA-Z\-]+)|([A-Z][a-zA-Z.]+\s[A-Z].?\s[A-Z][a-zA-Z]+)|([A-Z][a-zA-Z]+\s[A-Z]+\'[A-Z][a-zA-Z]+)')
publisher_pat = re.compile(r'\"subtle\">[a-zA-Z\s,.\(\)\'\-&;!\/\d+]+</em>')
date_pat = re.compile(r'[a-zA-Z]+\s\d+,\s\d+')
from bs4 import BeautifulSoup
from requests import TooManyRedirects
import re

def make_soup(url):
try:
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
except TooManyRedirects:
soup = ''
return soup

Using these regex patterns, I then defined two helper functions that take in a beautifulsoup object: one to get the number of pages of a review, and another to get the reviews from each page.

def get_num_pages(soup):
match = re.findall(page_pat,str(list(soup)))
if len(match) > 0:
match = match[0]
match = match.split(' of ')[-1]
return match
else:
return None
def get_critic_reviews_from_page(soup):
reviews = list()
rating = list()
fresh = list()
critic = list()
top_critic = list()
publisher = list()
date = list()

soup = str(soup)
review_soup = soup.split('="review_table')[1].split('row review_table_row')
review_soup.pop(0)

for review in review_soup:
match = re.findall(review_pat, str(review))
if len(match) > 0:
m = match[0]
for iden in ['<div class="the_review" data-qa="review-text"> ','</div>']:
m = m.replace(iden,'')
reviews.append(m.strip('"'))
# extract rating
match = re.findall(rating_pat, str(review))
if len(match) > 0:
m = match[0][0]
if '/1' in m:
sp_m = m.split('/')
if sp_m[-1] == '1':
sp_m[-1] = '10'
m = '/'.join(sp_m)
rating.append(m)
else:
rating.append(None)
# extract fresh indicator
match = re.findall(fresh_pat, str(review))
if len(match) > 0:
fresh.append(match[0])
else:
fresh.append(None)
# extract critic
match = re.findall(critic_pat, str(review))
if len(match) > 0:
critic.append(''.join(match[0]))
else:
critic.append(None)
# check if top critic
if '> Top Critic<' in str(review):
top_critic.append(1)
else:
top_critic.append(0)
# extract publisher
match = re.findall(publisher_pat, str(review))
if len(match) > 0:
m = match[0]
m = m.replace('"subtle">', '')
m = m.replace('</em>','')
publisher.append(m)
else:
publisher.append(None)
# extract date
match = re.findall(date_pat, str(review))
if len(match) > 0:
date.append(match[0].strip('"'))
else:
date.append(None)

return [reviews, rating, fresh, critic, top_critic, publisher, date]

Once I had those functions defined, I could get to scraping. I defined a get_critic_reviews function that calls those helper functions and takes in a Rotten Tomatoes url (eg. https://www.rottentomatoes.com/m/wonder_woman_1984/)

def get_critic_reviews(page):
info = [[],[],[],[],[],[],[]]
soup = make_soup(page + "reviews")
# print(soup)
pages = get_num_pages(soup)
# print(pages)
if pages is not None:
for page_num in range(1,int(pages)+1):
soup = make_soup(page + "reviews?page=" + str(page_num) + "&sort=")
c_info = get_critic_reviews_from_page(soup)

# accumulate review info
for i in range(len(c_info)):
info[i] = info[i] + c_info[i]

c_info = dict()
keys = ['reviews', 'rating', 'fresh', 'critic', 'top_critic', 'publisher', 'date']
for k in range(len(keys)):
c_info[keys[k]] = info[k]
else:
c_info = None
return c_info

This function should return a dictionary, which you can then convert into a dataframe using pd.DataFrame.from_dict(). Your dataframe should look something like this:

Image by author.

Now obviously this dataframe needs to be cleaned a bit (especially the reviews column). I wasn’t interested in doing text analysis, but if I was, I would simply add “.text” to the div element in the scraper.

What I am interested in is somehow converting the “fresh” column to integers {1,0} so I can calculate the cumulative Rotten Tomatoes score for each day. And this is pretty easy to do in pandas.

all_films['score'] = all_films['fresh'].apply(lambda x: 1 if x == 'fresh' else 0)

We can now use this “score” column to finally calculate the Rotten Tomatoes score! I first grouped the reviews by date, and used two aggregator functions on the score column (sum and count). This way, we can take the cumulative sums of scores and divide it by the cumulative count of scores.

Essentially, we are calculating what percent of total critics rated the film fresh on each day, which is how the Tomatometer score is calculated.

df = all_films[all_films['Film'] == "Wonder Woman 1984"]
grouped_1 = df[['date', 'score']].groupby('date').agg([sum, 'count'])
grouped_1.columns = grouped_1.columns.droplevel(0)
grouped_1.cumsum()['sum']/grouped_1.cumsum()['count']

And presto! You should get a series object like this:

Image by author.

Method #2: Scraping Rotten Tomatoes scores from the Wayback Machine

The code for the second method can be seen in my score scraper Jupyter Notebook.

The second method may seem more obvious, but it is much more complicated. The Rotten Tomatoes website has changed a lot in the past 3 years, and because of that I had to write multiple scrapers to account for the different versions of the website.

Image by author.

I’ve only verified that these scrapers work for archived websites from 2018 onwards. If you’re going with this method, make sure you’re using the correct scraper for the correct year!

These scrapers make use of the waybackpy Python package, which allows you to access archived versions of a URL programmatically. I wrote functions to scrape the number of critic and audience reviews, and also the critic and audience scores from each URL.

Good news is the code for this method is much shorter! Here’s the function that scrapes the number of reviews for critics and audiences:

#get number of critic and audience reviews
def getNumReviews(soup):
critic = soup.find_all("small", class_="mop-ratings-wrap__text--small")
audience = soup.find_all("strong", class_="mop-ratings-wrap__text--small")
if critic:
critic = critic[0].text.replace("\n", '').strip()
else:
critic = soup.find_all("a", class_='scoreboard__link scoreboard__link--tomatometer')
critic = critic[0].text
if len(audience) > 1:
audience = audience[1].text.replace("Verified Ratings: ", '')
else:
audience = soup.find_all("a", class_='scoreboard__link scoreboard__link--audience')
if audience:
audience = audience[0].text
else:
audience = "Coming"
return [critic, audience]

And here’s the function(s) to scrape the audience and critic scores:

#2018 scraper
def get_score(soup):
critic = soup.find('span', {'class': "meter-value superPageFontColor"})
audience = soup.find('div', {'class': "audience-score meter"}).find('span', {'class': "superPageFontColor"})
return [critic.text, audience.text]
#2019-21 scraper
def
getScore(soup):
temp = soup.find_all('div', class_='mop-ratings-wrap__half')
try:
critic = temp[0].text.strip().replace('\n', '').split(' ')[0] #2020 scraper
if len(temp) > 1:
audience = temp[1].text.strip().replace('\n', '').split(' ')[0]
else:
audience = "Coming"
except:
scores = soup.find("score-board") #2021 scraper
return [scores["tomatometerscore"], scores["audiencescore"]]
return [critic, audience]

One slightly annoying thing is that the format for the scores vary slightly across different versions of the Rotten Tomatoes website. Sometimes there will be a ‘%’ attached to the string, sometimes it will have ‘score’ or other related words included with it. Make sure to look out for those edge cases when cleaning the data!

Anyhow, after defining the scraping helper functions, we can scrape the data! I wrote a build function that takes in a url, and a starting and end date to scrape the data for.

def build(key, date1, date2):
dates = pd.date_range(date1, date2).tolist()
wayback = waybackpy.Url(movie_urls[key], user_agent)
scores = []
for date in dates:
try:
archived = wayback.near(year=date.year, month=date.month, day=date.day).archive_url
except:
print(date)
continue
page = requests.get(archived).text
soup = BeautifulSoup(page, 'lxml')
scores.append(getScore(soup) + getNumReviews(soup) + [date, key])
return scores

The output of this function will be an array of arrays, like in the below image.

Image by author.

Again, we can easily convert this output into a DataFrame for our analysis!

pd.DataFrame(df, columns=["critic", "audience", "criticNum", "audienceNum", "date", "film"]))

The pitfall of using this method is that if the Wayback machine has not archived the URL for a given day, you will not be able to scrape the data for that day. I tried this method for upwards of 20 films and it generally wasn’t a problem. There might be 1–5 days of missing data depending on your date range, but if you’re looking to scrape scores for a more obscure film that isn’t likely to be archived, you might have to go with method #1.

And finally, check out my video if you’re interested in hearing what I found from analyzing this data!

--

--

Anjali Shrivastava
Anjali Shrivastava

Responses (1)