Leverage the Wayback Machine’s archived URLs for time series analysis of movie reviews

I, like many others, was fascinated by how Wonder Woman 1984’s Rotten Tomatoes score fell from “certified fresh” to “rotten” over the course of a few weeks. I wanted to scrape Rotten Tomatoes for historic Tomatometer scores to see if this kind of drop in ratings can be seen for other films.

Image by author.

There are two different methods to scrape these historic reviews, and there are pros and cons to each. The first method is to scrape all of the reviews for a film, and then use each review’s date to calculate the Rotten Tomatoes score for that day. …


Looking back through the lens of Wikipedia’s most popular articles

Image by author.

People often go to Wikipedia to make sense of current events — perhaps to get a synopsis of Netflix’s latest release, or to read about the achievements of a recently deceased celebrity.

I had the idea to use Wikipedia page view data to create a “2020 rewind” of sorts — an animated timeline of current events and trends throughout the year of 2020. Here’s how I did it.

The code for this project can be accessed here.

I first defined a “get_traffic” function to get the page view data for a given day:

TOP_API_URL = 'https://wikimedia.org/api/rest_v1/metrics/pageviews/'\               'top/{lang}.{project}/all-access/{year}/{month}/{day}'  def get_traffic(year, month…


From “simp” to “wfh,” these 10 words took over the internet in the past year

Google trends can tell us a lot about language — it’s often used to analyze regional colloquialisms or trending topics. For this project, I wanted to see if I could use Google trends to determine rising slang words throughout the year.

My criteria for a “slang” word was the following:

  1. the word had to emerge in 2020, meaning that it was rarely searched for in previous years
  2. people don’t know the definition of the word, meaning that people were searching for “____ definition”, “what is ____”, “____ dictionary” etc.

The first point ensures that our list won’t be full of…


Data Science

Using Bayesian statistics and Normal distributions to model the coronavirus

Disclaimer: This article is only for educational purposes. We do not encourage anyone to scrape websites, especially those web properties that may have terms and conditions against such actions.

By now, you’ve probably read Tomas Pueyo’s articles on the coronavirus. Remember this graph?

The above graph shows the number of confirmed COVID-19 cases in the province of Hubei, China, but it also shows the estimated date that each of these cases first contracted the virus. The graph clearly communicates why it was important for Hubei to lockdown on Jan. …


Using text mining techniques to analyze Trump, Biden’s speech

COVID-19 may have taken away our in-person debate watch parties, but it’s not stopping us from making a drinking game out of it! In my latest Youtube video, I used text mining techniques to develop the ultimate data-driven drinking game rules for the upcoming Presidential debates. This post will walk you through exactly how I did that.

To start, I scraped the transcripts from campaign rallies, speeches, and any other events that have taken place in the last few weeks during which Biden or Trump (or both!) spoke. The full list of events I scraped can be seen in the…


Data Visualization

The barely known trick that should become industry standard

Have you ever noticed that inline plots in Jupyter notebooks tend to look… bad?

They’re either blurry or too small — and the default behavior for rescaling images in iPython is to retain the original resolution, meaning that enlarged images look extra blurry.

Here’s an example from my latest project, in which I analyzed the popular web series “Content Cop.” Notice how fuzzy the text and lines look? And I bet you couldn’t even tell that the red vertical line is dashed.

Image by the author.

What if I told you these problems could be solved with one line of code?

Simply type the…


GDPR and YouTube regulations make it harder than you might think

In my latest video, I analyzed the impacts of the popular YouTube series Content Cop, and the journey of actually acquiring the data I needed for this was much longer than anticipated. In this post, I’ll explain how to scrape subscription and views data for a YouTube channel so you don’t have to go through the same frustrations that I did.

If you’ve ever used the YouTube API, you’ll know that you can only get current subscription numbers from it — you cannot go back in time or fetch historical subscription data with the API. Now, you could build your…


A tutorial on using BeautifulSoup to scrape DeviantArt

The Sonic fandom has achieved a level of notoriety that few fandoms on the Internet enjoy. The art is known for being distorted, disturbing and in many cases, explicit. In my latest Youtube video, I scraped DeviantArt to analyze fan art to determine whether or not it truly lives up to this reputation. This post will walk you through exactly how I did that.

I first wanted to get a sense of how many Sonic artworks there are on DeviantArt, a fan art sharing website. No scraping required, here — I simply searched for “Sonic” on DeviantArt and recorded how…

Anya Vastava

Data Science senior at UC Berkeley; I use data science and data visualization to answer questions I have about pop culture and Internet trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store