IMDB Dataset

Introduction

A few days back, the NYC Data Science Academy scrapped the IMDB website, acquiring information of more than 5000 movies .  Being a movie buff myself, I could not wait to dive into it.

This data set was posted on Kaggle. The entire process of data acquisition and cleaning can be found here.

The data set

We import our dataset into the system. Before starting the process of diving in , we need to understand the structure of this data. Are there any variables that need changing for efficient analysis?

Let see.

What are the variables? What are its dimensions?

movie <- read.csv(‘movie_metadata.csv’,header=T,stringsAsFactors = F)

str(movie)

## ‘data.frame’:    5043 obs. of  28 variables:

##  $ color                    : chr  “Color” “Color” “Color” “Color” …

##  $ director_name            : chr  “James Cameron” “Gore Verbinski” “Sam Mendes” “Christopher Nolan” …

##  $ num_critic_for_reviews   : int  723 302 602 813 NA 462 392 324 635 375 …

##  $ duration                 : int  178 169 148 164 NA 132 156 100 141 153 …

##  $ director_facebook_likes  : int  0 563 0 22000 131 475 0 15 0 282 …

##  $ actor_3_facebook_likes   : int  855 1000 161 23000 NA 530 4000 284 19000 10000 …

##  $ actor_2_name             : chr  “Joel David Moore” “Orlando Bloom” “Rory Kinnear” “Christian Bale” …

##  $ actor_1_facebook_likes   : int  1000 40000 11000 27000 131 640 24000 799 26000 25000 …

##  $ gross                    : int  760505847 309404152 200074175 448130642 NA 73058679 336530303 200807262 458991599 301956980 …

##  $ genres                   : chr  “Action|Adventure|Fantasy|Sci-Fi” “Action|Adventure|Fantasy” “Action|Adventure|Thriller” “Action|Thriller” …

##  $ actor_1_name             : chr  “CCH Pounder” “Johnny Depp” “Christoph Waltz” “Tom Hardy” …

##  $ movie_title              : chr  “Avatar ” “Pirates of the Caribbean: At World’s End ” “Spectre ” “The Dark Knight Rises ” …

##  $ num_voted_users          : int  886204 471220 275868 1144337 8 212204 383056 294810 462669 321795 …

##  $ cast_total_facebook_likes: int  4834 48350 11700 106759 143 1873 46055 2036 92000 58753 …

##  $ actor_3_name             : chr  “Wes Studi” “Jack Davenport” “Stephanie Sigman” “Joseph Gordon-Levitt” …

##  $ facenumber_in_poster     : int  0 0 1 0 0 1 0 1 4 3 …

##  $ plot_keywords            : chr  “avatar|future|marine|native|paraplegic” “goddess|marriage ceremony|marriage proposal|pirate|singapore” “bomb|espionage|sequel|spy|terrorist” “deception|imprisonment|lawlessness|police officer|terrorist plot” …

##  $ movie_imdb_link          : chr  “http://www.imdb.com/title/tt0499549/?ref_=fn_tt_tt_1&#8221; “http://www.imdb.com/title/tt0449088/?ref_=fn_tt_tt_1&#8221; “http://www.imdb.com/title/tt2379713/?ref_=fn_tt_tt_1&#8221; “http://www.imdb.com/title/tt1345836/?ref_=fn_tt_tt_1&#8221; …

##  $ num_user_for_reviews     : int  3054 1238 994 2701 NA 738 1902 387 1117 973 …

##  $ language                 : chr  “English” “English” “English” “English” …

##  $ country                  : chr  “USA” “USA” “UK” “USA” …

##  $ content_rating           : chr  “PG-13” “PG-13” “PG-13” “PG-13” …

##  $ budget                   : num  2.37e+08 3.00e+08 2.45e+08 2.50e+08 NA …

##  $ title_year               : int  2009 2007 2015 2012 NA 2012 2007 2010 2015 2009 …

##  $ actor_2_facebook_likes   : int  936 5000 393 23000 12 632 11000 553 21000 11000 …

##  $ imdb_score               : num  7.9 7.1 6.8 8.5 7.1 6.6 6.2 7.8 7.5 7.5 …

##  $ aspect_ratio             : num  1.78 2.35 2.35 2.35 NA 2.35 2.35 1.85 2.35 2.35 …

##  $ movie_facebook_likes     : int  33000 0 85000 164000 0 24000 0 29000 118000 10000 …

dim(movie)

## [1] 5043   28

We see that the data is made of 5043 rows with 28 columns. Lets see what we can understand from this data. For this, we ask a few questions.

What are the total number of movies reviewed by year?

Lets see the trend. This would give us an idea of the increase/decrease in the number of movies.

temp <- movie %>% select(movie_title,title_year)

temp <- temp %>% group_by(title_year) %>% summarise(n=n())

temp <- na.omit(temp)

p <- plot_ly(temp, x = title_year, y = n, name = “Number of Movies by Year”)

p %>%

  add_trace(y = fitted(loess(n ~ as.numeric(title_year))), x = title_year) %>%

  layout(title = “Year and Movies”,

         showlegend = FALSE) %>%

  dplyr::filter(n == max(n)) %>%

  layout(annotations = list(x = title_year, y = n, text = “Peak”, showarrow = T))


We see the following trend.

We see the highest number of movies were in the year 2009, with the number of movies being 260 The trend seems to be exponential. The number of movies reviewed seems to be 160 in 2016.

What is the average IMDB rating by year?

What is the average quality of movies released each year.

temp <- movie %>% select(imdb_score,title_year)

temp <- temp %>% group_by(title_year)%>% summarise(score=mean(imdb_score))

temp <- na.omit(temp)

p <- plot_ly(temp, x = title_year, y = score, name = “Avg Score by Year”)

p %>%

  add_trace(y = fitted(loess(score ~ as.numeric(title_year))), x = title_year) %>%

  layout(title = “Year and Score”,

         showlegend = FALSE) %>%

  dplyr::filter(score == max(score)) %>%

  layout(annotations = list(x = title_year, y = score, text = “Peak”, showarrow = T))

The overall trend of average score seems to be declining. The highest average score is achieved by the year 1957. The lowest being achieved in 1920. As years pass by , the trend keeps decreasing.This is a question of quality vs quantity.Average scores seem to be constant in the years 1925 to 1929.

How do the average score change for each type of content rating?

Are content ratings tied to IMDB scores?

temp <- movie %>% select(content_rating,imdb_score)

temp <- temp %>% group_by(content_rating)%>% summarise(score = mean(imdb_score))

p <- plot_ly(

  x = temp$content_rating,

  y = temp$score,

  name = “Avg score by Rating”,

  type = “bar”)

p


We see that the highest average score seems to be bagged by TV-MA category. This is not enough. How do these scores vary by category?

temp <- movie %>% select(imdb_score,content_rating)

temp <- na.omit(temp)

plot_ly(temp, x = imdb_score, color = content_rating, type = “box”)


We see that the IQR of each distribution is above 5. The highest imdb_scores tend to be of

the TV-MA content rating type. The R rated category has the largest number of outliers that range from a score of 1.9 to 4.2.

Which director has the highest average IMBD rating?

Lets display the top twenty directors.

temp <- movie %>% select(director_name,imdb_score)

temp <- temp %>% group_by(director_name) %>% summarise(avg=mean(imdb_score))

temp <- temp %>% arrange(desc(avg))

temp <- temp[1:20,]

temp %>%

  formattable(list(avg = color_bar(“orange”)), align = ‘l’)


Are the number of facebook likes an indicator of the IMDB score?

temp <- movie %>% select(movie_facebook_likes,imdb_score,content_rating)

plot_ly(temp, x = movie_facebook_likes, y = imdb_score,

        color =content_rating , mode = “markers”,text=paste(‘Content:’,content_rating))

We divide this scatter plot by content-rating. We do not see any trend here. There seem to

be movies that have high IMDB scores but low Facebook likes.

How are the likes of the total cast related to budget? Do producers spend more on popular actors?

temp <- movie %>% select(cast_total_facebook_likes,budget,movie_title,content_rating)

plot_ly(temp, x = cast_total_facebook_likes, y = budget,

        color =content_rating , mode = “markers”,text=paste(‘Movie:’,movie_title))

Oops!We cant see much here in the plot. We try to reduce our data space by removing outliers.

We do not see much here either. The Legend of Ron Burgundy has the highest number of cast Facebook likes with a budget of just 26M!The Gladiator on the other hand has 6521 cast likes with a budget of 103M!

Is the number of likes for a director tied to the IMDB score?

temp <- movie %>% select(director_facebook_likes,imdb_score,content_rating,movie_title)

plot_ly(temp, x = director_facebook_likes, y = imdb_score,

        color =content_rating , mode = “markers”,text=paste(‘Movie:’,movie_title))

As we see this is not true. There are directors with high IMDB scores with low Facebook likes .We find that the movie Towering Inferno has an IMDB score of 9.5 with director likes of 0.

What are the number of films that have grossed less than their budgets by year?

temp <- movie %>% select(gross, budget,title_year)

temp$diff <- temp$gross – temp$budget

temp <- na.omit(temp)

temp$profit <- rep(”,dim(temp)[1])

temp$profit <- ifelse(temp$diff<0,’No’,’Yes’)

temp <- temp %>% group_by(title_year) %>% summarise(n=sum(profit==’No’))

p <- plot_ly(temp, x = title_year, y = n, name = “Number of low grossing movies by Year”)

p %>%

  add_trace(y = fitted(loess(n ~ as.numeric(title_year))), x = title_year) %>%

  layout(title = “Year and Low Gross”,

         showlegend = FALSE) %>%

  dplyr::filter(n == max(n)) %>%

  layout(annotations = list(x = title_year, y = n, text = “Peak”, showarrow = T))

We see the year with the largest number of low grossing movies was 2004 with a value of 104.

The trend seems to be increasing with a lot of indie films coming into the picture.

There are many ways we can look into this data. Predicting movie ratings through machine learning accurately is one of the biggest challenges. We will address this issue in the next post.

Thank You for Reading!

Interactive Plots for this project can be found here.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s