Exploring Data: Successfully Start a Business in the Movie Making Industry

Andrea Cabello
Python in Plain English
8 min readSep 17, 2020

--

In this publication, I present an Exploratory Data Analysis performed looking to answer the question What type of movies to produce?

Scenario

Let’s say Microsoft has decided to create a new movie studio. They have hired me to help them better understand the movie industry and to make sound decisions based on Analysis and Science.

To get started, I was given 11 sets of data obtained from the following sources:

  • Rotten Tomatoes
  • Box Office Mojo
  • IMDB
  • TheMovieDB.org
  • The-Numbers.com

I was assigned with the following tasks:

  • Explore the given data and/or find complementary data,
  • Obtain meaningful, actionable insights from it that will,
  • Help the new head of the studio decide What type of films to create?

Business Problem

What type of films to create?

As I prepared to begin reviewing and understanding the data, I put together a preliminary list of questions that could help answer the ultimate question: what type of movies to create?

  • What type of films are currently doing the best at the box office?
  • What does “type of films” mean? The Genre becomes relevant right away!
  • What do people like?
  • How much does it cost to make each type of movie?
  • Who is my competition? And how are they doing?
  • How much money are they making?
  • What’s the influence of Netflix and other streaming platforms?

Understanding the Data

Objective: Identify if/which of the provided data sets can help me answer the formulated questions.

  • The preliminary list of questions above helped me identify what kind of information I should be looking for within the data.
  • Our sources are well-known experts in the subject.
  • Inspect all of the data sets provided using python and pandas library .

After carefully reviewing all of the given data sets, I selected 4 of them (from IMDB and The-Numbers.com) based on:

  • Amount of null numbers present in each data set and;
  • the variables (such as gross income, year released and budget) that will help answer my list of questions.

The Netflix Effect

None of the data sets provided contains observations that could help us measure the Netflix effect. So I wondered why? I went on searching for any Netflix data available.

According to Google search results, Netflix used to have an API but they shut it down back in 2014. I was able to find a data set containing some basic info on the current titles available on Netflix. Even though we won’t be able to measure Netflix users preferences or Netflix’s finances using this data, I consider it important to take a look at for general reference on what type of content they are offering.

Data Preparation

Objective: to identify what type of movies to create.

  • I applied statistics to these data sets with the help of python and pandas library.
  • Used matplotlib and seaborn to put together some bar graphs to display the information.
  • For method and approach, I really focused on genres and finances.
  • To help minimize the effect of not being able to measure the effect of streaming services in our investigation, I have decided to only include movies released in 2010 or after.
  • Patterns over outliers. There’s only one Avatar or one Titanic.

Hypothesis

‘Finding a balance between what is most profitable for the studio and what the people like will tell us what type of movies to create’.

With a better understanding of the data, we now know which questions can be answered and it’s time to prepare and clean the data to address the following:

  • Question 1: Who is my competition? What are they doing?
  • Question 2: How much does it cost to make a movie?
  • Question 3: What about the finances?
  • Question 4: What do people like?
  • Appendix: What type of content is Netflix offering the most?

Let’s see what I found!

Experimentation

Question 1: Who is my competition?

  • Top 10 Studios producing the most films.
  • Top 10 Studios making the highest gross income on average.

First I looked at the information we had on studios and found some interesting things.

  • Here, we can see the Top 10 Studios with the most movies produced since 2010. All of these studios are of American origin except for SPC which is an Argentinian company.
  • An average of 14 movies were produced a year per studio between 2010 and 2018.
  • I was interested in looking at the average gross income by studio and surprise, surprise! I found that two studios made it to the Top 10 with just one movie each.
  • These studios happen to be foreign. We see HC from China (Wolf Warrior 2) in the Top 1 and GRT from India (Baahubali 2: The Conclusion) in the Top 4.
  • Our Top 2, Paramount/DreamWorks had a count of 10 movies and 9 of them are animated movies (think Shrek and KungFu Panda) and 1 a comedy (A Thousand Words).
  • A slightly negative correlation between number of movies produced and average gross income indicates that a higher number of movies produced does not imply higher revenue.

Inferences

  • Making more movies does not mean making more money.
  • Movie industry is mostly centralized in the USA.
  • Interesting presence of China, India and Argentina in our Top 10's.
  • Now that we understand our competition a little better, let’s look at the numbers.

Question 2: What about the numbers?

  • Top 10 most produced Genres
  • Average Budget per Genre
  • Calculate Profit by Genre
  • Calculate Gross Profit Margin by Genre (return per every dollar invested)

Above we can see the Top 10 genres with the highest average budget and the highest average gross profit, which we obtained subtracting the budget from the worldwide gross. I used stars to point at the genres that are showing as interesting. I was not expecting at all to see musical as the Top 1 genre. So here, I asked ok, so does the budget have anything to do with the profit?

Yes it does! Here you can see how the profit increases as the budget does.

Up next, I applied my knowledge of finances to go a step further and calculate the average Gross Profit Margin or Ratio. This is obtained:

This operation returns the margin as a percentage (%) which represents the return an investor gets per every dollar invested.

These are great numbers. So if this is true, these should be the movies that are getting produced the most? Let’s look into that.

Inferences

  • Drama, Documentary and Comedy are the Top 3 most produced genres.
  • There’s a positive correlation between Budget, Gross Profit and Gross Profit Margin for our Top 10 Genres.
  • Comedy, Adventure and Sci-Fi appear in all Top 10's.

Question 3: What do people like?

  • Top 10 Best Rated Genres

For this question, to remove bias, I filtered the data and selected movies with about the same number of votes in average and filtered movies with too high or too low ratings (scale 1–10).

Inferences

  • In our bar plot we saw how the Top 3 Best Rated Genres did not appear in our previous Top 10's.
  • A slightly negative correlation between Ratings and Profit Margin reveal that best rated does not always mean most profitable.
  • Music, Musical, Drama, Biography and Adventure are well rated and also highly profitable.
  • Drama is our Top 1 most produced and we also see it in the top 12 best rated and it’s also listed in the top 12 with the highest return percentage at 61.53%.
  • Now, documentary is both often produced, and generally liked however it’s not the most profitable kind of production.
  • Adventure and Animation are amongst our best rated and most profitable genres but are not the most produced.

Appendix: Netflix Analysis

  • Top 5 Genres
  • Top 5 Origin Countries

Even though I could not find data on Netflix users preferences or Netflix’s finances, I consider it important to take a look at what’s available for general reference on what type of content they are offering.

Inferences

  • Netflix content is distributed at about 70% Movies — 30% TV Shows.
  • International Movies is one of the most popular categories.
  • Looking at it by country, we see a clear dominance of the US in the movie industry followed by India.

Future Work

  • It is important to note that we did not take into consideration the effect of the COVID-19 pandemic on consumer behavior and preferences.
  • Hopefully, more data on Netflix and other streaming services will become available in the near future. That will allow us to measure the effect they have on distribution and how the money is made.

Recommendation

  • In our EDA we looked at the Genres individually, but more often than not movies fall within more than one category or genre. Having said that, some sort of combination between the genres shown in the image below will produce the best movies in terms of profitability and acceptance amongst consumers.
  • Set Budget per movie between $4M and $10M.
  • Important to consider opportunities of expansion to the international market.
  • Use the technology knowledge and tools of Microsoft (the parent company) to maximize quality of films as well as the distribution.

SWOT Analysis

Thank you for reading through my work. I hope you found it relevant.

Here is a link to the GitHub Repository that contains the technical work.

Please do contact me if there are any questions and/or comments.

--

--

She. Her. Passionate mind. Stubborn soul. Bohemian heart. Born and raised Peruvian. Proud New Yorker. Self made bilingual. Data Scientist in the making.