IMDB Movie Analysis

Aryan
Oct 15, 2022
4 min read

Project Description:

1. Cleaning the data: This is one of the most important steps to perform before moving forward with the analysis. Use your knowledge learned till now to do this. (Dropping columns, removing null values, etc.).

2. Movies with the highest profit: Create a new column called profit which contains the difference between the two columns: gross and budget. Sort the column using the profit column as a reference. Plot profit (y-axis) vs budget (x-axis) and observe the outliers using the appropriate chart type.

3. Find IMDB Top 250: Create a new column IMDb_Top_250 and store the top 250 movies with the highest IMDb Rating (corresponding to the column: imdb_score). Also make sure that for all of these movies, the num_voted_users is greater than 25,000. Also, add a Rank column containing the values 1 to 250 indicating the ranks of the corresponding films. Extract all the movies in the IMDb_Top_250 column which are not in the English language and store them in a new column named Top_Foreign_Lang_Film.

4. Best Directors: Group the column using the director_name column. Find out the top 10 directors for whom the mean of imdb_score is the highest and store them in a new column top10director. In case of a tie in IMDb score between two directors, sort them alphabetically.

5. Popular Genres: Perform this step using the knowledge gained while performing previous steps.

6. Find the critic-favourite and audience-favourite actors: Create three new columns namely, Meryl_Streep, Leo_Caprio, and Brad_Pitt which contain the movies in which the actors: 'Meryl Streep', 'Leonardo DiCaprio', and 'Brad Pitt' are the lead actors. Use only the actor_1_name column for extraction. Also, make sure that you use the names 'Meryl Streep', 'Leonardo DiCaprio', and 'Brad Pitt' for the said extraction. Append the rows of all these columns and store them in a new column named Combined. Group the combined column using the actor_1_name column. Find the mean of the num_critic_for_reviews and num_users_for_review and identify the actors which have the highest mean. Observe the change in the number of voted users over decades using a bar chart. Create a column called decade which represents the decade to which every movie belongs. For example, the title_year year 1923, 1925 should be stored as the 1920s. Sort the column based on the column decade, group it by decade and find the sum of users who voted in each decade. Store this in a new data frame called df_by_decade.

Approach:

All the questions are clearly explained and can be solved using Microsoft Excel only.

Tech-Stack Used:

Microsoft Excel for Microsoft 365

Dataset:

https://drive.google.com/file/d/1XpGThHzLnXxL_7aQo2sCpYL3SeB18MMB/view

Insights:

All the questions asked can be answered through Excel. I used various operators and functions to get the solutions

1. Cleaning the data

Inspect Null values using COUNTBLANK() function to count the number of empty cells in all the columns and rows. Also, we will find the % of Null values in each column.
Here, we are analyzing the movies with respect to gross collection, ratings, popularity, etc and many columns are not required in the dataset. like : color, director_facebook_likes, actor_1_facebook_likes, actor_2_facebook_likes, actor_3_facebook_likes, actor_2_name, cast_total_facebook_likes, actor_3_name, duration, facenumber_in_poster, content_rating, country, movie_imdb_link, aspect_ratio, plot_keywords.
We can also see that some columns have large % of null values, which will drop all such rows.
I observed that the language column has some NaN Values and believe that it is okay to replace all the missing values with 'English'.
After checking the number and percentage of the rows retained after doing the above tasks, I have noticed that there are still 77.16 % of the rows.

2. Movies with the highest profit:

Convert the unit of the budget and gross columns from $ to million $ by dividing the amount by 1000000
Then, created a new column called profit which contains the difference between gross and budget.
Sorted the data frame using the profit column as a reference.
Extracted the top 10 profiting movies in descending order and stored them in a new data frame - top10

3. IMDb Top 250:

Sort all the values by imdb_score in descending order
After that, filter the num_voted_users to more than 25,000
Extracted the top 250 movies in descending order and stored them in a new dataframe - IMDb_Top_250
Extracted the non-English movies from dataframe - IMDb_Top_250 and stored in Top_Foreign_Lang_Film

4. Best Directors:

Group all the values by director_name and imdb_score and mean
Sorted all the values by director_name(ascending) and imdb_score(descending).
Extracted the top 10 directors in descending order and stored them in a new dataframe - top10director

5. Popular Genres:

As the genres column data is separated by a pipe (|), the first two are most significant for any film and I have split the values from the genres column and extracted them into 3 columns, i.e. genre_1, genre_2, remainder
If genre_2 is empty, then copy the same value from genre_1 to genre_2 also.
Group the data using genre_1 as the primary column and genre_2 as the second column
The mean of the gross values using the gross column and stored in a new data frame named PopGenre.

Most popular: Family and Sci-Fi.

6. Critic-favorite and Audience-favorite Actors:

Created three new data frames namely, Meryl_Streep, Leo_Caprio, and Brad_Pitt which contained the movies in which the actors: 'Meryl Streep', 'Leonardo DiCaprio', and 'Brad Pitt' are the lead actors. Then, appended the rows of all these data frames and stored them in a new data frame named 'Combined'.
Then, group the 'Combined' dataframe using the actor_1_name column.
Calculated mean of the num_critic_for_reviews and num_users_for_review