GitHub Repository: suyashi29/python-su
Path: blob/master/Python for Data Science/EDA Movies.ipynb
³⁰⁷⁴ views

Kernel: Python 3 (ipykernel)

Problem Statemet

The dataset consists of the survey data of the Movies within the years 2006-2016
Various variables present in the dataset inclueds genre,ratings,votes,revenues,etc .
The dataset comprises of 1000 observations of 12 columns. Below is a table showing names of all the columns and their description.

| Column Name | Description | | ------------- |:------------- 😐 | Rank | Rank of the Movie | | Title | Title of the movie | | Genre | To what Genre the movie belongs | | Description | The description of the movie which gives the backgound of the movie | | Director | Name of the director of the movie | | Actors | Actors of the Movie | | Year | Year in which the Movie was released | | Runtime (Minutes) | The duration of the movie in minutes| | Rating | Rating of the Movie | | Votes | Votes given to the Movie | | Revenue (Millions) | Revenue made by the Movie in millions | |Metascore |Score of the movie on the metacritic website|

Importing Packages

1. Read Data
2.Check shape of data
3.check for null values
4.Replace null values
5.Decscribe your data
6. Drop if not acceptable for analysis
7. Remove duplicates
8. What is the count of moving with years?

In [ ]:


##pip install --proxy http://uername:[email protected]:8000 sweetviz

In [ ]:

import numpy as np                                                 # Implemennts milti-dimensional array and matrices
import pandas as pd                                                # For data manipulation and analysis
import pandas_profiling as prof
import matplotlib.pyplot as plt                                    # Plotting library for Python programming language and it's numerical mathematics extension NumPy
import seaborn as sns                                              # Provides a high level interface for drawing attractive and informative statistical graphics
%matplotlib inline 
sns.set() #To switch to seaborn defaults, simply call the set() function
from subprocess import check_output 
#to run new applications or programs through Python code by creating new processes.
import warnings 
warnings.filterwarnings('ignore')

Importing the Movies Dataset

movies=pd.ExcelFile(r"C:\Users\suyashi144893\Documents\data Sets\movies.xlsx").parse("Sheet1")

In [ ]:


#movies=pd.read_csv("https://raw.githubusercontent.com/suyashi29/python-su/master/Data%20Analysis%20using%20Python/imdb.csv")

In [ ]:

movies=pd.read_excel("movies.xlsx")
## Movie trend: Rank, drop Title, Genre, drop description, drop director, drop actor, year, runtime(duration), rating(), drop vote, renvnue, metascore

In [ ]:

movies.shape

The Movies dataset has 1000 observations and 12 columns

In [ ]:

#movies['Year'] = movies['Year'].astype(str)

In [ ]:

movies.columns

These are the columns present in the dataset

In [ ]:

movies.head(2)

In [ ]:

movies.tail(1)

In [ ]:

movies.info()

In [ ]:

movies.describe()

It is seen from the data that revenue has a high standard deviation and hence a large spread of revenue genration for the movies

In [ ]:

movies.describe(include='object')

##Checking null values

In [ ]:

m=movies.isnull().sum()
miss= (movies.isnull().sum()/len(movies))*100
miss_data=pd.concat([m,miss],axis=1,keys=['Total','%'])
print(miss_data)

From the above output we can see that Revenue(Millions) and Metascore columns contains maximum null values

Pre Profiling

profile = prof.ProfileReport(movies) profile.to_file(outputfile="Movies_before_preprocessing.html")

Here, we have done Pandas Profiling before preprocessing our dataset, so we have named the html file as Movies_before_preprocessing.html.Now we will process our data to better understand it.

Preprocessing

Adding new features

In [ ]:

new = movies['Genre'].str.split(",", 2)
movies['Genre 1']=new.str.get(0)
movies['Genre 2']=new.str.get(1)
movies['Genre 3']=new.str.get(2)

In [ ]:

movies['Genre 1'].describe(include='all')

In [ ]:

movies['Genre 2'].describe(include='all')

In [ ]:

movies['Genre 3'].describe(include='all')

In [ ]:

movies.drop('Genre', axis = 1,inplace = True)

In [ ]:

movies.drop('Description', axis = 1,inplace = True)

In [ ]:

movies.drop_duplicates(subset=None,keep="first",inplace=False)

In [ ]:

movies["Rating"].describe() # Describing a particular field

In [ ]:

movies['RB']=0
movies.loc[movies['Rating']<=5,'RB']="LOW"
movies.loc[(movies['Rating']>5)&(movies['Rating']<=7),'RB']="MEDIUM"
movies.loc[movies['Rating']>7,'RB']="HIGH"
movies.head(2)

## Create a Field for movie duration

Treating missing values

In [ ]:

# Metascore by Median()
movies["Metascore"]=movies["Metascore"].fillna(0)

In [ ]:

# Metascore by 
r=movies["Revenue (Millions)"].median()
movies["Revenue (Millions)"]=movies["Revenue (Millions)"].fillna(r)

Post Pandas Profiling

profile = prof.ProfileReport(movies)
profile.to_file(outputfile="Movies_after_preprocessing.html")

Questions

Q1) Movies made on year basis?

In [ ]:


movies.groupby(['Genre 1'])['Year'].count()

MDF = pd.read_excel('Movies.xlsx')
base = ['Adventure','Crime','Comedy','Sci-Fi','Action']
for var in base:
    MDF[var]= MDF['Genre'].apply(lambda x: 1 if var in str(x) else 0)
MDF['Other'] = MDF.apply(lambda x: 0 if (x['Adventure']+
                                        x['Crime']+
                                        x['Comedy']+
                                        x['Sci-Fi']+
                                        x['Action'])>=1 else 1,axis=1)
check = MDF.groupby(['Year']).agg({'Adventure':sum,
                                       'Crime':sum,
                                      'Sci-Fi':sum,
                                      'Action':sum,
                                      'Comedy':sum,
                                       'Other':sum})
check.reset_index(inplace=True)
check['Total'] = check.apply(lambda x: x['Adventure']+x['Crime']+x['Sci-Fi']+
                                       x['Action']+x['Comedy']+x['Other'],axis=1)
base2 = base+ ['Other']
for var in base2:
    check[var+'_norm']=check.apply(lambda x: float(x[var])/float(x['Total'])*100,axis=1)
base3 = [x+'_norm' for x in base2]
check.plot(kind='bar',x=['Year'],y=base3,stacked=True)
plt.show()

In [ ]:

movies.groupby(['Year'])['Year'].count()

Observation:

From the above we can see that with passing years more movies are being made year by year
The above data shows that there is a sudden increase in the creation of movies in year 2016 as compared to year 2015

In [ ]:

sns.countplot(x='Year', data=movies).set_title('Count plot for Movies with passing Years.')
sns.set(rc={'figure.figsize':(25,30)})
plt.show()

Q3) What is the Rating that a movie normally gets?

In [ ]:

movies.groupby(['RB'])['RB'].count()

In [ ]:


#movies.groupby(['Rating'])['Rating'].count()

In [ ]:

sns.set(rc={'figure.figsize':(10,10)})
sns.countplot(x='RB', data=movies).set_title('Count plot for Movies with according to their Rating.')

The data is negatively skewed

In [ ]:

movies['Rating'].plot.hist()

Observation

It is observed that on an average a large number of movies get ratings from the range of 6.1-7.5

Observation

It is observed that maximum Directors only direct a movie in a span of 10 years

In [ ]:

sns.boxplot(x="Year", y="Rating", data=movies)

Q5) What is the Revenue generated by Movies, does rating affect revenue generation?

In [ ]:

movies['Revenue (Millions)'].sort_values().plot.hist()

Observation

Maximum movies product a revenue in the range of 0-100 (Millions)
Very few Movies generate a revenue in the range of 380-650 (Millions)
Positively skewed data

In [ ]:

sns.boxplot(x="RB", y="Revenue (Millions)", data=movies)

Observation

Maximum movies with Rating 6-7 generate a revenue of 100-200

Q6) Runtimes of movies affect revenue and ratings?

In [ ]:

movies['Runtime (Minutes)'].sort_values().plot.hist()

Observation:

Very few movies have a runtime of less than 80 mins and more than 165 mins
Generally Movies have a runtime of 90 mins to 130 mins

Observation

A large number of Movies with a runtime of 95-110 minutes produce a revenue of 0-100 millions

Size of confidence intervals(ci) to draw around estimated values. If “sd”, skip bootstrapping and draw the standard deviation of the observations. If None, no bootstrapping will be performed, and error bars will not be drawn.

In [ ]:

sns.barplot("RB","Runtime (Minutes)", data=movies)
#sns.barplot("Year","Runtime (Minutes)", data=movies,color="lightblue")

In [ ]:

movies.plot.hexbin(x='Revenue (Millions)', y='Runtime (Minutes)', gridsize=10)

Observation

A large number of movies with runtime 95-125 mins get an average rating of 6-7.5

Q7) Effect of Metascore, are metascore and rating related ?

In [ ]:

movies['Metascore'].sort_values().plot.hist()

Observation

A large amount of movies get a metascore of 55-70
Around 20 movies get a good metascore of 90-100
The metascore has a perfect distribution

In [ ]:

movies.plot.scatter(x='Revenue (Millions)', y='Rating')

In [ ]:

movies.plot.scatter(x='Metascore', y='Rating')

Observation

The Scatter plot defines that Rating and Metascore are closely related

In [ ]:

movies["Rating"].plot.hist(color='r', alpha=0.7)

In [ ]:

movies[movies['Genre 1'] == 'Action'].Year.groupby(movies.Year).count().plot(kind='pie', figsize=(6, 6),autopct='%2.2f%%')
plt.axis('equal')
plt.show()

movies.plot(kind="bar",x='Genre 2',y="Year",stacked=True)

In [ ]:

# print (movies["Year"].pct_change())

In [ ]:

sns.heatmap(movies.corr(),annot=True)
#Use annotation to add correlation numbers to the Seaborn heatmap
plt.figure(figsize=(5,5))

In [ ]:

movies.plot.area(x='Votes',y="Rating")

In [ ]:

movies.plot.scatter(x='Votes', y='Rating')

Observation

It will be very rare case in which the movie may receive a lot of votes but when a movie gets a large number of votes it has a good rating
The larger the number of votes the best is the rating.
There is a close co relation between Votes and Rating

Q9) What kind of Genres movies are made?

In [ ]:

print("The count of different Genres for Genre 1")
movies.groupby(['Genre 1'])['Genre 1'].count().sort_values()

In [ ]:

import collections
print("The count of different Genres")
g1 = dict(movies.groupby(['Genre 1'])['Genre 1'].count().sort_values())
g2 = dict(movies.groupby(['Genre 2'])['Genre 2'].count().sort_values())
g3 = dict(movies.groupby(['Genre 3'])['Genre 3'].count().sort_values())
g0 = [g1, g2, g3]
counter = collections.Counter()
for d in g0:
    counter.update(d)
g = dict(counter)
pd.DataFrame(g.values(), index=g.keys(), columns=['cnt']).plot(kind='pie', y='cnt',figsize=(10, 10),autopct='%1.1f%%', legend=False)
plt.axis('equal')
plt.show()

From the above data of Genre 1 we conclude that the top three primary Genre movies that are made are __ Action,Drama,Comedy__

In [ ]:


movies.plot.line(y="Runtime (Minutes)",x="Year" )

In [ ]:

sns.set(rc={'figure.figsize':(14,12)})
sns.countplot(x='Genre 1', data=movies).set_title('Count plot for Genre 1')

In [ ]:

print("The count of different Genres for Genre 2")
movies.groupby(['Genre 2'])['Genre 2'].count().sort_values()

From the above data of Genre 2 we conclude that the top Genre movies that are made are __ Drama,Adventure,Romance,Comedy__
Romance, Comedy, Crime, Thriller, Mystery, Horror made at the same level

In [ ]:

sns.set(rc={'figure.figsize':(14,12)})
sns.countplot(x='Genre 2', data=movies).set_title('Count plot for Genre 2')

In [ ]:

print("The count of different Genres for Genre 3")
movies.groupby(['Genre 3'])['Genre 3'].count().sort_values()

From the above data of Genre 3 we conclude that the top Genre movies that are made are __ Thriller,Sci-Fi,Drama,Romance and Fantasy__

In [ ]:

sns.set(rc={'figure.figsize':(14,12)})
sns.countplot(x='Genre 3', data=movies).set_title('Count plot for Genre 3')

Observation

From evaluating the above data on Genres it is seen that Maximum movies have Drama in them within the data of 1000 Movies, 513 being for Drama meaning around 50% movies have Drama as their Genre
Action Movies are also made but not as Drama with a count of 303 for a data of 1000 movies
Comedy Movies are made more than Adventure with a count of 279 for Comedy and 259 for Adventure
Thriller movies are made more than Crime Movies with a count of 195 and 150
Crime and Romance movies are equally made as the count suggests from the above 150 for Crime and 141 for Romance

Q10) Which genre produces best Revenue in years 2015,2016?

In [ ]:

sns.swarmplot(x="Genre 2", y="Revenue (Millions)", hue="Year",data=movies.loc[movies['Year'].isin(['2015', '2016'])])

In [ ]:

sns.violinplot(x='Genre 1',y='Revenue (Millions)',hue='Year',data=movies.loc[movies['Year'].isin(['2015','2016'])],split=True)

In [ ]:

sns.swarmplot(x="Genre 2", y="Revenue (Millions)", hue="Year",data=movies.loc[movies['Year'].isin(['2015', '2016'])])

In [ ]:

sns.swarmplot(x="Genre 3", y="Revenue (Millions)", hue="Year",data=movies.loc[movies['Year'].isin(['2015', '2016'])])

From the above plots it is observed that Drama,Adventure,Action,Comedy movies are made and they have generated large amount of revenue in the year 2015,2016
2016 has generated more revenue as compared to 2015

Conclusions

The Movies data set helps us in deducting that year by year more and more Movies are being made,also there is a sudden increase in the creation of movies in year 2016 as compared to year 2015
There are 128 Movies which did not generate revenue at all even after releasing in the year Range 2006-2016
Maximum number of Movies get the rating of 7.1, there is only one movie which got a rating on 9
A large number of movies get ratings from the range of 6.1-7.5
Director Ridley Scott has Directed the maximum number of movies which are 8 in a period of 2006-2016,maximum Directors only direct a movie in a span of 10 years
Maximum movies produce a revenue in the range of 0-100 (Millions),Very few Movies generate a revenue in the range of 380-650 (Millions)
Generally Movies have a runtime of 90 mins to 130 mins.A large number of movies with runtime 95-125 mins get an average rating of 6-7.5
When a movie gets a large number of votes it has a good rating.The larger the number of votes the best is the rating. There is a close co relation between Votes and Rating
Maximum movies have Drama in them within the data of 1000 Movies, 513 being for Drama meaning around 50% movies have Drama as their Genre
It is observed that Drama,Adventure,Action,Comedy movies are made and they have generated large amount of revenue in the year 2015,2016, Mostly movies with revenue of less than 100 million contribute to the revenue generation of the year.

Problem Statemet

Importing Packages

Importing the Movies Dataset

Pre Profiling

Preprocessing

Adding new features

Treating missing values

Post Pandas Profiling

Questions

Observation:

Observation

Observation

Observation

Observation

Observation:

Observation

Observation

Observation

Observation

Observation

Observation

Conclusions

Product

Resources

Company