GitHub Repository: suyashi29/python-su
Path: blob/master/Data Analysis using Python/EDA Movies.ipynb
³⁰⁷⁴ views

Kernel: Python 3

Problem Statemet

The dataset consists of the survey data of the Movies within the years 2006-2016
Various variables present in the dataset inclueds genre,ratings,votes,revenues,etc .
The dataset comprises of 1000 observations of 12 columns. Below is a table showing names of all the columns and their description.

| Column Name | Description | | ------------- |:------------- 😐 | Rank | Rank of the Movie | | Title | Title of the movie | | Genre | To what Genre the movie belongs | | Description | The description of the movie which gives the backgound of the movie | | Director | Name of the director of the movie | | Actors | Actors of the Movie | | Year | Year in which the Movie was released | | Runtime (Minutes) | The duration of the movie in minutes| | Rating | Rating of the Movie | | Votes | Votes given to the Movie | | Revenue (Millions) | Revenue made by the Movie in millions | |Metascore |Score of the movie on the metacritic website|

Importing Packages

In [5]:

import numpy as np                                                 # Implemennts milti-dimensional array and matrices
import pandas as pd                                                # For data manipulation and analysis
#import pandas_profiling as prof
import matplotlib.pyplot as plt                                    # Plotting library for Python programming language and it's numerical mathematics extension NumPy
import seaborn as sns                                              # Provides a high level interface for drawing attractive and informative statistical graphics
%matplotlib inline # the output of plotting commands is displayed inline within frontends like the Jupyter notebook
sns.set() #To switch to seaborn defaults, simply call the set() function
from subprocess import check_output 
#to run new applications or programs through Python code by creating new processes.
import warnings 
warnings.filterwarnings('ignore')

Out[5]:

UsageError: unrecognized arguments: # the output of plotting commands is displayed inline within frontends like the Jupyter notebook

In [ ]:

1- Descride data
2- Check for null values
3- Replace null values
4- Drop Colums
5- Which industry has highest average Revenue?
6- What is the Count according to industry
7- What is Distribution of Revenue and Growth and Expense in 2005?

Importing the Movies Dataset

movies=pd.ExcelFile(r"C:\Users\suyashi144893\Documents\data Sets\movies.xlsx").parse("Sheet1")

In [ ]:

company=pd.ExcelFile(r"C:\Users\suyashi144893\Documents\data Sets\CompanyDetails.xlsx").parse("Overview")
company

In [6]:

movies=pd.read_excel("movies.xlsx") 
## Movie trend: Rank, drop Title, Genre, drop description, drop director, drop actor, year, runtime(duration), rating(), drop vote, renvnue, metascore

In [7]:

movies.shape

Out[7]:

(1000, 12)

The Movies dataset has 1000 observations and 12 columns

In [ ]:

#movies['Year'] = movies['Year'].astype(str)

In [ ]:

movies.columns

These are the columns present in the dataset

In [8]:

movies.head(2)

Out[8]:

In [9]:

movies.tail(1)

Out[9]:

In [10]:

movies.info()

Out[10]:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Rank                1000 non-null   int64  
 1   Title               1000 non-null   object 
 2   Genre               1000 non-null   object 
 3   Description         1000 non-null   object 
 4   Director            1000 non-null   object 
 5   Actors              1000 non-null   object 
 6   Year                1000 non-null   int64  
 7   Runtime (Minutes)   1000 non-null   int64  
 8   Rating              1000 non-null   float64
 9   Votes               1000 non-null   int64  
 10  Revenue (Millions)  872 non-null    float64
 11  Metascore           936 non-null    float64
dtypes: float64(3), int64(4), object(5)
memory usage: 93.9+ KB

In [11]:

movies.describe()

Out[11]:

It is seen from the data that revenue has a high standard deviation and hence a large spread of revenue genration for the movies

In [12]:

movies.describe(include='object')

##Checking null values

Out[12]:

In [13]:

m=movies.isnull().sum()
miss= (m)/len(movies)*100
miss_data=pd.concat([m,miss],axis=1,keys=['Total','%'])
print(miss_data)

Out[13]:

                    Total     %
Rank                    0   0.0
Title                   0   0.0
Genre                   0   0.0
Description             0   0.0
Director                0   0.0
Actors                  0   0.0
Year                    0   0.0
Runtime (Minutes)       0   0.0
Rating                  0   0.0
Votes                   0   0.0
Revenue (Millions)    128  12.8
Metascore              64   6.4

From the above output we can see that Revenue(Millions) and Metascore columns contains maximum null values

Pre Profiling

profile = prof.ProfileReport(movies) profile.to_file(outputfile="Movies_before_preprocessing.html")

Here, we have done Pandas Profiling before preprocessing our dataset, so we have named the html file as Movies_before_preprocessing.html.Now we will process our data to better understand it.

Preprocessing

Adding new features

In [14]:

new = movies['Genre'].str.split(",", 2)
movies['Genre 1']=new.str.get(0)
movies['Genre 2']=new.str.get(1)
movies['Genre 3']=new.str.get(2)

In [15]:

movies['Genre 1'].describe(include='all')

Out[15]:

count       1000
unique        13
top       Action
freq         293
Name: Genre 1, dtype: object

In [16]:

movies['Genre 2'].describe(include='all')

Out[16]:

count       895
unique       19
top       Drama
freq        238
Name: Genre 2, dtype: object

In [17]:

movies['Genre 3'].describe(include='all')

Out[17]:

count          660
unique          18
top       Thriller
freq           133
Name: Genre 3, dtype: object

In [18]:

movies.drop('Genre', axis = 1,inplace = True)

In [19]:

movies.drop_duplicates(subset=None,keep="first",inplace=False)

Out[19]:

In [20]:

movies["Rating"].describe() # Describing a particular field

Out[20]:

count    1000.000000
mean        6.723200
std         0.945429
min         1.900000
25%         6.200000
50%         6.800000
75%         7.400000
max         9.000000
Name: Rating, dtype: float64

In [21]:

movies['RB']=0
movies.loc[movies['Rating']<=4,'RB']="LOW"
movies.loc[(movies['Rating']>4)&(movies['Rating']<=7),'RB']="MEDIUM"
movies.loc[movies['Rating']>7,'RB']="HIGH"
movies.head(2)

## Create a Field for movie duration

Out[21]:

Treating missing values

In [22]:

# Metascore by Mean()
movies["Metascore"]=movies["Metascore"].fillna(0)

In [23]:

# Metascore by 
r=movies["Revenue (Millions)"].median()
movies["Revenue (Millions)"]=movies["Revenue (Millions)"].fillna(r)

In [25]:

movies.isnull().sum()

Out[25]:

Rank                    0
Title                   0
Description             0
Director                0
Actors                  0
Year                    0
Runtime (Minutes)       0
Rating                  0
Votes                   0
Revenue (Millions)      0
Metascore               0
Genre 1                 0
Genre 2               105
Genre 3               340
RB                      0
dtype: int64

Post Pandas Profiling

profile = prof.ProfileReport(movies)
profile.to_file(outputfile="Movies_after_preprocessing.html")

Questions

Q1) Movies made on year basis?

In [26]:


movies.groupby(['Genre 1'])['Year'].count()

Out[26]:

Genre 1
Action       293
Adventure     75
Animation     49
Biography     64
Comedy       175
Crime         71
Drama        195
Fantasy        4
Horror        46
Mystery       13
Romance        2
Sci-Fi         3
Thriller      10
Name: Year, dtype: int64

In [ ]:

MDF = pd.read_excel('Movies.xlsx')
base = ['Adventure','Crime','Comedy','Sci-Fi','Action']
for var in base:
    MDF[var]= MDF['Genre'].apply(lambda x: 1 if var in str(x) else 0)
MDF['Other'] = MDF.apply(lambda x: 0 if (x['Adventure']+
                                        x['Crime']+
                                        x['Comedy']+
                                        x['Sci-Fi']+
                                        x['Action'])>=1 else 1,axis=1)
check = MDF.groupby(['Year']).agg({'Adventure':sum,
                                       'Crime':sum,
                                      'Sci-Fi':sum,
                                      'Action':sum,
                                      'Comedy':sum,
                                       'Other':sum})
check.reset_index(inplace=True)
check['Total'] = check.apply(lambda x: x['Adventure']+x['Crime']+x['Sci-Fi']+
                                       x['Action']+x['Comedy']+x['Other'],axis=1)
base2 = base+ ['Other']
for var in base2:
    check[var+'_norm']=check.apply(lambda x: float(x[var])/float(x['Total'])*100,axis=1)
base3 = [x+'_norm' for x in base2]
check.plot(kind='bar',x=['Year'],y=base3,stacked=True)
plt.show()

In [27]:

movies.groupby(['Year'])['Year'].count()

Out[27]:

Year
   44
   53
   52
   51
   60
   63
   64
   91
   98
  127
  297
Name: Year, dtype: int64

Observation:

From the above we can see that with passing years more movies are being made year by year
The above data shows that there is a sudden increase in the creation of movies in year 2016 as compared to year 2015

In [28]:

sns.countplot(x='Year', data=movies).set_title('Count plot for Movies with passing Years.')
sns.set(rc={'figure.figsize':(20,20)})
plt.show()

Out[28]:

Q3) What is the Rating that a movie normally gets?

In [29]:

movies.groupby(['RB'])['RB'].count()

Out[29]:

RB
HIGH      399
LOW        12
MEDIUM    589
Name: RB, dtype: int64

In [ ]:


#movies.groupby(['Rating'])['Rating'].count()

In [30]:

sns.set(rc={'figure.figsize':(10,10)})
sns.countplot(x='RB', data=movies).set_title('Count plot for Movies with according to their Rating.')

Out[30]:

Text(0.5, 1.0, 'Count plot for Movies with according to their Rating.')

The data is negatively skewed

In [31]:

movies['Rating'].plot.hist()

Out[31]:

<AxesSubplot:ylabel='Frequency'>

Observation

It is observed that on an average a large number of movies get ratings from the range of 6.1-7.5

Observation

It is observed that maximum Directors only direct a movie in a span of 10 years

In [32]:

sns.boxplot(x="Year", y="Rating", data=movies)

Out[32]:

<AxesSubplot:xlabel='Year', ylabel='Rating'>

Q5) What is the Revenue generated by Movies, does rating affect revenue generation?

In [33]:

movies['Revenue (Millions)'].sort_values().plot.hist()

Out[33]:

<AxesSubplot:ylabel='Frequency'>

Observation

Maximum movies product a revenue in the range of 0-100 (Millions)
Very few Movies generate a revenue in the range of 380-650 (Millions)
Positively skewed data

In [34]:

sns.boxplot(x="Year", y="Revenue (Millions)", data=movies)

Out[34]:

<AxesSubplot:xlabel='Year', ylabel='Revenue (Millions)'>

Observation

Maximum movies with Rating 6-7 generate a revenue of 100-200

Q6) Runtimes of movies affect revenue and ratings?

In [35]:

movies['Runtime (Minutes)'].sort_values().plot.hist()

Out[35]:

<AxesSubplot:ylabel='Frequency'>

Observation:

Very few movies have a runtime of less than 80 mins and more than 165 mins
Generally Movies have a runtime of 90 mins to 130 mins

Observation

A large number of Movies with a runtime of 95-110 minutes produce a revenue of 0-100 millions

Size of confidence intervals(ci) to draw around estimated values. If “sd”, skip bootstrapping and draw the standard deviation of the observations. If None, no bootstrapping will be performed, and error bars will not be drawn.

In [36]:

sns.barplot("Year","Runtime (Minutes)", data=movies,ci= None)
#sns.barplot("Year","Runtime (Minutes)", data=movies,color="lightblue")

Out[36]:

C:\Users\suyashi144893\Anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(

<AxesSubplot:xlabel='Year', ylabel='Runtime (Minutes)'>

In [37]:

movies.plot.hexbin(x='Rating', y='Runtime (Minutes)', gridsize=10)

Out[37]:

<AxesSubplot:xlabel='Rating', ylabel='Runtime (Minutes)'>

Observation

A large number of movies with runtime 95-125 mins get an average rating of 6-7.5

Q7) Effect of Metascore, are metascore and rating related ?

In [38]:

movies['Metascore'].sort_values().plot.hist()

Out[38]:

<AxesSubplot:ylabel='Frequency'>

Observation

A large amount of movies get a metascore of 55-70
Around 20 movies get a good metascore of 90-100
The metascore has a perfect distribution

In [39]:

movies.plot.scatter(x='Revenue (Millions)', y='Rating')

Out[39]:

*c* argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with *x* & *y*.  Please use the *color* keyword-argument or provide a 2-D array with a single row if you intend to specify the same RGB or RGBA value for all points.

<AxesSubplot:xlabel='Revenue (Millions)', ylabel='Rating'>

In [40]:

movies.corr()

Out[40]:

In [41]:


movies.plot.scatter(x='Metascore', y='Rating')

Out[41]:

*c* argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with *x* & *y*.  Please use the *color* keyword-argument or provide a 2-D array with a single row if you intend to specify the same RGB or RGBA value for all points.

<AxesSubplot:xlabel='Metascore', ylabel='Rating'>

Observation

The Scatter plot defines that Rating and Metascore are closely related

In [46]:

movies["Rating"].plot.hist(color='r', alpha=0.8)

Out[46]:

<AxesSubplot:ylabel='Frequency'>

In [47]:

movies[movies['Genre 1'] == 'Action'].Year.groupby(movies.Year).count().plot(kind='pie', figsize=(6, 6),autopct='%1.1f%%')
plt.axis('equal')
plt.show()

Out[47]:

movies.plot(kind="bar",x='Genre 2',y="Year",stacked=True)

In [48]:

movies[movies['Genre 2'] == 'Action'].Year.groupby(movies.Year).count().plot(kind='pie', figsize=(6, 6),autopct='%1.1f%%')
plt.axis('equal')
plt.show()

Out[48]:

In [50]:

sns.heatmap(movies.corr(),annot=True)
#Use annotation to add correlation numbers to the Seaborn heatmap
plt.figure(figsize=(5,5))

Out[50]:

<Figure size 360x360 with 0 Axes>

<Figure size 360x360 with 0 Axes>

In [51]:

movies.plot.area(x='Votes',y="Rating")

Out[51]:

<AxesSubplot:xlabel='Votes'>

In [ ]:

movies.plot.scatter(x='Votes', y='Rating')

Observation

It will be very rare case in which the movie may receive a lot of votes but when a movie gets a large number of votes it has a good rating
The larger the number of votes the best is the rating.
There is a close co relation between Votes and Rating

Q9) What kind of Genres movies are made?

In [ ]:

print("The count of different Genres for Genre 1")
movies.groupby(['Genre 1'])['Genre 1'].count().sort_values()

From the above data of Genre 1 we conclude that the top three primary Genre movies that are made are __ Action,Drama,Comedy__

In [54]:


movies.plot.line(y="Revenue (Millions)",x="Year" )

Out[54]:

<AxesSubplot:xlabel='Year'>

In [ ]:

sns.set(rc={'figure.figsize':(14,12)})
sns.countplot(x='Genre 1', data=movies).set_title('Count plot for Genre 1')

In [ ]:

print("The count of different Genres for Genre 2")
movies.groupby(['Genre 2'])['Genre 2'].count().sort_values()

From the above data of Genre 2 we conclude that the top Genre movies that are made are __ Drama,Adventure,Romance,Comedy__
Romance, Comedy, Crime, Thriller, Mystery, Horror made at the same level

In [ ]:

sns.set(rc={'figure.figsize':(14,12)})
sns.countplot(x='Genre 2', data=movies).set_title('Count plot for Genre 2')

In [ ]:

print("The count of different Genres for Genre 3")
movies.groupby(['Genre 3'])['Genre 3'].count().sort_values()

From the above data of Genre 3 we conclude that the top Genre movies that are made are __ Thriller,Sci-Fi,Drama,Romance and Fantasy__

In [ ]:

sns.set(rc={'figure.figsize':(14,12)})
sns.countplot(x='Genre 3', data=movies).set_title('Count plot for Genre 3')

Observation

From evaluating the above data on Genres it is seen that Maximum movies have Drama in them within the data of 1000 Movies, 513 being for Drama meaning around 50% movies have Drama as their Genre
Action Movies are also made but not as Drama with a count of 303 for a data of 1000 movies
Comedy Movies are made more than Adventure with a count of 279 for Comedy and 259 for Adventure
Thriller movies are made more than Crime Movies with a count of 195 and 150
Crime and Romance movies are equally made as the count suggests from the above 150 for Crime and 141 for Romance

Q10) Which genre produces best Revenue in years 2015,2016?

In [55]:

sns.swarmplot(x="Genre 2", y="Revenue (Millions)", hue="Year",data=movies.loc[movies['Year'].isin(['2015', '2016'])])

Out[55]:

C:\Users\suyashi144893\Anaconda3\lib\site-packages\seaborn\categorical.py:1296: UserWarning: 32.1% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot.
  warnings.warn(msg, UserWarning)
C:\Users\suyashi144893\Anaconda3\lib\site-packages\seaborn\categorical.py:1296: UserWarning: 11.1% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot.
  warnings.warn(msg, UserWarning)
C:\Users\suyashi144893\Anaconda3\lib\site-packages\seaborn\categorical.py:1296: UserWarning: 15.4% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot.
  warnings.warn(msg, UserWarning)
C:\Users\suyashi144893\Anaconda3\lib\site-packages\seaborn\categorical.py:1296: UserWarning: 62.0% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot.
  warnings.warn(msg, UserWarning)
C:\Users\suyashi144893\Anaconda3\lib\site-packages\seaborn\categorical.py:1296: UserWarning: 26.1% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot.
  warnings.warn(msg, UserWarning)
C:\Users\suyashi144893\Anaconda3\lib\site-packages\seaborn\categorical.py:1296: UserWarning: 5.3% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot.
  warnings.warn(msg, UserWarning)

<AxesSubplot:xlabel='Genre 2', ylabel='Revenue (Millions)'>

In [ ]:

sns.swarmplot(x="Genre 2", y="Revenue (Millions)", hue="Year",data=movies.loc[movies['Year'].isin(['2015', '2016'])])

In [57]:

sns.swarmplot(x="Genre 3", y="Revenue (Millions)", hue="Year",data=movies.loc[movies['Year'].isin(['2015', '2016'])])

Out[57]:

C:\Users\suyashi144893\Anaconda3\lib\site-packages\seaborn\categorical.py:1296: UserWarning: 25.0% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot.
  warnings.warn(msg, UserWarning)
C:\Users\suyashi144893\Anaconda3\lib\site-packages\seaborn\categorical.py:1296: UserWarning: 24.3% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot.
  warnings.warn(msg, UserWarning)
C:\Users\suyashi144893\Anaconda3\lib\site-packages\seaborn\categorical.py:1296: UserWarning: 43.1% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot.
  warnings.warn(msg, UserWarning)
C:\Users\suyashi144893\Anaconda3\lib\site-packages\seaborn\categorical.py:1296: UserWarning: 8.3% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot.
  warnings.warn(msg, UserWarning)

<AxesSubplot:xlabel='Genre 3', ylabel='Revenue (Millions)'>

From the above plots it is observed that Drama,Adventure,Action,Comedy movies are made and they have generated large amount of revenue in the year 2015,2016
2016 has generated more revenue as compared to 2015

Conclusions

The Movies data set helps us in deducting that year by year more and more Movies are being made,also there is a sudden increase in the creation of movies in year 2016 as compared to year 2015
There are 128 Movies which did not generate revenue at all even after releasing in the year Range 2006-2016
Maximum number of Movies get the rating of 7.1, there is only one movie which got a rating on 9
A large number of movies get ratings from the range of 6.1-7.5
Director Ridley Scott has Directed the maximum number of movies which are 8 in a period of 2006-2016,maximum Directors only direct a movie in a span of 10 years
Maximum movies produce a revenue in the range of 0-100 (Millions),Very few Movies generate a revenue in the range of 380-650 (Millions)
Generally Movies have a runtime of 90 mins to 130 mins.A large number of movies with runtime 95-125 mins get an average rating of 6-7.5
When a movie gets a large number of votes it has a good rating.The larger the number of votes the best is the rating. There is a close co relation between Votes and Rating
Maximum movies have Drama in them within the data of 1000 Movies, 513 being for Drama meaning around 50% movies have Drama as their Genre
It is observed that Drama,Adventure,Action,Comedy movies are made and they have generated large amount of revenue in the year 2015,2016, Mostly movies with revenue of less than 100 million contribute to the revenue generation of the year.

Problem Statemet

Importing Packages

Importing the Movies Dataset

Pre Profiling

Preprocessing

Adding new features

Treating missing values

Post Pandas Profiling

Questions

Observation:

Observation

Observation

Observation

Observation

Observation:

Observation

Observation

Observation

Observation

Observation

Observation

Conclusions

Product

Resources

Company