Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
suyashi29
GitHub Repository: suyashi29/python-su
Path: blob/master/Data Analysis using Python/EDA Movies.ipynb
3074 views
Kernel: Python 3

Problem Statemet

image.png

  • The dataset consists of the survey data of the Movies within the years 2006-2016

  • Various variables present in the dataset inclueds genre,ratings,votes,revenues,etc .

  • The dataset comprises of 1000 observations of 12 columns. Below is a table showing names of all the columns and their description.

| Column Name | Description | | ------------- |:------------- 😐 | Rank | Rank of the Movie | | Title | Title of the movie | | Genre | To what Genre the movie belongs | | Description | The description of the movie which gives the backgound of the movie | | Director | Name of the director of the movie | | Actors | Actors of the Movie | | Year | Year in which the Movie was released | | Runtime (Minutes) | The duration of the movie in minutes| | Rating | Rating of the Movie | | Votes | Votes given to the Movie | | Revenue (Millions) | Revenue made by the Movie in millions | |Metascore |Score of the movie on the metacritic website|

Importing Packages

import numpy as np # Implemennts milti-dimensional array and matrices import pandas as pd # For data manipulation and analysis #import pandas_profiling as prof import matplotlib.pyplot as plt # Plotting library for Python programming language and it's numerical mathematics extension NumPy import seaborn as sns # Provides a high level interface for drawing attractive and informative statistical graphics %matplotlib inline # the output of plotting commands is displayed inline within frontends like the Jupyter notebook sns.set() #To switch to seaborn defaults, simply call the set() function from subprocess import check_output #to run new applications or programs through Python code by creating new processes. import warnings warnings.filterwarnings('ignore')
UsageError: unrecognized arguments: # the output of plotting commands is displayed inline within frontends like the Jupyter notebook
1- Descride data 2- Check for null values 3- Replace null values 4- Drop Colums 5- Which industry has highest average Revenue? 6- What is the Count according to industry 7- What is Distribution of Revenue and Growth and Expense in 2005?

Importing the Movies Dataset

movies=pd.ExcelFile(r"C:\Users\suyashi144893\Documents\data Sets\movies.xlsx").parse("Sheet1")

company=pd.ExcelFile(r"C:\Users\suyashi144893\Documents\data Sets\CompanyDetails.xlsx").parse("Overview") company
movies=pd.read_excel("movies.xlsx") ## Movie trend: Rank, drop Title, Genre, drop description, drop director, drop actor, year, runtime(duration), rating(), drop vote, renvnue, metascore
movies.shape
(1000, 12)
  • The Movies dataset has 1000 observations and 12 columns

#movies['Year'] = movies['Year'].astype(str)
movies.columns
  • These are the columns present in the dataset

movies.head(2)
movies.tail(1)
movies.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1000 entries, 0 to 999 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Rank 1000 non-null int64 1 Title 1000 non-null object 2 Genre 1000 non-null object 3 Description 1000 non-null object 4 Director 1000 non-null object 5 Actors 1000 non-null object 6 Year 1000 non-null int64 7 Runtime (Minutes) 1000 non-null int64 8 Rating 1000 non-null float64 9 Votes 1000 non-null int64 10 Revenue (Millions) 872 non-null float64 11 Metascore 936 non-null float64 dtypes: float64(3), int64(4), object(5) memory usage: 93.9+ KB
movies.describe()
  • It is seen from the data that revenue has a high standard deviation and hence a large spread of revenue genration for the movies

movies.describe(include='object') ##Checking null values
m=movies.isnull().sum() miss= (m)/len(movies)*100 miss_data=pd.concat([m,miss],axis=1,keys=['Total','%']) print(miss_data)
Total % Rank 0 0.0 Title 0 0.0 Genre 0 0.0 Description 0 0.0 Director 0 0.0 Actors 0 0.0 Year 0 0.0 Runtime (Minutes) 0 0.0 Rating 0 0.0 Votes 0 0.0 Revenue (Millions) 128 12.8 Metascore 64 6.4
  • From the above output we can see that Revenue(Millions) and Metascore columns contains maximum null values

Pre Profiling

profile = prof.ProfileReport(movies) profile.to_file(outputfile="Movies_before_preprocessing.html")

Here, we have done Pandas Profiling before preprocessing our dataset, so we have named the html file as Movies_before_preprocessing.html.Now we will process our data to better understand it.

Preprocessing

Adding new features

new = movies['Genre'].str.split(",", 2) movies['Genre 1']=new.str.get(0) movies['Genre 2']=new.str.get(1) movies['Genre 3']=new.str.get(2)
movies['Genre 1'].describe(include='all')
count 1000 unique 13 top Action freq 293 Name: Genre 1, dtype: object
movies['Genre 2'].describe(include='all')
count 895 unique 19 top Drama freq 238 Name: Genre 2, dtype: object
movies['Genre 3'].describe(include='all')
count 660 unique 18 top Thriller freq 133 Name: Genre 3, dtype: object
movies.drop('Genre', axis = 1,inplace = True)
movies.drop_duplicates(subset=None,keep="first",inplace=False)
movies["Rating"].describe() # Describing a particular field
count 1000.000000 mean 6.723200 std 0.945429 min 1.900000 25% 6.200000 50% 6.800000 75% 7.400000 max 9.000000 Name: Rating, dtype: float64
movies['RB']=0 movies.loc[movies['Rating']<=4,'RB']="LOW" movies.loc[(movies['Rating']>4)&(movies['Rating']<=7),'RB']="MEDIUM" movies.loc[movies['Rating']>7,'RB']="HIGH" movies.head(2) ## Create a Field for movie duration

Treating missing values

# Metascore by Mean() movies["Metascore"]=movies["Metascore"].fillna(0)
# Metascore by r=movies["Revenue (Millions)"].median() movies["Revenue (Millions)"]=movies["Revenue (Millions)"].fillna(r)
movies.isnull().sum()
Rank 0 Title 0 Description 0 Director 0 Actors 0 Year 0 Runtime (Minutes) 0 Rating 0 Votes 0 Revenue (Millions) 0 Metascore 0 Genre 1 0 Genre 2 105 Genre 3 340 RB 0 dtype: int64

Post Pandas Profiling

profile = prof.ProfileReport(movies) profile.to_file(outputfile="Movies_after_preprocessing.html")

Questions

Q1) Movies made on year basis?

movies.groupby(['Genre 1'])['Year'].count()
Genre 1 Action 293 Adventure 75 Animation 49 Biography 64 Comedy 175 Crime 71 Drama 195 Fantasy 4 Horror 46 Mystery 13 Romance 2 Sci-Fi 3 Thriller 10 Name: Year, dtype: int64
MDF = pd.read_excel('Movies.xlsx') base = ['Adventure','Crime','Comedy','Sci-Fi','Action'] for var in base: MDF[var]= MDF['Genre'].apply(lambda x: 1 if var in str(x) else 0) MDF['Other'] = MDF.apply(lambda x: 0 if (x['Adventure']+ x['Crime']+ x['Comedy']+ x['Sci-Fi']+ x['Action'])>=1 else 1,axis=1) check = MDF.groupby(['Year']).agg({'Adventure':sum, 'Crime':sum, 'Sci-Fi':sum, 'Action':sum, 'Comedy':sum, 'Other':sum}) check.reset_index(inplace=True) check['Total'] = check.apply(lambda x: x['Adventure']+x['Crime']+x['Sci-Fi']+ x['Action']+x['Comedy']+x['Other'],axis=1) base2 = base+ ['Other'] for var in base2: check[var+'_norm']=check.apply(lambda x: float(x[var])/float(x['Total'])*100,axis=1) base3 = [x+'_norm' for x in base2] check.plot(kind='bar',x=['Year'],y=base3,stacked=True) plt.show()
movies.groupby(['Year'])['Year'].count()
Year 2006 44 2007 53 2008 52 2009 51 2010 60 2011 63 2012 64 2013 91 2014 98 2015 127 2016 297 Name: Year, dtype: int64

Observation:

  • From the above we can see that with passing years more movies are being made year by year

  • The above data shows that there is a sudden increase in the creation of movies in year 2016 as compared to year 2015

sns.countplot(x='Year', data=movies).set_title('Count plot for Movies with passing Years.') sns.set(rc={'figure.figsize':(20,20)}) plt.show()
Image in a Jupyter notebook

Q3) What is the Rating that a movie normally gets?

movies.groupby(['RB'])['RB'].count()
RB HIGH 399 LOW 12 MEDIUM 589 Name: RB, dtype: int64
#movies.groupby(['Rating'])['Rating'].count()
sns.set(rc={'figure.figsize':(10,10)}) sns.countplot(x='RB', data=movies).set_title('Count plot for Movies with according to their Rating.')
Text(0.5, 1.0, 'Count plot for Movies with according to their Rating.')
Image in a Jupyter notebook
  • The data is negatively skewed

movies['Rating'].plot.hist()
<AxesSubplot:ylabel='Frequency'>
Image in a Jupyter notebook

Observation

  • It is observed that on an average a large number of movies get ratings from the range of 6.1-7.5

Observation

  • It is observed that maximum Directors only direct a movie in a span of 10 years

sns.boxplot(x="Year", y="Rating", data=movies)
<AxesSubplot:xlabel='Year', ylabel='Rating'>
Image in a Jupyter notebook

Q5) What is the Revenue generated by Movies, does rating affect revenue generation?

movies['Revenue (Millions)'].sort_values().plot.hist()
<AxesSubplot:ylabel='Frequency'>
Image in a Jupyter notebook
Observation
  • Maximum movies product a revenue in the range of 0-100 (Millions)

  • Very few Movies generate a revenue in the range of 380-650 (Millions)

  • Positively skewed data

sns.boxplot(x="Year", y="Revenue (Millions)", data=movies)
<AxesSubplot:xlabel='Year', ylabel='Revenue (Millions)'>
Image in a Jupyter notebook

Observation

  • Maximum movies with Rating 6-7 generate a revenue of 100-200

Q6) Runtimes of movies affect revenue and ratings?

movies['Runtime (Minutes)'].sort_values().plot.hist()
<AxesSubplot:ylabel='Frequency'>
Image in a Jupyter notebook

Observation:

  • Very few movies have a runtime of less than 80 mins and more than 165 mins

  • Generally Movies have a runtime of 90 mins to 130 mins

Observation

  • A large number of Movies with a runtime of 95-110 minutes produce a revenue of 0-100 millions

Size of confidence intervals(ci) to draw around estimated values. If “sd”, skip bootstrapping and draw the standard deviation of the observations. If None, no bootstrapping will be performed, and error bars will not be drawn.

sns.barplot("Year","Runtime (Minutes)", data=movies,ci= None) #sns.barplot("Year","Runtime (Minutes)", data=movies,color="lightblue")
C:\Users\suyashi144893\Anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn(
<AxesSubplot:xlabel='Year', ylabel='Runtime (Minutes)'>
Image in a Jupyter notebook
movies.plot.hexbin(x='Rating', y='Runtime (Minutes)', gridsize=10)
<AxesSubplot:xlabel='Rating', ylabel='Runtime (Minutes)'>
Image in a Jupyter notebook

Observation

  • A large number of movies with runtime 95-125 mins get an average rating of 6-7.5

Q7) Effect of Metascore, are metascore and rating related ?

movies['Metascore'].sort_values().plot.hist()
<AxesSubplot:ylabel='Frequency'>
Image in a Jupyter notebook

Observation

  • A large amount of movies get a metascore of 55-70

  • Around 20 movies get a good metascore of 90-100

  • The metascore has a perfect distribution

movies.plot.scatter(x='Revenue (Millions)', y='Rating')
*c* argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with *x* & *y*. Please use the *color* keyword-argument or provide a 2-D array with a single row if you intend to specify the same RGB or RGBA value for all points.
<AxesSubplot:xlabel='Revenue (Millions)', ylabel='Rating'>
Image in a Jupyter notebook
movies.corr()
movies.plot.scatter(x='Metascore', y='Rating')
*c* argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with *x* & *y*. Please use the *color* keyword-argument or provide a 2-D array with a single row if you intend to specify the same RGB or RGBA value for all points.
<AxesSubplot:xlabel='Metascore', ylabel='Rating'>
Image in a Jupyter notebook

Observation

  • The Scatter plot defines that Rating and Metascore are closely related

movies["Rating"].plot.hist(color='r', alpha=0.8)
<AxesSubplot:ylabel='Frequency'>
Image in a Jupyter notebook
movies[movies['Genre 1'] == 'Action'].Year.groupby(movies.Year).count().plot(kind='pie', figsize=(6, 6),autopct='%1.1f%%') plt.axis('equal') plt.show()
Image in a Jupyter notebook

movies.plot(kind="bar",x='Genre 2',y="Year",stacked=True)

movies[movies['Genre 2'] == 'Action'].Year.groupby(movies.Year).count().plot(kind='pie', figsize=(6, 6),autopct='%1.1f%%') plt.axis('equal') plt.show()
Image in a Jupyter notebook
sns.heatmap(movies.corr(),annot=True) #Use annotation to add correlation numbers to the Seaborn heatmap plt.figure(figsize=(5,5))
<Figure size 360x360 with 0 Axes>
Image in a Jupyter notebook
<Figure size 360x360 with 0 Axes>
movies.plot.area(x='Votes',y="Rating")
<AxesSubplot:xlabel='Votes'>
Image in a Jupyter notebook
movies.plot.scatter(x='Votes', y='Rating')
Observation
  • It will be very rare case in which the movie may receive a lot of votes but when a movie gets a large number of votes it has a good rating

  • The larger the number of votes the best is the rating.

  • There is a close co relation between Votes and Rating

Q9) What kind of Genres movies are made?

print("The count of different Genres for Genre 1") movies.groupby(['Genre 1'])['Genre 1'].count().sort_values()
  • From the above data of Genre 1 we conclude that the top three primary Genre movies that are made are __ Action,Drama,Comedy__

movies.plot.line(y="Revenue (Millions)",x="Year" )
<AxesSubplot:xlabel='Year'>
Image in a Jupyter notebook
sns.set(rc={'figure.figsize':(14,12)}) sns.countplot(x='Genre 1', data=movies).set_title('Count plot for Genre 1')
print("The count of different Genres for Genre 2") movies.groupby(['Genre 2'])['Genre 2'].count().sort_values()
  • From the above data of Genre 2 we conclude that the top Genre movies that are made are __ Drama,Adventure,Romance,Comedy__

  • Romance, Comedy, Crime, Thriller, Mystery, Horror made at the same level

sns.set(rc={'figure.figsize':(14,12)}) sns.countplot(x='Genre 2', data=movies).set_title('Count plot for Genre 2')
print("The count of different Genres for Genre 3") movies.groupby(['Genre 3'])['Genre 3'].count().sort_values()
  • From the above data of Genre 3 we conclude that the top Genre movies that are made are __ Thriller,Sci-Fi,Drama,Romance and Fantasy__

sns.set(rc={'figure.figsize':(14,12)}) sns.countplot(x='Genre 3', data=movies).set_title('Count plot for Genre 3')

Observation

  • From evaluating the above data on Genres it is seen that Maximum movies have Drama in them within the data of 1000 Movies, 513 being for Drama meaning around 50% movies have Drama as their Genre

  • Action Movies are also made but not as Drama with a count of 303 for a data of 1000 movies

  • Comedy Movies are made more than Adventure with a count of 279 for Comedy and 259 for Adventure

  • Thriller movies are made more than Crime Movies with a count of 195 and 150

  • Crime and Romance movies are equally made as the count suggests from the above 150 for Crime and 141 for Romance

Q10) Which genre produces best Revenue in years 2015,2016?

sns.swarmplot(x="Genre 2", y="Revenue (Millions)", hue="Year",data=movies.loc[movies['Year'].isin(['2015', '2016'])])
C:\Users\suyashi144893\Anaconda3\lib\site-packages\seaborn\categorical.py:1296: UserWarning: 32.1% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot. warnings.warn(msg, UserWarning) C:\Users\suyashi144893\Anaconda3\lib\site-packages\seaborn\categorical.py:1296: UserWarning: 11.1% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot. warnings.warn(msg, UserWarning) C:\Users\suyashi144893\Anaconda3\lib\site-packages\seaborn\categorical.py:1296: UserWarning: 15.4% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot. warnings.warn(msg, UserWarning) C:\Users\suyashi144893\Anaconda3\lib\site-packages\seaborn\categorical.py:1296: UserWarning: 62.0% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot. warnings.warn(msg, UserWarning) C:\Users\suyashi144893\Anaconda3\lib\site-packages\seaborn\categorical.py:1296: UserWarning: 26.1% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot. warnings.warn(msg, UserWarning) C:\Users\suyashi144893\Anaconda3\lib\site-packages\seaborn\categorical.py:1296: UserWarning: 5.3% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot. warnings.warn(msg, UserWarning)
<AxesSubplot:xlabel='Genre 2', ylabel='Revenue (Millions)'>
Image in a Jupyter notebook
sns.swarmplot(x="Genre 2", y="Revenue (Millions)", hue="Year",data=movies.loc[movies['Year'].isin(['2015', '2016'])])
sns.swarmplot(x="Genre 3", y="Revenue (Millions)", hue="Year",data=movies.loc[movies['Year'].isin(['2015', '2016'])])
C:\Users\suyashi144893\Anaconda3\lib\site-packages\seaborn\categorical.py:1296: UserWarning: 25.0% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot. warnings.warn(msg, UserWarning) C:\Users\suyashi144893\Anaconda3\lib\site-packages\seaborn\categorical.py:1296: UserWarning: 24.3% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot. warnings.warn(msg, UserWarning) C:\Users\suyashi144893\Anaconda3\lib\site-packages\seaborn\categorical.py:1296: UserWarning: 43.1% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot. warnings.warn(msg, UserWarning) C:\Users\suyashi144893\Anaconda3\lib\site-packages\seaborn\categorical.py:1296: UserWarning: 8.3% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot. warnings.warn(msg, UserWarning)
<AxesSubplot:xlabel='Genre 3', ylabel='Revenue (Millions)'>
Image in a Jupyter notebook
  • From the above plots it is observed that Drama,Adventure,Action,Comedy movies are made and they have generated large amount of revenue in the year 2015,2016

  • 2016 has generated more revenue as compared to 2015

Conclusions

  • The Movies data set helps us in deducting that year by year more and more Movies are being made,also there is a sudden increase in the creation of movies in year 2016 as compared to year 2015

  • There are 128 Movies which did not generate revenue at all even after releasing in the year Range 2006-2016

  • Maximum number of Movies get the rating of 7.1, there is only one movie which got a rating on 9

  • A large number of movies get ratings from the range of 6.1-7.5

  • Director Ridley Scott has Directed the maximum number of movies which are 8 in a period of 2006-2016,maximum Directors only direct a movie in a span of 10 years

  • Maximum movies produce a revenue in the range of 0-100 (Millions),Very few Movies generate a revenue in the range of 380-650 (Millions)

  • Generally Movies have a runtime of 90 mins to 130 mins.A large number of movies with runtime 95-125 mins get an average rating of 6-7.5

  • When a movie gets a large number of votes it has a good rating.The larger the number of votes the best is the rating. There is a close co relation between Votes and Rating

  • Maximum movies have Drama in them within the data of 1000 Movies, 513 being for Drama meaning around 50% movies have Drama as their Genre

  • It is observed that Drama,Adventure,Action,Comedy movies are made and they have generated large amount of revenue in the year 2015,2016, Mostly movies with revenue of less than 100 million contribute to the revenue generation of the year.