Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
suyashi29
GitHub Repository: suyashi29/python-su
Path: blob/master/Python for Data Science/EDA Movies.ipynb
3074 views
Kernel: Python 3 (ipykernel)

Problem Statemet

image.png

  • The dataset consists of the survey data of the Movies within the years 2006-2016

  • Various variables present in the dataset inclueds genre,ratings,votes,revenues,etc .

  • The dataset comprises of 1000 observations of 12 columns. Below is a table showing names of all the columns and their description.

| Column Name | Description | | ------------- |:------------- 😐 | Rank | Rank of the Movie | | Title | Title of the movie | | Genre | To what Genre the movie belongs | | Description | The description of the movie which gives the backgound of the movie | | Director | Name of the director of the movie | | Actors | Actors of the Movie | | Year | Year in which the Movie was released | | Runtime (Minutes) | The duration of the movie in minutes| | Rating | Rating of the Movie | | Votes | Votes given to the Movie | | Revenue (Millions) | Revenue made by the Movie in millions | |Metascore |Score of the movie on the metacritic website|

Importing Packages

1. Read Data 2.Check shape of data 3.check for null values 4.Replace null values 5.Decscribe your data 6. Drop if not acceptable for analysis 7. Remove duplicates 8. What is the count of moving with years?
##pip install --proxy http://uername:[email protected]:8000 sweetviz
import numpy as np # Implemennts milti-dimensional array and matrices import pandas as pd # For data manipulation and analysis import pandas_profiling as prof import matplotlib.pyplot as plt # Plotting library for Python programming language and it's numerical mathematics extension NumPy import seaborn as sns # Provides a high level interface for drawing attractive and informative statistical graphics %matplotlib inline sns.set() #To switch to seaborn defaults, simply call the set() function from subprocess import check_output #to run new applications or programs through Python code by creating new processes. import warnings warnings.filterwarnings('ignore')

Importing the Movies Dataset

movies=pd.ExcelFile(r"C:\Users\suyashi144893\Documents\data Sets\movies.xlsx").parse("Sheet1")

#movies=pd.read_csv("https://raw.githubusercontent.com/suyashi29/python-su/master/Data%20Analysis%20using%20Python/imdb.csv")
movies=pd.read_excel("movies.xlsx") ## Movie trend: Rank, drop Title, Genre, drop description, drop director, drop actor, year, runtime(duration), rating(), drop vote, renvnue, metascore
movies.shape
  • The Movies dataset has 1000 observations and 12 columns

#movies['Year'] = movies['Year'].astype(str)
movies.columns
  • These are the columns present in the dataset

movies.head(2)
movies.tail(1)
movies.info()
movies.describe()
  • It is seen from the data that revenue has a high standard deviation and hence a large spread of revenue genration for the movies

movies.describe(include='object') ##Checking null values
m=movies.isnull().sum() miss= (movies.isnull().sum()/len(movies))*100 miss_data=pd.concat([m,miss],axis=1,keys=['Total','%']) print(miss_data)
  • From the above output we can see that Revenue(Millions) and Metascore columns contains maximum null values

Pre Profiling

profile = prof.ProfileReport(movies) profile.to_file(outputfile="Movies_before_preprocessing.html")

Here, we have done Pandas Profiling before preprocessing our dataset, so we have named the html file as Movies_before_preprocessing.html.Now we will process our data to better understand it.

Preprocessing

Adding new features

new = movies['Genre'].str.split(",", 2) movies['Genre 1']=new.str.get(0) movies['Genre 2']=new.str.get(1) movies['Genre 3']=new.str.get(2)
movies['Genre 1'].describe(include='all')
movies['Genre 2'].describe(include='all')
movies['Genre 3'].describe(include='all')
movies.drop('Genre', axis = 1,inplace = True)
movies.drop('Description', axis = 1,inplace = True)
movies.drop_duplicates(subset=None,keep="first",inplace=False)
movies["Rating"].describe() # Describing a particular field
movies['RB']=0 movies.loc[movies['Rating']<=5,'RB']="LOW" movies.loc[(movies['Rating']>5)&(movies['Rating']<=7),'RB']="MEDIUM" movies.loc[movies['Rating']>7,'RB']="HIGH" movies.head(2) ## Create a Field for movie duration

Treating missing values

# Metascore by Median() movies["Metascore"]=movies["Metascore"].fillna(0)
# Metascore by r=movies["Revenue (Millions)"].median() movies["Revenue (Millions)"]=movies["Revenue (Millions)"].fillna(r)

Post Pandas Profiling

profile = prof.ProfileReport(movies) profile.to_file(outputfile="Movies_after_preprocessing.html")

Questions

Q1) Movies made on year basis?

movies.groupby(['Genre 1'])['Year'].count()
MDF = pd.read_excel('Movies.xlsx') base = ['Adventure','Crime','Comedy','Sci-Fi','Action'] for var in base: MDF[var]= MDF['Genre'].apply(lambda x: 1 if var in str(x) else 0) MDF['Other'] = MDF.apply(lambda x: 0 if (x['Adventure']+ x['Crime']+ x['Comedy']+ x['Sci-Fi']+ x['Action'])>=1 else 1,axis=1) check = MDF.groupby(['Year']).agg({'Adventure':sum, 'Crime':sum, 'Sci-Fi':sum, 'Action':sum, 'Comedy':sum, 'Other':sum}) check.reset_index(inplace=True) check['Total'] = check.apply(lambda x: x['Adventure']+x['Crime']+x['Sci-Fi']+ x['Action']+x['Comedy']+x['Other'],axis=1) base2 = base+ ['Other'] for var in base2: check[var+'_norm']=check.apply(lambda x: float(x[var])/float(x['Total'])*100,axis=1) base3 = [x+'_norm' for x in base2] check.plot(kind='bar',x=['Year'],y=base3,stacked=True) plt.show()
movies.groupby(['Year'])['Year'].count()

Observation:

  • From the above we can see that with passing years more movies are being made year by year

  • The above data shows that there is a sudden increase in the creation of movies in year 2016 as compared to year 2015

sns.countplot(x='Year', data=movies).set_title('Count plot for Movies with passing Years.') sns.set(rc={'figure.figsize':(25,30)}) plt.show()

Q3) What is the Rating that a movie normally gets?

movies.groupby(['RB'])['RB'].count()
#movies.groupby(['Rating'])['Rating'].count()
sns.set(rc={'figure.figsize':(10,10)}) sns.countplot(x='RB', data=movies).set_title('Count plot for Movies with according to their Rating.')
  • The data is negatively skewed

movies['Rating'].plot.hist()

Observation

  • It is observed that on an average a large number of movies get ratings from the range of 6.1-7.5

Observation

  • It is observed that maximum Directors only direct a movie in a span of 10 years

sns.boxplot(x="Year", y="Rating", data=movies)

Q5) What is the Revenue generated by Movies, does rating affect revenue generation?

movies['Revenue (Millions)'].sort_values().plot.hist()
Observation
  • Maximum movies product a revenue in the range of 0-100 (Millions)

  • Very few Movies generate a revenue in the range of 380-650 (Millions)

  • Positively skewed data

sns.boxplot(x="RB", y="Revenue (Millions)", data=movies)

Observation

  • Maximum movies with Rating 6-7 generate a revenue of 100-200

Q6) Runtimes of movies affect revenue and ratings?

movies['Runtime (Minutes)'].sort_values().plot.hist()

Observation:

  • Very few movies have a runtime of less than 80 mins and more than 165 mins

  • Generally Movies have a runtime of 90 mins to 130 mins

Observation

  • A large number of Movies with a runtime of 95-110 minutes produce a revenue of 0-100 millions

Size of confidence intervals(ci) to draw around estimated values. If “sd”, skip bootstrapping and draw the standard deviation of the observations. If None, no bootstrapping will be performed, and error bars will not be drawn.

sns.barplot("RB","Runtime (Minutes)", data=movies) #sns.barplot("Year","Runtime (Minutes)", data=movies,color="lightblue")
movies.plot.hexbin(x='Revenue (Millions)', y='Runtime (Minutes)', gridsize=10)

Observation

  • A large number of movies with runtime 95-125 mins get an average rating of 6-7.5

Q7) Effect of Metascore, are metascore and rating related ?

movies['Metascore'].sort_values().plot.hist()

Observation

  • A large amount of movies get a metascore of 55-70

  • Around 20 movies get a good metascore of 90-100

  • The metascore has a perfect distribution

movies.plot.scatter(x='Revenue (Millions)', y='Rating')
movies.plot.scatter(x='Metascore', y='Rating')

Observation

  • The Scatter plot defines that Rating and Metascore are closely related

movies["Rating"].plot.hist(color='r', alpha=0.7)
movies[movies['Genre 1'] == 'Action'].Year.groupby(movies.Year).count().plot(kind='pie', figsize=(6, 6),autopct='%2.2f%%') plt.axis('equal') plt.show()

movies.plot(kind="bar",x='Genre 2',y="Year",stacked=True)

# print (movies["Year"].pct_change())
sns.heatmap(movies.corr(),annot=True) #Use annotation to add correlation numbers to the Seaborn heatmap plt.figure(figsize=(5,5))
movies.plot.area(x='Votes',y="Rating")
movies.plot.scatter(x='Votes', y='Rating')
Observation
  • It will be very rare case in which the movie may receive a lot of votes but when a movie gets a large number of votes it has a good rating

  • The larger the number of votes the best is the rating.

  • There is a close co relation between Votes and Rating

Q9) What kind of Genres movies are made?

print("The count of different Genres for Genre 1") movies.groupby(['Genre 1'])['Genre 1'].count().sort_values()
import collections print("The count of different Genres") g1 = dict(movies.groupby(['Genre 1'])['Genre 1'].count().sort_values()) g2 = dict(movies.groupby(['Genre 2'])['Genre 2'].count().sort_values()) g3 = dict(movies.groupby(['Genre 3'])['Genre 3'].count().sort_values()) g0 = [g1, g2, g3] counter = collections.Counter() for d in g0: counter.update(d) g = dict(counter) pd.DataFrame(g.values(), index=g.keys(), columns=['cnt']).plot(kind='pie', y='cnt',figsize=(10, 10),autopct='%1.1f%%', legend=False) plt.axis('equal') plt.show()
  • From the above data of Genre 1 we conclude that the top three primary Genre movies that are made are __ Action,Drama,Comedy__

movies.plot.line(y="Runtime (Minutes)",x="Year" )
sns.set(rc={'figure.figsize':(14,12)}) sns.countplot(x='Genre 1', data=movies).set_title('Count plot for Genre 1')
print("The count of different Genres for Genre 2") movies.groupby(['Genre 2'])['Genre 2'].count().sort_values()
  • From the above data of Genre 2 we conclude that the top Genre movies that are made are __ Drama,Adventure,Romance,Comedy__

  • Romance, Comedy, Crime, Thriller, Mystery, Horror made at the same level

sns.set(rc={'figure.figsize':(14,12)}) sns.countplot(x='Genre 2', data=movies).set_title('Count plot for Genre 2')
print("The count of different Genres for Genre 3") movies.groupby(['Genre 3'])['Genre 3'].count().sort_values()
  • From the above data of Genre 3 we conclude that the top Genre movies that are made are __ Thriller,Sci-Fi,Drama,Romance and Fantasy__

sns.set(rc={'figure.figsize':(14,12)}) sns.countplot(x='Genre 3', data=movies).set_title('Count plot for Genre 3')

Observation

  • From evaluating the above data on Genres it is seen that Maximum movies have Drama in them within the data of 1000 Movies, 513 being for Drama meaning around 50% movies have Drama as their Genre

  • Action Movies are also made but not as Drama with a count of 303 for a data of 1000 movies

  • Comedy Movies are made more than Adventure with a count of 279 for Comedy and 259 for Adventure

  • Thriller movies are made more than Crime Movies with a count of 195 and 150

  • Crime and Romance movies are equally made as the count suggests from the above 150 for Crime and 141 for Romance

Q10) Which genre produces best Revenue in years 2015,2016?

sns.swarmplot(x="Genre 2", y="Revenue (Millions)", hue="Year",data=movies.loc[movies['Year'].isin(['2015', '2016'])])
sns.violinplot(x='Genre 1',y='Revenue (Millions)',hue='Year',data=movies.loc[movies['Year'].isin(['2015','2016'])],split=True)
sns.swarmplot(x="Genre 2", y="Revenue (Millions)", hue="Year",data=movies.loc[movies['Year'].isin(['2015', '2016'])])
sns.swarmplot(x="Genre 3", y="Revenue (Millions)", hue="Year",data=movies.loc[movies['Year'].isin(['2015', '2016'])])
  • From the above plots it is observed that Drama,Adventure,Action,Comedy movies are made and they have generated large amount of revenue in the year 2015,2016

  • 2016 has generated more revenue as compared to 2015

Conclusions

  • The Movies data set helps us in deducting that year by year more and more Movies are being made,also there is a sudden increase in the creation of movies in year 2016 as compared to year 2015

  • There are 128 Movies which did not generate revenue at all even after releasing in the year Range 2006-2016

  • Maximum number of Movies get the rating of 7.1, there is only one movie which got a rating on 9

  • A large number of movies get ratings from the range of 6.1-7.5

  • Director Ridley Scott has Directed the maximum number of movies which are 8 in a period of 2006-2016,maximum Directors only direct a movie in a span of 10 years

  • Maximum movies produce a revenue in the range of 0-100 (Millions),Very few Movies generate a revenue in the range of 380-650 (Millions)

  • Generally Movies have a runtime of 90 mins to 130 mins.A large number of movies with runtime 95-125 mins get an average rating of 6-7.5

  • When a movie gets a large number of votes it has a good rating.The larger the number of votes the best is the rating. There is a close co relation between Votes and Rating

  • Maximum movies have Drama in them within the data of 1000 Movies, 513 being for Drama meaning around 50% movies have Drama as their Genre

  • It is observed that Drama,Adventure,Action,Comedy movies are made and they have generated large amount of revenue in the year 2015,2016, Mostly movies with revenue of less than 100 million contribute to the revenue generation of the year.