Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
suyashi29
GitHub Repository: suyashi29/python-su
Path: blob/master/Data Analysis using Python/EDA Report on Netflix Data Set.ipynb
3074 views
Kernel: Python 3

EDA on Netflix Data

image.png

%%time import pandas as pd import seaborn as sns from matplotlib import pyplot as plt netflix = pd.read_csv(r"C:\Users\suyashi144893\Documents\data Sets\netflix_data.csv") netflix.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 8807 entries, 0 to 8806 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 show_id 8807 non-null object 1 type 8807 non-null object 2 title 8807 non-null object 3 director 6173 non-null object 4 cast 7982 non-null object 5 country 7976 non-null object 6 date_added 8797 non-null object 7 release_year 8807 non-null int64 8 rating 8803 non-null object 9 duration 8804 non-null object 10 listed_in 8807 non-null object 11 description 8807 non-null object dtypes: int64(1), object(11) memory usage: 825.8+ KB Wall time: 1.56 s
netflix.shape
(8807, 12)

About Data

This dataset contains data collected from Netflix of different TV shows and movies from the year 2008 to 2021.

  • type: Gives information about 2 different unique values one is TV Show and another is Movie

  • title: Gives information about the title of Movie or TV Show

  • director: Gives information about the director who directed the Movie or TV Show

  • cast: Gives information about the cast who plays role in Movie or TV Show

  • release_year: Gives information about the year when Movie or TV Show was released

  • rating: Gives information about the Movie or TV Show are in which category (eg like the movies are only for students, or adults, etc)

  • duration: Gives information about the duration of Movie or TV Show

  • listed_in: Gives information about the genre of Movie or TV Show

  • description: Gives information about the description of Movie or TV Show

netflix.head(2)

Data Prepartion

  • Check Missing Values

m=netflix.isnull().sum() miss= (netflix.isnull().sum()/len(netflix))*100 miss_data=pd.concat([m,miss],axis=1,keys=['Total','%']) print(miss_data)
Total % show_id 0 0.000000 type 0 0.000000 title 0 0.000000 director 2634 29.908028 cast 825 9.367549 country 831 9.435676 date_added 10 0.113546 release_year 0 0.000000 rating 4 0.045418 duration 3 0.034064 listed_in 0 0.000000 description 0 0.000000

Dropping the cast and director features because we are not going to use those features right now

netflix = netflix.dropna( how='any',subset=['cast', 'director'])
netflix.fillna({ 'country': 'missing', 'rating': 'missing'}, inplace=True) netflix.isnull().sum()
show_id 0 type 0 title 0 director 0 cast 0 country 0 date_added 0 release_year 0 rating 0 duration 3 listed_in 0 description 0 dtype: int64
netflix['date_added'] = pd.to_datetime(netflix['date_added']) netflix.head()
## Finding how many unique values are there in the dataset netflix.nunique()
show_id 5700 type 2 title 5700 director 4152 cast 5512 country 605 date_added 1478 release_year 72 rating 18 duration 205 listed_in 346 description 5677 dtype: int64

EDA

  • What different types of show or movie are uploaded on Netflix?

  • Correlation between the features

  • Most watched shows on the Netflix

  • Distribution of Ratings

  • Which has the highest rating Tv show or Movies

  • Finding the best Month for releasing content

  • Highest watched genres on Netflix

  • Released movie over the years

netflix.type.value_counts().to_frame('values_count')
pip install matplotlib
%config InlineBackend.figure_format = 'svg' plt.style.use('seaborn') fig, ax = plt.subplots() plot = ax.bar(netflix.type.unique(), netflix.type.value_counts(), edgecolor="black", linewidth=1) ax.set_ylabel('Value Count') #ax.bar_label(plot, padding=-15, color='white') ax.set_title('Type', fontweight='bold');
Image in a Jupyter notebook