EDA on Netflix Data

In [1]:

%%time
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt

netflix = pd.read_csv(r"C:\Users\suyashi144893\Documents\data Sets\netflix_data.csv")
netflix.info()

Out[1]:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8807 non-null   object
 1   type          8807 non-null   object
 2   title         8807 non-null   object
 3   director      6173 non-null   object
 4   cast          7982 non-null   object
 5   country       7976 non-null   object
 6   date_added    8797 non-null   object
 7   release_year  8807 non-null   int64 
 8   rating        8803 non-null   object
 9   duration      8804 non-null   object
 10  listed_in     8807 non-null   object
 11  description   8807 non-null   object
dtypes: int64(1), object(11)
memory usage: 825.8+ KB
Wall time: 1.56 s

In [3]:

netflix.shape

Out[3]:

(8807, 12)

About Data

This dataset contains data collected from Netflix of different TV shows and movies from the year 2008 to 2021.

type: Gives information about 2 different unique values one is TV Show and another is Movie
title: Gives information about the title of Movie or TV Show
director: Gives information about the director who directed the Movie or TV Show
cast: Gives information about the cast who plays role in Movie or TV Show
release_year: Gives information about the year when Movie or TV Show was released
rating: Gives information about the Movie or TV Show are in which category (eg like the movies are only for students, or adults, etc)
duration: Gives information about the duration of Movie or TV Show
listed_in: Gives information about the genre of Movie or TV Show
description: Gives information about the description of Movie or TV Show

In [2]:

netflix.head(2)

Out[2]:

Data Prepartion

Check Missing Values

In [3]:

m=netflix.isnull().sum()
miss= (netflix.isnull().sum()/len(netflix))*100
miss_data=pd.concat([m,miss],axis=1,keys=['Total','%'])
print(miss_data)

Out[3]:

              Total          %
show_id           0   0.000000
type              0   0.000000
title             0   0.000000
director       2634  29.908028
cast            825   9.367549
country         831   9.435676
date_added       10   0.113546
release_year      0   0.000000
rating            4   0.045418
duration          3   0.034064
listed_in         0   0.000000
description       0   0.000000

Dropping the cast and director features because we are not going to use those features right now

In [4]:

netflix = netflix.dropna( how='any',subset=['cast', 'director'])

In [5]:

netflix.fillna({ 'country': 'missing', 'rating': 'missing'}, inplace=True)
netflix.isnull().sum()

Out[5]:

show_id         0
type            0
title           0
director        0
cast            0
country         0
date_added      0
release_year    0
rating          0
duration        3
listed_in       0
description     0
dtype: int64

In [6]:

netflix['date_added'] = pd.to_datetime(netflix['date_added'])
netflix.head()

Out[6]:

In [7]:

## Finding how many unique values are there in the dataset
netflix.nunique()

Out[7]:

show_id         5700
type               2
title           5700
director        4152
cast            5512
country          605
date_added      1478
release_year      72
rating            18
duration         205
listed_in        346
description     5677
dtype: int64

EDA

What different types of show or movie are uploaded on Netflix?
Correlation between the features
Most watched shows on the Netflix
Distribution of Ratings
Which has the highest rating Tv show or Movies
Finding the best Month for releasing content
Highest watched genres on Netflix
Released movie over the years

In [8]:

netflix.type.value_counts().to_frame('values_count')

Out[8]:

pip install matplotlib

In [11]:

%config InlineBackend.figure_format = 'svg'
plt.style.use('seaborn')
fig, ax = plt.subplots()
plot = ax.bar(netflix.type.unique(), netflix.type.value_counts(), edgecolor="black", linewidth=1)
ax.set_ylabel('Value Count')
#ax.bar_label(plot, padding=-15, color='white')
ax.set_title('Type', fontweight='bold');

Out[11]:

In [ ]:

EDA on Netflix Data

About Data

This dataset contains data collected from Netflix of different TV shows and movies from the year 2008 to 2021.

Data Prepartion

Dropping the cast and director features because we are not going to use those features right now

EDA

Product

Resources

Company