CoCalc -- EDAcarsales.ipynb

GitHub Repository: suyashi29/python-su
Path: blob/master/Data Analysis using Python/EDAcarsales.ipynb
³⁰⁷⁴ views

Kernel: Python 3

1. Problem Statement

"This dataset contains data for more than 9.5K cars sale in Ukraine. Most of them are used cars so it opens the possibility to analyze features related to car operation. This is a subset of all car data in Ukraine. Using this We will analyze the various parameters of used car sales in Ukraine."

1.1. Introduction

This Exploratory Data Analysis is to practice Python skills learned till now on a structured data set including loading, inspecting, wrangling, exploring, and drawing conclusions from data. The notebook has observations with each step in order to explain thoroughly how to approach the data set. Based on the observation some questions also are answered in the notebook for the reference though not all of them are explored in the analysis.

1.2. Data source and dataset

a. How was it collected?

Name: "Car Sales"
Sponsoring Organization: Don't know
Year: 2019
Description: "This is a case study of more than 9.5K cars sale in Ukraine."

b. Is it a sample? If yes, was it properly sampled?

Yes, it is a sample. We don't have official information about the data collection method, but it appears not to be a random sample, so we can assume that it is not representative.

2. Load the packages and data

Importing Packages

In [ ]:

import numpy as np                                                 # Implemennts milti-dimensional array and matrices
import pandas as pd                                                # For data manipulation and analysis
#import pandas_profiling
import matplotlib.pyplot as plt                                    # Plotting library for Python programming language and it's numerical mathematics extension NumPy
import seaborn as sns                                              # Provides a high level interface for drawing attractive and informative statistical graphics
%matplotlib inline
sns.set()
from subprocess import check_output

Loading Dataset

In [ ]:

CarSales_Data = pd.read_excel("Car_Sales.xlsx")
CarSales_Data.shape

3. Data Profiling

3.1 Understanding the Dataset

In [ ]:

CarSales_Data.shape                                                    # This will print the number of rows and comlumns of the Data Frame

CarSales_Data has 9576 rows and 10 columns.

In [ ]:

CarSales_Data.columns  
# This will print the names of all columns.

In [ ]:

CarSales_Data

In [ ]:

CarSales_Data.describe()

In [ ]:

CarSales_Data.describe(include="all")

In [ ]:

CarSales_Data.sort_values(by=['price'],ascending= False)

In [ ]:

import seaborn as sns                                              # Provides a high level interface for drawing attractive and informative statistical graphics
sns.set()
plt.subplots(figsize=(10,20))
sns.heatmap(CarSales_Data.corr(),annot=True)

In [ ]:

CarSales_Data.head(5)  

# This will print the last n rows of the Data Frame

In [ ]:

CarSales_Data.info()                                                   # This will give Index, Datatype and Memory information

In [ ]:

CarSales_Data.isnull().sum()

From the above output we can see that engV and drive columns contains maximum null values. We will see how to deal with them.

Fill missing 2.Sort()according to price (Asending)
Group via drive
Dummy

### 3.2 Pre Profiling

!pip install pandas_profiling    #Installing pandas_profiling packages

Now performing pandas profiling to understand data better.

profile = pandas_profiling.ProfileReport(CarSales_Data)
profile.to_file(outputfile="CarSales_before_preprocessing.html")

3.3 Preprocessing

Dealing with duplicate rows
- Find number of duplicate rows in the dataset.
- Print the duplicate entries and analyze.
- Drop the duplicate entries from the dataset.

In [ ]:

b=CarSales_Data["drive"].mode()

In [ ]:

CarSales_Data["drive"]=CarSales_Data["drive"].fillna("front")
CarSales_Data.isnull().sum()

In [ ]:

print(CarSales_Data.duplicated().sum())

In [ ]:

CarSales_Data.loc[CarSales_Data.duplicated(keep=False), :]

In [ ]:

CarSales_Data.drop_duplicates(keep='first').shape

Duplicate entries are removed now.

Dealing with missing values
- 434 missing entries of engV. Replace it with median value of engV from the same Car and body group of cars.
- 511 missing entries of drive. Replace it with most common value of drive from the same Car and body group of cars.
- Drop entries having price is 0 or less than 0.

In [ ]:

CarSales_Data['engV'] = CarSales_Data.groupby(['car', 'body'])['engV'].transform(lambda x: x.fillna(x.median()))

Now let's check if the missing values of engV has been replaced.

In [ ]:

CarSales_Data.isnull().sum()

424 missing values of engV has been replaced however, still 10 entries are left as missing. Let's see the missing value data.

In [ ]:

CarSales_Data[CarSales_Data.engV.isnull()]

Replacing NaN values of drive with most common values of drive from Car and body group.

In [ ]:

def f(x):
    if x.count()<=0:
        return np.nan
    return x.value_counts().index[0]

CarSales_Data['drive'] = CarSales_Data['drive'].fillna(CarSales_Data.groupby(['car','body'])['drive'].transform(f))
#CarSales_Data[CarSales_Data.drive.isnull()]

Let's check the count of NaN values of engV and drive.

In [ ]:

CarSales_Data.isnull().sum()

Dropping remaining NaN values of engV and drive.

In [ ]:

CarSales_Data.dropna(subset=['engV'],inplace=True)
CarSales_Data.dropna(subset=['drive'],inplace=True)
CarSales_Data.isnull().sum()

Dropping entries with price <= 0 .

In [ ]:

CarSales_Data = CarSales_Data.drop(CarSales_Data[CarSales_Data.price <= 0 ].index)

In [ ]:

CarSales_Data.price[CarSales_Data.price ==0].count()

In [ ]:

b=CarSales_Data["mileage"].median()
CarSales_Data["mileage"]=CarSales_Data["mileage"].replace(0,b)

In [ ]:

CarSales_Data[CarSales_Data.mileage == 0]

profile = pandas_profiling.ProfileReport(CarSales_Data)
profile.to_file(outputfile="CarSales_post_preprocessing.html")

The data are processed now. The dataset doesnot contain missing and zero values. The pandas profiling report generated after processing the data giving us more clear data. We can compare the two reports.

4. Questions

4.1 Which type of cars are sold maximum?

Using Countplot

In [ ]:

sns.countplot(x='body', data=CarSales_Data).set_title('Count plot for car variants.')

You can see sedan cars are sold maximum and followed that crossover,hatch,van,other and vagon

4.2 What is the co-relation between price and mileage?

In [ ]:

sns.regplot(x='mileage',y='price',data=CarSales_Data)

You can see there are some outliers here. Excluding those,it seems that majority of car price is below 150000 and gives mileage in the range of 0 to 400.

4.3. How many cars are registered?

In [ ]:

sns.countplot(CarSales_Data['registration'])

8000+ cars are registered and very few are not registered

4.4. Price distribution between registered and non-registered cars.

In [ ]:

sns.boxplot(x='registration',y='price',data=CarSales_Data)

Majority of the cars are registered and the price of those cars are below 300000. Non-registered cars are cheaper in cost.

4.5. What is the car price distribution based on Engine Value?

In [ ]:

sns.regplot(x='engV',y='price',data=CarSales_Data)

Except few outliers, it is clearly observed that the range of car price is between 0 to 150000 having the range of engine value between 0 to 6.

4.6. Which engine type of cars users preferred maximum?

In [ ]:

sns.countplot(CarSales_Data['engType'])

Petrol cars are more preferred and followed by Diesel, Gas and others.

4.7 Establish coorelation between all the features using heatmap.

In [ ]:

corr = CarSales_Data.corr()
plt.figure(figsize=(10,10))
sns.heatmap(corr,vmax=.8,linewidth=.01, square = True, annot = True,cmap='YlGnBu',linecolor ='black')
plt.title('Correlation between features')

mileage and engV are negatively corelated with year.
mileage is also negatively corelated with year.
engV is positively coorelated with mileage and price.
Positive corelation observed between year and price too.

4.8 Distribution of price.

In [ ]:

sns.distplot(CarSales_Data['price'],color ='g')
plt.title('Distribution of price')
plt.show()

The 'price' mostly varies between 0 and 80000.

5 Conclusion

Sedan cars sold maximum.
Price is increasing as the engine value is increasing.
The price and mileage goes down as engine values decreasing.

Table of Contents