CoCalc -- EDAcarsales.ipynb

GitHub Repository: suyashi29/python-su
Path: blob/master/Python for Data Science/EDAcarsales.ipynb
⁴⁷⁵² views

Kernel: Python 3 (ipykernel)

1. Problem Statement

"This dataset contains data for more than 9.5K cars sale in Ukraine. Most of them are used cars so it opens the possibility to analyze features related to car operation. This is a subset of all car data in Ukraine. Using this We will analyze the various parameters of used car sales in Ukraine."

1.1. Introduction

This Exploratory Data Analysis is to practice Python skills learned till now on a structured data set including loading, inspecting, wrangling, exploring, and drawing conclusions from data. The notebook has observations with each step in order to explain thoroughly how to approach the data set. Based on the observation some questions also are answered in the notebook for the reference though not all of them are explored in the analysis.

1.2. Data source and dataset

a. How was it collected?

Name: "Car Sales"
Sponsoring Organization: Don't know
Year: 2019
Description: "This is a case study of more than 9.5K cars sale in Ukraine."

b. Is it a sample? If yes, was it properly sampled?

Yes, it is a sample. We don't have official information about the data collection method, but it appears not to be a random sample, so we can assume that it is not representative.

2. Load the packages and data

Importing Packages

In [1]:

import numpy as np                                                 # Implemennts milti-dimensional array and matrices
import pandas as pd                                                # For data manipulation and analysis
import pandas_profiling
import matplotlib.pyplot as plt                                    # Plotting library for Python programming language and it's numerical mathematics extension NumPy
import seaborn as sns                                              # Provides a high level interface for drawing attractive and informative statistical graphics
%matplotlib inline
sns.set()
from subprocess import check_output

Loading Dataset

In [2]:

CarSales_Data = pd.read_excel("Car_Sales.xlsx")
CarSales_Data.shape

Out[2]:

(9576, 10)

3. Data Profiling

3.1 Understanding the Dataset

In [3]:

CarSales_Data                                                 # This will print the number of rows and comlumns of the Data Frame

Out[3]:

CarSales_Data has 9576 rows and 10 columns.

In [4]:

CarSales_Data.columns  
# This will print the names of all columns.

Out[4]:

Index(['car', 'price', 'body', 'mileage', 'engV', 'engType', 'registration',
       'year', 'model', 'drive'],
      dtype='object')

In [5]:

CarSales_Data.head()

Out[5]:

In [6]:

CarSales_Data.describe()

Out[6]:

In [7]:

CarSales_Data.describe(include="all")

Out[7]:

In [8]:

CarSales_Data.sort_values(by=['price'],ascending= False)

Out[8]:

1 - +ive
-1 - -IVE
0 : NO RELATION

In [9]:

import seaborn as sns                                              # Provides a high level interface for drawing attractive and informative statistical graphics
sns.set()
plt.subplots(figsize=(10,20))
sns.heatmap(CarSales_Data.corr(),annot=True)

Out[9]:

<AxesSubplot:>

In [10]:

CarSales_Data.head(5)  

# This will print the last n rows of the Data Frame

Out[10]:

In [11]:

CarSales_Data.info()                                                   # This will give Index, Datatype and Memory information

Out[11]:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9576 entries, 0 to 9575
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   car           9576 non-null   object 
 1   price         9576 non-null   float64
 2   body          9576 non-null   object 
 3   mileage       9576 non-null   int64  
 4   engV          9142 non-null   float64
 5   engType       9576 non-null   object 
 6   registration  9576 non-null   object 
 7   year          9576 non-null   int64  
 8   model         9576 non-null   object 
 9   drive         9065 non-null   object 
dtypes: float64(2), int64(2), object(6)
memory usage: 748.2+ KB

In [12]:

CarSales_Data.isnull().sum()

Out[12]:

car               0
price             0
body              0
mileage           0
engV            434
engType           0
registration      0
year              0
model             0
drive           511
dtype: int64

From the above output we can see that engV and drive columns contains maximum null values. We will see how to deal with them.

Fill missing 2.Sort()according to price (Asending)
Group via drive
Dummy

### 3.2 Pre Profiling

!pip install pandas_profiling    #Installing pandas_profiling packages

Now performing pandas profiling to understand data better.

profile = pandas_profiling.ProfileReport(CarSales_Data)
profile.to_file(outputfile="CarSales_before_preprocessing.html")

3.3 Preprocessing

Dealing with duplicate rows
- Find number of duplicate rows in the dataset.
- Print the duplicate entries and analyze.
- Drop the duplicate entries from the dataset.

In [13]:

b=CarSales_Data["drive"].mode()

In [14]:

CarSales_Data["drive"]=CarSales_Data["drive"].fillna("front")
CarSales_Data.isnull().sum()

Out[14]:

car               0
price             0
body              0
mileage           0
engV            434
engType           0
registration      0
year              0
model             0
drive             0
dtype: int64

In [15]:

print(CarSales_Data.duplicated().sum())

Out[15]:

113

In [16]:

CarSales_Data.loc[CarSales_Data.duplicated(keep=False), :]

Out[16]:

In [17]:

CarSales_Data.drop_duplicates(keep='first').shape

Out[17]:

(9463, 10)

Duplicate entries are removed now.

Dealing with missing values
- 434 missing entries of engV. Replace it with median value of engV from the same Car and body group of cars.
- 511 missing entries of drive. Replace it with most common value of drive from the same Car and body group of cars.
- Drop entries having price is 0 or less than 0.

In [18]:

CarSales_Data['engV'] = CarSales_Data.groupby(['car', 'body'])['engV'].transform(lambda x: x.fillna(x.median()))

Now let's check if the missing values of engV has been replaced.

In [19]:

CarSales_Data.isnull().sum()

Out[19]:

car              0
price            0
body             0
mileage          0
engV            10
engType          0
registration     0
year             0
model            0
drive            0
dtype: int64

424 missing values of engV has been replaced however, still 10 entries are left as missing. Let's see the missing value data.

In [20]:

CarSales_Data[CarSales_Data.engV.isnull()]

Out[20]:

Replacing NaN values of drive with most common values of drive from Car and body group.

In [21]:

def f(x):
    if x.count()<=0:
        return np.nan
    return x.value_counts().index[0]

CarSales_Data['drive'] = CarSales_Data['drive'].fillna(CarSales_Data.groupby(['car','body'])['drive'].transform(f))
#CarSales_Data[CarSales_Data.drive.isnull()]

Let's check the count of NaN values of engV and drive.

In [22]:

CarSales_Data.isnull().sum()

Out[22]:

car              0
price            0
body             0
mileage          0
engV            10
engType          0
registration     0
year             0
model            0
drive            0
dtype: int64

Dropping remaining NaN values of engV and drive.

In [23]:

CarSales_Data.dropna(subset=['engV'],inplace=True)
CarSales_Data.dropna(subset=['drive'],inplace=True)
CarSales_Data.isnull().sum()

Out[23]:

car             0
price           0
body            0
mileage         0
engV            0
engType         0
registration    0
year            0
model           0
drive           0
dtype: int64

Dropping entries with price <= 0 .

In [24]:

CarSales_Data = CarSales_Data.drop(CarSales_Data[CarSales_Data.price <= 0 ].index)

In [25]:

CarSales_Data.price[CarSales_Data.price ==0].count()

Out[25]:

0

In [26]:

b=CarSales_Data["mileage"].median()
CarSales_Data["mileage"]=CarSales_Data["mileage"].replace(0,b)

In [27]:

CarSales_Data[CarSales_Data.mileage == 0]

Out[27]:

profile = pandas_profiling.ProfileReport(CarSales_Data)
profile.to_file(outputfile="CarSales_post_preprocessing.html")

The data are processed now. The dataset doesnot contain missing and zero values. The pandas profiling report generated after processing the data giving us more clear data. We can compare the two reports.

4. Questions

4.1 Which type of cars are sold maximum?

Using Countplot

In [28]:

sns.countplot(x='body', data=CarSales_Data).set_title('Count plot for car variants.')

Out[28]:

Text(0.5, 1.0, 'Count plot for car variants.')

You can see sedan cars are sold maximum and followed that crossover,hatch,van,other and vagon

4.2 What is the co-relation between price and mileage?

In [29]:

sns.regplot(x='mileage',y='price',data=CarSales_Data)

Out[29]:

<AxesSubplot:xlabel='mileage', ylabel='price'>

You can see there are some outliers here. Excluding those,it seems that majority of car price is below 150000 and gives mileage in the range of 0 to 400.

4.3. How many cars are registered?

In [30]:

sns.countplot(CarSales_Data['registration'])

Out[30]:

C:\Users\suyashi144893\Anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(

<AxesSubplot:xlabel='registration', ylabel='count'>

8000+ cars are registered and very few are not registered

4.4. Price distribution between registered and non-registered cars.

In [31]:

sns.boxplot(x='registration',y='price',data=CarSales_Data)

Out[31]:

<AxesSubplot:xlabel='registration', ylabel='price'>

Majority of the cars are registered and the price of those cars are below 300000. Non-registered cars are cheaper in cost.

4.5. What is the car price distribution based on Engine Value?

In [32]:

sns.regplot(x='engV',y='price',data=CarSales_Data)

Out[32]:

<AxesSubplot:xlabel='engV', ylabel='price'>

Except few outliers, it is clearly observed that the range of car price is between 0 to 150000 having the range of engine value between 0 to 6.

4.6. Which engine type of cars users preferred maximum?

In [33]:

sns.countplot(CarSales_Data['engType'])

Out[33]:

C:\Users\suyashi144893\Anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(

<AxesSubplot:xlabel='engType', ylabel='count'>

Petrol cars are more preferred and followed by Diesel, Gas and others.

4.7 Establish coorelation between all the features using heatmap.

In [34]:

corr = CarSales_Data.corr()
plt.figure(figsize=(10,10))
sns.heatmap(corr,vmax=.8,linewidth=.01, square = True, annot = True,cmap='YlGnBu',linecolor ='black')
plt.title('Correlation between features')

Out[34]:

Text(0.5, 1.0, 'Correlation between features')

mileage and engV are negatively corelated with year.
mileage is also negatively corelated with year.
engV is positively coorelated with mileage and price.
Positive corelation observed between year and price too.

4.8 Distribution of price.

In [35]:

sns.distplot(CarSales_Data['price'],color ='g')
plt.title('Distribution of price')
plt.show()

Out[35]:

C:\Users\suyashi144893\Anaconda3\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)

The 'price' mostly varies between 0 and 80000.

5 Conclusion

Sedan cars sold maximum.
Price is increasing as the engine value is increasing.
The price and mileage goes down as engine values decreasing.

Table of Contents