Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
suyashi29
GitHub Repository: suyashi29/python-su
Path: blob/master/Python for Data Science/EDAcarsales.ipynb
3074 views
Kernel: Python 3 (ipykernel)

1. Problem Statement CarSalesImage.png

"This dataset contains data for more than 9.5K cars sale in Ukraine. Most of them are used cars so it opens the possibility to analyze features related to car operation. This is a subset of all car data in Ukraine. Using this We will analyze the various parameters of used car sales in Ukraine."

1.1. Introduction

This Exploratory Data Analysis is to practice Python skills learned till now on a structured data set including loading, inspecting, wrangling, exploring, and drawing conclusions from data. The notebook has observations with each step in order to explain thoroughly how to approach the data set. Based on the observation some questions also are answered in the notebook for the reference though not all of them are explored in the analysis.

1.2. Data source and dataset

a. How was it collected?

  • Name: "Car Sales"

  • Sponsoring Organization: Don't know

  • Year: 2019

  • Description: "This is a case study of more than 9.5K cars sale in Ukraine."

b. Is it a sample? If yes, was it properly sampled?

  • Yes, it is a sample. We don't have official information about the data collection method, but it appears not to be a random sample, so we can assume that it is not representative.

2. Load the packages and data

Importing Packages

import numpy as np # Implemennts milti-dimensional array and matrices import pandas as pd # For data manipulation and analysis import pandas_profiling import matplotlib.pyplot as plt # Plotting library for Python programming language and it's numerical mathematics extension NumPy import seaborn as sns # Provides a high level interface for drawing attractive and informative statistical graphics %matplotlib inline sns.set() from subprocess import check_output

Loading Dataset

CarSales_Data = pd.read_excel("Car_Sales.xlsx") CarSales_Data.shape
(9576, 10)

3. Data Profiling

3.1 Understanding the Dataset

CarSales_Data # This will print the number of rows and comlumns of the Data Frame

CarSales_Data has 9576 rows and 10 columns.

CarSales_Data.columns # This will print the names of all columns.
Index(['car', 'price', 'body', 'mileage', 'engV', 'engType', 'registration', 'year', 'model', 'drive'], dtype='object')
CarSales_Data.head()
CarSales_Data.describe()
CarSales_Data.describe(include="all")
CarSales_Data.sort_values(by=['price'],ascending= False)
1 - +ive -1 - -IVE 0 : NO RELATION
import seaborn as sns # Provides a high level interface for drawing attractive and informative statistical graphics sns.set() plt.subplots(figsize=(10,20)) sns.heatmap(CarSales_Data.corr(),annot=True)
<AxesSubplot:>
Image in a Jupyter notebook
CarSales_Data.head(5) # This will print the last n rows of the Data Frame
CarSales_Data.info() # This will give Index, Datatype and Memory information
<class 'pandas.core.frame.DataFrame'> RangeIndex: 9576 entries, 0 to 9575 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 car 9576 non-null object 1 price 9576 non-null float64 2 body 9576 non-null object 3 mileage 9576 non-null int64 4 engV 9142 non-null float64 5 engType 9576 non-null object 6 registration 9576 non-null object 7 year 9576 non-null int64 8 model 9576 non-null object 9 drive 9065 non-null object dtypes: float64(2), int64(2), object(6) memory usage: 748.2+ KB
CarSales_Data.isnull().sum()
car 0 price 0 body 0 mileage 0 engV 434 engType 0 registration 0 year 0 model 0 drive 511 dtype: int64

From the above output we can see that engV and drive columns contains maximum null values. We will see how to deal with them.

  1. Fill missing 2.Sort()according to price (Asending)

  2. Group via drive

  3. Dummy

### 3.2 Pre Profiling
!pip install pandas_profiling #Installing pandas_profiling packages

Now performing pandas profiling to understand data better.

profile = pandas_profiling.ProfileReport(CarSales_Data) profile.to_file(outputfile="CarSales_before_preprocessing.html")

3.3 Preprocessing

  • Dealing with duplicate rows

    • Find number of duplicate rows in the dataset.

    • Print the duplicate entries and analyze.

    • Drop the duplicate entries from the dataset.

b=CarSales_Data["drive"].mode()
CarSales_Data["drive"]=CarSales_Data["drive"].fillna("front") CarSales_Data.isnull().sum()
car 0 price 0 body 0 mileage 0 engV 434 engType 0 registration 0 year 0 model 0 drive 0 dtype: int64
print(CarSales_Data.duplicated().sum())
113
CarSales_Data.loc[CarSales_Data.duplicated(keep=False), :]
CarSales_Data.drop_duplicates(keep='first').shape
(9463, 10)

Duplicate entries are removed now.

  • Dealing with missing values

    • 434 missing entries of engV. Replace it with median value of engV from the same Car and body group of cars.

    • 511 missing entries of drive. Replace it with most common value of drive from the same Car and body group of cars.

    • Drop entries having price is 0 or less than 0.

CarSales_Data['engV'] = CarSales_Data.groupby(['car', 'body'])['engV'].transform(lambda x: x.fillna(x.median()))

Now let's check if the missing values of engV has been replaced.

CarSales_Data.isnull().sum()
car 0 price 0 body 0 mileage 0 engV 10 engType 0 registration 0 year 0 model 0 drive 0 dtype: int64

424 missing values of engV has been replaced however, still 10 entries are left as missing. Let's see the missing value data.

CarSales_Data[CarSales_Data.engV.isnull()]

Replacing NaN values of drive with most common values of drive from Car and body group.

def f(x): if x.count()<=0: return np.nan return x.value_counts().index[0] CarSales_Data['drive'] = CarSales_Data['drive'].fillna(CarSales_Data.groupby(['car','body'])['drive'].transform(f)) #CarSales_Data[CarSales_Data.drive.isnull()]

Let's check the count of NaN values of engV and drive.

CarSales_Data.isnull().sum()
car 0 price 0 body 0 mileage 0 engV 10 engType 0 registration 0 year 0 model 0 drive 0 dtype: int64

Dropping remaining NaN values of engV and drive.

CarSales_Data.dropna(subset=['engV'],inplace=True) CarSales_Data.dropna(subset=['drive'],inplace=True) CarSales_Data.isnull().sum()
car 0 price 0 body 0 mileage 0 engV 0 engType 0 registration 0 year 0 model 0 drive 0 dtype: int64

Dropping entries with price <= 0 .

CarSales_Data = CarSales_Data.drop(CarSales_Data[CarSales_Data.price <= 0 ].index)
CarSales_Data.price[CarSales_Data.price ==0].count()
0
b=CarSales_Data["mileage"].median() CarSales_Data["mileage"]=CarSales_Data["mileage"].replace(0,b)
CarSales_Data[CarSales_Data.mileage == 0]

profile = pandas_profiling.ProfileReport(CarSales_Data) profile.to_file(outputfile="CarSales_post_preprocessing.html")

The data are processed now. The dataset doesnot contain missing and zero values. The pandas profiling report generated after processing the data giving us more clear data. We can compare the two reports.

4. Questions

4.1 Which type of cars are sold maximum?

  • Using Countplot

sns.countplot(x='body', data=CarSales_Data).set_title('Count plot for car variants.')
Text(0.5, 1.0, 'Count plot for car variants.')
Image in a Jupyter notebook

You can see sedan cars are sold maximum and followed that crossover,hatch,van,other and vagon

4.2 What is the co-relation between price and mileage?

sns.regplot(x='mileage',y='price',data=CarSales_Data)
<AxesSubplot:xlabel='mileage', ylabel='price'>
Image in a Jupyter notebook

You can see there are some outliers here. Excluding those,it seems that majority of car price is below 150000 and gives mileage in the range of 0 to 400.

4.3. How many cars are registered?

sns.countplot(CarSales_Data['registration'])
C:\Users\suyashi144893\Anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn(
<AxesSubplot:xlabel='registration', ylabel='count'>
Image in a Jupyter notebook

8000+ cars are registered and very few are not registered

4.4. Price distribution between registered and non-registered cars.

sns.boxplot(x='registration',y='price',data=CarSales_Data)
<AxesSubplot:xlabel='registration', ylabel='price'>
Image in a Jupyter notebook

Majority of the cars are registered and the price of those cars are below 300000. Non-registered cars are cheaper in cost.

4.5. What is the car price distribution based on Engine Value?

sns.regplot(x='engV',y='price',data=CarSales_Data)
<AxesSubplot:xlabel='engV', ylabel='price'>
Image in a Jupyter notebook

Except few outliers, it is clearly observed that the range of car price is between 0 to 150000 having the range of engine value between 0 to 6.

4.6. Which engine type of cars users preferred maximum?

sns.countplot(CarSales_Data['engType'])
C:\Users\suyashi144893\Anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn(
<AxesSubplot:xlabel='engType', ylabel='count'>
Image in a Jupyter notebook

Petrol cars are more preferred and followed by Diesel, Gas and others.

4.7 Establish coorelation between all the features using heatmap.

corr = CarSales_Data.corr() plt.figure(figsize=(10,10)) sns.heatmap(corr,vmax=.8,linewidth=.01, square = True, annot = True,cmap='YlGnBu',linecolor ='black') plt.title('Correlation between features')
Text(0.5, 1.0, 'Correlation between features')
Image in a Jupyter notebook
  • mileage and engV are negatively corelated with year.

  • mileage is also negatively corelated with year.

  • engV is positively coorelated with mileage and price.

  • Positive corelation observed between year and price too.

4.8 Distribution of price.

sns.distplot(CarSales_Data['price'],color ='g') plt.title('Distribution of price') plt.show()
C:\Users\suyashi144893\Anaconda3\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
Image in a Jupyter notebook

The 'price' mostly varies between 0 and 80000.

5 Conclusion

  • Sedan cars sold maximum.

  • Price is increasing as the engine value is increasing.

  • The price and mileage goes down as engine values decreasing.