Path: blob/master/Data Analysis using Python/EDAcarsales.ipynb
3074 views
1. Problem Statement 
"This dataset contains data for more than 9.5K cars sale in Ukraine. Most of them are used cars so it opens the possibility to analyze features related to car operation. This is a subset of all car data in Ukraine. Using this We will analyze the various parameters of used car sales in Ukraine."
1.1. Introduction
This Exploratory Data Analysis is to practice Python skills learned till now on a structured data set including loading, inspecting, wrangling, exploring, and drawing conclusions from data. The notebook has observations with each step in order to explain thoroughly how to approach the data set. Based on the observation some questions also are answered in the notebook for the reference though not all of them are explored in the analysis.
1.2. Data source and dataset
a. How was it collected?
Name: "Car Sales"
Sponsoring Organization: Don't know
Year: 2019
Description: "This is a case study of more than 9.5K cars sale in Ukraine."
b. Is it a sample? If yes, was it properly sampled?
Yes, it is a sample. We don't have official information about the data collection method, but it appears not to be a random sample, so we can assume that it is not representative.
Importing Packages
Loading Dataset
3. Data Profiling
3.1 Understanding the Dataset
CarSales_Data has 9576 rows and 10 columns.
From the above output we can see that engV and drive columns contains maximum null values. We will see how to deal with them.
Fill missing 2.Sort()according to price (Asending)
Group via drive
Dummy
Now performing pandas profiling to understand data better.
3.3 Preprocessing
Dealing with duplicate rows
Find number of duplicate rows in the dataset.
Print the duplicate entries and analyze.
Drop the duplicate entries from the dataset.
Duplicate entries are removed now.
Dealing with missing values
434 missing entries of engV. Replace it with median value of engV from the same Car and body group of cars.
511 missing entries of drive. Replace it with most common value of drive from the same Car and body group of cars.
Drop entries having price is 0 or less than 0.
Now let's check if the missing values of engV has been replaced.
424 missing values of engV has been replaced however, still 10 entries are left as missing. Let's see the missing value data.
Replacing NaN values of drive with most common values of drive from Car and body group.
Let's check the count of NaN values of engV and drive.
Dropping remaining NaN values of engV and drive.
Dropping entries with price <= 0 .
The data are processed now. The dataset doesnot contain missing and zero values. The pandas profiling report generated after processing the data giving us more clear data. We can compare the two reports.
4. Questions
4.1 Which type of cars are sold maximum?
Using Countplot
You can see sedan cars are sold maximum and followed that crossover,hatch,van,other and vagon
4.2 What is the co-relation between price and mileage?
You can see there are some outliers here. Excluding those,it seems that majority of car price is below 150000 and gives mileage in the range of 0 to 400.
4.3. How many cars are registered?
8000+ cars are registered and very few are not registered
4.4. Price distribution between registered and non-registered cars.
Majority of the cars are registered and the price of those cars are below 300000. Non-registered cars are cheaper in cost.
4.5. What is the car price distribution based on Engine Value?
Except few outliers, it is clearly observed that the range of car price is between 0 to 150000 having the range of engine value between 0 to 6.
4.6. Which engine type of cars users preferred maximum?
Petrol cars are more preferred and followed by Diesel, Gas and others.
4.7 Establish coorelation between all the features using heatmap.
mileage and engV are negatively corelated with year.
mileage is also negatively corelated with year.
engV is positively coorelated with mileage and price.
Positive corelation observed between year and price too.
4.8 Distribution of price.
The 'price' mostly varies between 0 and 80000.
5 Conclusion
Sedan cars sold maximum.
Price is increasing as the engine value is increasing.
The price and mileage goes down as engine values decreasing.