Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.
Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.
Path: blob/main/07. Data Analysis with Python/02. Data Wrangling/02. Data Wrangling.ipynb
Views: 4598
Data Wrangling
Objectives
After completing this lab you will be able to:
Handle missing values
Correct data format
Standardize and normalize data
What is the purpose of data wrangling?
Data wrangling is the process of converting data from the initial format to a format that may be better for analysis.
What is the fuel consumption (L/100k) rate for the diesel car?
Import data
You can find the "Automobile Dataset" from the following link: https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data. We will be using this dataset throughout this course.
Import packages
Reading the dataset from the URL and adding the related headers
First, we assign the URL of the dataset to "filename".
Then, we create a Python list headers containing name of headers.
Use the Pandas method read_csv() to load the data from the web address. Set the parameter "names" equal to the Python list "headers".
Use the method head() to display the first five rows of the dataframe.
As we can see, several question marks appeared in the dataframe; those are missing values which may hinder our further analysis.
How to work with missing data?
Steps for working with missing data:
- Identify missing data
- Deal with missing data
- Correct data format
Identify and handle missing values
Identify missing values
Convert "?" to NaN
In the car dataset, missing data comes with the question mark "?". We replace "?" with NaN (Not a Number), Python's default missing value marker for reasons of computational speed and convenience. Here we use the function:.replace(A, B, inplace = True)to replace A by B.
Evaluating for Missing Data
The missing values are converted by default. We use the following functions to identify these missing values. There are two methods to detect missing data:
- .isnull()
- .notnull()
"True" means the value is a missing value while "False" means the value is not a missing value.
Count missing values in each column
Using a for loop in Python, we can quickly figure out the number of missing values in each column. As mentioned above, "True" represents a missing value and "False" means the value is present in the dataset. In the body of the for loop the method ".value_counts()" counts the number of "True" values.
Based on the summary above, each column has 205 rows of data and seven of the columns containing missing data:
- "normalized-losses": 41 missing data
- "num-of-doors": 2 missing data
- "bore": 4 missing data
- "stroke" : 4 missing data
- "horsepower": 2 missing data
- "peak-rpm": 2 missing data
- "price": 4 missing data
Deal with missing data
How to deal with missing data?- Drop data
a. Drop the whole row
b. Drop the whole column - Replace data
a. Replace it by mean
b. Replace it by frequency
c. Replace it based on other functions
Whole columns should be dropped only if most entries in the column are empty. In our dataset, none of the columns are empty enough to drop entirely. We have some freedom in choosing which method to replace data; however, some methods may seem more reasonable than others. We will apply each method to many different columns:
Replace by mean:
- "normalized-losses": 41 missing data, replace them with mean
- "stroke": 4 missing data, replace them with mean
- "bore": 4 missing data, replace them with mean
- "horsepower": 2 missing data, replace them with mean
- "peak-rpm": 2 missing data, replace them with mean
Replace by frequency:
- "num-of-doors": 2 missing data, replace them with "four".
- Reason: 84% sedans is four doors. Since four doors is most frequent, it is most likely to occur
Drop the whole row:
- "price": 4 missing data, simply delete the whole row
- Reason: price is what we want to predict. Any data entry without price data cannot be used for prediction; therefore any row now without price data is not useful to us
Calculate the mean value for the "normalized-losses" column
Replace "NaN" with mean value in "normalized-losses" column
Calculate the mean value for the "bore" column
Replace "NaN" with the mean value in the "bore" column
Question #1:
Based on the example above, replace NaN in "stroke" column with the mean value.
Calculate the mean value for the "horsepower" column
Replace "NaN" with the mean value in the "horsepower" column
Calculate the mean value for "peak-rpm" column
Replace "NaN" with the mean value in the "peak-rpm" column
To see which values are present in a particular column, we can use the ".value_counts()" method:
We can see that four doors are the most common type. We can also use the ".idxmax()" method to calculate the most common type automatically:
The replacement procedure is very similar to what we have seen previously:
Finally, let's drop all rows that do not have price data:
Now we have a dataset with no missing values.
The last step in data cleaning is checking and making sure that all data is in the correct format (int, float, text or other).
In Pandas, we use:
.dtype() to check the data type
.astype() to change the data type
Let's list the data types for each column
As we can see above, some columns are not of the correct data type. Numerical variables should have type 'float' or 'int', and variables with strings such as categories should have type 'object'. For example, 'bore' and 'stroke' variables are numerical values that describe the engines, so we should expect them to be of the type 'float' or 'int'; however, they are shown as type 'object'. We have to convert data types into a proper format for each column using the "astype()" method.
Convert data types to proper format
Let us list the columns after the conversion
Now we have finally obtained the cleaned dataset with no missing values with all data in its proper format.
Data Standardization
Data is usually collected from different agencies in different formats. (Data standardization is also a term for a particular type of data normalization where we subtract the mean and divide by the standard deviation.)
What is standardization?
Standardization is the process of transforming data into a common format, allowing the researcher to make the meaningful comparison.
Example
Transform mpg to L/100km:
In our dataset, the fuel consumption columns "city-mpg" and "highway-mpg" are represented by mpg (miles per gallon) unit. Assume we are developing an application in a country that accepts the fuel consumption with L/100km standard.
We will need to apply data transformation to transform mpg into L/100km.
The formula for unit conversion is:
L/100km = 235 / mpg
We can do many mathematical operations directly in Pandas.
Question #2:
According to the example above, transform mpg to L/100km in the column of "highway-mpg" and change the name of column to "highway-L/100km".
Data Normalization
Why normalization?
Normalization is the process of transforming values of several variables into a similar range. Typical normalizations include scaling the variable so the variable average is 0, scaling the variable so the variance is 1, or scaling the variable so the variable values range from 0 to 1.
Example
To demonstrate normalization, let's say we want to scale the columns "length", "width" and "height".
Target: would like to normalize those variables so their value ranges from 0 to 1
Approach: replace original value by (original value)/(maximum value)
Question #3:
According to the example above, normalize the column "height".
Here we can see we've normalized "length", "width" and "height" in the range of [0,1].
Binning
Why binning?Binning is a process of transforming continuous numerical variables into discrete categorical 'bins' for grouped analysis.
Example:
In our dataset, "horsepower" is a real valued variable ranging from 48 to 288 and it has 57 unique values. What if we only care about the price difference between cars with high horsepower, medium horsepower, and little horsepower (3 types)? Can we rearrange them into three ‘bins' to simplify analysis?
We will use the pandas method 'cut' to segment the 'horsepower' column into 3 bins.
Example of Binning Data In Pandas
Convert data to correct format:
Let's plot the histogram of horsepower to see what the distribution of horsepower looks like.
We would like 3 bins of equal size bandwidth so we use numpy's linspace(start_value, end_value, numbers_generated
function.
Since we want to include the minimum value of horsepower, we want to set start_value = min(df["horsepower"]).
Since we want to include the maximum value of horsepower, we want to set end_value = max(df["horsepower"]).
Since we are building 3 bins of equal length, there should be 4 dividers, so numbers_generated = 4.
We build a bin array with a minimum value to a maximum value by using the bandwidth calculated above. The values will determine when one bin ends and another begins.
We set group names:
We apply the function "cut" to determine what each value of df['horsepower']
belongs to.
Let's see the number of vehicles in each bin:
Let's plot the distribution of each bin:
Look at the dataframe above carefully. You will find that the last column provides the bins for "horsepower" based on 3 categories ("Low", "Medium" and "High").
We successfully narrowed down the intervals from 57 to 3!
Bins Visualization
Normally, a histogram is used to visualize the distribution of bins we created above.The plot above shows the binning result for the attribute "horsepower".
Indicator Variable (or Dummy Variable)
What is an indicator variable?An indicator variable (or dummy variable) is a numerical variable used to label categories. They are called 'dummies' because the numbers themselves don't have inherent meaning.
Why we use indicator variables?
We use indicator variables so we can use categorical variables for regression analysis in the later modules.
ExampleWe see the column "fuel-type" has two unique values: "gas" or "diesel". Regression doesn't understand words, only numbers. To use this attribute in regression analysis, we convert "fuel-type" to indicator variables.
We will use pandas' method 'get_dummies' to assign numerical values to different categories of fuel type.
Get the indicator variables and assign it to data frame "dummy_variable_1":
Change the column names for clarity:
In the dataframe, column 'fuel-type' has values for 'gas' and 'diesel' as 0s and 1s now.
The last two columns are now the indicator variable representation of the fuel-type variable. They're all 0s and 1s now.
Question #4:
Similar to before, create an indicator variable for the column "aspiration"
Question #5:
Merge the new dataframe to the original dataframe, then drop the column 'aspiration'.
Save the new csv: