Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.
Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.
Path: blob/main/07. Data Analysis with Python/02. Data Wrangling/README.md
Views: 4598
Data Wrangling
Data wrangling is the process of converting data from the initial format to a format that may be better for analysis, so naturally, this lab involves the following activities:
Identifying and handling missing values
Data standardization
Data normalization (centering/scaling)
Binning
Labelling categories using indicator variables (dummy variables)
The output of this data wrangling is the 02. Data Wrangling - clean_df.csv
file.
The methods of data wrangling mentioned above are shown in the sections below.
Missing values
Dropping Missing values
To drop missing values, use dataframe.dropna()
. This won't change the dataframe, unless the parameter inplace = True
is used, i.e.
The argument axis=0
will drop the entire row, and axis=1
will drop the entire column, i.e.
Replacing Missing values
To replace missing values, use
Correcting Data Types
Identifying Data Types
To identify data types, use
Converting Data Types
To convert datatypes, use
For example, converting data type of the column 'price' to integer would be
Normalising Data
The following are ways to normalise data, and each type has its own use case.
Simple Feature Scaling
Min-Max
Z-Score
Binning Data
Binning data is grouping of values into "bins". It converts numeric data into categorical variables, e.g. if item prices varies between £5,000 and £50,000, they can be placed into "low", "medium", and "high" priced items, by defining the bin ranges.
To group into N
bins, we need N+1
numbers as dividers that are equal distance apart. To do so, use the numpy function linspace
to return the array "bins" that contains N+1
equally spaced numbers over the specified interval of the price. A list can then be created that contains the different bin names and the pandas function cut
can be used to segment and sort the data values into bins.
So, for the low, medium, high example:
Note that histograms are a good visualization for data split into bins
One-Hot Encoding
One-hot encoding allows us to go from categorial variables to numerical variables, by adding dummy variables for each unique category and assigning a 0 or 1 in each category.
To do this with Pandas, use the method
E.g. for a column 'fuel' with values of 'petrol' and 'diesel', the following code will create 2 columns called 'petrol' and 'diesel', and assign 0s and 1s in each row depending on which value appeared in the 'fuel' column.
Grouping Data
groupby Method
The Pandas method
can be applied on categorical variables, and groups the data into categories. An aggregate function, e.g. .mean()
can also be applied. For example:
pivot Method
The Pandas pivot
method can be used to show 1 variable displayed along the columns and the other variables displayed along the rows, and the syntax is as follows:
A useful way to visualise this is by using a heatmap, so using matplotlib: