Machine Learning with PyTorch and Scikit-Learn
-- Code Examples
Package version checks
Add folder to path in order to load from the check_packages.py script:
Check recommended package versions:
Chapter 4 - Building Good Training Datasets – Data Preprocessing
Overview
Dealing with missing data
Identifying missing values in tabular data
Eliminating training examples or features with missing values
Imputing missing values
Understanding the scikit-learn estimator API
Handling categorical data
Nominal and ordinal features
Mapping ordinal features
Encoding class labels
Performing one-hot encoding on nominal features
Optional: Encoding Ordinal Features
If we are unsure about the numerical differences between the categories of ordinal features, or the difference between two ordinal values is not defined, we can also encode them using a threshold encoding with 0/1 values. For example, we can split the feature "size" with values M, L, and XL into two new features "x > M" and "x > L". Let's consider the original DataFrame:
We can use the apply
method of pandas' DataFrames to write custom lambda expressions in order to encode these variables using the value-threshold approach:
Partitioning a dataset into a separate training and test set
Bringing features onto the same scale
A visual example:
Selecting meaningful features
...
L1 and L2 regularization as penalties against model complexity
A geometric interpretation of L2 regularization
Sparse solutions with L1-regularization
For regularized models in scikit-learn that support L1 regularization, we can simply set the penalty
parameter to 'l1'
to obtain a sparse solution:
Applied to the standardized Wine data ...
Sequential feature selection algorithms
Assessing feature importance with Random Forests
Now, let's print the 3 features that met the threshold criterion for feature selection that we set earlier (note that this code snippet does not appear in the actual book but was added to this notebook later for illustrative purposes):
Summary
...
Readers may ignore the next cell.