Path: blob/master/notebooks/tutorials/pandas_intro.ipynb
1192 views
Pandas
Pandas is a widely used Python library for storing and manipulating tabular data, where feature columns may be of different types (e.g., scalar, ordinal, categorical, text). We give some examples of how to use it below.
For very large datasets, you might want to use modin, which provides the same pandas API but scales to multiple cores, by using dask or ray on the backend.
Install necessary libraries
We notice that there are only 392 horsepower rows, but 398 of the others. This is because the HP column has 6 missing values (also called NA, or not available). There are 3 main ways to deal with this:
Drop the rows with any missing values using dropna()
Drop any columns with any missing values using drop()
Replace the missing vales with some other valye (eg the median) using fillna. (This is called missing value imputation.) For simplicity, we adopt the first approach.
Xarray
Xarray generalizes pandas to multi-dimensional indexing. Put another way, xarray is a way to create multi-dimensional numpy arrays, where each dimension has a label (instead of having to remember axis ordering), and each value along each dimension can also have a specified set of allowable values (instead of having to be an integer index). This allows for easier slicing and dicing of data. We give some examples below.
DataArray
A data-array is for storing a single, multiply-indexed variable. It is a generalization of a Pandas series.
We create a 2d DataArray, where the first dimension is labeled 'gender' and has values 'male', 'female' and 'other' for its coordinates; the second dimension is labeled 'age', and has integer coordinates. We also associate some arbitrary attributes to the array.
We can also do broadcasting on xarrays.
DataSet
An xarray DataSet is a collection of related DataArrays.