Path: blob/master/python/pandas/pandas.ipynb
1480 views
Understanding Pandas Data Type
When working using pandas with small data (under 100 megabytes), performance is rarely a problem. When we move to larger data (100 megabytes to multiple gigabytes), performance issues can make run times much longer, and cause code to fail entirely due to insufficient memory. While tools like Spark can handle large data sets (100 gigabytes to multiple terabytes), taking full advantage of their capabilities usually requires more expensive hardware. And unlike pandas, they lack rich feature sets for high quality data cleaning, exploration, and analysis. For medium-sized data, we're better off trying to get more out of pandas, rather than switching to a different tool.
In this documentation, we'll learn about memory usage with pandas, how to make pandas DataFrame smaller and faster, simply by selecting the appropriate data types for columns.
We'll first look at the memory usage of each column, because we're interested in accuracy, we'll set the argument deep
to True
to get an accurate number.
Under the hood, pandas groups the columns into blocks of values of the same type, because each data type is stored separately, we’re going to examine the memory usage by each data type.
Immediately we can see that most of our memory is used by our object
columns. We'll look at those later, but first lets see if we can improve on the memory usage for our numeric columns.
Optimizing Numeric Columns
For blocks representing numeric values like integers and floats, pandas combines the columns and stores them as a NumPy ndarray. The NumPy ndarray is built around a C array, and the values are stored in a contiguous block of memory. This storage model consumes less space and allows us to access the values themselves quickly.
Many types in pandas have multiple subtypes that can use fewer bytes to represent each value. For example, the float type has the float16, float32, and float64 subtypes. The number portion of a type's name indicates the number of bits that type uses to represent values. For example, the subtypes we just listed use 2, 4, 8 and 16 bytes, respectively. The following table shows the subtypes for the most common pandas types:
memory usage | float | int | uint | datetime | bool |
---|---|---|---|---|---|
1 bytes | int8 | uint8 | bool | ||
2 bytes | float16 | int16 | uint16 | ||
4 bytes | float32 | int32 | uint32 | ||
8 bytes | float64 | int64 | uint64 | datetime64 |
An int8 value uses 1 byte (or 8 bits) to store a value, and can represent 256 values (2^8) in binary. This means that we can use this subtype to represent values ranging from -128 to 127 (including 0). And uint8, which is unsigned int, means we can only have positive values for this type, thus we can represent 256 values ranging from 0 to 255.
We can use the numpy.iinfo
class to verify the minimum and maximum values for each integer subtype. Let's look at an example:
We can use the function pd.to_numeric()
to downcast our numeric types. We’ll use DataFrame.select_dtypes
to select only the integer columns, then we’ll optimize the types and compare the memory usage.
Lets do the same thing with our float columns.
Optimizing object types
The object type represents values using Python string objects, partly due to the lack of support for missing string values in NumPy. Because Python is a high-level, interpreted language, it doesn't have fine grained-control over how values in memory are stored.
We'll use sys.getsizeof()
to prove this out, first by looking at individual strings, and then items in a pandas series.
We can see that the size of strings when stored in a pandas series are identical to their usage as separate strings in Python. This limitation causes strings to be stored in a fragmented way that consumes more memory and is slower to access. Each element in an object column is really a pointer that contains the "address" for the actual value's location in memory. For more information about this part consider referring to the following link. Blog: Why Python is Slow: Looking Under the Hood
To overcome this problem, Pandas introduced Categoricals
in version 0.15. The category type uses integer values under the hood to represent the values in a column, rather than the raw values. Pandas uses a separate mapping dictionary that maps the integer values to the raw ones. This arrangement is useful whenever a column contains a limited set of values. When we convert a column to the category dtype, pandas uses the most space efficient int subtype that can represent all of the unique values in a column.
Since the country
and continent
columns are strings, they are represented as object types in pandas. Now let's say, instead of storing strings, we want to store the continent
column as integers to reduce the memory required to store them by converting it to categorical type. To apply this conversion, we simply have to convert the column type to category
using the .astype
method.
As we can see, apart from the fact that the type of the column has changed, the data looks exactly the same. Pandas internals will smooth out the user experience so we don’t notice that we’re actually using a compact array of integers.
Let’s take a look at what's happening under the hood. In the following code chunk, we use the Series.cat.codes
attribute to return the integer values the category type uses to represent each value.
This column doesn’t have any missing values, but if it did, the category
subtype handles missing values by setting them to -1
.
We can also access the unique categories using the Series.cat.categories
attribute. This information servers as the lookup table that stores the mappings of the integer representation to the original category.
Lastly, let’s look at the memory usage for this column before and after converting to the category type.
We can see that by converting the continent
column to integers we're being more space-efficient. Apart from that it can actually speed up laters operations, e.g. sorting, groupby as we're storing the strings as compactly as integers. Let's apply this notion again to the country
column.
This time, the memory usage for the country column is now larger. The reason is that the country column's value is unique. If all of the values in a column are unique, the category type will end up using more memory because the column is storing all of the raw string values in addition to the integer category codes.
Thus we're actually creating 193 (shown below) unqiue categories, and we also have to store a lookup table for that.
In summary, if we're working with an object column of strings, convert it to category
type to make it for efficient. But this must be based on the assumption that the column takes a limited number of unique values, like in this case, the continent column only has 6 unique values.
Selecting Types While Reading the Data In
So far, we've explored ways to reduce the memory footprint of an existing dataframe. By reading the dataframe in first and then iterating on ways to save memory, we were able to understand the amount of memory we can expect to save from each optimization better. As we mentioned earlier in the mission, however, we often won't have enough memory to represent all the values in a data set. How can we apply memory-saving techniques when we can't even create the dataframe in the first place?
Fortunately, we can specify the optimal column types when we read the data set in. The pandas.read_csv()
function has a few different parameters that allow us to do this. The dtype parameter accepts a dictionary that has (string) column names as the keys and numpy type objects as the values.
Or instead of manually specifying the type, we can leverage a function to automatically perform the memory reduction for us.
The idea is that after performing the memory reduction, we should save this dataframe back to disk so in the future, we won't have to go through this process every time. (Assuming we'll be reading this data again and again).
Ordered Categorical
Another usage of category
is to specify its order to perform sorting.