Path: blob/main/resources/week-2/QueryingDataFrame_ed.ipynb
3223 views
In this lecture we're going to talk about querying DataFrames. The first step in the process is to understand Boolean masking. Boolean masking is the heart of fast and efficient querying in numpy and pandas, and its analogous to bit masking used in other areas of computational science. By the end of this lecture you'll understand how Boolean masking works, and how to apply this to a DataFrame to get out data you're interested in.
A Boolean mask is an array which can be of one dimension like a series, or two dimensions like a data frame, where each of the values in the array are either true or false. This array is essentially overlaid on top of the data structure that we're querying. And any cell aligned with the true value will be admitted into our final result, and any cell aligned with a false value will not.
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-11-3d7e76efc1e4> in <module>
5 # Unfortunatly, it doesn't feel quite as natural in pandas. For instance, if you want to take two boolean
6 # series and and them together
----> 7 (df['chance of admit'] > 0.7) and (df['chance of admit'] < 0.9)
/opt/conda/lib/python3.7/site-packages/pandas/core/generic.py in __nonzero__(self)
1554 "The truth value of a {0} is ambiguous. "
1555 "Use a.empty, a.bool(), a.item(), a.any() or a.all().".format(
-> 1556 self.__class__.__name__
1557 )
1558 )
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
In this lecture, we have learned to query dataframe using boolean masking, which is extremely important and often used in the world of data science. With boolean masking, we can select data based on the criteria we desire and, frankly, you'll use it everywhere. We've also seen how there are many different ways to query the DataFrame, and the interesting side implications that come up when doing so.