Why Numpy and Pandas?
For a long time, when working with data, programmers who used Python had to use the core Python libraries to manipulate data, which was a bit painful. The modules Numpy and Pandas give us the tools we need to look at data quickly and efficiently, in a nicer format. By the end of this guide you should feel comfortable with Numpy arrays and Pandas series and dataframes.
Introduction to Numpy
Numpy is a module that lets us generate more efficient lists that have the option to be multi-dimensional. Before we look at what Numpy can do, we have to first import the module. As with matplotlib, we will be importing it under a different name for brevity. Here we use "np".
Almost all of Numpy's functionality comes from it's multi-dimensional arrays, which mostly operate like Python lists, but use less memory and have some cool features. To initialise an array, we use the function np.array:
A very important note is that the "np.array" function's argument is a list - Python won't understand you if you give it something else! For example - the command...
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-4-e75c477c5330> in <module>()
----> 1 aError = np.array(1,2,3,4)
ValueError: only 2 non-keyword arguments accepted
... give us an error because Python expects a list, not four numbers.
We can also make two dimensional arrays like so:
Numpy also comes with some functions to generate arrays - for example the function linspace() gives us an array of numbers equally spread out between the arguements. For example:
Gives us an array of 21 elements equally spaced from 0 to 10.
Numpy, unlike Python lists, can have operations performed on them directly:
We can also use numpy arrays in the same way as Python lists for graphing:
Numpy arrays can also be indexed and sliced in the same way as Python lists:
Apart from direct operations on arrays - this may seem a little redundant. So why do we use Numpy arrays over Python lists? Well, they use less memory and so run faster, thanks to some behind the scenes work.
Pandas
Numpy is the backbone of most data focused Python libraries, because it provides a solid foundation to build upon. One of the most important libraries is Pandas, which we use to create series and dataframes, i.e tables.
As always, we first need to import the library. With pandas we use the alias "pd" by convention:
Pandas' functionality is two Python objects - the series, and the dataframe. For making series, we can use the Series function (note: this is case sensitive!):
The numbers in the left column are our index - it can be helpful to change this, for example if our data is time based. To do this, we add another arguement to the series function:
Dataframes are just a collection of series. To make a dataframe, we have a few options, all using the DataFrame function (again notice the capitals!).
We can pass a two dimensional numpy array as an arguement, along with column names (here we use the random sublibrary of numpy to give us a 6x4 array of random numbers):
Another way is to pass a dictionary as our arguement to the function - here we use the function "pd.date_range" to generate an array of dates, starting from 30/03/2017 and ending at 02/04/2017:
Most of the time during data analysis, we will looking at tables that are much larger than just 4 or 5 rows. It can be helpful to know some commands to give us summary information about our data without bringing up the whole frame.
The head and tail methods give us the first and last row(s) of the frame:
The describe method can give us some summary statistics of our data:
Accessing columns and rows of a dataframe is similar to lists and arrays - for columns we index as normal:
For rows, we need to use the loc method:
And finally, to get induvidual values, we can use a double-index:
Worked Example
We're going to take a closer look at the random.randn function to see how this data is distributed:
So our random data has a mean of about 0 and a standard deviation of around 1. This seems to be a standard normal distribution, and in fact that's true - the "n" of "randn" stands for normal. We can illustrate this using a graph:
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-24-32b4a5c1848a> in <module>()
1 import matplotlib.pyplot as plt
2
----> 3 mySortedData = mydata.sort_values(0) #sorts the data in ascending order
4 x = np.linspace(-10, 10, 100000) #setting up a dummy array
5
NameError: name 'mydata' is not defined
Here we can see the cumulative distribution function of the normal distribution!
Mini Project
Below we have a dataframe of marks in a class - can you find out:
The average mark for English?
Each student's average mark? (DON'T do this manually!!!)
The subject which students scored the least marks in?
The student who got the most marks in the class overall