Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
Download
27 views
Kernel: Python 3 (Ubuntu Linux)

title

Why Numpy and Pandas?

For a long time, when working with data, programmers who used Python had to use the core Python libraries to manipulate data, which was a bit painful. The modules Numpy and Pandas give us the tools we need to look at data quickly and efficiently, in a nicer format. By the end of this guide you should feel comfortable with Numpy arrays and Pandas series and dataframes.

Introduction to Numpy

Numpy is a module that lets us generate more efficient lists that have the option to be multi-dimensional. Before we look at what Numpy can do, we have to first import the module. As with matplotlib, we will be importing it under a different name for brevity. Here we use "np".

import numpy as np

Almost all of Numpy's functionality comes from it's multi-dimensional arrays, which mostly operate like Python lists, but use less memory and have some cool features. To initialise an array, we use the function np.array:

a = np.array([1,2,3,4]) print(a)
[1 2 3 4]

A very important note is that the "np.array" function's argument is a list - Python won't understand you if you give it something else! For example - the command...

aError = np.array(1,2,3,4)
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-4-e75c477c5330> in <module>() ----> 1 aError = np.array(1,2,3,4) ValueError: only 2 non-keyword arguments accepted

... give us an error because Python expects a list, not four numbers.

We can also make two dimensional arrays like so:

a = np.array([ [1,2,3,4] , [5,6,7,8] ]) print(a)
[[1 2 3 4] [5 6 7 8]]

Numpy also comes with some functions to generate arrays - for example the function linspace() gives us an array of numbers equally spread out between the arguements. For example:

b = np.linspace(0,10,21) print(b)
[ 0. 0.5 1. 1.5 2. 2.5 3. 3.5 4. 4.5 5. 5.5 6. 6.5 7. 7.5 8. 8.5 9. 9.5 10. ]

Gives us an array of 21 elements equally spaced from 0 to 10.

Numpy, unlike Python lists, can have operations performed on them directly:

print(b**2)
[ 0. 0.25 1. 2.25 4. 6.25 9. 12.25 16. 20.25 25. 30.25 36. 42.25 49. 56.25 64. 72.25 81. 90.25 100. ]

We can also use numpy arrays in the same way as Python lists for graphing:

import matplotlib.pyplot as plt #We need to import our graphing library! plt.plot(b,b**2) plt.show()
Image in a Jupyter notebook

Numpy arrays can also be indexed and sliced in the same way as Python lists:

print(b[0])
0.0
print(b[0:11])
[ 0. 0.5 1. 1.5 2. 2.5 3. 3.5 4. 4.5 5. ]
for element in b: print(str(element) + " is an element of our array!")
0.0 is an element of our array! 0.5 is an element of our array! 1.0 is an element of our array! 1.5 is an element of our array! 2.0 is an element of our array! 2.5 is an element of our array! 3.0 is an element of our array! 3.5 is an element of our array! 4.0 is an element of our array! 4.5 is an element of our array! 5.0 is an element of our array! 5.5 is an element of our array! 6.0 is an element of our array! 6.5 is an element of our array! 7.0 is an element of our array! 7.5 is an element of our array! 8.0 is an element of our array! 8.5 is an element of our array! 9.0 is an element of our array! 9.5 is an element of our array! 10.0 is an element of our array!

Apart from direct operations on arrays - this may seem a little redundant. So why do we use Numpy arrays over Python lists? Well, they use less memory and so run faster, thanks to some behind the scenes work.

Pandas

Numpy is the backbone of most data focused Python libraries, because it provides a solid foundation to build upon. One of the most important libraries is Pandas, which we use to create series and dataframes, i.e tables.

As always, we first need to import the library. With pandas we use the alias "pd" by convention:

import pandas as pd

Pandas' functionality is two Python objects - the series, and the dataframe. For making series, we can use the Series function (note: this is case sensitive!):

c = pd.Series([1,1,2,3,5,8]) print(c)
0 1 1 1 2 2 3 3 4 5 5 8 dtype: int64

The numbers in the left column are our index - it can be helpful to change this, for example if our data is time based. To do this, we add another arguement to the series function:

c = pd.Series([1,1,2,3,5,8], index=["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"]) print(c)
Monday 1 Tuesday 1 Wednesday 2 Thursday 3 Friday 5 Saturday 8 dtype: int64

Dataframes are just a collection of series. To make a dataframe, we have a few options, all using the DataFrame function (again notice the capitals!).

We can pass a two dimensional numpy array as an arguement, along with column names (here we use the random sublibrary of numpy to give us a 6x4 array of random numbers):

d = pd.DataFrame(np.random.randn(6,4), columns=['A','B','C','D']) print(d) #Note, if you are using jupyter notebook, just outputting d here instead of printing it will give you a nicer format.
A B C D 0 1.198576 -0.969093 -1.067570 -0.192506 1 0.540026 0.378931 0.475101 0.670622 2 0.653270 0.987377 -0.456903 -0.447462 3 -0.085078 0.696090 0.790374 1.335124 4 0.518464 0.426724 0.971381 -0.514355 5 -1.825887 0.891358 0.999850 -0.022633

Another way is to pass a dictionary as our arguement to the function - here we use the function "pd.date_range" to generate an array of dates, starting from 30/03/2017 and ending at 02/04/2017:

e = pd.DataFrame({'A': [1,2,3,4], 'B':"hello", 'C': pd.date_range('20170330', periods=4)}) print(e)
A B C 0 1 hello 2017-03-30 1 2 hello 2017-03-31 2 3 hello 2017-04-01 3 4 hello 2017-04-02

Most of the time during data analysis, we will looking at tables that are much larger than just 4 or 5 rows. It can be helpful to know some commands to give us summary information about our data without bringing up the whole frame.

The head and tail methods give us the first and last row(s) of the frame:

print(d.head(3)) #Looks at the first 3 rows of our "d" dataframe
A B C D 0 1.198576 -0.969093 -1.067570 -0.192506 1 0.540026 0.378931 0.475101 0.670622 2 0.653270 0.987377 -0.456903 -0.447462
print(e.tail(1)) #Looks at the last row of our "e" dataframe
A B C 3 4 hello 2017-04-02

The describe method can give us some summary statistics of our data:

print(d.describe())
A B C D count 6.000000 6.000000 6.000000 6.000000 mean 0.166562 0.401898 0.285372 0.138132 std 1.058152 0.714096 0.854814 0.723901 min -1.825887 -0.969093 -1.067570 -0.514355 25% 0.065807 0.390880 -0.223902 -0.383723 50% 0.529245 0.561407 0.632738 -0.107570 75% 0.624959 0.842541 0.926129 0.497308 max 1.198576 0.987377 0.999850 1.335124

Accessing columns and rows of a dataframe is similar to lists and arrays - for columns we index as normal:

print(d['A'])
0 1.198576 1 0.540026 2 0.653270 3 -0.085078 4 0.518464 5 -1.825887 Name: A, dtype: float64

For rows, we need to use the loc method:

print(d.loc[0])
A 1.198576 B -0.969093 C -1.067570 D -0.192506 Name: 0, dtype: float64

And finally, to get induvidual values, we can use a double-index:

d['A'][0]
1.1985760283116749

Worked Example

We're going to take a closer look at the random.randn function to see how this data is distributed:

import numpy as np import pandas as pd myData = pd.DataFrame(np.random.randn(100000)) print(myData.describe())
0 count 100000.000000 mean -0.001092 std 0.997733 min -4.349732 25% -0.675322 50% -0.000223 75% 0.671825 max 4.731422

So our random data has a mean of about 0 and a standard deviation of around 1. This seems to be a standard normal distribution, and in fact that's true - the "n" of "randn" stands for normal. We can illustrate this using a graph:

import matplotlib.pyplot as plt mySortedData = mydata.sort_values(0) #sorts the data in ascending order x = np.linspace(-10, 10, 100000) #setting up a dummy array plt.plot(mySortedData,x) plt.show()
--------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-24-32b4a5c1848a> in <module>() 1 import matplotlib.pyplot as plt 2 ----> 3 mySortedData = mydata.sort_values(0) #sorts the data in ascending order 4 x = np.linspace(-10, 10, 100000) #setting up a dummy array 5 NameError: name 'mydata' is not defined

Here we can see the cumulative distribution function of the normal distribution!

Mini Project

Below we have a dataframe of marks in a class - can you find out:

  • The average mark for English?

  • Each student's average mark? (DON'T do this manually!!!)

  • The subject which students scored the least marks in?

  • The student who got the most marks in the class overall

import numpy as np import pandas as pd subjects = ["Maths","English","Science","Geography","History","Languages"] marks = pd.DataFrame({"Alice": [85, 86, 98, 94, 2, 39],"Billy": [55, 26, 69, 39, 47, 15],"Cameron": [80, 5, 28, 28, 44, 37],"David": [ 5, 22, 95, 71, 62, 6],"Ellie": [75, 93, 66, 18, 87, 60],"Faye": [72, 0, 63, 51, 65, 83],"Garry": [67, 92, 62, 35, 0, 79],"Harriet": [51, 17, 87, 31, 91, 99],"Izzy": [63, 37, 58, 26, 39, 51],"James": [17, 7, 88, 27, 6, 16],"Katie": [15, 77, 12, 54, 81, 0],"Liam": [25, 35, 80, 71, 71, 9],"Mason": [70, 78, 4, 19, 61, 77],"Noah": [78, 96, 86, 42, 73, 51],"Olivia": [75, 81, 23, 19, 76, 3],"Patrick": [43, 50, 87, 94, 33, 65],"Quinn": [72, 1, 80, 96, 76, 56],"Ross": [ 3, 25, 30, 49, 84, 7],"Sam": [67, 29, 91, 64, 11, 43],"Terri": [63, 36, 70, 73, 13, 25],"Umar": [70, 30, 47, 71, 25, 57],"Veronica": [88, 34, 29, 92, 82, 62],"Will": [89, 11, 14, 56, 78, 63]}, index=subjects)