Lab 2: Arrays and DataFrames
Due Saturday, July 12th at 11:59pm
Welcome to Lab 2! This week, we'll learn about arrays, which allow us to store sequences of data, and DataFrames, which let us work with multiple arrays of data about the same things. These topics are covered in Notes 7-10 of the course notes.
Please do not use for-loops for any questions in this lab. If you don't know what a for-loop is, don't worry -- we haven't covered them yet. But if you do know what they are and are wondering why it's not OK to use them, it is because loops in Python are slow, and looping over arrays and DataFrames should usually be avoided.
First, set up imports by running the cell below.
1. Arrays
Computers are most useful when you can use a small amount of code to do the same action to many different things.
For example, in the time it takes you to calculate the 18% tip on a restaurant bill, a laptop can calculate 18% tips for every restaurant bill paid by every human on Earth that day. (That's if you're pretty fast at doing arithmetic in your head!)
Arrays are how we put many values in one place so that we can operate on them as a group. For example, if billions_of_numbers is an array of numbers, the expression
gives a new array of numbers that's the result of multiplying each number in billions_of_numbers by 0.18 (18%). Arrays are not limited to numbers; we can also put all the words in a book into an array of strings.
Concretely, an array is a collection of values of the same type, like a column in a spreadsheet (think Google Sheets or Microsoft Excel).

1.1. Making arrays
You can type in the data that goes in an array yourself, but that's not typically how we'll create arrays. Normally, we create arrays by loading them from an external source, like a data file.
First, though, let's learn how to do it the hard way. To begin, we can make a list of numbers by putting them within square brackets:
Just like int, float, and str, the list is a datatype provided by Python. Lists are very flexible and easy to work with, but they are slowwww 🐢.
As data scientists, we'll often be working with millions or even billions of numbers. For this, we need something faster than a list. Instead of lists, we will use arrays.
Arrays are provided by a package called NumPy (pronounced "NUM-pie" or, if you prefer to pronounce things incorrectly, "NUM-pee"). The package is called numpy, but it's standard to rename it np for brevity. You can do that with:
Data scientists, as well as engineers and scientists of all kinds, use numpy frequently, and you'll see quite a bit of it if you're a data science major.
Now, to create an array, call the function np.array with a list of numbers. Run this cell to see an example:
Note that you need the square-brackets here. If you were to try running the following code, Python would yell at you because you forgot them:

Arrays themselves are also values, just like numbers and strings. That means you can assign them names or use them as arguments to functions.
Question 1.1.1. Make an array containing the numbers 2, 4, and 6, in that order. Name it even_numbers.
Question 1.1.2. Make an array containing the numbers 0, -1, 1, , and , in that order. Name it odd_numbers.
Hint: and are available from the np module, which has already been imported. Just as you used math.pi to get in the last lab, you can use np.pi to get as well. Do not import the math module.
Question 1.1.3. Make an array containing the five strings "Hello", ",", " ", "world", and "!". (The third one is a single space inside quotes.) Name it hello_world_components.
Note: If you print hello_world_components, you'll notice some extra information in addition to its contents: dtype='<U5'. That's just NumPy's extremely odd way of saying that the things in the array are strings. In case you're interested, the U means that this string is encoded in unicode, and the <5 means all strings in the array are 5 characters long or less.
Very often in data science, we want to work with many numbers that are evenly spaced within some range. NumPy provides a special function for this called arange. The expression np.arange(start, stop, space) produces an array with all the numbers starting at start and counting up by space, stopping before stop is reached.
For example, the value of np.arange(1, 8, 2) is an array with elements 1, 3, 5, and 7 -- it starts at 1 and counts up by 2, then stops before 8. In other words, it makes the same array as np.array([1, 3, 5, 7]).
np.arange(4, 9, 1) is an array with elements 4, 5, 6, 7, and 8. (It doesn't contain 9 because np.arange stops before the stop value is reached.)
Question 1.1.4. Use np.arange to create an array with all the multiples of 99 from 0 up to (and including) 9999. (So its elements are 0, 99, 198, 297, etc.)
Temperature readings
NOAA (the US National Oceanic and Atmospheric Administration) operates weather stations that measure surface temperatures at different sites around the United States. The hourly readings are publicly available.
Suppose we download all the hourly data from the San Diego, California site for the month of December 2021. To analyze the data, we want to know when each reading was taken, but we find that the data doesn't include the timestamps of the readings (the time at which each one was taken).
However, we know the first reading was taken at the first instant of December 2021 (midnight on December 1st) and each subsequent reading was taken exactly 1 hour after the last.
Question 1.1.5. Create an array of the time, in seconds, since the start of the month at which each hourly reading was taken. Name it collection_times.
Hint 1: There are 31 days in December, which is equivalent to () hours or () seconds.
Hint 2: The len function works on arrays, too. If your collection_times isn't passing the tests, check its length and make sure it has elements, since readings are taken hourly for 31 days.
1.2. Working with single elements of arrays ("indexing")
Let's work with a more interesting dataset. The next cell creates an array called population that includes estimated world populations in every year from 1950 to 2022. (The estimates come from the International Database, maintained by the US Census Bureau.)
Rather than type in the data manually, we've loaded them from a file on your computer called world_population_2022.csv. You'll learn how to read in data from files very soon.
Here's how we get the first element of population, which is the world population in the first year in the dataset, 1950.
Notice that we use square brackets here. The square brackets signal that we are accessing an element of the array. Square brackets in Python are kind of like subscripts in math.
The value of that expression is the number 2557619597 (around 2.5 billion), because that's the first thing in the array population.
Notice that we wrote population[0], not population[1], to get the first element. This is a weird convention in computer science. 0 is called the index of the first item. It's the number of elements that appear before that item. So 3 is the index of the 4th item.
Here are some more examples. In the examples, we've given names to the things we get out of population. Read and run each cell.
Question 1.2.1. Set population_1998 to the world population in 1998, by getting the appropriate element from population.
1.3. Doing something to every element of an array
Arrays are primarily useful for doing the same operation many times, so we don't often have to access and work with single elements.
Logarithms
Here is one simple question we might ask about world population:
How big was the population in orders of magnitude in each year?
The logarithm function is one way of measuring how big a number is. The logarithm (base 10) of a number increases by 1 every time we multiply the number by 10. It's like a measure of how many decimal digits the number has, or how big it is in orders of magnitude.
We could try to answer our question like this, using the log10 function from NumPy on each element of the population array:
But this is tedious and repetitive. There must be a better way!
It turns out that NumPy's log10 is pretty powerful. Not only can it take in a single number (like population[0]) as input and return the logarithm of a single number, but it can also take in an entire array of numbers and return the logarithm of each element in that array!
If you give NumPy's log10 an array as input, it will return an array of the same length, where the first element of the result is the logarithm of the first element of the input, the second element of the result is the logarithm of the second element of the input, and so on.

This is called elementwise application of the function, since it operates separately on each element of the array it's called on.
Question 1.3.1. Use NumPy's log10 function to compute the logarithms of the world population in every year. Give the result (an array of 73 numbers) the name population_magnitudes. Your code should be very short.
Arithmetic
Arithmetic also works elementwise on arrays. For example, you can divide all the population numbers by 1 billion to get numbers in billions:
You can do the same with addition, subtraction, multiplication, and exponentiation (**). For example, you can calculate a twenty percent tip on several restaurant bills at once:

Question 1.3.2. Suppose the total charge at a restaurant is the original bill plus the tip (20%). That means we can multiply the original bill by 1.2 to get the total charge. Compute the total charge for each bill in restaurant_bills and give the resulting array the name total_charges.
Let's read in some data to use in the next question.
Question 1.3.3. The array more_restaurant_bills contains 100,000 bills! Compute the total charge for each one, assuming again a twenty percent tip, and give the resulting array the name more_total_charges.
The function sum takes a single array of numbers as its argument. It returns the sum of all the numbers in that array (so it returns a single number, not an array).
Question 1.3.4. What was the sum of all the bills in more_restaurant_bills, including tips?
Powers of Two
The powers of 2 (, , , etc) arise frequently in computer science. (For example, you may have noticed that storage on smartphones or computers come in powers of 2, like 64 GB, 128 GB, or 256 GB.)
Question 1.3.5. Use np.arange and the exponentiation operator ** to create an array containing the first 40 powers of 2, starting from .
Hint 1: Did your kernel "die" when you ran your solution? There is a common incorrect response to this problem that tries to create an array with so many entries that Python gives up and crashes. If this happens to you, double-check your answer!
Hint 2: Maybe just start with the first 5 powers of two. Once you get that working, then try all 40. At no point should you have to manually write 0, 1, 2, 3, 4, ...; if you find yourself trying that, scroll up to earlier in the lab notebook.
2. DataFrames (i.e. Tables)
2.1. Introduction
For a collection of things in the world, an array is useful for describing a single attribute of each thing. For example, among the collection of US States, an array could describe the land area of each. Tables extend this idea by describing multiple attributes for each element of a collection. In a table of states, for example, we might keep track of land area, population, state capital, and the name of the governor. In other words, tables keep track of many entities (individuals, stored as rows), and for each entity, many attributes (features, stored as columns).
In the cell below we have two arrays. The first one contains the world population in each year (as estimated by the US Census Bureau), and the second contains the years themselves (in order, so the first elements in the population and the years arrays correspond).
Suppose we want to answer this question:
When did the world's population surpass 7 billion?
You could technically answer this question just by staring at the arrays, but it's a bit convoluted, since you would have to count the position where the population first crossed 7 billion, then find the corresponding element in the years array. In cases like these, it might be easier to put the data into a table.
Just as numpy provides arrays, a popular package called pandas provides DataFrames, which is pandas' name for tables. pandas is the tool for doing data science in Python. Unfortunately, pandas isn't as cute as its name might suggest: it's very complicated and can be somewhat hard to learn.
Instead of using pandas, we'll use a package that we've created specifically for DSC 10. It is a subset of pandas, including only the parts that we think are necessary and throwing out all of the rest. Because it is smaller (and cuter), we've called it babypandas.

You can import babypandas using the following code:
The nice thing about babypandas is that it is easier to learn but every bit of code you write using babypandas will work with pandas, too. If you're a data science major, or just going to be doing a lot of data analysis in Python, you'll see quite a lot of pandas in your future.
The cell below:
creates an empty DataFrame using the expression
bpd.DataFrame(),assigns two columns to the DataFrame by calling
assign,assigns the resulting DataFrame to the name
population_df, and finallydisplays
population_dfso that we can see the DataFrame we've made.
"Population" and "Year" are column labels that we have chosen. We could have chosen anything, but it's a good idea to choose names that are descriptive and not too long.
Now the data are all together in a single DataFrame! It's much easier to parse this data. If you need to know what the population was in 2011, for example, you can tell from a single glance. We'll revisit this DataFrame later.
Question 2.1.1. In the cell below, we've created 2 arrays. Using the steps above, assign top_10_movies to a DataFrame that has two columns called "Rating" and "Name", which hold top_10_movie_ratings and top_10_movie_names respectively.
Suppose you want to add your own ratings to this table. The cell below contains your ranking of each movie:
Question 2.1.2 You can use the assign method to add a column to an already-existing table, too. Create a new DataFrame called with_ranking by adding a column named "Ranking" to the table in top_10_movies.
2.2. Indexes
You may have noticed that the table of population numbers what looks like an extra, unlabeled column on the left with the numbers 0 through 65. This is not a column, it's what we call an index. The index contains the row labels. Whereas the columns of this table are labeled "Population" and "Year", the rows are labeled 0, 1, ..., 65.
By default, babypandas doesn't know how to label the rows, and so it just numbers them (starting with 0). Of course, in this case it makes more sense to use the year as a row's label. We can do this by telling babypandas to set the "Year" column as the index:
As we'll see, this does more than make the DataFrame look nicer -- it is very useful, too.
Question 2.2.1 Create a new DataFrame named top_10_movies_by_name by taking the DataFrame you made above, top_10_movies, and setting the index to be the "Name" column.
You can get an array of row names using .index. For instance, the array of row names of the population_by_year DataFrame is:
Question 2.2.2 Using code, assign to tenth_movie the name of the tenth movie in top_10_movies_by_name.
Hint: Remember that the index is an array, and we use square brackets to access elements of an array.
2.3 Reading a DataFrame from a file
In most cases, we aren't going to go through the trouble of typing in all the data manually. Instead, we can use functions provided by babypandas to read in data from external files.
The bpd.read_csv() function takes one argument, a path to a data file (a string), and returns a DataFrame. There are many formats for data files, but CSV ("comma-separated values") is the most common.
Question 2.3.1. The file data/imdb.csv contains a table of information about the 250 highest-rated movies on IMDb. Load it as a DataFrame called imdb.
Notice the dots in the middle of the DataFrame. This means that a lot of the rows have been omitted. This DataFrame is big enough that only a few of its rows are displayed, but the others are still there. There are 250 movies total.
Where did imdb.csv come from? Take a look at this lab's folder. If you go into the data/ directory, you should see a file called imdb.csv.
Open up the imdb.csv file in that folder and look at the format. What do you notice? The .csv filename ending says that this file is in the CSV (comma-separated value) format.
Question 2.3.2. This is a data set of movies, so it makes sense to use the movie title as the row label. Create a new DataFrame called imdb_by_name which uses the movie title as the index.
2.4. Series
Suppose we're interested primarily in movie ratings. To extract just this column from the table, we use the .get method:
Notice how not only the movie ratings have been returned, but also the name of the movie! This is precisely because we have set the movie title to be the index! For example, if we had asked for the "Rating" column of the original DataFrame, imdb, we would see:
This is one way in which indices are very useful - they provide meaningful labels for the data.
At first glance, it might look like asking for a column using .get returns a table with one column, but that's not quite right. Instead, it returns a special type of thing called a Series:
You can think of a Series as an array with an index. Whereas arrays are simple sequences of numbers without labels, Series can have labels. This is often very useful.
ratings is now a Series which contains the column of movie ratings. Suppose we're interested in the rating of a particular movie: Alien. To do so, we will use the .loc accessor which pulls a value from the Series at a particular location:
There are a couple of things to note here. First, those are square brackets around "Alien". This is because .loc is not a method, but an accessor. The square brackets signal that we're going to be extracting an element from the Series. Second, we passed in the label as a string.
Question 2.4.1. Find the rating of 3 Idiots.
Now suppose we wanted to know the year in which Alien was released. We could do this by first getting the column of years:
And then using .loc to get the right entry:
We could also do this in one step by chaining the operations together:
This works because Python first evaluates imdb_by_name.get('Year') to a Series. It then evaluates the .loc['Alien'] to return the year.
Chaining is used pretty frequently and can be handy. Just be sure not to chain too many things together that your code gets hard to read. You can always save an intermediate result to a variable.
Question 2.4.2 Find the decade in which Gone Girl was released using chaining.
Hint: imbd_by_name has a column named "Decade".
3. Analyzing datasets
With just a few DataFrame methods, we can answer some interesting questions about the IMDb dataset.
If we want just the ratings of the movies, we can use .get:
Remember that ratings is a Series. Series objects have some useful methods.
Question 3.1. Find the rating of the highest-rated movie in the dataset.
Hint: Type ratings. and hit Tab to see a list of the available methods. Is there one that looks useful?
You probably want to know the name of the movie whose rating you found! To do that, we can sort the whole Series using the .sort_values method:
So there are actually 2 highest-rated movies in the dataset: The Shawshank Redemption and The Godfather.
Notice that we are sorting by the ratings, not the labels! Moreover, the label follows the rating as it is sorted. This is exactly what we want.
When we use the sort_values method, the resulting Series has the data sorted in ascending order, from small to large. This is the default behavior of sort_values, but we can change that. Had we wanted the highest rated movies on top, we would need to specify that the sorting should not be in ascending order with an optional keyword argument:
If we set the keyword argument ascending to True, we get the same result as if we did not set it at all. This is what we mean when we say that the default behavior of sort_values is to sort in ascending order. Confirm that the next two cells give the same output.
Not only can we sort Series, but we can sort entire DataFrames, too. When we do that, we have to specify the column to sort by:
Similarly, we can specify that the sort should be in descending order:
Some details about sorting a DataFrame:
The first argument to
sort_valuesis the name of a column to sort by.If the column has strings in it,
sortwill sort alphabetically; if the column has numbers, it will sort numerically.imdb_by_name.sort_values("Rating")returns a new DataFrame; theimdb_by_nameDataFrame doesn't get modified. For example, if we calledimdb_by_name.sort("Rating"), then runningimdb_by_nameby itself would still return the unsorted DataFrame. To save the result, you should assign it to a new variable.Rows always stick together when a table is sorted. It wouldn't make sense to sort just one column and leave the other columns alone. For example, in this case, if we sorted just the
"Rating"column, the movies would all end up with the wrong ratings.
Question 3.2. Create a version of imdb_by_name that's sorted chronologically, with the earliest movies first. Call it imdb_sorted.
Question 3.3. What's the title of the earliest movie in the dataset? You could just look this up from the output of the previous cell. Instead, write Python code to find out.
Hint: Remember that the index is an array.
Suppose we want to get the rating of the oldest movie in the table. One way to do this is to first find the index label of the oldest movie (which we've already done). We then extract the "Rating" column and use .loc to find the rating of the oldest movie.
There's a faster way, though. A Series not only has a .loc accessor, but also an .iloc accessor. While .loc looks up things by label, .iloc looks up elements by integer position.
Let's remember what is in the "Rating" column:
If we want the rating of the first row, we can use .iloc[0]:
This returns the exact same thing as imdb_sorted.get('Rating').loc['The Kid']; these are two ways of doing the same thing. Usually it is more convenient to access an element by its label rather than by its integer position, but both .loc and .iloc are good to know.
Question 3.4. What is the rating of the fifth oldest movie in the dataset? You could just look this up from the output of the previous cell. Instead, write Python code to find out.
4. Finding pieces of a dataset
Suppose you're interested in movies from the 1950s. Sorting the table by year doesn't help you, because the 1950s are in the middle of the dataset. Instead, we'll use a feature of Series that allows us to easily compare each element in a column to a particular value.
First remember that we can use .get to extract a single column. The result is not a DataFrame, but rather a Series:
We want to check whether each movie is released in the decade 1940. Python gives us a way of checking whether two things are equal with == (remember that = is already being used for another purpose: it assigns values to variable names):
True and False are instances of a type that we haven't seen before:
bool stands for "Boolean", named after the English logician George Boole. We say that "True" and "False" are Boolean values.
It turns out that we can easily check if each of the elements in a Series is equal to something:
We see that the result is a new series which has True only where the decade was 1950, and False everywhere else. We say that the resulting series is a series of Booleans, or a Boolean Series.
Let's call this result is_from_fifties. Its name can be read like it is a question: "is this movie from the 1950s"?
Each row is an answer to this question. Is The Elephant Man from the 1950s? False. Is All About Eve from the 1950s? True.
We can use is_from_1950s to select only the rows from imdb_by_name for which the answer is True. The syntax for this is:
What imdb_by_name[is_from_1950s] does, precisely, is to go through the table imdb_by_name row by row. If the row named Singin' in the Rain has the value True in is_from_1950s, that row is kept. If the value is False, the row is discarded. And so on, for every row.
Note that we could have accomplished this without ever creating the variable is_from_1950s by simply placing the code that we used to create the boolean series directly inside the [...]. This is a typical pattern you'll be using a lot!
It helps to read the square brackets as "where." So the command in the cell above says to keep all rows from imbdb_by_name where the decade is the 1950s.
Creating a new DataFrame by selecting only certain rows from an existing DataFrame which satisfy some condition is called querying. The line of code imdb_by_name[imdb_by_name.get('Decade') == 1950] is a query.
Question 4.1. Create a DataFrame called ninety_eight containing the movies that came out in 1998.
So far we've only been finding where a column is exactly equal to a certain value. However, there are many other comparison operators we could use. Here are a few:
| Operator | Tests |
|---|---|
== | thing on left is equal to thing on right |
!= | thing on left is not equal to thing on right |
> | thing on left is greater than (and not equal to) thing on right |
>= | thing on left is greater than or equal to thing on right |
< | thing on left is less than (and not equal to) thing on right |
Note 10 in the course notes has more examples.
Question 4.2. Using operators from the table above, find all the movies with a rating higher than 8.6. Put their data in a DataFrame called really_highly_rated.
What is the highest rating of any movie from the 1990s? We now have the tools to answer questions like these. Breaking it into pieces, we first find all of the movies from the 1990s:
We then select only these movies from our table:
We then find the highest rating out of just these movies:
Or, if we wanted to do all of this more concisely using chaining:
Question 4.3. Find the average rating for movies released in the 20th century and the average rating for movies released in the 21st century for the movies in imdb.
Hint: Series have a .mean() method. Note that the year 2000 is in the 20th century, and that the earliest movie in the dataset is from 1921!
The property shape tells you how many rows and columns are in a DataFrame. (A "property" is like a method that doesn't need to be called by adding parentheses.)
Like an array, you can get the first element of the shape using [0], and the second element using [1]. For instance, the number of rows in imdb_by_name is:
We can use this to answer "How many movies are from the 20th century?":
Question 4.4. Use shape (and arithmetic) to find the proportion of movies in the dataset that were released in the 20th century, and the proportion from the 21st century.
Hint: The proportion of movies released in the 20th century is the number of movies released in the 20th century, divided by the total number of movies in the dataset.
Question 4.5. Finally, let's revisit the population_by_year DataFrame from earlier in the lab. Compute the year when the world population first went above 7 billion.
Finish Line
Congratulations! You are done with Lab 02.
To submit your assignment:
Select
Kernel -> Restart & Run Allto ensure that you have executed all cells, including the test cells.Read through the notebook to make sure everything is fine and all tests passed.
Run the cell below to run all tests, and make sure that they all pass.
Download your notebook using
File -> Download as -> Notebook (.ipynb), then upload your notebook to Gradescope.