Genomics Data Notebook 1: Preparing Our Data for Analysis
Goals For This Notebook:
1 - Load the algae RNA sequencing data and perform some Exploratory Data Analysis.
2 - Clean the data.
3 - Save our data into a new csv file.
Table of Contents
1 - Loading the Data
1.1 - What are the dimensions of our data?
1.2 - What is tracking ID?
1.3 - What are our other columns?
2 - Understanding and Cleaning the Data
2.1 - Making More Readable Column Names
2.2 - Understanding the Data Values
For the genomics data notebooks, we will be working with RNA sequencing data from an experiment measuring gene expression in the algae Chromochloris zofingiensis.
This algae is important for a couple of reasons:
It stores large amounts of energy from the sun, which can then be turned into biofuel.
It produces molecules that are beneficial for human health, like antioxidants.
Recently, scientists performed an experiment to figure out which genes were most important for these functions. You can read more about the experiment here. Specifically pay attention to the section titled "High Light-Induced Gene Expression" (under Results and Discussion), where scientists looked at which genes were 'turned on' (i.e. increased their expression levels) when C. zofingiensis samples were exposed to stronger light.
Question 1 Why would it be helpful for scientists to know which genes were expressed when the algae was exposed to high light?
It would be helpful for scientists to know which genes are expressed when exposed to high light to see if different light levels result in more or less energy stored in fat.
In this notebook, you will get to know the genomics dataset through Exploratory Data Analysis. You will also clean the dataset and then perform further data analysis in future notebooks.
Let's first get started by importing the libraries we need:
Now we want to import "rnaseq_raw_counts.txt" to the variable name rna_data
. We must include the fact that the file is saved in the folder data, so the computer knows where to look for the csv file! We add the foldername before the filename and add a slash (/) between - e.g. 'data/results.csv'
Before importing it, take a look at "rnaseq_raw_counts.txt". Notice that each value is separated by tabs instead of commas. This means we want to use the argument sep='\t'
in the pd.read_csv()
function call. The argument tells the computer that the file's values are separated by tabs instead of commas.
Great! Now let's learn more about our data set.
Find out how many rows and columns there are in our rna_data
table.
Hint: You can use the len()
function. For columns, you will need an extra step and use the len()
function after you use .columns
on your dataframe. Alternatively, you can use .shape
- look up on Google how to use the .shape
function.
That's a lot of data! Let's take a quick peek at the data table and see what we are working with. Notice that we cannot see every single column name; instead there is a "column" with ellipses (...) instead.
The column tracking_id
refers to the id of a specific gene we are tracking. Each one is in the form 'Cz##g#####'.
'Cz' means that the gene is from the algae species Chromochloris zofingiensis.
'##g' (the next two digits + 'g') tell us which chromosome the gene is on.
'#####' (the last few digits) are a randomly assigned ID number.
Let's check if each gene ID is unique. Look back at notebook 07 (pandas dataframes) to remind yourself what function we can use to find the number of unique tracking IDs.
Question 2 Compare the number of unique IDs to the number of rows. Notice that they are the same number. What does that mean?
That means that each ID is unique to the number of rows.
Now that we have explored the first column, let's take a look at the remaining columns. In the cell below, print out the names of all of the columns.
Question 3 Take a look at all of the column names other than the tracking_id
column. What is similar about all of the names? What is different? Do you have any guesses about what these might mean?
HL is probably high light while Ml is probably medium light.
Now that we have an idea of what our columns look like, we can make our table easier to work with through data cleaning. Here are some of our main goals for cleaning the data:
Creating a table index that is useful
Changing column labels to be more readable
Finding and addressing null values
Let's start with the first objective. Earlier, we found out that each tracking_id
is unique, so we can make tracking_id
our table index. This makes it easier to find data associated with a specific gene.
Be a data scientist and do some online research to find out how to use the pandas function set_index()
, then use it in the cell below:
Well done! Now let's talk about the other columns.
You may have noticed earlier that they follow a structure.
'HL' or 'ML' refers to whether the algae grew in 'high light' or 'medium light' respectively.
'##h' tells us how many hours the algae was exposed to the light before a sample was collected.
For HL, the range of times is [0.5, 12, 1, 3, 6].
For ML, the range of times is [0.5, 0, 12, 1, 3, 6].
'#' (the last digit) is an indicator of what replication of the sample it is.
Each experiment has 4 replications labeled 0, 1, 2, or 3.
The column HL.0.5h0
can be read as "high light for 0.5 hours -- sample 0".
This format is hard to read with the period after 'HL'/'ML' and the 'h' denoting that the time is in hours. Let's change the column names so that it is easier for us to read. We have provided new column names for you to use in the following format.
'##' (the first few digits) tell us the number of hours of light exposure.
'HL' or 'ML' denotes the light intensity.
'-#' gives us the replication number.
The column 0.5HL-0
can be read as "0.5 hours of high light for sample 0".
Quick Note: This format is better in terms of readability, but it might not be best for future coding uses and analyses! In data science, we sometimes have to compromise between readability and practicality. In this case, we want you to understand the data well, so we chose to emphasize readibility over practicality.
Relabel the columns of rna_data
to the given labels in rna_new_columns
.
To get a better understanding of the data we are working with, take a look at the data in the first 10 rows and first 10 columns. Use iloc
to do this (you can look back at notebook 07 to remind yourself how iloc works).
Question 4 We have a full understanding of our data table's labels, so let's now consider what the values in the table represent. What data type(s) are the values in the table? What do you think they might represent?
The table represents the lipid level at different light levels for different tracking id's.
Check if there is any missing data in the following cell (look back at notebook 07). Based on your answer above, think about whether this would affect our data analysis later.
It's a good thing we have no missing data! It seems like all of our data values are numbers, so let's see what range of values are under our 0.5HL-0
column.
Find the minimum value, maximum value, and mean in the 0.5HL-0
column (again notebook 07 can give you helpful reminders on what functions to use).
It seems like there is a large range of numbers under this column. Choose another column and check if it has a range of values that is just as large as 0.5HL-0
. Feel free to try multiple different columns.
Why are the range of values so broad for most columns?
The values in our data table represent the number of "turned on" genes under the given light conditions. Some genes may turn on more under lower light conditions while others may turn on more under higher light conditions. We might also see that some genes may turn on more after longer light exposure than they will under shorter light exposure.
In order to analyze, this however, we need to be able to look at numbers that range from 0 to the hundreds of thousands! We will address this issue in the next notebook.
Let's save our progress by using the .to_csv()
function to save the dataframe rna_data
, to "rna_data_cleaned.csv". Don't forget to specify to save it in the data folder.
Notebook developed by: Ciara Acosta, Sharon Greenblum, Alisa Bettale