Genomics Data Notebook 2: Normalizing Our Data
Goal For This Notebook:
1 - Normalize the data so we can better compare the data, by using a relative scale instead of directly comparing them.
Table of Contents
2 - Normalization by Read Depth
3 - Checking the Effect of Normalization
4 - Transposing Data
5 - Nomalization by Gene Length
In this notebook, we will look at our cleaned data from the previous notebook and then normalize the data. Normalization allows us to compare our data on a relative scale instead of directly comparing them. This will allow us to visualize data easier for the next notebook and help us find relationships in the data.
Normalization is a new topic so this notebook will break it down into steps. But first, as always, let us import the needed libraries:
Load yesterday's cleaned table into rna_data
. Recall that we saved yesterday's work into a file called "rna_data_cleaned.csv" in the data folder. Double check that rna_data
is what it looked like yesterday.
The index_col=0
argument allows us to set the first column as the index of the dataframe. If you are curious, you can try running the code without the argument and see the difference.
1. What Is Normalization?
Previously, we noted that the values in a single column ranged from 0 to the hundreds of thousands. Since it is hard to look at numbers in this wide of a range, we want to "normalize" the data. Normalizing data means that we adjust the scale of our data by comparing our numbers on a relative scale instead of directly comparing them. This helps with data that takes on a large range of numbers, and it eliminates the issue of comparing categories that are on different scales.
For example, let's say we have two classes, Class 1 and Class 2, with class sizes of 10 and 1,000 respectively. If 10 students from each class earn a perfect score on a test, we can tell that there is a significant difference between Class 1 and Class 2's performances. However, both classes had the same number of perfect test scores when we look at the raw numbers. In order to better portray this, we want to work with relative numbers.
Let's see how this would look with the data table classes
.
If we were to only look at the data associated with "A" grades, we would not know that Class 1 is performing better than Class 2. Instead, we see that both classes had the same number of scores.
In order to normalize the data, we could divide our Class 1
data by the total number of students in Class 1
and our Class 2
data by the total number of students in Class 2
. This tells us what proportion of students received a perfect score instead of what number of students did.
Now, if we only look at the data associated with "A" grades, we can see the difference between Class 1 scores and Class 2 scores.
2. Normalization By Read Depth
Let's apply this logic to our genomics data. Here, our columns are experiment conditions, and our rows are specific genes. It is possible that a condition has the fewest counts for all genes because the sample actually "turns on" fewer genes than other conditions, but it is also possible that the condition had fewer counts overall. In order to analyze this, we need to find out how many counts were obtained across all genes under a specific condition.
Let's define what we will be working with.
reads: the # of genes that "turned on" (ie. the values in our data)
read depth: the total # of reads across all genes under a specific condition
Consider this oversimplification of our rna_data
table. For fictional gene Cz1, we see that high light (HL) conditions produce more reads than medium light (ML) conditions. However, when compared to other genes, we notice that the relative reads are larger under the ML
condition than under the HL
condition.
Let's write some code that calculates the read depths of all experiment conditions/samples.
Create an empty list of read depths called
read_depths
.Create a for loop (see notebook 04) that goes through each experiment condition (i.e. each sample):
Calculate the read depth by finding the total number of reads across all genes in one column. Store it in the variable
one_read_depth
.Add the variable
one_read_depth
to our list,read_depths
.
Looking back at the classes example earlier, this is synonymous to calculating all of our class sizes.
Now make a scatter plot (see notebook 08) that shows the range of read depths across all samples.
We want the x-axis to be the different samples (i.e. the column labels) and the y-axis to be the read depths corresponding to those samples.
Through this visualization, you can see that the read depth differs by a large amount under different conditions and samples. This is like how our class sizes differed, so it is a good thing that we have decided to normalize our data!
Now let's create a function (see notebook 01) that normalizes the data in our rna_data
table. Write some code that calculates relative reads based on read depths.
For reference, we have provided the normalization of our example dataframe simple_rna
by read depth below. You can see that to normalize our data, we divide the dataframe by the list simple_read_depths
, which contain the read depths (the total # of gene counts) for each column.
Now that we have a function that normalizes by read depth (the total # of gene counts), we can normalize rna_data
by read depth. Recall that we calculated the read depths of all experiment conditions/samples and stored it in a list called read_depths
. Now, use rna_data
and read_depths
with our new function to get a normalized data table:
Because each value is a proportion, the values under each column should each sum up to 1. Let's verify that.
Let's visualize our column totals once again. Previously, our column totals were our read depths, because they were the sum of all reads under a certain condition/sample. Now, our data is in relative reads (i.e. proportions), so our column totals should all be lined up at 1.
Now, our various conditions/samples are all on the same "scale", so it will be easier to compare between different light conditions and samples.
To verify this, let's compare the data for gene Cz01g00040
before and after normalizating by read depth.
3. Checking the Effect of Normalization
Create a plot that shows reads for only gene Cz01g00040
in the dataframe rna_data
. Each row corresponds to a specific gene, so you will need to access the Cz01g00040
row in the table (hint: use loc
- see notebook 07).
Going back to our classes example, this visualization would be similar to plotting the raw grade counts for the classes
data (i.e. it will not show the relative differences).
Question 1 What is the scale of Cz01g00040
reads? How much larger is the highest read compared to the lowest read? Is there a pattern or trend in the visualization?
Your answer here.
Now let's look at the normalized reads for gene Cz01g00040
in the normalized dataframe rna_by_read_depth
. Once again, Cz01g00040
is a row in our table whose data you need to access using loc
. This visualization shows the normalized reads of this gene, so it accounts for the total number of reads each condition produced.
Question 2 What is the scale of normalized Cz01g00040
reads? How much larger is the highest read compared to the lowest read? Is there a pattern or trend in the visualization? Did it change from the previous visualization?
Your answer here.
Recall that we have normalized the data such that the table's values represent the reads (or gene counts) relative to the total reads (or read depth) based on the corresponding experiment conditions and sample number. As an example, we saw how gene Cz01g00040
's data pattern became more clear after normalization.
Even though this made our data easier to analyze, we can still do more normalization to aide the interpretation of our data.
When we normalized by read depth, we normalized by column. More specifically, we took the totals of each column and represented each data value as a proportion of each total. In order to improve our analysis, we can also do normalization by row.
Question 3 What does each row represent? What characteristics do genes have? What do you think we could do normalization by?
Your answer here.
4. Transposing Data
Before we can normalize by row, we need to go over the concept of transposing data. Recall that normalizing by column consisted of dividing our columns by the sum of each column. We did this by dividing the dataframe by a list that contained the column sums.
Unfortunately, computers can only divide columns by arrays/lists and cannot divide rows by arrays/lists. In order to account for this, we can transpose our data -- flip our data table such that the rows are columns and the columns are rows. After normalizing our columns, we can transpose our data again, so that our columns are back to being rows.
Image credit: w3resource
Let's first take a look at the classes
data table that we worked with previously.
Try using the function np.transpose()
so that the classes
table has the following structure.
Columns: "Grade", "A", "B", "C"
Rows: "Class 1", "Class 2"
Now we can perform normalization just like before. To do this, divide the transposed dataframe by the totals for each grade (A, B, C).
Note that these values should not be the same as the normalized values from section 1
Now we will apply np.transpose()
once again and update our classes
dataframe:
Altogether, the whole process (denoted below) normalizes our original table, classes
, by row.
Transpose the table so that the rows are columns and the columns are rows.
Do column division by an array/list.
Transpose the table again so that the structure is the same as the original table.
5. Normalization by Gene Length
Now you know how to apply normalization by row. Let's think about what we want to normalize by. Each row represents a specific gene, so it might be worthwhile to consider the length of a gene. Different genes have different lengths, so let's check out the lengths of our genes.
In the cell below, load the information we have on each gene via the "gene_info.info" file (located in the data folder). Like the original rna_data
table, this file is separated by tabs instead of commas. Indicate this in your function call with the argument sep=...
.
Hint: if you forgot what to put in the ellipses (...), check what you did in the first genomics notebook!
This data table gives us a lot of interesting information! First, our index is geneID
, which seems similar to our tracking_id
column in rna_data
. The column chrom
tells us what chromosome the gene can be found on, and the columns start
and stop
tell us where the gene begins and ends on that chromosome. Finally, length
refers to how long that specific gene is.
The last column would be the most useful to us for row normalization. Genes that are of longer lengths may produce more reads than those of shorter lengths, so we need to take that into account. Let's normalize the dataframe rna_by_read_depth
by each row's gene length.
First though we have to check whether or not the gene IDs of gene_info
are in the same order as the gene IDs in rna_by_read_depth
. Do this in the next cell and keep in mind that the gene IDs of both data tables are the table indices.
Get the indices of both tables (by using
.index
) and assign them them togene_info_ids
andrna_data_ids
.Compare the gene ID arrays using a boolean statement.
Count how many indices match up.
Recall that 'True' = 1 and 'False' = 0. You can use the
sum()
function.
Verify that the number of matching gene IDs is the number of genes in the
rna_by_read_depth
table.
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/tmp/ipykernel_1410/3843919298.py in <cell line: 6>()
4 rna_data_ids = rna_by_read_depth.index
5
----> 6 matches = (gene_info_ids == rna_data_ids)
7 num_matches = sum(matches)
8
/usr/local/lib/python3.10/dist-packages/pandas/core/ops/common.py in new_method(self, other)
79 other = item_from_zerodim(other)
80
---> 81 return method(self, other)
82
83 return new_method
/usr/local/lib/python3.10/dist-packages/pandas/core/arraylike.py in __eq__(self, other)
38 @unpack_zerodim_and_defer("__eq__")
39 def __eq__(self, other):
---> 40 return self._cmp_method(other, operator.eq)
41
42 @unpack_zerodim_and_defer("__ne__")
/usr/local/lib/python3.10/dist-packages/pandas/core/indexes/base.py in _cmp_method(self, other, op)
6756 self
6757 ) != len(other):
-> 6758 raise ValueError("Lengths must match to compare")
6759
6760 if not isinstance(other, ABCMultiIndex):
ValueError: Lengths must match to compare
Based on the above statement, we can see that the gene ID columns are in the same order for gene_info
and rna_by_read_depth
. This means we can take the length
information from gene_info
with full confidence that it will be in the same order as the corresponding genes in rna_data
.
Let's store that in the variable gene_lengths
in the next cell:
Let's take a quick peek at the gene lengths of the first 100 genes!
Wow! Gene lengths vary by a lot, and these are only the first 100 gene lengths! Once again, it is a good thing we have decided to normalize our data to account for this variability in gene lengths.
Let's create a function to normalizes rna_by_read_depth
by rows. Looking back at section 4, recall the steps are as follows:
Transpose the table so that the rows are columns and the columns are rows.
Do column division by an array/list.
Transpose the table again so that the structure is the same as the original table.
Well done! We now have a function that normalizes by row and an array of values that we would like to normalize by. In the next cell, let's normalize rna_by_read_depth
by gene_lengths
.
You might notice that the numbers are incredibly small. Recall that these values are gene reads as proportions of their gene lengths as well as gene reads as proportions of all samples' total reads.
What effect did normalization have?
Let's see how our data changed for gene Cz01g00040
after normalizing the data by gene lengths.
Once again, let's create a plot that visualizes the normalized reads for Cz01g00040
in rna_by_read_depth
. Remember that rna_by_read_depth
is our data normalized by read depths only.
In this next cell, you should visualize the normalized reads for Cz01g00040
in rna_by_gene_len
. This is the data from the previous visualization normalized by gene lengths.
Question 1 How did the visualization change after the data was normalized by gene length? Did the pattern change? Did the scale of normalized Cz01g00040
reads change? How much larger is the highest read compared to the lowest read?
Your answer here.
Question 2 The changes that occurred may not seem too drastic in the visualization. However, they are still important. Why might that be? Hint: Look back at your genes vs. gene length graph
Your answer here.
To finish off this notebook, let's save our progress by using the .to_csv()
function to save the dataframe rna_by_gene_len
, as "rna_normalized.csv" in the data folder.
Notebook developed by: Sharon Greenblum, Ciara Acosta, Kseniya Usovich, Alisa Bettale