⁴⁶⁹ views
ubuntu2204

Kernel: Python 3 (system-wide)

Genomics Data Notebook 2: Normalizing Our Data

Goal For This Notebook:

1 - Normalize the data so we can better compare the data, by using a relative scale instead of directly comparing them.

1 - What Is Normalization?

2 - Normalization by Read Depth

3 - Checking the Effect of Normalization

4 - Transposing Data

5 - Nomalization by Gene Length

In this notebook, we will look at our cleaned data from the previous notebook and then normalize the data. Normalization allows us to compare our data on a relative scale instead of directly comparing them. This will allow us to visualize data easier for the next notebook and help us find relationships in the data.

Normalization is a new topic so this notebook will break it down into steps. But first, as always, let us import the needed libraries:

In [3]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Load yesterday's cleaned table into rna_data. Recall that we saved yesterday's work into a file called "rna_data_cleaned.csv" in the data folder. Double check that rna_data is what it looked like yesterday.

The index_col=0 argument allows us to set the first column as the index of the dataframe. If you are curious, you can try running the code without the argument and see the difference.

In [4]:

# EXERCISE

rna_data = pd.read_csv("data/rna_data_cleaned.csv", index_col=0, nrows = 100)

rna_data.head()

Out[4]:

1. What Is Normalization?

Previously, we noted that the values in a single column ranged from 0 to the hundreds of thousands. Since it is hard to look at numbers in this wide of a range, we want to "normalize" the data. Normalizing data means that we adjust the scale of our data by comparing our numbers on a relative scale instead of directly comparing them. This helps with data that takes on a large range of numbers, and it eliminates the issue of comparing categories that are on different scales.

For example, let's say we have two classes, Class 1 and Class 2, with class sizes of 10 and 1,000 respectively. If 10 students from each class earn a perfect score on a test, we can tell that there is a significant difference between Class 1 and Class 2's performances. However, both classes had the same number of perfect test scores when we look at the raw numbers. In order to better portray this, we want to work with relative numbers.

Let's see how this would look with the data table classes.

In [5]:

# EXAMPLE

classes = pd.DataFrame([["A", 10, 10], ["B", 0, 900], ["C", 0, 90]], 
                       columns=["Grade", "Class 1", "Class 2"])
classes = classes.set_index("Grade")
classes

Out[5]:

If we were to only look at the data associated with "A" grades, we would not know that Class 1 is performing better than Class 2. Instead, we see that both classes had the same number of scores.

In [6]:

# EXAMPLE

classes.loc["A"]

Out[6]:

Class 1    10
Class 2    10
Name: A, dtype: int64

In order to normalize the data, we could divide our Class 1 data by the total number of students in Class 1 and our Class 2 data by the total number of students in Class 2. This tells us what proportion of students received a perfect score instead of what number of students did.

In [7]:

# EXAMPLE

num_students_1 = sum(classes.iloc[:, 0]) #taking the sum of column 0, Class 1
num_students_2 = sum(classes.iloc[:, 1]) #taking the sum of column 1, Class 2

classes["Class 1"] = classes["Class 1"]/num_students_1 #Divide Class 1 data by the total number of students in Class 1
classes["Class 2"] = classes["Class 2"]/num_students_2 #Divide Class 2 data by the total number of students in Class 2

print("# of students in Class 1: " + str(num_students_1))
print("# of students in Class 2: " + str(num_students_2))

classes

Out[7]:

# of students in Class 1: 10
# of students in Class 2: 1000

Now, if we only look at the data associated with "A" grades, we can see the difference between Class 1 scores and Class 2 scores.

In [8]:

# EXAMPLE

classes.loc["A"]

Out[8]:

Class 1    1.00
Class 2    0.01
Name: A, dtype: float64

2. Normalization By Read Depth

Let's apply this logic to our genomics data. Here, our columns are experiment conditions, and our rows are specific genes. It is possible that a condition has the fewest counts for all genes because the sample actually "turns on" fewer genes than other conditions, but it is also possible that the condition had fewer counts overall. In order to analyze this, we need to find out how many counts were obtained across all genes under a specific condition.

Let's define what we will be working with.

reads: the # of genes that "turned on" (ie. the values in our data)
read depth: the total # of reads across all genes under a specific condition

Consider this oversimplification of our rna_data table. For fictional gene Cz1, we see that high light (HL) conditions produce more reads than medium light (ML) conditions. However, when compared to other genes, we notice that the relative reads are larger under the ML condition than under the HL condition.

In [9]:

# EXAMPLE

simple_rna = pd.DataFrame([["Cz1", 50, 20], ["Cz2", 100, 1], ["Cz3", 100, 1]], 
                       columns=["Gene", "HL", "ML"])
simple_rna = simple_rna.set_index("Gene")
simple_rna

Out[9]:

Let's write some code that calculates the read depths of all experiment conditions/samples.

Create an empty list of read depths called read_depths.
Create a for loop (see notebook 04) that goes through each experiment condition (i.e. each sample):
- Calculate the read depth by finding the total number of reads across all genes in one column. Store it in the variable one_read_depth.
- Add the variable one_read_depth to our list, read_depths.

Looking back at the classes example earlier, this is synonymous to calculating all of our class sizes.

In [10]:

sum(rna_data.iloc[:, 0])

Out[10]:

32957

In [11]:

# EXERCISE

read_depths = []
num_samples = len(rna_data.columns)

for i in range(num_samples):
    one_read_depth = sum(rna_data.iloc[:, i])
    read_depths.append(one_read_depth)

print("These are the read depths of all samples:")
print(read_depths)

Out[11]:

These are the read depths of all samples:
[32957, 29176, 35282, 29898, 23592, 26639, 20156, 28573, 17893, 21955, 28462, 21877, 32008, 28603, 26906, 29191, 14000, 6864, 12072, 27248, 24738, 24488, 24385, 25407, 26685, 3413, 37023, 27658, 30623, 31021, 27898, 30620, 24308, 38429, 30648, 37750, 29454, 28353, 26801, 29369, 30515, 27875, 26818, 11886]

Now make a scatter plot (see notebook 08) that shows the range of read depths across all samples.

We want the x-axis to be the different samples (i.e. the column labels) and the y-axis to be the read depths corresponding to those samples.

In [12]:

# EXERCISE

x_data = rna_data.columns
y_data = read_depths

plt.scatter(x_data, y_data)
plt.title("Samples vs. Read Depth")
plt.xlabel("Samples")
plt.ylabel("Read Depth")
plt.xticks(rotation=90);

Out[12]:

Through this visualization, you can see that the read depth differs by a large amount under different conditions and samples. This is like how our class sizes differed, so it is a good thing that we have decided to normalize our data!

Now let's create a function (see notebook 01) that normalizes the data in our rna_data table. Write some code that calculates relative reads based on read depths.

For reference, we have provided the normalization of our example dataframe simple_rna by read depth below. You can see that to normalize our data, we divide the dataframe by the list simple_read_depths, which contain the read depths (the total # of gene counts) for each column.

In [13]:

# EXAMPLE

simple_read_depths = [250, 22]

simple_rna_by_read_depth = simple_rna / simple_read_depths
simple_rna_by_read_depth

Out[13]:

In [14]:

# EXERCISE

def normalizeByReadDepth(data, read_depth_list):
    return data / read_depth_list

Now that we have a function that normalizes by read depth (the total # of gene counts), we can normalize rna_data by read depth. Recall that we calculated the read depths of all experiment conditions/samples and stored it in a list called read_depths. Now, use rna_data and read_depths with our new function to get a normalized data table:

In [15]:

# EXERCISE

rna_by_read_depth = normalizeByReadDepth(rna_data, read_depths)

rna_by_read_depth.head()

Out[15]:

Because each value is a proportion, the values under each column should each sum up to 1. Let's verify that.

In [16]:

column_totals = []

for i in range(num_samples): 
    one_column_total = np.round(sum(rna_by_read_depth.iloc[:, i])) 
    column_totals.append(one_column_total)

print("These are sums of all columns:")
print(column_totals)

Out[16]:

These are sums of all columns:
[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]

Let's visualize our column totals once again. Previously, our column totals were our read depths, because they were the sum of all reads under a certain condition/sample. Now, our data is in relative reads (i.e. proportions), so our column totals should all be lined up at 1.

In [17]:

x_data = rna_data.columns
y_data = column_totals

plt.scatter(x_data, y_data)
plt.title("Samples vs. Column Total")
plt.xlabel("Samples")
plt.ylabel("Column Total")
plt.xticks(rotation=90);

Out[17]:

Now, our various conditions/samples are all on the same "scale", so it will be easier to compare between different light conditions and samples.

To verify this, let's compare the data for gene Cz01g00040 before and after normalizating by read depth.

3. Checking the Effect of Normalization

Create a plot that shows reads for only gene Cz01g00040 in the dataframe rna_data. Each row corresponds to a specific gene, so you will need to access the Cz01g00040 row in the table (hint: use loc - see notebook 07).

Going back to our classes example, this visualization would be similar to plotting the raw grade counts for the classes data (i.e. it will not show the relative differences).

In [18]:

# EXERCISE

x_data = rna_data.columns
y_data = rna_data.loc["Cz01g00040"]

plt.scatter(x_data, y_data)
plt.title("Samples vs. CZ01g00040 Reads")
plt.xlabel("Samples")
plt.ylabel("CZ01g00040 Reads")
plt.xticks(rotation=90);

Out[18]:

Question 1 What is the scale of Cz01g00040 reads? How much larger is the highest read compared to the lowest read? Is there a pattern or trend in the visualization?

Your answer here.

Now let's look at the normalized reads for gene Cz01g00040 in the normalized dataframe rna_by_read_depth. Once again, Cz01g00040 is a row in our table whose data you need to access using loc. This visualization shows the normalized reads of this gene, so it accounts for the total number of reads each condition produced.

In [35]:

# EXERCISE

x_data = rna_by_read_depth.columns

y_data = rna_by_read_depth.loc["Cz01g00040"]

plt.scatter(x_data, y_data)
plt.title("Samples vs. Normalized CZ01g00040 Reads")
plt.xlabel("Samples")
plt.ylabel("Normalized CZ01g00040 Reads")
plt.xticks(rotation=90);

Out[35]:

Question 2 What is the scale of normalized Cz01g00040 reads? How much larger is the highest read compared to the lowest read? Is there a pattern or trend in the visualization? Did it change from the previous visualization?

Your answer here.

Recall that we have normalized the data such that the table's values represent the reads (or gene counts) relative to the total reads (or read depth) based on the corresponding experiment conditions and sample number. As an example, we saw how gene Cz01g00040's data pattern became more clear after normalization.

Even though this made our data easier to analyze, we can still do more normalization to aide the interpretation of our data.

When we normalized by read depth, we normalized by column. More specifically, we took the totals of each column and represented each data value as a proportion of each total. In order to improve our analysis, we can also do normalization by row.

Question 3 What does each row represent? What characteristics do genes have? What do you think we could do normalization by?

Your answer here.

4. Transposing Data

Before we can normalize by row, we need to go over the concept of transposing data. Recall that normalizing by column consisted of dividing our columns by the sum of each column. We did this by dividing the dataframe by a list that contained the column sums.

Unfortunately, computers can only divide columns by arrays/lists and cannot divide rows by arrays/lists. In order to account for this, we can transpose our data -- flip our data table such that the rows are columns and the columns are rows. After normalizing our columns, we can transpose our data again, so that our columns are back to being rows.

Image credit: w3resource

Let's first take a look at the classes data table that we worked with previously.

In [20]:

# EXAMPLE

classes = pd.DataFrame([["A", 10, 10], ["B", 0, 900], ["C", 0, 90]], 
                       columns=["Grade", "Class 1", "Class 2"])
classes = classes.set_index("Grade")
classes

Out[20]:

Try using the function np.transpose() so that the classes table has the following structure.

Columns: "Grade", "A", "B", "C"
Rows: "Class 1", "Class 2"

In [21]:

classes_by_grade = np.transpose(classes)
classes_by_grade

Out[21]:

Now we can perform normalization just like before. To do this, divide the transposed dataframe by the totals for each grade (A, B, C).

Note that these values should not be the same as the normalized values from section 1

In [22]:

# EXERCISE

grade_totals = [20, 900, 90]

classes_by_grade_shares = classes_by_grade / grade_totals
classes_by_grade_shares

Out[22]:

Now we will apply np.transpose() once again and update our classes dataframe:

In [23]:

classes = np.transpose(classes_by_grade_shares)

classes

Out[23]:

Altogether, the whole process (denoted below) normalizes our original table, classes, by row.

Transpose the table so that the rows are columns and the columns are rows.
Do column division by an array/list.
Transpose the table again so that the structure is the same as the original table.

5. Normalization by Gene Length

Now you know how to apply normalization by row. Let's think about what we want to normalize by. Each row represents a specific gene, so it might be worthwhile to consider the length of a gene. Different genes have different lengths, so let's check out the lengths of our genes.

In the cell below, load the information we have on each gene via the "gene_info.info" file (located in the data folder). Like the original rna_data table, this file is separated by tabs instead of commas. Indicate this in your function call with the argument sep=....

Hint: if you forgot what to put in the ellipses (...), check what you did in the first genomics notebook!

In [24]:

# EXERCISE

gene_info = pd.read_csv('data/gene_info.info', sep="\t", index_col=0)

gene_info.head()

Out[24]:

This data table gives us a lot of interesting information! First, our index is geneID, which seems similar to our tracking_id column in rna_data. The column chrom tells us what chromosome the gene can be found on, and the columns start and stop tell us where the gene begins and ends on that chromosome. Finally, length refers to how long that specific gene is.

The last column would be the most useful to us for row normalization. Genes that are of longer lengths may produce more reads than those of shorter lengths, so we need to take that into account. Let's normalize the dataframe rna_by_read_depth by each row's gene length.

First though we have to check whether or not the gene IDs of gene_info are in the same order as the gene IDs in rna_by_read_depth. Do this in the next cell and keep in mind that the gene IDs of both data tables are the table indices.

Get the indices of both tables (by using .index) and assign them them to gene_info_ids and rna_data_ids.
Compare the gene ID arrays using a boolean statement.
Count how many indices match up.
- Recall that 'True' = 1 and 'False' = 0. You can use the sum() function.
Verify that the number of matching gene IDs is the number of genes in the rna_by_read_depth table.

In [37]:

# EXERCISE

gene_info_ids = gene_info.index
rna_data_ids = rna_by_read_depth.index

matches = (gene_info_ids == rna_data_ids)
num_matches = sum(matches)

rna_data_num_rows = len(rna_data_ids)
same_order = (num_matches == rna_data_num_rows)

print("The gene IDs are in the same order as the tracking IDs: " + str(same_order))

Out[37]:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_1410/3843919298.py in <cell line: 6>()
      4 rna_data_ids = rna_by_read_depth.index
      5 
----> 6 matches = (gene_info_ids == rna_data_ids)
      7 num_matches = sum(matches)
      8 
/usr/local/lib/python3.10/dist-packages/pandas/core/ops/common.py in new_method(self, other)
     79         other = item_from_zerodim(other)
     80 
---> 81         return method(self, other)
     82 
     83     return new_method
/usr/local/lib/python3.10/dist-packages/pandas/core/arraylike.py in __eq__(self, other)
     38     @unpack_zerodim_and_defer("__eq__")
     39     def __eq__(self, other):
---> 40         return self._cmp_method(other, operator.eq)
     41 
     42     @unpack_zerodim_and_defer("__ne__")
/usr/local/lib/python3.10/dist-packages/pandas/core/indexes/base.py in _cmp_method(self, other, op)
   6756             self
   6757         ) != len(other):
-> 6758             raise ValueError("Lengths must match to compare")
   6759 
   6760         if not isinstance(other, ABCMultiIndex):
ValueError: Lengths must match to compare

Based on the above statement, we can see that the gene ID columns are in the same order for gene_info and rna_by_read_depth. This means we can take the length information from gene_info with full confidence that it will be in the same order as the corresponding genes in rna_data.

Let's store that in the variable gene_lengths in the next cell:

In [26]:

gene_info

Out[26]:

In [27]:

# EXERCISE

gene_lengths = gene_info["length"]

Let's take a quick peek at the gene lengths of the first 100 genes!

In [28]:

x_data = rna_by_read_depth.index[0:100]
y_data = gene_lengths[0:100]

plt.bar(x_data, y_data)
plt.title("Genes vs. Gene Length")
plt.xlabel("Genes")
plt.ylabel("Length");

Out[28]:

Wow! Gene lengths vary by a lot, and these are only the first 100 gene lengths! Once again, it is a good thing we have decided to normalize our data to account for this variability in gene lengths.

Let's create a function to normalizes rna_by_read_depth by rows. Looking back at section 4, recall the steps are as follows:

Transpose the table so that the rows are columns and the columns are rows.
Do column division by an array/list.
Transpose the table again so that the structure is the same as the original table.

In [48]:

# EXERCISE

def normalizeByGene(data, gene_length_list):
    transposed_data = np.transpose(data)
    transposed_data1 = transposed_data / gene_length_list
    return np.transpose (transposed_data1)

Well done! We now have a function that normalizes by row and an array of values that we would like to normalize by. In the next cell, let's normalize rna_by_read_depth by gene_lengths.

You might notice that the numbers are incredibly small. Recall that these values are gene reads as proportions of their gene lengths as well as gene reads as proportions of all samples' total reads.

In [49]:

# EXERCISE

rna_by_gene_len = normalizeByGene(rna_by_read_depth, gene_lengths)
rna_by_gene_len.head()

Out[49]:

What effect did normalization have?

Let's see how our data changed for gene Cz01g00040 after normalizing the data by gene lengths.

Once again, let's create a plot that visualizes the normalized reads for Cz01g00040 in rna_by_read_depth. Remember that rna_by_read_depth is our data normalized by read depths only.

In [31]:

x_data = rna_by_read_depth.columns
y_data = rna_by_read_depth.loc["Cz01g00040", :]

plt.scatter(x_data, y_data)
plt.title("Samples vs. Normalized CZ01g00040 (Read Depth)")
plt.xlabel("Samples")
plt.ylabel("Normalized CZ01g00040 Reads")
plt.xticks(rotation=90);

Out[31]:

In this next cell, you should visualize the normalized reads for Cz01g00040 in rna_by_gene_len. This is the data from the previous visualization normalized by gene lengths.

In [34]:

# EXERCISE

x_data = rna_data.columns
y_data = rna_data.loc["Cz01g00040"]

plt.scatter(x_data, y_data)
plt.title("Samples vs. Normalized CZ01g00040 (Read Depth, Gene Length)")
plt.xlabel("Samples")
plt.ylabel("Normalized CZ01g00040 Reads")
plt.xticks(rotation=90);

Out[34]:

Question 1 How did the visualization change after the data was normalized by gene length? Did the pattern change? Did the scale of normalized Cz01g00040 reads change? How much larger is the highest read compared to the lowest read?

Your answer here.

Question 2 The changes that occurred may not seem too drastic in the visualization. However, they are still important. Why might that be? Hint: Look back at your genes vs. gene length graph

Your answer here.

To finish off this notebook, let's save our progress by using the .to_csv() function to save the dataframe rna_by_gene_len, as "rna_normalized.csv" in the data folder.

In [50]:

# EXERCISE

rna_by_gene_len.to_csv("data/rna_normalized.csv")

Notebook developed by: Sharon Greenblum, Ciara Acosta, Kseniya Usovich, Alisa Bettale

Genomics Data Notebook 2: Normalizing Our Data

Goal For This Notebook:

Table of Contents

1. What Is Normalization?

2. Normalization By Read Depth

3. Checking the Effect of Normalization

4. Transposing Data

5. Normalization by Gene Length

What effect did normalization have?

Product

Resources

Company