⁴⁶⁹ views
ubuntu2204

Kernel: Python 3 (system-wide)

Genomics Data Notebook 1: Preparing Our Data for Analysis

Goals For This Notebook:

1 - Load the algae RNA sequencing data and perform some Exploratory Data Analysis.

2 - Clean the data.

3 - Save our data into a new csv file.

1 - Loading the Data

1.1 - What are the dimensions of our data?

1.2 - What is tracking ID?

1.3 - What are our other columns?

2 - Understanding and Cleaning the Data

2.1 - Making More Readable Column Names

2.2 - Understanding the Data Values

For the genomics data notebooks, we will be working with RNA sequencing data from an experiment measuring gene expression in the algae Chromochloris zofingiensis.

This algae is important for a couple of reasons:

It stores large amounts of energy from the sun, which can then be turned into biofuel.
It produces molecules that are beneficial for human health, like antioxidants.

Recently, scientists performed an experiment to figure out which genes were most important for these functions. You can read more about the experiment here. Specifically pay attention to the section titled "High Light-Induced Gene Expression" (under Results and Discussion), where scientists looked at which genes were 'turned on' (i.e. increased their expression levels) when C. zofingiensis samples were exposed to stronger light.

Question 1 Why would it be helpful for scientists to know which genes were expressed when the algae was exposed to high light?

It would be helpful for scientists to know which genes are expressed when exposed to high light to see if different light levels result in more or less energy stored in fat.

In this notebook, you will get to know the genomics dataset through Exploratory Data Analysis. You will also clean the dataset and then perform further data analysis in future notebooks.

Let's first get started by importing the libraries we need:

In [3]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

1. Loading the Data

Now we want to import "rnaseq_raw_counts.txt" to the variable name rna_data. We must include the fact that the file is saved in the folder data, so the computer knows where to look for the csv file! We add the foldername before the filename and add a slash (/) between - e.g. 'data/results.csv'

Before importing it, take a look at "rnaseq_raw_counts.txt". Notice that each value is separated by tabs instead of commas. This means we want to use the argument sep='\t' in the pd.read_csv() function call. The argument tells the computer that the file's values are separated by tabs instead of commas.

In [4]:

# EXERCISE

rna_data = pd.read_csv("data/rnaseq_raw_counts.txt", sep="\t")

1.1 What are the dimensions of our data?

Great! Now let's learn more about our data set.

Find out how many rows and columns there are in our rna_data table.

Hint: You can use the len() function. For columns, you will need an extra step and use the len() function after you use .columns on your dataframe. Alternatively, you can use .shape - look up on Google how to use the .shape function.

In [5]:

# EXERCISE

num_rows = len(rna_data)
num_columns = rna_data.columns

print("# of rows: " + str(num_rows))
print("# of columns: " + str(num_columns))

Out[5]:

# of rows: 15369
# of columns: Index(['tracking_id', 'HL.0.5h0', 'HL.0.5h1', 'HL.0.5h2', 'HL.0.5h3',
       'HL.12h0', 'HL.12h1', 'HL.12h2', 'HL.12h3', 'HL.1h0', 'HL.1h1',
       'HL.1h2', 'HL.1h3', 'HL.3h0', 'HL.3h1', 'HL.3h2', 'HL.3h3', 'HL.6h0',
       'HL.6h1', 'HL.6h2', 'HL.6h3', 'ML.0.5h0', 'ML.0.5h1', 'ML.0.5h2',
       'ML.0.5h3', 'ML.0h0', 'ML.0h1', 'ML.0h2', 'ML.0h3', 'ML.12h0',
       'ML.12h1', 'ML.12h2', 'ML.12h3', 'ML.1h0', 'ML.1h1', 'ML.1h2', 'ML.1h3',
       'ML.3h0', 'ML.3h1', 'ML.3h2', 'ML.3h3', 'ML.6h0', 'ML.6h1', 'ML.6h2',
       'ML.6h3'],
      dtype='object')

That's a lot of data! Let's take a quick peek at the data table and see what we are working with. Notice that we cannot see every single column name; instead there is a "column" with ellipses (...) instead.

In [6]:

rna_data.head()

Out[6]:

2. What is `tracking_id`?

The column tracking_id refers to the id of a specific gene we are tracking. Each one is in the form 'Cz##g#####'.

'Cz' means that the gene is from the algae species Chromochloris zofingiensis.
'##g' (the next two digits + 'g') tell us which chromosome the gene is on.
'#####' (the last few digits) are a randomly assigned ID number.

Let's check if each gene ID is unique. Look back at notebook 07 (pandas dataframes) to remind yourself what function we can use to find the number of unique tracking IDs.

In [10]:

# EXERCISE

rna_data["tracking_id"].unique()

Out[10]:

array(['Cz01g00020', 'Cz01g00030', 'Cz01g00040', ..., 'UNPLg00911',
       'UNPLg00914', 'UNPLg00915'], dtype=object)

In [11]:

rna_data["tracking_id"].nunique()

Out[11]:

15369

Question 2 Compare the number of unique IDs to the number of rows. Notice that they are the same number. What does that mean?

That means that each ID is unique to the number of rows.

3. What are our other columns?

Now that we have explored the first column, let's take a look at the remaining columns. In the cell below, print out the names of all of the columns.

In [12]:

# EXERCISE

rna_data.columns

Out[12]:

Index(['tracking_id', 'HL.0.5h0', 'HL.0.5h1', 'HL.0.5h2', 'HL.0.5h3',
       'HL.12h0', 'HL.12h1', 'HL.12h2', 'HL.12h3', 'HL.1h0', 'HL.1h1',
       'HL.1h2', 'HL.1h3', 'HL.3h0', 'HL.3h1', 'HL.3h2', 'HL.3h3', 'HL.6h0',
       'HL.6h1', 'HL.6h2', 'HL.6h3', 'ML.0.5h0', 'ML.0.5h1', 'ML.0.5h2',
       'ML.0.5h3', 'ML.0h0', 'ML.0h1', 'ML.0h2', 'ML.0h3', 'ML.12h0',
       'ML.12h1', 'ML.12h2', 'ML.12h3', 'ML.1h0', 'ML.1h1', 'ML.1h2', 'ML.1h3',
       'ML.3h0', 'ML.3h1', 'ML.3h2', 'ML.3h3', 'ML.6h0', 'ML.6h1', 'ML.6h2',
       'ML.6h3'],
      dtype='object')

Question 3 Take a look at all of the column names other than the tracking_id column. What is similar about all of the names? What is different? Do you have any guesses about what these might mean?

HL is probably high light while Ml is probably medium light.

2. Understanding and Cleaning the Data

Now that we have an idea of what our columns look like, we can make our table easier to work with through data cleaning. Here are some of our main goals for cleaning the data:

Creating a table index that is useful
Changing column labels to be more readable
Finding and addressing null values

Let's start with the first objective. Earlier, we found out that each tracking_id is unique, so we can make tracking_id our table index. This makes it easier to find data associated with a specific gene.

Be a data scientist and do some online research to find out how to use the pandas function set_index(), then use it in the cell below:

In [14]:

# EXERCISE

rna_data = rna_data.set_index("tracking_id")

rna_data.head()

Out[14]:

2.1 Making More Readable Column Names

Well done! Now let's talk about the other columns.

You may have noticed earlier that they follow a structure.

'HL' or 'ML' refers to whether the algae grew in 'high light' or 'medium light' respectively.
'##h' tells us how many hours the algae was exposed to the light before a sample was collected.
- For HL, the range of times is [0.5, 12, 1, 3, 6].
- For ML, the range of times is [0.5, 0, 12, 1, 3, 6].
'#' (the last digit) is an indicator of what replication of the sample it is.
- Each experiment has 4 replications labeled 0, 1, 2, or 3.

The column HL.0.5h0 can be read as "high light for 0.5 hours -- sample 0".

This format is hard to read with the period after 'HL'/'ML' and the 'h' denoting that the time is in hours. Let's change the column names so that it is easier for us to read. We have provided new column names for you to use in the following format.

'##' (the first few digits) tell us the number of hours of light exposure.
'HL' or 'ML' denotes the light intensity.
'-#' gives us the replication number.

The column 0.5HL-0 can be read as "0.5 hours of high light for sample 0".

Quick Note: This format is better in terms of readability, but it might not be best for future coding uses and analyses! In data science, we sometimes have to compromise between readability and practicality. In this case, we want you to understand the data well, so we chose to emphasize readibility over practicality.

In [15]:

rna_new_columns = ['0.5HL-0', '0.5HL-1', '0.5HL-2', '0.5HL-3',
        '12HL-0', '12HL-1', '12HL-2', '12HL-3', 
        '1HL-0', '1HL-1', '1HL-2', '1HL-3', 
        '3HL-0', '3HL-1', '3HL-2', '3HL-3', 
        '6HL-0', '6HL-1', '6HL-2', '6HL-3',

        '0.5ML-0', '0.5ML-1', '0.5ML-2','0.5ML-3', 
        '0ML-0', '0ML-1', '0ML-2', '0ML-3', 
        '12ML-0', '12ML-1', '12ML-2', '12ML-3', 
        '1ML-0', '1ML-1', '1ML-2', '1ML-3',
        '3ML-0', '3ML-1', '3ML-2', '3ML-3', 
        '6ML-0', '6ML-1', '6ML-2', '6ML-3']

Relabel the columns of rna_data to the given labels in rna_new_columns.

In [16]:

# EXERCISE

rna_data.columns = rna_new_columns

rna_data.head()

Out[16]:

2.2 Understanding the Data Values

To get a better understanding of the data we are working with, take a look at the data in the first 10 rows and first 10 columns. Use iloc to do this (you can look back at notebook 07 to remind yourself how iloc works).

In [23]:

# EXERCISE

rna_data.iloc[:10,]

Out[23]:

Question 4 We have a full understanding of our data table's labels, so let's now consider what the values in the table represent. What data type(s) are the values in the table? What do you think they might represent?

The table represents the lipid level at different light levels for different tracking id's.

Check if there is any missing data in the following cell (look back at notebook 07). Based on your answer above, think about whether this would affect our data analysis later.

In [24]:

# EXERCISE

rna_data.isnull().sum()

Out[24]:

0.5HL-0    0
0.5HL-1    0
0.5HL-2    0
0.5HL-3    0
12HL-0     0
12HL-1     0
12HL-2     0
12HL-3     0
1HL-0      0
1HL-1      0
1HL-2      0
1HL-3      0
3HL-0      0
3HL-1      0
3HL-2      0
3HL-3      0
6HL-0      0
6HL-1      0
6HL-2      0
6HL-3      0
0.5ML-0    0
0.5ML-1    0
0.5ML-2    0
0.5ML-3    0
0ML-0      0
0ML-1      0
0ML-2      0
0ML-3      0
12ML-0     0
12ML-1     0
12ML-2     0
12ML-3     0
1ML-0      0
1ML-1      0
1ML-2      0
1ML-3      0
3ML-0      0
3ML-1      0
3ML-2      0
3ML-3      0
6ML-0      0
6ML-1      0
6ML-2      0
6ML-3      0
dtype: int64

It's a good thing we have no missing data! It seems like all of our data values are numbers, so let's see what range of values are under our 0.5HL-0 column.

Find the minimum value, maximum value, and mean in the 0.5HL-0 column (again notebook 07 can give you helpful reminders on what functions to use).

In [25]:

#EXERCISE

min_val_1 = rna_data["0.5HL-0"].min
max_val_1 = rna_data["0.5HL-0"].max
mean_val_1 = rna_data["0.5HL-0"].mean 

print("Minimum 0.5HL-0 value: " + str(min_val_1))
print("Maximum 0.5HL-0 value: " + str(max_val_1))
print("Mean 0.5HL-0 value: " + str(mean_val_1))

Out[25]:

Minimum 0.5HL-0 value: <bound method NDFrame._add_numeric_operations.<locals>.min of tracking_id
Cz01g00020    657
Cz01g00030    255
Cz01g00040     90
Cz01g00050     49
Cz01g00060     36
             ... 
UNPLg00909      0
UNPLg00910      0
UNPLg00911      0
UNPLg00914      0
UNPLg00915      1
Name: 0.5HL-0, Length: 15369, dtype: int64>
Maximum 0.5HL-0 value: <bound method NDFrame._add_numeric_operations.<locals>.max of tracking_id
Cz01g00020    657
Cz01g00030    255
Cz01g00040     90
Cz01g00050     49
Cz01g00060     36
             ... 
UNPLg00909      0
UNPLg00910      0
UNPLg00911      0
UNPLg00914      0
UNPLg00915      1
Name: 0.5HL-0, Length: 15369, dtype: int64>
Mean 0.5HL-0 value: <bound method NDFrame._add_numeric_operations.<locals>.mean of tracking_id
Cz01g00020    657
Cz01g00030    255
Cz01g00040     90
Cz01g00050     49
Cz01g00060     36
             ... 
UNPLg00909      0
UNPLg00910      0
UNPLg00911      0
UNPLg00914      0
UNPLg00915      1
Name: 0.5HL-0, Length: 15369, dtype: int64>

It seems like there is a large range of numbers under this column. Choose another column and check if it has a range of values that is just as large as 0.5HL-0. Feel free to try multiple different columns.

In [26]:

#EXERCISE

min_val_2 = rna_data["6HL-3"].min
max_val_2 = rna_data["6HL-3"].max
mean_val_2 = rna_data["6HL-3"].mean 

print("Minimum value: " + str(min_val_2))
print("Maximum value: " + str(max_val_2))
print("Mean value: " + str(mean_val_2))

Out[26]:

Minimum value: <bound method NDFrame._add_numeric_operations.<locals>.min of tracking_id
Cz01g00020    336
Cz01g00030    135
Cz01g00040    145
Cz01g00050    104
Cz01g00060     37
             ... 
UNPLg00909      0
UNPLg00910      0
UNPLg00911      0
UNPLg00914      0
UNPLg00915      0
Name: 6HL-3, Length: 15369, dtype: int64>
Maximum value: <bound method NDFrame._add_numeric_operations.<locals>.max of tracking_id
Cz01g00020    336
Cz01g00030    135
Cz01g00040    145
Cz01g00050    104
Cz01g00060     37
             ... 
UNPLg00909      0
UNPLg00910      0
UNPLg00911      0
UNPLg00914      0
UNPLg00915      0
Name: 6HL-3, Length: 15369, dtype: int64>
Mean value: <bound method NDFrame._add_numeric_operations.<locals>.mean of tracking_id
Cz01g00020    336
Cz01g00030    135
Cz01g00040    145
Cz01g00050    104
Cz01g00060     37
             ... 
UNPLg00909      0
UNPLg00910      0
UNPLg00911      0
UNPLg00914      0
UNPLg00915      0
Name: 6HL-3, Length: 15369, dtype: int64>

Why are the range of values so broad for most columns?

The values in our data table represent the number of "turned on" genes under the given light conditions. Some genes may turn on more under lower light conditions while others may turn on more under higher light conditions. We might also see that some genes may turn on more after longer light exposure than they will under shorter light exposure.

In order to analyze, this however, we need to be able to look at numbers that range from 0 to the hundreds of thousands! We will address this issue in the next notebook.

Let's save our progress by using the .to_csv() function to save the dataframe rna_data, to "rna_data_cleaned.csv". Don't forget to specify to save it in the data folder.

In [28]:

# EXERCISE

rna_data.to_csv("data/rna_data_cleaned.csv")

Notebook developed by: Ciara Acosta, Sharon Greenblum, Alisa Bettale

Genomics Data Notebook 1: Preparing Our Data for Analysis

Goals For This Notebook:

Table of Contents

1. Loading the Data

1.1 What are the dimensions of our data?

2. What is `tracking_id`?

3. What are our other columns?

2. Understanding and Cleaning the Data

2.1 Making More Readable Column Names

2.2 Understanding the Data Values

Product

Resources

Company

Genomics Data Notebook 1: Preparing Our Data for Analysis

Goals For This Notebook:

Table of Contents

1. Loading the Data

1.1 What are the dimensions of our data?

2. What is tracking_id?

3. What are our other columns?

2. Understanding and Cleaning the Data

2.1 Making More Readable Column Names

2.2 Understanding the Data Values

2. What is `tracking_id`?